MIAME Home
 MIAME 1.1
 MIAME MAGE-OM
 MIAME Checklist
 MIAME Software
 MIAME Archive
 Nov 1999
 Apr 2000
 May 2000
 Nov 2000
 Mar 2001
 Jan 2002
 Mar 2002
 MIAME 1.0
 Miscellaneous
 Checklist
 Goals
 Home  Meetings  Workgroups  Mission  MGED Board  Site Map

Home : Workgroups : MIAME : MIAME Archive : May 2000

Microarray Experiment Data Representation and Annotations Working Group

May 2000 - for archive purposes

Alvis Brazma (May 16)

Background

In less than three weeks many of us will meet in Heidelberg (MGED 2, May 25-27, see http://www.mged.org/), where we would like to discuss and accept next draft recommendations how array based gene expression data should be annotated. This is a long e-mail which summarizes my view on what we'd like to achieve and the draft proposal. If I get your comments within the next week or so, I will try to incorporate them in the draft, before the meeting.

Goal

I see the main goal of the microarray annotations group as developing a proposal how microarray experiments should be annotated and what information should be given by the laboratories about their published experiments (initially concentrating on gene expression).

The main reason for this is to facilitate the establishing of databases for microarray data, and interoperability of such databases and microarray data analysis tools. Additionally we'd like to develop a proposal for a definition of what should be the minimum information about microarray experiments that laboratories should reveal with scientific publications based on data from such experiments (similarly as macro-molecular structure data is required to be revealed for publications).

This discussion group was established at the first meeting of Microarray based Gene Expression Databases (MGED), held at the European Bioinformatics Institute in November, 1999, where the first draft of "minimum information" definition was accepted (see http://www.mged.org/) Based on this definition and the few comments on the mailing-list, I am proposing the next draft. I hope that our recommendations will feed into developing the XML based data exchange format (so far this has happened), while we will take into account recommendations from the Ontology and other discussion groups.

Below is my current proposal. I would like you to read and comment on this from several perspectives:

  1. what information about a microarray experiment should be given for the reproducibility of its results (this should roughly coincide with the "minimum information" definition);
  2. what information should be captured in a public repository for microarray data for the database to be usable by third parties (i.e., for meaningful queries to be possible);
  3. what is the optimal compromise between the information that has be be captured about the experiments for meaningful queries to be possible, and not overburdening the authors by too much extra-work while producing these annotations.

1 and 2 do not necessarily have to coincide. For instance the full TIFF microarray images may be necessary for reproducibility, but may be excessive for being kept in a public database. There may be two ways how the "minimum definition" can be treated - a) as a legal obligation, in which case essentially only controlled vocabulary fields have real meaning (in the free text fields anything can be given), or b) assuming that the data depositors will be cooperative. In my opinion only b) is an option.


 

PROPOSAL, May 8 2000

 (compiled by Alvis Brazma)

The minimum information about a published microarray based gene expression experiment should include:

  1. expression level measurement results, in particular:
    1. the TIFF image file from the hybridised microarray scanning;
    2. the image analysis output (of the particular image analysis software) for each spot, for each channel;
    3. a derived value summarising each spot in the authors interpretation (e.g., a background subtracted intensity typically used for Stanford or Incyte technologies);

      #COMMENT: is a) necessary - are there other opinions

      and
       
  2. the following annotations:
    1. array (e.g., platform type, substrate, number of spots, provider),
    2. each element (spot) on the array (e.g., sequence or clone and relevant accession numbers),
    3. sample source and treatment (e.g., organism, development stage, tissue, drug treatment),
    4. controls in the sample and on the array,
    5. hybridisation extract preparation (e.g., cell rupture method, nucleic acid extraction and labelling protocol),
    6. hybridisation procedure (e.g., time, concentration, volumes, washes),
    7. scanning procedure (e.g., hardware, output TIFF file header),
    8. image analysis and quantification (e.g., software, version, parameters),
      • also, we would like to encourage the image analysis software developers to try to design methods for standard ways of summarising spot quality.
    9. description of the experiment as a whole (e.g., set of related samples and hybridisations submitted together and their relationships [time series, comparative hybridisations], reference if published).

      #COMMENT: 1 and 2 was the consensus of MGED. There were no absolute consensus about the details of the minimum information, particularly because of the concerns regarding burden on experimentators.
      The majority view edited to reflect the later comments are:

 

  1. the following annotations:
    1. array
      • array name (e.g., "Stanford Human 10K set),
      • platform type: spotted vs. synthesised (spotted - oligos, PCR products, plasmids, colonies),
      • if commercial array,
      • provider
      • unique ID from the provider
      • array dimensions
      • spot dimensions
      • number of columns and rows, or number of spots
      • brief description
      • substrate
         
    2. each element (spot) on the array
      • clone info (obligatory for cDNA arrays)
        • clone ID (plus clone provider and date)  
      • sequence info (if known)
        • sequence accession number in DDBJ, EMBL, or GenBank if known sequence itself (if databases do not contain it)
        • number of oligos and the reference sequence (or accession number) for
        • Affymetrix type chips, plus the oligosequences, if given
      • quality information from the clone provider
      • checking of the DNA quality (none, resequenced, quality check by gel separation, amount of DNA)
      • if the element can be used for normalisation or control (e.g., element should have expected value)
         
    3. sample source and treatment:

      #COMMENT: From what we have heard from the Ontology group, this has turned out to be one of the most difficult parts in annotations. I propose to revise this part after we get recommendations from Ontology discussion group. At the same time some way of annotation the sample will have to be accepted even before we have developed a ontology, which may take time.
       
      • cell source and type (if derived from primary sources (s))
      • organism (NCBI taxonomy)
      • sex
      • age
      • development stage
      • organism part (tissue)
      • animal/plant strain or line (if applicable)
      • genetic variation (gene knockout, transgenic variation, ...)
      • individual (if applicable)
      • unique identifier for references in other hybridisations or overall experiment description
      • individual genetic characteristics (disease alleles, polymorphisms, etc.)
      • disease state (or normal)
      • target cell type
      • separation technique (none, trimming, microdissection, FACS, ...)
      • estimated % target
      • cell line and source (if applicable)
      • treatments
      • in vivo (organism or individual treatments)
      • in vitro treatments (cell culture conditions)
      • treatment type (e.g., small molecule, heat shock, cold shock, food deprivation, ...)
      • compound
      • verbal description (laboratory protocol)

        #COMMENT. Note that some of these categories are relevant only in particular contexts. For instance, tissue is relevant only to multi-cellular organisms. The categories mentioned below should be considered as a tree-like structure, and depending on the particular path along the tree, only some of the specified categories become relevant.
         
    4. controls

      #COMMENT: we will need a feedback form the Normalisation and Quality control group to finalise d)
       
      • control type (prelabeled and added at hybridisation [calibration of scan intensity to quantity]; added at sample labelling [quantitate sample labelling]; added at sample amplification [IVT or PCR control], ...)
      • ID for the controls
      • associated normalisation type array elements
         
    5. hybridisation extract preparation

      #COMMENT: we need input from the experimentators to finalise e) and f)
       
      • reference to the respective extract preparation protocol, if exits in the database
      • cell rupture method
      • chemical extraction procedure
      • physical extraction procedure
      • whether total RNA, mRNA, or genomic is extracted
      • type of preselection
      • amount of nucleic acids extracted
      • target amplification (RNA polymerases, PCR)
      • which label is used (e.g., Cy3, Cy5, 33P)
      • the labelling ratio
      • laboratory protocol (free text)
         
    6. hybridisation procedure
       
      • reference to the respective hybridisation protocol, if already exits in the database
      • the solution (e.g., concentration of solutes)
      • blocking agent
      • wash procedure
      • time, concentration, volume, temperature
      • description of the hybridisation instruments
      • laboratory protocol (free text)
         
    7. scanning
       
      • reference to the respective scanning protocol, if already exits in the database
      • scanning hardware
      • scanning software
      • parsed header of the TIFF file, including laser power, spatial resolution, pixel space
      • laboratory protocol (free text)
         
    8. image analysis and quantitation
       
      • reference to the image analysis software or algorithm (including author, version)
      • relevant parameters
      • any normalisation that has been applied before the final output (was there a scalar adjustment of the measurements)
      • also, we would like to encourage the image analysis software developers to try to design methods for standard ways of summarising spot quality.
         
    9. experiment as a whole (i.e., set of related hybridisations)

      #COMMENT: We need an input from Ontology group for i)
       
      • author (submitter), data
      • platforms used;
      • comparative or absolute measurements,
      • single or multiple hybridisations,
      • single or multiple arrays, For multiple hybridisations we have subtypes:
      • time course,
      • other ordered,
      • other unordered
      • relationship between samples (free text)
        We can also classify the experiments by the type of the question that has been asked:
      • the aim of the experiment (free text) plus one of the following:
      • effect of gene knock-out
      • effect of gene knock-in (transgenics)
      • shock
      • dose response
      • normal vs. diseased comparison
      • treated vs. untreated comparison
      • other
      • brief description
      • quality related indicators
      • publication (if exists)
      • number of replicates

 

Home | Meetings | Workgroups | Mission | MGED Board | Site Map
 

Last modified: 26 Sep, 2005.                              Contact Us
This site is hosted by the EMBL -
European Bioinformatics Institute
The maintenance of these pages is partially supported by the European Commission as part of the
TEMBLOR project.
 

MGED Sponsors: