Home : Workgroups : MIAME : MIAME Archive : Apr 2000
Recommendations for Microarray Data Standards, Annotations, Ontologies and Databases
April 2000 - for archive purposes
[ two files also available as RTF: recomm.rtf and details.rtf ]
[ Recommendations from November 1999 meeting and follow-up discussions in the several working groups ]
Recommendations and general conclusions at the meeting discussions
The meeting discussed draft recommendations to the microarray community proposed by the EBI and established a general consensus detailed below. These recommendations should not be regarded as an official view of the meeting, but as a starting point for wider discussions in the microarray community.
- Establishing a well-organised public repository for gene expression data will provide the bioinformatics community with a powerful tool. Establishing such a repository would be facilitated by:
- accepting a standard for the minimum information that laboratories should be encouraged to provide about microarray based experiments, to ensure reproducibility of the results;
- defining the data communication standards for such experiments;
- developing ontologies for sample description;
- developing standards for normalisation, quality control, and cross-platform data comparison for microarray based experiments;
The meeting discussed these and related issues and accepted a list of recommendations and conclusions detailed below. The meeting established working groups to detail out these recommendations. The participants agreed to meet again in approximately six months to discuss and possibly to accept more formally the recommendations.
The minimum information about a published microarray based gene expression experiment:
The minimum information about a published microarray based gene expression experiment should include:
- expression level measurement results, in particular:
- the TIFF image file from the hybridised microarray scanning;
- the image analysis output (of the particular image analysis software) for each spot, for each channel;
- a derived value summarising each spot in the authors interpretation (e.g., a background subtracted intensity typically used for Stanford or Incyte technologies);
- the following annotations:
- array (e.g., platform type, substrate, number of spots, provider),
- each element (spot) on the array (e.g., sequence or clone and relevant accession numbers),
- sample source and treatment (e.g., organism, development stage, tissue, drug treatment),
- controls in the sample and on the array,
- hybridisation extract preparation (e.g., cell rupture method, nucleic acid extraction and labelling protocol),
- hybridisation procedure (e.g., time, concentration, volumes, washes),
- scanning procedure (e.g., hardware, output TIFF file header),
- image analysis and quantification (e.g., software, version, parameters),
description of the experiment as a whole (e.g., set of related samples and hybridizations submitted together and their relationships [ time series, comparative hybridisations], reference if published).
- also, we would like to encourage the image analysis software developers to try to design methods for standard ways of summarising spot quality.
The meeting accepted the items 1a) -c) and 2a) - i) by consensus. There were two general opinions about the detailed specifications of each of the subitems. A clear majority considered the level of detail given in the "Details of the minimum information" document is close to the minimum that has to be provided about any published experiment. Nevertheless, there were a considerable number of participants, who considered the proposed details excessive. It was agreed that the details will be specified by working groups and by e-mail discussion and proposed for discussion at a follow-up meeting.
It was agreed by consensus, that once the definition of the minimal information about a public experiment is accepted by the community and public repositories supporting this specification are established, journals should be encouraged to require data submissions to a public repository, where the information can be confidential until the publication.
Data storage and communication standards
A standard XML-based flat-file format for microarray data description and exchange, compatible with the minimum information definition discussed above, should be developed and accepted by the community. This will formalise the definition of the minimum information, as well as open a way to populate public repositories directly from laboratory databases and LIMS systems.
It was proposed that:
A working group for developing XML standard was established at the meeting. The standard will be reviewed and accepted in a follow-up meeting.
- the flat-file format should support simultaneous submission of data from multiple experiments (i.e., unrelated hybridisations), to facilitate the uploading of data from laboratory databases into public repositories;
- the working group for data communication standards consider ways that might allow the standards developed for data from microarray expression experiments to be extended to cover data from other kinds of microarray experiments;
- ideally, the format should support a possibility of back-referencing to items submitted to a public repository earlier.
Ontologies for sample source and treatment description.
Ontologies should be used for sample source and treatment description (e.g., organism, development stage, tissue, cell line type, cell line, treatment type) where possible. In particular, we use collections of categories, each of which have their own controlled vocabularies, where the categories are themselves organised, e.g., as a tree.
- Universally accepted ontologies or standard vocabularies currently do not exist, except for description of species (Taxonomy database). Ontologies for developmental stages and tissues are relatively well described for some organisms, mouse and fruit fly in particular.
- A working group was established to consider where introduction of an ontology is possible, and ways achieving this. It is not feasible for the working group to develop the final ontology for any new category of sample description, but rather to
Recommendations from this working-group will reflect on the minimum information definition and on data exchange standard.
- identify categories which should be included in sample source and treatment description;
- identify and review relevant ontologies developed by independent groups;
- identify the subset of required categories that can be covered by incorporating and adapting available ontologies, and identify provisional means of handling remaining categories;
- document issues pertinent to use of other ontologies, and issues and possible approaches for fuller treatment of provisionally handled categories. The identification of high level categories and nodes where controlled vocabularies are possible will be considered for these latter categories.
Data normalisation and cross-platform comparison
The microarray community should determine common controls for their arrays and experiments. In particular there may be two types of controls:
Experiments in the public domain comparing different platforms for designing cross-platform normalisation procedures should be encouraged;
The meeting established a working group that will develop detailed recommendations for normalisation, quality control and cross-platform comparison, which will develop more detailed recommendations before the next meeting.
- normalisation controls
- quality controls
Database population and data submission issues.
XML based flat-file format will be a relatively straightforward and easy way of submission by e-mail or ftp download, enabling direct submissions from laboratory databases and LIMS systems.
Client side data submission tools (either Web-based or stand alone) would complement such flat-file based submissions. Ease of use and the ability to back-reference objects from the database will be essential.
Information about experiments and arrays may be submitted separately, with the array description being within the same or prior submission from experiments using them;
Use of standard protocols for hybridisation extract preparation, hybridisation, scanning, image analysis should be considered. Scanning hardware and image analysis software producers should be encouraged to accept relevant standards.
Ideally, the database should support the reuse of objects submitted in earlier experiments (e.g., extraction and hybridisation protocols), which would facilitate standardisation of these categories. The XML data exchange format should support such "back-references".
The minimal information specified in the first section of this document should be provided by the submitter and supported by a public repository.
Data curation, quality, and ownership in a public repository
Database administrators, submitters, and users should take steps to assure the quality of data on the database.
- Administrators of an open public database cannot police quality data, but can and should
Submitters of data to the database should be willing and able to update data that have proved to be in error on later analysis. For instance, if, after an experiment using an array has been loaded on the database, DNA sequencing proves a spot on the array to be unreliable, the submitter should be able to update this on the database;
Users of the database should be able to submit annotations. These should be identifiable as third-party annotations;
To ensure quality control in the early stages of the database development, administrators may at first accept data from selected collaborators. When the database reaches development stability, data submissions will be made open and public. Database submissions shouldl be open to the whole community before they can be made obligatory prerequisite by journals.
- verify that data meets the minimal information requirements given above and meets obvious data consistency checks. Where possible this should be done through automated checking at the time of data submission;
- flag database entries based on appropriately defined and accepted experimental quality assessment indicators. Possible bases for such indicators might include replication of experiments, use of recommended controls, publication of experiment in a peer-reviewed journal;
- reserve the right to remove from the database entries that have turned out to be obviously wrong. To work out formal criteria for making such conclusions may be difficult, however;
A list of possible interesting queries.
Constructing an extensive list of possible interesting queries and data mining problems that has to be supported by the database will facilitate the design process. The meeting established a working group of potential future users of the database for compiling such a list, as well as for research into query languages and data mining approaches for gene expression data, that need to be supported by the database.