Summary of Report on the DOE/NIH Genome Informatics Meeting

Dulles Hilton Hotel

April 2-3, 1998

On April 2 and 3, DOE’s OBER and NIH’s NHGRI convened a workshop to identify informatics needs and goals that could be part of the next genome five-year plan and that would begin to craft a vision for genome informatics over the next five years and beyond. In attendance were 46 invited informatics and genomics experts, and 6 DOE, 8 NHGRI, 2 NIGMS and 1 NSF staffers. The meeting was held at the Dulles Hilton in Herndon, VA.

Conclusions of the meeting:

Priorities:

A reference genome map and sequence database. The sequence data should be assembled into continuous sequence, with links to the maps. The sequence should be annotated and the information should be structured so that all sorts of queries can be run on the database. The data should be updated and curated by sets of editors rather than by anybody who wishes to correct or annotate it.
Integrated and linked databases
Variation database – organized by individual genotype and haplotype and by population. The genetic variation database should include or link to information on individual phenotypic variation.
Functional/expression database, including pathway/regulatory databases (e.g. WIT, KEGG, Eco Cyc).
Comprehensive data capture – raw data and the summary or processed data should be captured in standard formats. The data should be well-structured using controlled vocabularies.

The breakout groups had been asked to address four sets of issues, and their conclusions on these and some other issues are summarized:

Queries: Users want to be able to ask everything conceivable about sequences, genes, markers, regions, relationships, maps, proteins, functions, interactions, regulatory pathways, variation, phenotypes, and inter-species comparisons. How the data were derived, under what experimental conditions, by whom, the raw data (ABI traces, gel lanes, etc), what methods were used to process the raw data into database entries (e.g. sequence), QA/QC measures — everything! It should be possible to answer all queries that could be supported by the data.

The need for all the underlying data arises especially for individual phenotypic data. Given the expense of phenotyping, it is important to be able to go back and check whether a particular SNP is really there. The ABI traces are not needed for the reference sequence since questionable regions can be sequenced again.

Tools: DNA sequencing has a bottleneck at finishing; tools to speed up this process are a critical need. Others needed are production tools, research tools (for analysis, for visualization, etc.), access tools (for visualizing data objects, for extracting objects from different databases, etc.), annotation tools, data capture tools, functional genomics tools, data mining tools. Development and hardening of tools to promote easier dissemination finishing and exporting, QA/QC of the different tools, tools that are interoperable, map integration tools, and outreach tools. A web site that collects and annotates these tools would be very useful.

Standards: There was strong support for intelligent standards that various constituencies of the genome project, academic, government, and industry, could join in defining and implementing. These include a variety of controlled vocabularies for various objects that would be entered into appropriate databases. Today, industry standards are very distinct from the few that exist (e.g. Phred/phrap for sequence QA/QC) in the HGP. A current group (the OMG, Object Management Group) is composed mostly of industry representatives, but should involve academic and government representatives. Explicit object definitions and access methods are desperately needed. Component-oriented software standards would promote systems integration, interoperability, flexibility and responsiveness to change (e.g. CORBA). It was recognized that there is a balance between having standards and allowing change and flexibility.

Annotation: Automated annotation analyses should be done using clearly defined standard operating procedures, consistent application, and sufficient documentation. Automated annotation is a good place for biologists to start for more detailed understanding of particular chromosome regions. Human participation in the annotation process is still important, however, for getting the most out of genomic information.

Quality checks: There were suggestions that the databases be subject to regular checks of quality. Users are frustrated by incorrect data and the unwillingness or inability of database providers to correct these mistakes. Official editors who curate information could resolve errors and improve the data quality. The success of the quality assessment exercise for sequence centers provided a model for the usefulness of database quality assessments.

Training/Environment issues: NSF S&T centers are models for needed genome informatics centers. Three to five such centers were proposed, where there would be a critical mass to allow interactions among various disciplines and training of students.

The workshop closed with some policy recommendations:

There should be open competition for supplying most database/informatics needs.
No one database can be expected to do everything for everybody; however, users should feel that they are interacting with only one entity. Data submission should be uniform.
Existing frameworks (database schema, submission tools, etc.) should be used where possible.
There should be continued support for the model organism databases.
Raw data should be captured to the maximum extent possible before it is irretrievably lost.
There should be investments made in hardening and exporting software tools from genome centers.