Workshop on the Functional Analysis of Genomic Sequences

Chevy Chase Holiday Inn
Chevy Chase, Md.

December 2-3, 1997

Workshop Summary
Summary of Recommendations
Agenda
Participants

Summary of Workshop on the Functional Analysis of Genomic Sequences

As part of the National Human Genome Research Institute's (NHGRI) five-year planning process, a workshop on the "Functional Analysis of Genomic Sequences" was held on December 2-3, 1997. The purposes of the workshop were to: (1) to define those biological questions which can be addressed using genomic approaches to gain insight into the function of genomic sequences, and (2) to explore what new technology and resource development will be required to facilitate genomic approaches to these questions.

The two-day meeting began with six talks to set the stage for discussions. Following the talks, the attendees were divided into three breakout groups, one in each of the general areas of DNA analysis, RNA analysis, and protein analysis, to discuss potential ideas for future genomic research. The following day, a preliminary set of recommendations from each breakout group was reported by the moderator and discussed by the entire group of participants. In the final afternoon, these recommendations were refined into a more concise, non-redundant set.

Overall, the workshop covered a very broad range of topics. Recommendations were made in the following four general areas:

Generation of Resources/"Production" Activities.
There was great enthusiasm for the production of full-insert human cDNA sequences, and possibly mouse cDNA sequences. With respect to sequencing the genomes of other model organisms, sequencing the mouse genome in the near future was unanimously recommended. With the cost of DNA sequencing still relatively high, development of a strict set of criteria for determining what other genomes should be sequenced was recommended. Given the the value of having many genomic sequences, further support of technology development to reduce the cost of genomic sequencing was strongly endorsed.

There was a strong recommendation for the comprehensive analysis of RNA expression patterns in the human and in model organisms. It was thought that, although further technology development in this area is needed, it is appropriate to initiate the support of these types of studies now.
Technology Development.
Numerous opportunities were identified for technology development. These include technologies for determining the function of non-coding sequences; improving cDNA resources, especially the generation of full-length cDNAs; determining the function of proteins from structure; and analysis of protein expression and protein interactions.
Bioinformatics/Databases

and
Training/Access. There were several recommendations in the areas of bioinformatics and training. It was recommended that NHGRI support the development of new tools for data representation, visualization, and analysis to build the capability to handle complex sets of data that will be forthcoming from genomic analyses. There was also a strong endorsement for training in the area of computational biology, but there was not as much consensus for support of other areas of interdisciplinary training.

Summary of Recommendations

Future NHGRI-Supported Research Efforts

Production/Resources
1. Reference Human DNA Sequence
  1. The completion of the sequence of the human genome was acknowledged to be of the highest priority for NHGRI.
2. Human SNPs
  1. There was strong endorsement for NHGRI to pursue, in conjunction with other National Institutes of Health (NIH) Institutes, the generation of human SNPs as well as the development of tools to exploit them.
3. Full-Insert cDNA Sequences
  1. There was consensus that these should be generated for the human; less consensus regarding the mouse (in part because of uncertainty as to what HHMI will support). An advantage of the mouse is that it will be possible to generate cDNA libraries with a different representation of genes than the human. Similar efforts for other model organisms, e.g. Drosophila, should be considered.
  2. There was general consensus that one pass sequencing on each strand would provide adequate accuracy for human cDNAs, in part because it is anticipated that the genomic sequence will be done at a very high accuracy; accuracy for other organisms needs to be considered on a case-by-case basis. Confidence levels should be put on each base.
4. Other Model organisms
  1. There is a need to establish criteria for determining whether or not to sequence any additional model organisms. A potential list of criteria was generated during the RNA session (see below), including "phylogenetic power," and the capability to transfect the organism. Consideration should be given to alternative approaches for some organisms (e.g. low pass or sequence-sampling strategies for genomic sequencing, or EST sequencing). In some instances, only the generation of genomic resources, such as genetic or physical maps, may be appropriate.
5. Comprehensive "database" of RNA expression patterns in human and model systems
  1. It would be valuable to create a database of RNA expression patterns that contains information about which sets of transcripts are expressed, and at what level, in each cell at any given stage of development, differentiation, or time in the cell cycle.
  2. There was general consensus that the technology for RNA expression analysis is sufficiently developed to initiate these types of projects now. However, there is a critical need for the development of internal standards to allow for the cross-comparison of studies. Additional technology development, especially in the area of informatics, is also needed (see RNA section below).
  3. This is a long-term goal (beyond the next five years), whose comprehensive achievement may be more appropriate for NIH as a whole than for NHGRI alone.
Technology Development
Numerous opportunities for technology development were identified and recommended for support in the following areas:
1. RNA Expression
2. Synthesis of full-length cDNA clones
  1. NHGRI's role in supporting the generation and sequencing of these cDNAs, once the technology has been robustly developed, needs further discussion.
3. Discovery of rare/underrepresented transcripts.
4. Large-scale methods for RNA in situ analyses, including the development and use of multiple probes.
5. High-throughput cis-element analysis to study transcriptional regulation.
6. Defining regulatory hierarchies, such as the identification of all target genes regulated by a given factor or small combination of factors.
DNA Analysis
1. High-throughput analysis of non-coding sequences that function at the chromosomal level, such as centromeres and telomeres.
Protein Structure and Expression
1. Identification of the complete set of protein folds (thought to be finite in number, i.e., one to several thousand).
2. Production of a complete set of expressed proteins.
  1. Efficient methodology for heterologous expression of large quantities of proteins.
  2. Development of native protein microarrays.
3. Multiple, benign and readily recognizable protein tags for localization and other studies.
4. Large-scale protein expression analysis.
  1. Improvement of 2D gels and other front end separation technologies for mass spectrometry.
  2. Improvement of mass spectrometry.
  3. Development of novel technologies, e.g. arrays of specific protein ligands.
5. Protein interactions
  1. Comprehensive analysis of protein-protein interactions, including protein complexes; further discussion of technology development for comprehensive analyses of protein-DNA and protein-ligand interactions as well as other physiological interactors is needed.
Bioinformatics/Databases
1. New tools for data representation, visualization and analysis (including interactive/hierarchical data), e.g., computable pathway algorithms and electronic representation of metabolic pathways, are needed.
Training/Access
1. Computational biology training is critical.
2. There was less consensus regarding interdisciplinary training in other areas. One approach, thought by some to be more effective, is to build multidisciplinary research teams composed of individuals with specialized expertise and to nurture interdisciplinary collaborations.
3. Interdisciplinary training should be done at the post-Ph.D. level.

Comments: Although the participants endorsed sequencing of the mouse genome, there was no explicit discussion regarding this in the summary session. It was noted that there is going to be another workshop specifically focused on the mouse in March, 1998.

There were several other points that were strongly endorsed by one or more breakout groups that were not discussed at length in the summary session and might be considered for further discussion by the Council subcommittee. These include:

Facilitating/subsidizing affordable chip resources, access to genome technologies.
Large-scale approaches to probe the function of gene products, e.g., mutagenesis/tagged insertions.

Summary of Recommendations by DNA Group
Maynard Olson

Reference "Databases"
1. The generation of the first complete human genomic sequence was endorsed to be of the highest priority for NHGRI.
2. The generation of a reference database for human polymorphisms was discussed at length. There was a strong consensus that NIH should be very active in this area, especially as it related to the generation of a large number of polymorphic markers (e.g. 100,000 SNPs), as well as additional theory development. A second, longer-term component (for which there was less consensus) was the comprehensive analysis of human polymorphisms. This type of analysis poses significant scientific as well as ELSI challenges and would require significant technology development.
3. The sequencing of the mouse genome was not discussed at length, but should be considered for funding.
First-Pass Genome Resources
1. Of overwhelming interest is the development of a strategy to obtain a relatively complete set of human cDNA sequences (and a similar resource for additional organisms if possible). This would not necessarily be a comprehensive set (including e.g. all splice variants and very rare transcripts) and may not need to be of highest accuracy nor from full-length clones, depending on the level of investment.
2. There was somewhat less consensus on the development of additional first pass resources. These include EST sets for a number of organisms, beyond the standard models. A number of these sets would allow for better phylogenetic definition for higher organisms. Additional resources suggested were high-quality germline clone libraries and improved genetic maps for a variety of organisms.
3. More research to study the function of germline sequences was endorsed by some of the members of the breakout group and this topic engendered significant discussion during the morning recap session, perhaps because of the strong opinions of a minority of the participants. Areas to pursue include the analysis of cis-regulatory regions controlling transcription and the functional analysis of other regulatory elements, such as those involved in chromosome structure, i.e. study the biology of the "genome" in addition to the genes. While there was considerable concern that this could be considered "the rest of biology" some thought that genomic approaches to study these biological questions could be developed. One approach to support is mutagenesis, especially in the mouse. Further discussion is needed with respect to the relative merits of targeted (insertional/tagged) vs. chemical mutagenesis, and this topic will be addressed in the March, 1998 meeting on mouse genomic resources.
Technology Development
1. There should be a major effort to push for a reduction in the cost of DNA sequencing. The genomes (or biologically interesting portions of genomes) of many model organisms could then be readily sequenced, which would alleviate the pressure to set strict priorities for choosing which additional model organisms (if any) to sequence. It was recognized that this is a very difficult problem requiring a significant investment. NHGRI should seek less traditional partners than have historically been considered (e.g., DARPA ).
2. Technology development for the generation of many of the first-pass resources discussed above is clearly needed.
Bioinformatics
1. There was a significant level of enthusiasm for continued development in this area. It was recognized that there is a need for ongoing training at all levels and an emphasis on keeping a viable academic culture in this area. A vigorous small grants program is critically needed in this area to produce innovation and to maintain faculty in academia.

Additional Points Raised During the Discussion:

NHGRI should take the lead in encouraging and facilitating the transfer of genomic resources to the general research community, not only from the large genome centers, but from individual labs as well.
Promote the use of chips and other related technologies by increasing access and lowering the costs to researchers.
It was stressed that there is significant value in sequencing model organisms beyond what will be learned about that given organism. If they are chosen in a phylogenetically-informed manner much can be learned about the human and other vertebrate organisms.
The study of polymorphisms such as SNPs will also facilitate the functional analysis of the genome; some changes will be functionally significant.

Summary of Recommendations by RNA Group
Barbara Wold

Human and Mouse EST Resources
There was widespread enthusiasm for the current EST resources and further investment was thought to be highly worthwhile.
1. Resource Generation
  1. Validate the source of clones used to generate the existing human and mouse EST sets and complete the sequence of these clones. Validation would take approximately 6 months at an estimated cost of $1.5M, creating a higher quality resource that could be used for full-insert sequencing than currently exists.
  2. Construct an expression library for all existing full-length protein coding sequences.
  3. Generate more full-length cDNAs.
2. Technology Development
  1. Develop (and apply) new technologies for cloning underrepresented RNAs (low level expression; specific time and places).
  2. Improve expression vectors to allow for regulated expression in a variety of cell types and organisms.
3. Other Considerations
  1. Encourage trans-NIH funding for resource generation.
  2. Management and oversight of projects by NHGRI.
Complete Molecular Phenotyping for Model Organisms
Determine what set of transcripts or proteins are expressed in each cell at a given time and at what level. This is a long-term goal (beyond the next 5 years) requiring significant technology development. Execution may go beyond NHGRI.
1. Technology Development
  1. Develop and implement internal standards for each model organism for inclusion in each data set for use in all methodological approaches. Will facilitate cross-comparisons.
  2. Increase sensitivity of input with goal of single cell inputs.
  3. Informatics to permit access; clear identifiers.
  4. Informatics to link to different kinds of data.
  5. Informatics/methods to assign a unique identifier, amount relative to standard and some kind of P value for this amount (analogous to quality standard for base calling) to each measurement.
  6. Methods for cell enrichment.
  7. Alternatives to array technology; alternate array technologies.
2. Resource Generation
  1. Build standard data sets for expression studies for model organisms (continually update until complete array of genes).
  2. Provide "chips" (either complete set or subsets of genes) to user community at reasonable cost.
  3. Provide technology access to R01 investigators.
  4. Improve technology for export (cheaper, lower capacity if necessary).
3. Other Considerations
  1. Start with RNA first since technology is more advanced, then move to protein.
  2. Challenge lies in determining site of resource generation: At center(s) vs. dissemination of technology.
Characterizing "Wildtype" Mouse
Mouse phenotypes are poorly understood. Much underlying information is likely to have already been generated and there is a need to establish a means of capturing it in a central database.
1. Resource Generation
  1. Database of high quality phenotypic measurements (physiology, endocrinology, behavior, anatomy, etc) from standard strains used in knock-out experiments.
2. Other Considerations
  1. Combined informatics and new measurements.
  2. Mandate R01 grantees doing knockout studies to submit wildtype data to "control" database.
  3. Trans-NIH/other support.
  4. Combined RFA/R01 contributions.
Regulatory Architecture for Genome Expression
NHGRI should support technology development in this area; application of technology to specific areas may be more appropriately supported elsewhere.
1. Develop (and apply?) technology to identify all target genes (functional cis-elements) regulated by a given factor or small combination of factors.
2. Develop technologies for rapid, high-throughput cis-element discovery and characterization (couple biology and informatics).
3. Develop methods for visual representation of complex, multidimensional, and often hierarchical data. There is a need for these methods to analyze many other types of large, complex data sets as well.
Additional Model Organisms
1. Sequence the mouse genome.
2. Criteria for evaluation of candidates (to be used when sequencing costs come down).
  1. Transfection capability (essential)
  2. Phylogenetic power (essential)
  3. Mutagenesis/screening/strain maintenance
  4. Targeted mutagenesis (desirable)
  5. Availability of material, including embryos
  6. Genome size (preferably small)
3. Possible candidate: Amphioxis or small genome tunicate prior to tetraploidy of vertebrates; avoid gene redundancy.
4. Consider starting with EST projects for candidates; reduce pressure on genome size.
Protein Structure/Manipulations
1. High-throughput expression libraries for model organisms where you know all or most of the proteins (e.g. bacculovirus resource) followed by a massively parallel protein production and crystallization effort. Provide those that work to crystallography community.
2. Technology for improved crystallization methods designed to extend the range of proteins that can be handled. Support for the application of methods should be from resource interested in specific protein(s).
3. Develop methods to render glycosylated proteins amenable for analysis by mass spectrometry.
Additional New Technologies and Resources
These are clearly longer-term goals.
1. Generate libraries of chemical ligands or antibodies for arraying, detecting, affinity purification of each protein for the model organisms and the human.
2. Develop technology (where still needed) for genome-wide, systematic (tagged) disruption of all genes in model organisms.
3. Generate resources of disrupted tagged strains as technology and finances permit. [Strain storage issues for some organisms].
4. Methods for higher-order multiplexing of gene expression tags and in situ hybridization probes or protein detection probes (on the order of 10s -100s).

Additional Points Raised During the Discussion:

While technology development is very important, the money required is beyond our budget. We need to consider partnerships with industry relatively early on in the development; exploit SBIR/STTR program; support proof of principle and then transfer it over to industry. There was some discussion about the implications of this approach, including access.
Full-length cDNAs should be generated for all model organisms, or as many as possible.

Summary of Recommendations by Protein Group
Tony Pawson

General Recommendations (not related to proteins)
1. Sequence mouse.
2. Improve the quality of the EST database.
3. Sequence full-length cDNAs (for predicting ORFs) from multiple organisms; complete accuracy not necessary.
Protein Structure/Function
Work toward predicting function from protein sequence.
1. Understand totality of protein folds.
  1. Predict all possible folds.
  2. Analysis of novel folds by structural determination.
2. Improve homology modeling.
3. Improve alignments to assign protein families; take advantage of structural information.
4. Improve structural analysis of membrane proteins.
Proteomics
Better technology is needed for quantitative global analysis of protein expression and post-translational modification.
1. 2D gel technology
  1. Improve technology for quantifying individual protein levels, identification of post-translational modifications. Needs standardization/automation/increased sensitivity. Useful currently for small genomes, further technology development needed for display of proteins from more complex systems.
  2. Apply current technology to identify every protein in e.g., yeast/bacteria.
2. Mass Spectrometry
  1. Technology development needed for front end (automation, sample loading/interfacing with separation technology) and back end (software development, automated data collection and reference to databases)
3. Protein Microarrays
  1. Useful to identify protein ligands/physiological partners.
  2. Considered to be very important to develop, but highly challenging.
  3. Best done on domains.
  4. Should be group production effort using common technology; need to have specialists working with specific sets of proteins.
  5. Create analogous array of unique ligands to probe for protein expression.
  6. Develop novel methods for more rapid, automated technology for protein identification.
Protein Interactions/Function
Generate set of reagents to allow you to learn about protein interactions and pathways.
1. Generate entire set of domains and identify peptide motifs (or other ligands) that they interact with e.g., peptide libraries, phage display. Use to establish network of protein interactions.
2. Generate similar set of affinity probes, e.g., small molecules or antibodies.
3. Develop global approaches to activate or inactivate protein.
4. Develop better prediction methods for protein localization.
5. Develop new technology to identify low affinity protein-protein & protein-ligand interactions.
Bioinformatics
1. Develop proteome database of higher eukaryotes serving as central organization of all that is known about proteins, e.g., motifs, structure, interactions, function.
Training
1. Cross-discipline training important; suggested at the post-doctoral level rather than graduate student level.

Additional Points Raised During the Discussion:

Strong endorsement of the approach to identify complete set of the RNA group to determine the structure of every protein for which a crystal can be made; approach can be experimentally verified domains rather than the more brute force approach recommended by
Suggested additional organism to sequence - one from the "bottom of the eukaryotic radiation." Many functions lost in yeast; study other unicellular organism.

Agenda

Tuesday, December 2^nd

8:30 a.m. Welcome/Introductions

Dr. Feingold

8:40 a.m. Purpose of Workshop

Dr. Collins

9:00 a.m. Scientific Presentations

Dr. Feingold

Dr. Ronald Davis
Stanford University
"Whole Genome Analysis and the New Biology"

Dr. Judith Campbell
California Institute of Technology
"Whole Genome Analysis of Multienzyme Machines: A Fusion of Biochemistry and Genetics."

Dr. Steven Henikoff
Fred Hutchinson Cancer Research Center
"Using Protein Sequence Homology to Infer Function"

10:30 a.m. Coffee Break

10:45 a.m. Scientific Presentations (continued)

Dr. Gregory Petsko
Brandeis University
"A Structural Biologist's Perspective on Functional Genomics"

Dr. Thomas Pollard
The Salk Institute
"Molecular Mechanisms on a Genomic Scale?"

Dr. Eric Davidson
California Institute of Technology
"Hard-Wired Genomic Cis-Regulatory Information in Development and Evolution"

12:15 p.m. Orientation for Breakout Session

Dr. Chakravarti

12:30 p.m. Lunch

1:30 p.m. Breakout Session

6:00 p.m. Approximate Conclusion of Breakout Session

Wednesday, December 3^rd

8:30 a.m. Reports from Breakout Groups and General Discussion

Dr. Hartwell
Dr. Olson
Dr. Wold
Dr. Pawson

12:00 p.m. Lunch

1:00 p.m. Development of Recommendations

Dr. Chakravarti

3:30 p.m. Conclusion of Workshop

Participants

Greg Barsh, Ph.D.
Stanford University School of Medicine
Pediatrics and Genetics
Stanford, CA 94305-5428

Mark Boguski, M.D., Ph.D.
National Institutes of Health
National Library of Medicine
8600 Rockville Pike
Bldg. 38A, Rm. 8N805
Bethesda, MD 20894

Anne Bowcock, Ph.D.
University of Texas Southwestern
Medical Center
5323 Harry Hines Blvd.
Dallas, TX 75235-8591

Allan Bradley, Ph.D.
Howard Hughes Medical Institute
Baylor College of Medicine
One Baylor Plaza
Houston, TX 77030

Patrick Brown, M.D., Ph.D.
Howard Hughes Medical Institute
& Dept of Biochemistry
Stanford University School of Medicine
Stanford, CA 94305-5428

Judith Campbell, Ph.D.
Braun Laboratories 147-75
California Institute of Technology
Pasadena, CA 91125 USA

Lewis Cantley, Ph.D.
Harvard Medical School
200 Longwood Avenue
Warren Alpert Building
Boston, MA 02115

Eric Davidson, Ph.D.
California Institute of Technology
Division of Biology, 156-29
Pasadena, CA 91125

Ronald Davis, Ph.D.
Stanford University School of Medicine
Department of Biochemistry
Stanford, CA 94305-5307

David Fenyo, Ph.D.
Rockefeller University
Box 170, 230 York Avenue
New York, NY 10021

James Garrels, Ph.D.
PROTEOME, Inc
200 Cummings Center, Suite 425C
Beverley, MA 01915

Roger Hendrix, Ph.D.
University of Pittsburgh
Department of Biological Sciences
A234 Langley Hall
Pittsburgh, PA 15260

Steven Henikoff, Ph.D.
Howard Hughes Medical Institute
Fred Hutchinson Cancer
Research Center
1100 Fairview Ave. N, A1-162
P.O. Box 19024
Seattle, WA 98109-1024

Nancy Hopkins, Ph.D.
Massachusetts Institute of Technology
Center for Cancer Research
Department of Biology -E17-341
Cambridge, MA 02139

Gary Karpen, Ph.D.
The Salk Institute
10010 North Torrey Pines Road
La Jolla, CA 92037

Raju Kucherlapati, Ph.D.
Albert Einstein College of Medicine
1300 Morris Ave
Bronx, NY 10461

J. Richard McIntosh, Ph.D.
University of Colorado, Boulder
Campus Box 347
Biology Department - MCD
Boulder, CO 80309

William Pavan, Ph.D.
National Institutes of Health
National Human Genome Research Institute
49 Convent Drive
Bethesda, MD 20892

Gregory Petsko, Ph.D.
Brandeis University
Rosentiel Center
415 South Street
Waltham, MA 02254-9110

John Postlethwait, Ph.D.
University of Oregon
Institute of Neuroscience
Eugene, OR 97403

Rudolf Raff, Ph.D.
Indiana University
Department of Biology
Bloomington, IN 47405

Lynne Regan, Ph.D.
Yale University
Dept. of Molecular Biophysics
and Biochemistry
266 Whitney Avenue
New Haven, CT 06520

Martin Ringwald, Ph.D.
Jackson Laboratory
600 Main Street
Bar Harbor, ME 04609

Richard Roman, Ph.D.
Medical College of Wisconsin
Dept. of Physiology
8701 Watertown Plank Road
Milwaukee, WI 53226

George Rose, Ph.D.
John Hopkins University
School of Medicine
725 N. Wolfe Street
Baltimore, MD 21205

Allan Spradling, Ph.D.
Carnegie Institution of Washington
Dept. of Embryology
115 West University Parkway
Baltimore, MD 21210

D. Lansing Taylor, Ph.D.
BioDx, Inc.
635 William Pitt Way
Pittsburgh, PA 15238

Jeffrey M. Trent, Ph.D.
National Institutes of Health
National Human Genome Research Institute
9000 Rockville Pike,
49 Convent Drive
Bethesda, MD 20892

Michael Waterman, Ph.D.
University of Southern California
Denney Research Bldg., DRB 155
Los Angeles, CA 90089-1113

Huntington Willard, Ph.D.
Case Western Reserve School of Medicine
Dept. of Genetics
2109 Adelbert Rd., BRB731
Cleveland, OH 44106-4955

Alan Wolffe, Ph.D.
National Institute of Health
National Institute for Child Health
and Development
Bldg. 18T, Rm. 106
Bethesda, MD 20892-5431

NHGRI Advisory Council Scientific Planning Subcommittee

Aravinda Chakravarti, Ph.D.
Case Western Reserve University
10900 Euclid Avenue
Room BRB 721
Cleveland, OH 44106

Lee Hartwell, Ph.D.
Fred Hutchinson Cancer
Research Center
1100 Fairview Avenue North
Mailstop LY 301
Seattle. WA 98109

Charles Langley, Ph.D.
University of California, Davis
Center for Population Biology and
section of Evolution and Ecology
Davis, CA 95616

Maynard Olson, Ph.D.
University of Washington
Mail Stop GJ-10
Seattle, WA 98195

Anthony Pawson, Ph.D.
Samuel Lunenfeld Research
Institute - Mount Sinai Hospital
600 University Avenue - Rm. 989
Toronto, Ontario
CANADA M5G 1X5

Thomas Pollard, Ph.D.
The Salk Institute
10010 North Torrey Pines Road
La Jolla, CA 92307

Alan Williamson, Ph.D.
Merck Research Laboratories
126 East Lincoln Avenue
MSC-RY80K
Rahway, NJ 07065

Barbara Wold, Ph.D.
California Institute of Technology
1201 E. California Boulevard
Pasadena, CA 91125

Last updated: October 23, 2012

Workshop on the Functional Analysis of Genomic Sequences