NHGRI logo

Executive Summary
Workshop on Characterizing Human Genetic Variation


National Human Genome Research Institute
National Institutes of Health


Robert H. Waterston and David Altshuler, Co-chairs

Washington Dulles Airport Marriott
August 3-4, 2004

Purpose

The National Human Genome Research Institute (NHGRI) convened a workshop of scientists in the fields of genomic sequencing, sequencing technology development, population genetics, and ELSI research to discuss a number of issues associated with the idea of resequencing the human genome of many additional people for the purpose of further characterizing human genetic variation. The workshop followed from a recommendation of an NHGRI working group advising NHGRI on the best ways to use the large-scale sequencing capacity supported by the Institute to annotate and interpret the human genome sequence. One component of the working group¿s recommendation was a proposal to include additional human sequencing in the Institute¿s program, specifically to produce 0.1-fold sequence coverage of 1000 individuals for the purpose of discovering additional variants in human DNA. In approving the working group¿s proposal, the National Advisory Council for Human Genome Research recommended that NHGRI convene a workshop to discuss the options, strategies, and costs for, as well as the ethical, legal and social issues raised by, a human resequencing project.

Current Status

The workshop began with a discussion of the current status of research on genetic variation in the human and other organisms. At present, the largest data set on human variation is being generated by the International HapMap Project [hapmap.ncbi.nlm.nih.gov], which is genotyping a few million single nucleotide polymorphisms on 270 individuals from four geographically separated sites from around the world. The program also included presentations describing the sequencing of genomes from organisms that are very closely related to an organism (Drosophila melanogaster) whose genome has been sequenced to very high quality (www.dpgp.org) and the resequencing of coding regions from a number of humans. The workshop participants drew a number of conclusions from these presentations, including:

  1. The International HapMap Project has greatly increased the number of SNPs available to the research community to be used to study human variation and will produce a map of genome haplotypes in four populations with ancestry from parts of Africa, Asia, and Europe;
  2. Studies in Drosophila population genetics will develop methods of analysis that will be useful for studies in human and
  3. Resequencing of coding regions in many human samples from different populations can be a powerful approach to finding variants that may affect gene expression, but this approach will not find the large number of variants outside of coding regions that could play an important role in gene expression nor will it fine several types of non-SNP variation such as large repeats or deletions, etc.

Proposed Approaches:

Two potential strategies to characterize human variation more fully were then discussed:

  1. Whole genome shotgun (wgs) sequencing at a low level (0.1-fold coverage) over 1000 individuals. This is the proposal that was made to NHGRI by its Working Group on Annotation of the Human Genome. The rationale for this approach is that it will build on the current SNP and HapMap program to document the set of all common variants more fully, will reveal many rare variants, will provide the least biased and most comprehensive view of the human genome over the human population, will utilize current technology (large-scale, gel-based, Sanger sequencing) and will minimize redundancy in the sequence data produced. Additionally, it will set NHGRI on a path toward developing technology for whole genome analysis of case-control studies. The disadvantage of such a low coverage approach is that it would not provide much information on the haplotypes that the newly discovered variants are found on. In addition, data on any variant would come from only a few samples, and the informative set of samples would be different for each variant. Thus this strategy would not provide a comprehensive data set on all the variants in all the samples. Deeper genome coverage would increase the haplotype information, but would be less efficient for discovering more variants because the redundancy in the sequence data will increase as the degree of coverage increases.


  2. PCR amplification of all of the genic regions of the human genome and resequencing of them, using standard technology. This approach will identify many rare variants in and near coding regions, thereby capturing those variants thought to be most likely to affect protein function and gene regulation and contribute to disease. PCR-based resequencing is a technology currently employed in many of the large-scale sequencing centers. However, it is primarily used in low throughput applications and has not yet been optimized for higher levels of production. Therefore, it can be expected that the cost (which is currently 1.5 to 2.5-fold greater than wgs sequencing) can be dramatically reduced for higher throughput applications through the kinds of process improvement that led to dramatic reductions in the cost of shotgun-based sequencing. The advantage of this directed approach is that it will identify what are thought to be many of the ¿important¿ variations in the genome using considerably (6 to 20-fold) less sequencing than would be needed in the whole genome shotgun approach. The PCR-based strategy will also provide the haplotype background for all the variants. However, it will fail to detect variation outside of the amplified regions and is not reliable in genomic regions of high GC content.

Samples:

A critical resource needed to pursue either of the two strategies under consideration is a set of properly consented samples from individuals from appropriate populations. There are many challenges in collecting such samples, including obtaining individual informed consent and addressing group-based concerns, risks about group stigmatization and discrimination, concerns about the use of such data in population history studies, and concerns that the data could contribute to the inappropriate reification of race as a highly meaningful biological construct. Through recent efforts, some of which were undertaken as part of the International HapMap Project, samples will be available in early 2005 from more than 1,000 individuals from a balance of populations with ancestry from parts of Africa, Europe, Asia, and North America. The communities from which the samples have been obtained have been actively involved in a process of community engagement, and individual informed consent that would apply to these studies will have been given.

Resequencing Technology:

As mentioned above, the current high throughput centers have not emphasized the development of pipelines for PCR-based resequencing. However, whole genome shotgun (wgs) sequencing using gel-based Sanger sequencing is highly efficient, high quality and low cost. The current cost of wgs is about $1 per kb of high quality (Q20) sequence. There is optimism that the costs of wgs sequencing can be reduced further, but at the same time, it is recognized that the field is nearing the point at which further reductions will be more difficult to obtain. A number of promising new technologies that might be utilized for resequencing were discussed and are summarized below:

Method Cost per IX Genome Sequence Quality Technical Limitations
Electrophoresis $2.62M (PCR-based is 50-100% costlier) High Limit to further decreases in cost & throughput(10-fold more?)
Hybridization Estimate:$3 M currently; projected to be $100K in less than 2 years, and long term $10K Variable, 99+% accuracy, call rate of 90%. Can't find all variants, indels,repeats; samples must be homozygous.
Amplify & Synthesize Estimate: projected $100K in 2 years; $10K long term Variable, unproven in some cases. Expect >99% accuracy. Early stages for most of these technologies. Short reads; difficulty assembling repeats.

Because of the difficulties of comparing the output of these differing technologies, it was suggested that NHGRI sponsor an assessment exercise in which the developers of different new technologies use their technology to resequence standard samples provided by NHGRI. It is hoped that such an exercise would begin to provide an estimate of the capabilities of those technologies that are tested and allow comparison of the outputs of the new sequencing technologies in a noncommercial setting. The results of such an exercise would be used by NHGRI for program planning, and would not be released publicly. Given the many technical difficulties in doing a reasonable comparison of emerging technologies, it is likely that the assessments would have to be done several times over the next few years.

Conclusions:

The conclusions of the workshop were that the NHGRI should:

  1. Include a human resequencing component within its current sequencing pipeline,
  2. In the long term, this program should support a portfolio of grants that includes both resequencing of targeted regions and random wgs resequencing,
  3. Sponsor a resequencing technology assessment exercise to assess the strengths and applicability of the technologies in the next few years for studies of human variation and its relation to health and disease.
  4. Continue to assess the ethical issues associated with the collection of samples and phenotypic information from identifiable individuals and develop appropriate mechanisms to address those issues through community consultation and informed consent processes.

To view the PDF on this page, you will need Adobe Acrobat Reader. Download Adobe Acrobat Reader

Last Reviewed: February 19, 2012

Last updated: February 19, 2012