Genomic Summary Results Update FAQs
Frequently Asked Questions regarding the Update to NIH Management of Genomic Summary Results (GSR) Access.
Skip to a specific question
- What are Genomic Summary Results?
- How are GSR different from individual-level genomic data?
- Why did the NIH change the way it manages access to GSR?
- How are GSR shared or used?
- What are the privacy risks associated with sharing GSR?
- What are the benefits of sharing GSR through open (unrestricted) access?
- Who will calculate GSR and how?
- What are the options under the NIH GDS Policy for sharing GSR?
- Can approved data users share GSR that they derive from individual-level genomic data obtained from NIH-designated data repositories?
What are Genomic Summary Results?
Genomic summary results (GSR) are the output of analyses of genomic data across the many individual participants included within a specific study’s dataset or across many studies. For most studies in NIH-designated data repositories (such as dbGaP or AnVIL), for example, this means that GSR represent a summary of the information generated from hundreds, or thousands, of research participants. There are two broad classes of GSR information: allele frequency information1 and association analysis statistics2.
- An allele frequency is the proportion of a specific allele, or variation in the DNA code, relative to other possible alleles at the same position in the code in a given population, or in some cases, an entire species. Allele frequency information is used in the fields of Genomics, Population Genetics, and Clinical Genetics to help interpret the potential for links between the presence of specific alleles and observed “outcomes”, such as physical traits or disease risks.
- In genomics, association analysis statistics are the information generated when investigators evaluate the correlation of genotype to phenotype. Phenotypes studied may be diseases (e.g., diabetes), traits (e.g., height), or molecular traits (e.g., mRNA or protein expression levels). Examples of these kinds of statistics are: p-values, beta values in regression, odds ratios, and effect sizes.
Sometimes, GSR can be included in analytical models. For instance, Polygenic Risk Score (PRS) models include scoring files that contain the variants, effect alleles/weights (i.e., GSR) necessary to make use of the model, as estimated from relevant study(ies) used to develop the model.
-
What are Genomic Summary Results?
Genomic summary results (GSR) are the output of analyses of genomic data across the many individual participants included within a specific study’s dataset or across many studies. For most studies in NIH-designated data repositories (such as dbGaP or AnVIL), for example, this means that GSR represent a summary of the information generated from hundreds, or thousands, of research participants. There are two broad classes of GSR information: allele frequency information1 and association analysis statistics2.
- An allele frequency is the proportion of a specific allele, or variation in the DNA code, relative to other possible alleles at the same position in the code in a given population, or in some cases, an entire species. Allele frequency information is used in the fields of Genomics, Population Genetics, and Clinical Genetics to help interpret the potential for links between the presence of specific alleles and observed “outcomes”, such as physical traits or disease risks.
- In genomics, association analysis statistics are the information generated when investigators evaluate the correlation of genotype to phenotype. Phenotypes studied may be diseases (e.g., diabetes), traits (e.g., height), or molecular traits (e.g., mRNA or protein expression levels). Examples of these kinds of statistics are: p-values, beta values in regression, odds ratios, and effect sizes.
Sometimes, GSR can be included in analytical models. For instance, Polygenic Risk Score (PRS) models include scoring files that contain the variants, effect alleles/weights (i.e., GSR) necessary to make use of the model, as estimated from relevant study(ies) used to develop the model.
- An allele frequency is the proportion of a specific allele, or variation in the DNA code, relative to other possible alleles at the same position in the code in a given population, or in some cases, an entire species. Allele frequency information is used in the fields of Genomics, Population Genetics, and Clinical Genetics to help interpret the potential for links between the presence of specific alleles and observed “outcomes”, such as physical traits or disease risks.
How are GSR different from individual-level genomic data?
“Individual-level data” provide the specific genomic sequence for a single research participant and are usually only available through controlled-access pathways. The privacy risks for individual-level data are greater than those for GSR because they refer to the unique pattern in the DNA code of a single participant, rather than calculations about the patterns seen across a group of people.
-
How are GSR different from individual-level genomic data?
“Individual-level data” provide the specific genomic sequence for a single research participant and are usually only available through controlled-access pathways. The privacy risks for individual-level data are greater than those for GSR because they refer to the unique pattern in the DNA code of a single participant, rather than calculations about the patterns seen across a group of people.
Why did the NIH change the way it manages access to GSR?
NIH has considered the risks and benefits of access to GSR carefully since it was first described in 2008 that individuals could potentially be ‘re-identified’ through their use. Specifically, the agency held public workshops and solicited stakeholder comments through requests for information on the risks and benefits of different models of GSR access.
Public input over the years increasingly noted that the benefits of expanded access to GSR from most genomic studies outweighed the potential risks. Respondents highlighted the significant scientific value of GSR and the fact that there would be minimal risk to most participants if GSR were to be moved from controlled-access to an unrestricted access model. Based on this input, NIH changed the data access model for most GSR to make it more proportional to the risks for this type of information. However, because there are some studies where there might be additional privacy concerns, such as those that include populations from isolated geographic areas or with rare or stigmatizing traits, the access model includes a pathway for GSR from some studies to remain under controlled access.
-
Why did the NIH change the way it manages access to GSR?
NIH has considered the risks and benefits of access to GSR carefully since it was first described in 2008 that individuals could potentially be ‘re-identified’ through their use. Specifically, the agency held public workshops and solicited stakeholder comments through requests for information on the risks and benefits of different models of GSR access.
Public input over the years increasingly noted that the benefits of expanded access to GSR from most genomic studies outweighed the potential risks. Respondents highlighted the significant scientific value of GSR and the fact that there would be minimal risk to most participants if GSR were to be moved from controlled-access to an unrestricted access model. Based on this input, NIH changed the data access model for most GSR to make it more proportional to the risks for this type of information. However, because there are some studies where there might be additional privacy concerns, such as those that include populations from isolated geographic areas or with rare or stigmatizing traits, the access model includes a pathway for GSR from some studies to remain under controlled access.
How are GSR shared or used?
Currently, some GSR are included by investigators in the manuscripts that they publish to share the key findings from their research studies with the scientific community. GSR can be used to assess the validity and potential significance of results seen in other studies. They can also be useful for assessing the frequency of an individual genomic variant in different populations and for interpreting the possible pathologic importance of specific genomic test results. While publications only share a small number of GSR relevant to the specific research questions discussed, sharing the complete set of GSR across a dataset or many datasets creates the opportunity for the information to be used to answer different research questions.
On May 1, 2019, NIH released a Notice (NOT-OD-19-023) stating that GSR from most studies that are shared through NIH-designated data repositories would be shared through open access (unrestricted) pathways. This means that controlled-access NIH-designated data repositories could begin to share publicly more of the statistical findings for most of the studies hosted within the repository. This allows more GSR to be used by the broader scientific community to promote scientific or health-related research. Investigators requesting access to individual-level data through controlled-access (secondary investigators) can continue to share GSR calculations for others to use (e.g., through a publication). However, if these investigators wish to disseminate GSR generated from individual-level data more broadly (e.g., through an online resource), this should be described in a data access request, which will be reviewed by the Data Access Committee.
-
How are GSR shared or used?
Currently, some GSR are included by investigators in the manuscripts that they publish to share the key findings from their research studies with the scientific community. GSR can be used to assess the validity and potential significance of results seen in other studies. They can also be useful for assessing the frequency of an individual genomic variant in different populations and for interpreting the possible pathologic importance of specific genomic test results. While publications only share a small number of GSR relevant to the specific research questions discussed, sharing the complete set of GSR across a dataset or many datasets creates the opportunity for the information to be used to answer different research questions.
On May 1, 2019, NIH released a Notice (NOT-OD-19-023) stating that GSR from most studies that are shared through NIH-designated data repositories would be shared through open access (unrestricted) pathways. This means that controlled-access NIH-designated data repositories could begin to share publicly more of the statistical findings for most of the studies hosted within the repository. This allows more GSR to be used by the broader scientific community to promote scientific or health-related research. Investigators requesting access to individual-level data through controlled-access (secondary investigators) can continue to share GSR calculations for others to use (e.g., through a publication). However, if these investigators wish to disseminate GSR generated from individual-level data more broadly (e.g., through an online resource), this should be described in a data access request, which will be reviewed by the Data Access Committee.
What are the privacy risks associated with sharing GSR?
GSR can be used to determine whether an individual was in a particular group of a study (e.g., the disease group vs. the control group) but ONLY IF someone already has access to the research participant’s individual-level genomic information. While the risk is very low, it is possible that knowing that a person is part of group (e.g., a disease group) could potentially reveal sensitive information that was not already known from the individual-level genomic information itself.
It is possible that certain study populations may be more vulnerable to this privacy risk if they are from a small or isolated population or have a rare condition or trait. In other cases, the potential stigma of certain conditions or traits included in a study population may also increase privacy concerns.
-
What are the privacy risks associated with sharing GSR?
GSR can be used to determine whether an individual was in a particular group of a study (e.g., the disease group vs. the control group) but ONLY IF someone already has access to the research participant’s individual-level genomic information. While the risk is very low, it is possible that knowing that a person is part of group (e.g., a disease group) could potentially reveal sensitive information that was not already known from the individual-level genomic information itself.
It is possible that certain study populations may be more vulnerable to this privacy risk if they are from a small or isolated population or have a rare condition or trait. In other cases, the potential stigma of certain conditions or traits included in a study population may also increase privacy concerns.
What are the benefits of sharing GSR through open (unrestricted) access?
Sharing GSR through openly accessible mechanisms means that these summary findings can be used to address many different research questions or to inform the interpretation of clinical test results by health care providers. When GSR are available through unrestricted access, they also become easier to use for the development of new methods to interpret genomic information and its connection to phenotypes by a range of scientists from different fields. In addition, since GSR can be used to assess the validity or potential significance of results seen in other studies, the need to request individual-level data from a study may potentially decrease, thereby focusing access to individual-level data to those secondary studies that truly require it.
-
What are the benefits of sharing GSR through open (unrestricted) access?
Sharing GSR through openly accessible mechanisms means that these summary findings can be used to address many different research questions or to inform the interpretation of clinical test results by health care providers. When GSR are available through unrestricted access, they also become easier to use for the development of new methods to interpret genomic information and its connection to phenotypes by a range of scientists from different fields. In addition, since GSR can be used to assess the validity or potential significance of results seen in other studies, the need to request individual-level data from a study may potentially decrease, thereby focusing access to individual-level data to those secondary studies that truly require it.
Who will calculate GSR and how?
Currently, any GSR shared through controlled-access repositories for an individual study are generated and submitted by the study Principal Investigator(s) (PIs). NCBI also calculates allele frequencies across all non-sensitive datasets within dbGaP via the Allele Frequency Aggregator (ALFA) and shares the results through unrestricted access on (displayed by population).
-
Who will calculate GSR and how?
Currently, any GSR shared through controlled-access repositories for an individual study are generated and submitted by the study Principal Investigator(s) (PIs). NCBI also calculates allele frequencies across all non-sensitive datasets within dbGaP via the Allele Frequency Aggregator (ALFA) and shares the results through unrestricted access on (displayed by population).
What are the options under the NIH GDS Policy for sharing GSR?
If an institution’s IRB determines there are substantive individual privacy or group harm concerns for a particular study population, they may designate the study as “sensitive” on the Institutional Certification. If an institution designates GSR as “sensitive,” this data will only be shared through controlled-access, in conjunction with and under the same terms of access and use as the individual-level data for the study.
-
What are the options under the NIH GDS Policy for sharing GSR?
If an institution’s IRB determines there are substantive individual privacy or group harm concerns for a particular study population, they may designate the study as “sensitive” on the Institutional Certification. If an institution designates GSR as “sensitive,” this data will only be shared through controlled-access, in conjunction with and under the same terms of access and use as the individual-level data for the study.
Can approved data users share GSR that they derive from individual-level genomic data obtained from NIH-designated data repositories?
For individual-level human genomic data in NIH-designated data repositories, which are usually only available through controlled-access, a data access request that is reviewed by a Data Access Committee (DAC) is always required. For non-sensitive datasets, data requesters can indicate plans to generate and disseminate GSR in their research use statement if they wish to post GSR more broadly than publication within the scientific literature as an intrinsic piece of evidence to support a study’s conclusions, and this may be approved by a DAC. Requestors do not need to indicate what specific GSR they plan to generate and disseminate.
For datasets that are designated as sensitive, DACs will not approve research use statements that indicate plans to disseminate GSR more broadly than publication within the scientific literature to support a study’s conclusions.
-
Can approved data users share GSR that they derive from individual-level genomic data obtained from NIH-designated data repositories?
For individual-level human genomic data in NIH-designated data repositories, which are usually only available through controlled-access, a data access request that is reviewed by a Data Access Committee (DAC) is always required. For non-sensitive datasets, data requesters can indicate plans to generate and disseminate GSR in their research use statement if they wish to post GSR more broadly than publication within the scientific literature as an intrinsic piece of evidence to support a study’s conclusions, and this may be approved by a DAC. Requestors do not need to indicate what specific GSR they plan to generate and disseminate.
For datasets that are designated as sensitive, DACs will not approve research use statements that indicate plans to disseminate GSR more broadly than publication within the scientific literature to support a study’s conclusions.
Last updated: July 18, 2024