A deep catalogue of protein-coding variation in 983,578 individuals

Senior Partnerships and Business Operations, Business Operations and Administrative Coordinators, RGC-ME Cohort Partners, Accelerated Cures, African Descent and Glaucoma Evaluation Study (ADAGES) III, Age-related macular degeneration in the Amish, Albert Einstein College of Medicine, Amish Connectome Project, Amish Research Clinic, The Australia and New Zealand MS Genetics Consortium, Center for Non-Communicable Diseases (CNCD), Cincinnati Children’s Hospital, Columbia University, Dallas Heart Study, Diabetic Retinopathy Clinical Research (DRCR) Retina Network, Duke University, Flinders University of South Australia, Indiana Biobank, Indiana University School of Medicine, Kaiser PermanenteMayo Clinic, Mexico City Prospective Study (MCPS), MyCode-DiscovEHR Geisinger Health System Biobank, National Institute of Mental Health, Northwestern University, Penn Medicine BioBank, Primary Open-Angle African American Glaucoma Genetics (POAAG) study, Regeneron–Mt. Sinai BioMe Biobank, UAB GWAS in African Americans with rheumatoid arthritis, UAB Whole exome sequencing of systemic lupus erythematosus patients, University of California, Los Angeles, University of Colorado School of Medicine, University of Michigan Medical School, University of Ottawa, University of Pennsylvania, University of Pittsburgh, University of Texas Health Science Center at Houston, Vanderbilt University Medical Center, Regeneron Genetics Center, RGC Management and Leadership Team, Sequencing and Lab Operations, Clinical Informatics, Genome Informatics and Data Engineering, Analytical Genetics and Data Science, Therapeutic Area Genetics, Research Program Management and Strategic Initiatives

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Rare coding variants that substantially affect function provide insights into the biology of a gene1–3. However, ascertaining the frequency of such variants requires large sample sizes4–8. Here we present a catalogue of human protein-coding variation, derived from exome sequencing of 983,578 individuals across diverse populations. In total, 23% of the Regeneron Genetics Center Million Exome (RGC-ME) data come from individuals of African, East Asian, Indigenous American, Middle Eastern and South Asian ancestry. The catalogue includes more than 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. We identify individuals with rare biallelic pLOF variants in 4,848 genes, 1,751 of which have not been previously reported. From precise quantitative estimates of selection against heterozygous loss of function (LOF), we identify 3,988 LOF-intolerant genes, including 86 that were previously assessed as tolerant and 1,153 that lack established disease annotation. We also define regions of missense depletion at high resolution. Notably, 1,482 genes have regions that are depleted of missense variants despite being tolerant of pLOF variants. Finally, we estimate that 3% of individuals have a clinically actionable genetic variant, and that 11,773 variants reported in ClinVar with unknown significance are likely to be deleterious cryptic splice sites. To facilitate variant interpretation and genetics-informed precision medicine, we make this resource of coding variation from the RGC-ME dataset publicly accessible through a variant allele frequency browser.

Original languageEnglish (US)
Pages (from-to)583-592
Number of pages10
JournalNature
Volume631
Issue number8021
DOIs
StatePublished - Jul 18 2024

Funding

This work was supported in part by the Intramural Research Program of the National Institute of Mental Health (ZIA-MH002843) and a grant from R01 NCI R01 CA157823. Ethical approval for the UK Biobank was previously obtained from the North West Centre for Research Ethics Committee (11/ NW/0382). The work described herein was approved by the UK Biobank under application number 26041. Informed consent was obtained for all study participants. Approval for Geisinger Health System MyCode analyses was provided by the Geisinger Health System Institutional Review Board under project number 2006-0258. Informed consent was obtained for all study participants. Appropriate consent for the Penn Medicine BioBank was obtained from each participant regarding the storage of biological specimens, genetic sequencing and genotyping and access to all available EHR data. This study was approved by the Institutional Review Board of the University of Pennsylvania and complied with the principles set out in the Declaration of Helsinki. All individuals participating in the Mayo\u2013RGC project generation provided informed consent for the use of specimens and data in genetic and health research and ethical approval for project generation was provided by the Mayo Clinic Institutional Review Board (09-007763). All research performed in this study used de-identified data (without any Protected Health Information data) with no possibility of re-identifying any of the participants. Approval for the Indiana Biobank was provided by the Indiana University Institutional Review Board under project number 1105005445. For participants in the Mexico City Prospective Study, approval for the study was given by the Mexican Ministry of Health, the Mexican National Council of Science and Technology (0595 P-M) and the Central Oxford Research Ethics Committee (C99.260) and the Ethics and Research commissions from the Medicine Faculty at the National Autonomous University of Mexico (UNAM) (FMED/CI/SPLR/067/2015). All study participants provided written informed consent. Study participants were recruited from the BioMe Biobank Program of the Charles Bronfman Institute for Personalized Medicine at Mount Sinai Medical Center from 2007 onward. The BioMe Biobank Program (Institutional Review Board 07-0529) operates under a Mount Sinai Institutional Review Board-approved research protocol. All study participants provided written informed consent. The authors thank the RGC-ME Cohort Partners for contributions to this initiative. A list of cohorts (with links to the projects where available) and the data contributors can be accessed from the RGC-ME browser ( https://rgc-research.regeneron.com/me/data-contributors ). The authors thank everyone who made this work possible, the professionals from the member institutions who contributed to and supported this work and, most especially, all of the participants, without whom this research would not be possible. This study is funded by Regeneron Genetics Center and Regeneron Pharmaceuticals.

ASJC Scopus subject areas

  • General

Fingerprint

Dive into the research topics of 'A deep catalogue of protein-coding variation in 983,578 individuals'. Together they form a unique fingerprint.

Cite this