Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease

Emily Kunce Stroup, Zhe Ji*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

10 Scopus citations

Abstract

The genomic distribution of cleavage and polyadenylation (polyA) sites should be co-evolutionally optimized with the local gene structure. Otherwise, spurious polyadenylation can cause premature transcription termination and generate aberrant proteins. To obtain mechanistic insights into polyA site optimization across the human genome, we develop deep/machine learning models to identify genome-wide putative polyA sites at unprecedented nucleotide-level resolution and calculate their strength and usage in the genomic context. Our models quantitatively measure position-specific motif importance and their crosstalk in polyA site formation and cleavage heterogeneity. The intronic site expression is governed by the surrounding splicing landscape. The usage of alternative polyA sites in terminal exons is modulated by their relative locations and distance to downstream genes. Finally, we apply our models to reveal thousands of disease- and trait-associated genetic variants altering polyadenylation activity. Altogether, our models represent a valuable resource to dissect molecular mechanisms mediating genome-wide polyA site expression and characterize their functional roles in human diseases.

Original languageEnglish (US)
Article number7378
JournalNature communications
Volume14
Issue number1
DOIs
StatePublished - Dec 2023

Funding

This work was supported by grants to Z.J.: the National Institutes of Health (R35GM138192, and R01HL161389), and the Lynn Sage Scholar fund. E.S. was supported by the Predoctoral Training Program in Biomedical Data Driven Discovery (T32LM012203). We thank Alfred George and the members of the Ji lab for helpful discussions. This work was supported by grants to Z.J.: the National Institutes of Health (R35GM138192, and R01HL161389), and the Lynn Sage Scholar fund. E.S. was supported by the Predoctoral Training Program in Biomedical Data Driven Discovery (T32LM012203). We thank Alfred George and the members of the Ji lab for helpful discussions.

ASJC Scopus subject areas

  • General Chemistry
  • General Biochemistry, Genetics and Molecular Biology
  • General Physics and Astronomy

Fingerprint

Dive into the research topics of 'Deep learning of human polyadenylation sites at nucleotide resolution reveals molecular determinants of site usage and relevance in disease'. Together they form a unique fingerprint.

Cite this