Developing novel deep-learning based methods for deciphering non-coding gene regulatory code

Project: Research project

Project Details


Goals: This project will contribute novel pre-trained DNA Bidirectional Encoder Representations from Transformers, called DNABERT, and associated deep-learning tools to decipher the language of non-coding DNA and facilitate integration of gene regulatory information from rapidly accumulating sequence data with NLM’s genetic databases (for example, dbSNP, dbGaP and ClinVar), which serve both scientists and the public health by helping identify the genetic components of disease. The proposed research will further develop DNABERT for a variety of sequence prediction tasks, and benchmark with existing state-of-the-art deep-learning based methods. Specific aims are: 1. Develop novel deep-learning methods by adapting BERT. This aim will be accomplished by developing: a. DNABERT, a novel pre-trained transformer-based neural network model for human DNA representations b. Fine-tuned DNABERT modules for diverse sequence prediction tasks with labeled data c. DNABERT-viz, a novel visualization module for model interpretability and visualization of important regions, contexts and sequence motifs 2. Apply the proposed methods to specifically target non-coding DNA sequence analyses and predictions: We will fine-tune DNABERT to develop predictive models for (a) proximal promoters; (b) core-promoters; (c) splice sites; and (d) cis-regulatory elements (CREs). In conjunction, we will evaluate the fine-tuned DNABERT in effectively distinguishing tissue/cell-type specific promoters and polysemous CREs, by investigating contextual differences in binding specificities between different TF family members and their isoforms, by making use of publicly available ChIP-seq and transcriptome data. Since separate pre-training of DNABERT for different organisms can be both very time-consuming and resource-intensive, we will evaluate whether DNABERT pre-trained on human genome can be applied on other mammalian organisms, by applying pre-trained DNABERT from human genome on mouse ENCODE ChIP-seq datasets. 3. Predict and validate functional non-coding genetic variants by applying DNABERT prediction models: We will develop set of tools to identify functional non-coding variants using the short variants in dbSNP database, and validate candidate predictions by integrating information from dbGaP and ClinVar, and performing ChIP, dual-luciferase reporter assay and CRISPR/Cas9 genome editing in human cell lines Dr. Han Liu group (Northwestern University) will participate in completion of Aim 1. Davuluri group will participate in completing the proposed objective in Aims 2 and 3, which will be performed at Stony Brook University School of Medicine. Sequence analysis, interpretation and correlation with phenotypic outcomes, and manuscript writing will be performed both at Northwestern and Stony Brook Universities in joint collaboration between Drs. Liu and Davuluri groups.
Effective start/end date8/1/214/30/25


  • State University of New York at Stony Brook (5R01LM013722-03 // 1170742/2/92468 Amendment 2)
  • National Library of Medicine (5R01LM013722-03 // 1170742/2/92468 Amendment 2)


Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.