Abstract
Background: In the CRISPR-Cas9 system, the efficiency of genetic modifications has been found to vary depending on the single guide RNA (sgRNA) used. A variety of sgRNA properties have been found to be predictive of CRISPR cleavage efficiency, including the position-specific sequence composition of sgRNAs, global sgRNA sequence properties, and thermodynamic features. While prevalent existing deep learning-based approaches provide competitive prediction accuracy, a more interpretable model is desirable to help understand how different features may contribute to CRISPR-Cas9 cleavage efficiency. Results: We propose a gradient boosting approach, utilizing LightGBM to develop an integrated tool, BoostMEC (Boosting Model for Efficient CRISPR), for the prediction of wild-type CRISPR-Cas9 editing efficiency. We benchmark BoostMEC against 10 popular models on 13 external datasets and show its competitive performance. Conclusions: BoostMEC can provide state-of-the-art predictions of CRISPR-Cas9 cleavage efficiency for sgRNA design and selection. Relying on direct and derived sequence features of sgRNA sequences and based on conventional machine learning, BoostMEC maintains an advantage over other state-of-the-art CRISPR efficiency prediction models that are based on deep learning through its ability to produce more interpretable feature insights and predictions.
Original language | English (US) |
---|---|
Article number | 446 |
Journal | BMC bioinformatics |
Volume | 23 |
Issue number | 1 |
DOIs | |
State | Published - Dec 2022 |
Keywords
- CRISPR-Cas9
- Feature engineering
- Interpretability
- LightGBM
- Machine learning
- Regression trees
- sgRNA
ASJC Scopus subject areas
- Applied Mathematics
- Molecular Biology
- Structural Biology
- Biochemistry
- Computer Science Applications