TY - JOUR
T1 - Complementary feature selection from alternative splicing events and gene expression for phenotype prediction
AU - Labuzzetta, Charles J.
AU - Antonio, Margaret L.
AU - Watson, Patricia M.
AU - Wilson, Robert C.
AU - Laboissonniere, Lauren A.
AU - Trimarchi, Jeffrey M.
AU - Genc, Baris
AU - Ozdinler, P. Hande
AU - Watson, Dennis K.
AU - Anderson, Paul E.
N1 - Publisher Copyright:
© 2016 The Author 2016. Published by Oxford University Press. All rights reserved.
PY - 2016/9/1
Y1 - 2016/9/1
N2 - Motivation: A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering and filtering steps using phenotype prediction accuracy as a metric. At the same time, the complementary nature of gene and isoform data is analyzed and the feasibility of identifying isoforms as biomarker candidates is examined. Results: Isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer (NSCLC) patients, human patients with chronic obstructive pulmonary disease (COPD) and amyotrophic lateral sclerosis (ALS) transgenic mice, each including samples of diseased and non-diseased phenotypes. Availability and Implementation: https://github.com/clabuzze/Phenotype-Prediction-Pipeline.git.
AB - Motivation: A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering and filtering steps using phenotype prediction accuracy as a metric. At the same time, the complementary nature of gene and isoform data is analyzed and the feasibility of identifying isoforms as biomarker candidates is examined. Results: Isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer (NSCLC) patients, human patients with chronic obstructive pulmonary disease (COPD) and amyotrophic lateral sclerosis (ALS) transgenic mice, each including samples of diseased and non-diseased phenotypes. Availability and Implementation: https://github.com/clabuzze/Phenotype-Prediction-Pipeline.git.
UR - http://www.scopus.com/inward/record.url?scp=84990882905&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84990882905&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btw430
DO - 10.1093/bioinformatics/btw430
M3 - Article
C2 - 27587658
AN - SCOPUS:84990882905
SN - 1367-4803
VL - 32
SP - i421-i429
JO - Bioinformatics
JF - Bioinformatics
IS - 17
ER -