A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets.

Xia Jiang*, Binghuang Cai, Diyang Xue, Xinghua Lu, Gregory F. Cooper, Richard E. Neapolitan

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    9 Scopus citations


    The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions. We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation. In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data. EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased. Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

    Original languageEnglish (US)
    Pages (from-to)e312-319
    JournalJournal of the American Medical Informatics Association : JAMIA
    Issue numbere2
    StatePublished - Oct 2014

    ASJC Scopus subject areas

    • Health Informatics


    Dive into the research topics of 'A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets.'. Together they form a unique fingerprint.

    Cite this