Evaluation of a two-stage framework for prediction using big genomic data

Xia Jiang, Richard E. Neapolitan*

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    4 Scopus citations


    We are in the era of abundant 'big' or 'high-dimensional' data. These data afford us the opportunity to discover predictors of an event of interest, and to estimate occurrence of the event based on values of these predictors. For example, 'genome- wide association studies' examinemillions of single-nucleotide polymorphisms (SNPs), along with disease status. We can learn SNPs that affect disease status fromthese data sets, and use the knowledge learned to predict disease likelihood. Owing to the large number of features, it is difficult formany predictionmethods to use all the features directly. The ReliefF algorithmranks a set of features in terms of how well they predict a target. It can be used to identify good predictors, which can then be provided to a predictionmethod. We compared the performance of eight predictionmethods when predicting binary outcomes using high-dimensional discrete data sets. We performed two-stage prediction, where ReliefF is used in the first stage to identify good predictors. Bayesian network (BN)-basedmethods performed best overall. Furthermore, ReliefF did not improve their performance. The BN-basedmethods use the Bayesian Dirichlet Equivalent Uniformscore to evaluate candidatemodels, and use BN inference algorithms to performprediction. This score and these algorithms were developed for discrete variables. This perhaps explains why they performbetter in this domain. Many prediction methods are available, and researchers have little reason for choosing one over the other in the domain of binary prediction using high-dimensional data sets. Our results indicate that the best choices overall are BN-basedmethods.

    Original languageEnglish (US)
    Pages (from-to)912-921
    Number of pages10
    JournalBriefings in Bioinformatics
    Issue number6
    StatePublished - Feb 6 2015


    • Bayesian network
    • Big data
    • GWAS
    • High-dimensional data
    • Prediction
    • SNP

    ASJC Scopus subject areas

    • Information Systems
    • Molecular Biology


    Dive into the research topics of 'Evaluation of a two-stage framework for prediction using big genomic data'. Together they form a unique fingerprint.

    Cite this