Colon cancer survival prediction using ensemble data mining on SEER data

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Scopus citations

Abstract

We analyze the colon cancer data available from the SEER program with the aim of developing accurate survival prediction models for colon cancer. Carefully designed preprocessing steps resulted in removal of several attributes and applying several supervised classification methods. We also adopt synthetic minority over-sampling technique (SMOTE) to balance the survival and non-survival classes we have. In our experiments, ensemble voting of the three of the top performing classifiers was found to result in the best prediction performance in terms of prediction accuracy and area under the ROC curve. We evaluated multiple classification schemes to estimate the risk of mortality after 1 year, 2 years and 5 years of diagnosis, on a subset of 65 attributes after the data clean up process, 13 attribute carefully selected using attribute selection techniques, and SMOTE balanced set of the same 13 attributes, while trying to retain the predictive power of the original set of attributes. Moreover, we demonstrate the importance of balancing the classes of the data set to yield better results.

Original languageEnglish (US)
Title of host publicationProceedings - 2013 IEEE International Conference on Big Data, Big Data 2013
Pages9-16
Number of pages8
DOIs
StatePublished - Dec 1 2013
Event2013 IEEE International Conference on Big Data, Big Data 2013 - Santa Clara, CA, United States
Duration: Oct 6 2013Oct 9 2013

Publication series

NameProceedings - 2013 IEEE International Conference on Big Data, Big Data 2013

Other

Other2013 IEEE International Conference on Big Data, Big Data 2013
CountryUnited States
CitySanta Clara, CA
Period10/6/1310/9/13

Keywords

  • Colon Cancer
  • Ensemble
  • Machine Learning
  • Prediction

ASJC Scopus subject areas

  • Software

Fingerprint Dive into the research topics of 'Colon cancer survival prediction using ensemble data mining on SEER data'. Together they form a unique fingerprint.

Cite this