TY - GEN
T1 - A lung cancer outcome calculator using ensemble data mining on SEER data
AU - Agrawal, Ankit
AU - Misra, Sanchit
AU - Narayanan, Ramanathan
AU - Polepeddi, Lalith
AU - Choudhary, Alok
PY - 2011
Y1 - 2011
N2 - We analyze the lung cancer data available from the SEER program with the aim of developing accurate survival prediction models for lung cancer using data mining techniques. Carefully designed preprocessing steps resulted in removal/ modification/splitting of several attributes, and 2 of the 11 derived attributes were found to have significant predictive power. Several data mining classification techniques were used on the preprocessed data along with various data mining optimizations and validations. In our experiments, ensemble voting of five decision tree based classifiers and meta-classifiers was found to result in the best prediction performance in terms of accuracy and area under the ROC curve. Further, we have developed an on-line lung cancer outcome calculator for estimating risk of mortality after 6 months, 9 months, 1 year, 2 year, and 5 years of diagnosis, for which a smaller non-redundant subset of 13 attributes was carefully selected using attribute selection techniques, while trying to retain the predictive power of the original set of attributes. The on-line lung cancer outcome calculator developed as a result of this study is available at http://info.eecs.northwestern.edu:8080/LungCancerOutcome-Calculator/.
AB - We analyze the lung cancer data available from the SEER program with the aim of developing accurate survival prediction models for lung cancer using data mining techniques. Carefully designed preprocessing steps resulted in removal/ modification/splitting of several attributes, and 2 of the 11 derived attributes were found to have significant predictive power. Several data mining classification techniques were used on the preprocessed data along with various data mining optimizations and validations. In our experiments, ensemble voting of five decision tree based classifiers and meta-classifiers was found to result in the best prediction performance in terms of accuracy and area under the ROC curve. Further, we have developed an on-line lung cancer outcome calculator for estimating risk of mortality after 6 months, 9 months, 1 year, 2 year, and 5 years of diagnosis, for which a smaller non-redundant subset of 13 attributes was carefully selected using attribute selection techniques, while trying to retain the predictive power of the original set of attributes. The on-line lung cancer outcome calculator developed as a result of this study is available at http://info.eecs.northwestern.edu:8080/LungCancerOutcome-Calculator/.
KW - Ensemble data mining
KW - Lung cancer
KW - Outcome calculator
KW - Predictive modeling
UR - http://www.scopus.com/inward/record.url?scp=85147403425&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147403425&partnerID=8YFLogxK
U2 - 10.1145/2003351.2003356
DO - 10.1145/2003351.2003356
M3 - Conference contribution
AN - SCOPUS:85147403425
SN - 9781450308397
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
BT - 10th International Workshop on Data Mining in Bioinformatics, BIOKDD 2011 - Held in Conjunction with SIGKDD Conference, the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-2011
PB - Association for Computing Machinery
T2 - 10th International Workshop on Data Mining in Bioinformatics, BIOKDD 2011 - Held in Conjunction with the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD-2011
Y2 - 21 August 2011 through 24 August 2011
ER -