TY - JOUR
T1 - Improving automated pediatric bone age estimation using ensembles of models from the 2017 RSNA machine learning challenge
AU - Pan, Ian
AU - Thodberg, Hans Henrik
AU - Halabi, Safwan S.
AU - Kalpathy-Cramer, Jayashree
AU - Larson, David B.
N1 - Publisher Copyright:
© RSNA, 2019.
PY - 2019/11
Y1 - 2019/11
N2 - Purpose: To investigate improvements in performance for automatic bone age estimation that can be gained through model ensembling. Materials and Methods: A total of 48 submissions from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge were used. Participants were provided with 12 611 pediatric hand radiographs with bone ages determined by a pediatric radiologist to develop models for bone age determination. The final results were determined using a test set of 200 radiographs labeled with the weighted average of six ratings. The mean pairwise model correlation and performance of all possible model combinations for ensembles of up to 10 models using the mean absolute deviation (MAD) were evaluated. A bootstrap analysis using the 200 test radiographs was conducted to estimate the true generalization MAD. Results: The estimated generalization MAD of a single model was 4.55 months. The best-performing ensemble consisted of four models with an MAD of 3.79 months. The mean pairwise correlation of models within this ensemble was 0.47. In comparison, the lowest achievable MAD by combining the highest-ranking models based on individual scores was 3.93 months using eight models with a mean pairwise model correlation of 0.67. Conclusion: Combining less-correlated, high-performing models resulted in better performance than naively combining the top-performing models. Machine learning competitions within radiology should be encouraged to spur development of heterogeneous models whose predictions can be combined to achieve optimal performance.
AB - Purpose: To investigate improvements in performance for automatic bone age estimation that can be gained through model ensembling. Materials and Methods: A total of 48 submissions from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge were used. Participants were provided with 12 611 pediatric hand radiographs with bone ages determined by a pediatric radiologist to develop models for bone age determination. The final results were determined using a test set of 200 radiographs labeled with the weighted average of six ratings. The mean pairwise model correlation and performance of all possible model combinations for ensembles of up to 10 models using the mean absolute deviation (MAD) were evaluated. A bootstrap analysis using the 200 test radiographs was conducted to estimate the true generalization MAD. Results: The estimated generalization MAD of a single model was 4.55 months. The best-performing ensemble consisted of four models with an MAD of 3.79 months. The mean pairwise correlation of models within this ensemble was 0.47. In comparison, the lowest achievable MAD by combining the highest-ranking models based on individual scores was 3.93 months using eight models with a mean pairwise model correlation of 0.67. Conclusion: Combining less-correlated, high-performing models resulted in better performance than naively combining the top-performing models. Machine learning competitions within radiology should be encouraged to spur development of heterogeneous models whose predictions can be combined to achieve optimal performance.
UR - http://www.scopus.com/inward/record.url?scp=85086500892&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85086500892&partnerID=8YFLogxK
U2 - 10.1148/ryai.2019190053
DO - 10.1148/ryai.2019190053
M3 - Article
C2 - 32090207
AN - SCOPUS:85086500892
SN - 2638-6100
VL - 1
JO - Radiology: Artificial Intelligence
JF - Radiology: Artificial Intelligence
IS - 6
M1 - e190053
ER -