TY - JOUR
T1 - A semi-supervised learning approach to enhance health care community-based question answering
T2 - A case study in alcoholism
AU - Wongchaisuwat, Papis
AU - Klabjan, Diego
AU - Jonnalagadda, Siddhartha Reddy
N1 - Publisher Copyright:
© 2016 JMIR Publications Inc. All Rights Reserved.
PY - 2016/7
Y1 - 2016/7
N2 - Background: Community-based question answering (CQA) sites play an important role in addressing health information needs. However, a significant number of posted questions remain unanswered. Automatically answering the posted questions can provide a useful source of information for Web-based health communities. Objective: In this study, we developed an algorithm to automatically answer health-related questions based on past questions and answers (QA). We also aimed to understand information embedded within Web-based health content that are good features in identifying valid answers. Methods: Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA. To rank these candidates, we implemented a semi-supervised leaning algorithm that extracts the best answer to a question. We assessed this approach on a curated corpus from Yahoo! Answers and compared against a rule-based string similarity baseline. Results: On our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. Unified medical language system-based (health related) features used in the model enhance the algorithm's performance by proximately 8%. A reasonably high rate of accuracy is obtained given that the data are considerably noisy. Important features distinguishing a valid answer from an invalid answer include text length, number of stop words contained in a test question, a distance between the test question and other questions in the corpus, and a number of overlapping health-related terms between questions. Conclusions: Overall, our automated QA system based on historical QA pairs is shown to be effective according to the dataset in this case study. It is developed for general use in the health care domain, which can also be applied to other CQA sites.
AB - Background: Community-based question answering (CQA) sites play an important role in addressing health information needs. However, a significant number of posted questions remain unanswered. Automatically answering the posted questions can provide a useful source of information for Web-based health communities. Objective: In this study, we developed an algorithm to automatically answer health-related questions based on past questions and answers (QA). We also aimed to understand information embedded within Web-based health content that are good features in identifying valid answers. Methods: Our proposed algorithm uses information retrieval techniques to identify candidate answers from resolved QA. To rank these candidates, we implemented a semi-supervised leaning algorithm that extracts the best answer to a question. We assessed this approach on a curated corpus from Yahoo! Answers and compared against a rule-based string similarity baseline. Results: On our dataset, the semi-supervised learning algorithm has an accuracy of 86.2%. Unified medical language system-based (health related) features used in the model enhance the algorithm's performance by proximately 8%. A reasonably high rate of accuracy is obtained given that the data are considerably noisy. Important features distinguishing a valid answer from an invalid answer include text length, number of stop words contained in a test question, a distance between the test question and other questions in the corpus, and a number of overlapping health-related terms between questions. Conclusions: Overall, our automated QA system based on historical QA pairs is shown to be effective according to the dataset in this case study. It is developed for general use in the health care domain, which can also be applied to other CQA sites.
KW - Consumer health informatics
KW - Machine learning
KW - Natural language processing
KW - Question answering
KW - Web-based health communities
UR - http://www.scopus.com/inward/record.url?scp=85040527238&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85040527238&partnerID=8YFLogxK
U2 - 10.2196/medinform.5490
DO - 10.2196/medinform.5490
M3 - Article
C2 - 27485666
AN - SCOPUS:85040527238
SN - 2291-9694
VL - 4
JO - JMIR Medical Informatics
JF - JMIR Medical Informatics
IS - 3
M1 - e24
ER -