TY - GEN
T1 - A Flash in the Pan
T2 - 31st International Conference on Computational Linguistics, COLING 2025
AU - de Carvalho, Gustavo Adolpho Lucas
AU - Benigeri, Simon
AU - Healey, Jennifer
AU - Bursztyn, Victor
AU - Demeter, David
AU - Birnbaum, Lawrence
N1 - Publisher Copyright:
© 2025 Association for Computational Linguistics.
PY - 2025
Y1 - 2025
N2 - Conversational Recommendation Systems (CRSs) are a particularly interesting application for out-of-the-box LLMs due to their potential for eliciting user preferences and making recommendations in natural language across a wide set of domains. Somewhat surprisingly, we find however that in such a conversational application, the more questions a user answers about their preferences, the worse the model's recommendations become. We demonstrate this phenomenon on a previously published dataset as well as two novel datasets which we contribute. We also explain why earlier benchmarks failed to detect this round-over-round performance loss, highlighting the importance of the evaluation strategy we use and expanding upon Li et al. (2023a). We also present preference elicitation and recommendation strategies that mitigate this degradation in performance, beating state-of-the-art results, and show how three underlying models, GPT-3.5, GPT-4, and Claude 3.5 Sonnet, differently impact these strategies. Our datasets and code are available at https://github.com/CtrlVGustavo/A-Flash-in-the-Pan-CRS.
AB - Conversational Recommendation Systems (CRSs) are a particularly interesting application for out-of-the-box LLMs due to their potential for eliciting user preferences and making recommendations in natural language across a wide set of domains. Somewhat surprisingly, we find however that in such a conversational application, the more questions a user answers about their preferences, the worse the model's recommendations become. We demonstrate this phenomenon on a previously published dataset as well as two novel datasets which we contribute. We also explain why earlier benchmarks failed to detect this round-over-round performance loss, highlighting the importance of the evaluation strategy we use and expanding upon Li et al. (2023a). We also present preference elicitation and recommendation strategies that mitigate this degradation in performance, beating state-of-the-art results, and show how three underlying models, GPT-3.5, GPT-4, and Claude 3.5 Sonnet, differently impact these strategies. Our datasets and code are available at https://github.com/CtrlVGustavo/A-Flash-in-the-Pan-CRS.
UR - http://www.scopus.com/inward/record.url?scp=85218491757&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85218491757&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85218491757
T3 - Proceedings - International Conference on Computational Linguistics, COLING
SP - 8385
EP - 8398
BT - Main Conference
A2 - Rambow, Owen
A2 - Wanner, Leo
A2 - Apidianaki, Marianna
A2 - Al-Khalifa, Hend
A2 - Di Eugenio, Barbara
A2 - Schockaert, Steven
PB - Association for Computational Linguistics (ACL)
Y2 - 19 January 2025 through 24 January 2025
ER -