Abstract
Large Language Models (LLMs) excel at software development, but can they troubleshoot post-deployment failures? This talk explores the limitations of how we evaluate LLMs for Root Cause Analysis (RCA) in software systems.
Our study reveals that existing RCA benchmarks are too simple, allowing basic rule-based methods to outperform state-of-the-art models. To address this, we introduce OpenRCA, a benchmark dataset and evaluation framework for assessing LLMs' RCA ability, showing substantial room for model improvement. In addition, by implementing step-wise causal process supervision, we reveal that even top LLMs often guess the correct root cause following entirely flawed reasoning paths. Finally, we discuss the transition towards agentic software engineering, outlining future research directions such as building dynamic benchmarks and enhancing process-level reasoning via self-play.
About the speaker
Dr. Pinjia He is an Assistant Professor at The Chinese University of Hong Kong, Shenzhen. His research interests include software engineering, AI for SE, large language models, and trustworthy AI. He has published 70+ papers in top-tier conferences and journals such as ICSE, FSE, ICLR, NeurIPS, and CSUR. He received the IEEE TCSE Rising Star Award and the IEEE Open Source Software Services Award. His work has been cited over 9,000 times according to Google Scholar. The open-source projects he leads have been starred 7,000+ times on GitHub and have been downloaded 100K+ times by 450+ organizations.
