Date of Defense
28-10-2025 10:00 AM
Location
Room 0004 , H4 Building
Document Type
Thesis Defense
Degree Name
Master of Science in Software Engineering
College
College of Information Technology
First Advisor
Salah Bouktif
Keywords
Artificial Intelligence, Large Language Models, Code Generation
Abstract
The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has opened up new possibilities for automating intricate programming tasks with greater accuracy. Although contemporary foundational models demonstrate promising results, researchers continue to explore optimal post-training strategies to enhance code quality. These include supervised fine-tuning, retrieval-augmented generation (RAG), debugging, and many others. In this thesis, I combine two such widely used post training approaches—namely (1) multi-agent collaboration and (2) runtime execution of information-based debugging—for improving code generation functionality, reliability, and practical applicability. I perform an empirical study to extend the evaluation of both individual strategies and their combined application. My study uses 19 LLMs to examine the performance of each strategy as well as their composition, offering comprehensive insights into how different post training strategies influence code generation effectiveness. In particular, I implement a chained system that integrates both strategies to assess their combined impact on functional accuracy, code reliability, and generation latency using two benchmark datasets commonly used for code generation. My findings provide valuable insights for organizations seeking robust AI-driven coding solutions by guiding them in selecting models that can better adapt to complex post-training strategies—ultimately fostering the adoption of more effective and reliable code generation technologies. This research addresses the lack of extensive evaluation of post-training techniques designed to enhance code generation using large language models. By covering a broad range of LLMs, various approaches and various dimensions of evaluating LLM based approaches, such as code accuracy, generation latency and code rigorousness, I propose a comprenhensive framework to combine LLM code generation techniques and evaluate them thoroughly.
Included in
ENHANCING LLM CODE GENERATION: A SYSTEMATIC EVALUATION OF MULTI-AGENT COLLABORATION AND RUNTIME DEBUGGING FOR IMPROVED ACCURACY, RELIABILITY, AND LATENCY
Room 0004 , H4 Building
The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has opened up new possibilities for automating intricate programming tasks with greater accuracy. Although contemporary foundational models demonstrate promising results, researchers continue to explore optimal post-training strategies to enhance code quality. These include supervised fine-tuning, retrieval-augmented generation (RAG), debugging, and many others. In this thesis, I combine two such widely used post training approaches—namely (1) multi-agent collaboration and (2) runtime execution of information-based debugging—for improving code generation functionality, reliability, and practical applicability. I perform an empirical study to extend the evaluation of both individual strategies and their combined application. My study uses 19 LLMs to examine the performance of each strategy as well as their composition, offering comprehensive insights into how different post training strategies influence code generation effectiveness. In particular, I implement a chained system that integrates both strategies to assess their combined impact on functional accuracy, code reliability, and generation latency using two benchmark datasets commonly used for code generation. My findings provide valuable insights for organizations seeking robust AI-driven coding solutions by guiding them in selecting models that can better adapt to complex post-training strategies—ultimately fostering the adoption of more effective and reliable code generation technologies. This research addresses the lack of extensive evaluation of post-training techniques designed to enhance code generation using large language models. By covering a broad range of LLMs, various approaches and various dimensions of evaluating LLM based approaches, such as code accuracy, generation latency and code rigorousness, I propose a comprenhensive framework to combine LLM code generation techniques and evaluate them thoroughly.