Date of Defense

28-10-2025 10:00 AM

Location

Room 0004 , H4 Building

Document Type

Thesis Defense

Degree Name

Master of Science in Software Engineering

College

College of Information Technology

First Advisor

Salah Bouktif

Keywords

Artificial Intelligence, Large Language Models, Code Generation

Abstract

The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has opened up new possibilities for automating intricate programming tasks with greater accuracy. Although contemporary foundational models demonstrate promising results, researchers continue to explore optimal post-training strategies to enhance code quality. These include supervised fine-tuning, retrieval-augmented generation (RAG), debugging, and many others. In this thesis, I combine two such widely used post training approaches—namely (1) multi-agent collaboration and (2) runtime execution of information-based debugging—for improving code generation functionality, reliability, and practical applicability. I perform an empirical study to extend the evaluation of both individual strategies and their combined application. My study uses 19 LLMs to examine the performance of each strategy as well as their composition, offering comprehensive insights into how different post training strategies influence code generation effectiveness. In particular, I implement a chained system that integrates both strategies to assess their combined impact on functional accuracy, code reliability, and generation latency using two benchmark datasets commonly used for code generation. My findings provide valuable insights for organizations seeking robust AI-driven coding solutions by guiding them in selecting models that can better adapt to complex post-training strategies—ultimately fostering the adoption of more effective and reliable code generation technologies. This research addresses the lack of extensive evaluation of post-training techniques designed to enhance code generation using large language models. By covering a broad range of LLMs, various approaches and various dimensions of evaluating LLM based approaches, such as code accuracy, generation latency and code rigorousness, I propose a comprenhensive framework to combine LLM code generation techniques and evaluate them thoroughly.

Share

COinS
 
Oct 28th, 10:00 AM

ENHANCING LLM CODE GENERATION: A SYSTEMATIC EVALUATION OF MULTI-AGENT COLLABORATION AND RUNTIME DEBUGGING FOR IMPROVED ACCURACY, RELIABILITY, AND LATENCY

Room 0004 , H4 Building

The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has opened up new possibilities for automating intricate programming tasks with greater accuracy. Although contemporary foundational models demonstrate promising results, researchers continue to explore optimal post-training strategies to enhance code quality. These include supervised fine-tuning, retrieval-augmented generation (RAG), debugging, and many others. In this thesis, I combine two such widely used post training approaches—namely (1) multi-agent collaboration and (2) runtime execution of information-based debugging—for improving code generation functionality, reliability, and practical applicability. I perform an empirical study to extend the evaluation of both individual strategies and their combined application. My study uses 19 LLMs to examine the performance of each strategy as well as their composition, offering comprehensive insights into how different post training strategies influence code generation effectiveness. In particular, I implement a chained system that integrates both strategies to assess their combined impact on functional accuracy, code reliability, and generation latency using two benchmark datasets commonly used for code generation. My findings provide valuable insights for organizations seeking robust AI-driven coding solutions by guiding them in selecting models that can better adapt to complex post-training strategies—ultimately fostering the adoption of more effective and reliable code generation technologies. This research addresses the lack of extensive evaluation of post-training techniques designed to enhance code generation using large language models. By covering a broad range of LLMs, various approaches and various dimensions of evaluating LLM based approaches, such as code accuracy, generation latency and code rigorousness, I propose a comprenhensive framework to combine LLM code generation techniques and evaluate them thoroughly.