Date of Defense
31-3-2026 3:00 PM
Location
Microsoft Teams
Document Type
Thesis Defense
Degree Name
Master of Science in Software Engineering
College
CIT
Department
Computer Science and Software Engineering
First Advisor
Dr. Yasir Mahmood
Keywords
Coding Agent, Pull Requests, Large Language Models (LLM), Human-Computer Interaction (HCI), Prompt Engineering, Software Engineering.
Abstract
This thesis investigates the real-world behavior of LLM-driven coding agents that generate code changes and submit pull requests (PRs) to public software repositories. As these tools evolve from autocomplete-style assistants into more autonomous agents, their contributions increasingly interact with socio-technical review processes (human reviewers, bots, CI/CD gates, and project norms). This thesis focuses on understanding why agent-generated PRs are accepted or rejected and what these outcomes reveal about current agent limitations in practical development workflows. The main objective of this thesis is to systematically characterize rejection patterns and failure modes of agent-generated pull requests in real repositories. Specifically, the thesis aims to (i) quantify acceptance vs. rejection trends across agents and time, (ii) derive a structured taxonomy of rejectionreasons grounded in reviewer comments and repository signals, and (iii) translate the empirical findings into actionable recommendations for practitioners and for designers of more reliable, reflexive coding agents. The research employs an empirical software engineering methodology combining large-scale data mining, quantitative analysis, and qualitative labeling. We collect and curate agent-generated PRs and their associated artifacts (review comments, CI outcomes, and repository context), then conduct longitudinal and comparative analyses across agents and deployment settings. To operationalize failure-mode identification, we construct a hierarchical taxonomy of rejection reasons from observed review and gatekeeping signals, supported by systematic coding procedures and statistical validation (e.g., effect-size estimates and regression-based analyses) to relate rejection outcomes to agent-, repository-, and process-level factors. The study shows that agent-generated PR rejections are driven by a mix of technical defects (e.g., logic errors, inadequate tests, CI failures) and socio-technical misalignments (e.g., spec/intent mismatch, design fit, policy and maintainability concerns). The results further reveal that rejection behavior differs across agents and evolves over time, suggesting interactions between agent upgrades, developer learning, and repository guardrails. The derived taxonomy exposes recurring, high-impact rejection families that explain a substantial portion of negative PR outcomes and provide interpretable evidence for why current agents fail under real review constraints. A significant contribution of this work is a practitioner-oriented taxonomy of rejection reasons for agent-generated pull requests, grounded in real repository feedback and designed to be reusable for future measurement and benchmarking. Additionally, the thesis provides comparative evidence on how rejection patterns vary by agent and operational context and distills these findings into design recommendations for more reflexive agents (e.g., stronger spec alignment checks, test-aware patching, CI-informed iteration, and repository-norm adaptation). This research addresses the lack of large-scale, empirically grounded explanations for why agent-generated PRs fail in real-world repositories. By moving beyond purely benchmark-centric evaluations and linking PR outcomes to concrete reviewer and gatekeeping signals, the thesis provides both an explanatory framework and practical guidance for improving the reliability, acceptance, and governance of autonomous coding agents in modern software engineering.
Included in
SOFTWARE ENGINEERING IN THE AGE OF CODING AGENTS: FAILURE MODES AND REJECTION PATTERNS
Microsoft Teams
This thesis investigates the real-world behavior of LLM-driven coding agents that generate code changes and submit pull requests (PRs) to public software repositories. As these tools evolve from autocomplete-style assistants into more autonomous agents, their contributions increasingly interact with socio-technical review processes (human reviewers, bots, CI/CD gates, and project norms). This thesis focuses on understanding why agent-generated PRs are accepted or rejected and what these outcomes reveal about current agent limitations in practical development workflows. The main objective of this thesis is to systematically characterize rejection patterns and failure modes of agent-generated pull requests in real repositories. Specifically, the thesis aims to (i) quantify acceptance vs. rejection trends across agents and time, (ii) derive a structured taxonomy of rejectionreasons grounded in reviewer comments and repository signals, and (iii) translate the empirical findings into actionable recommendations for practitioners and for designers of more reliable, reflexive coding agents. The research employs an empirical software engineering methodology combining large-scale data mining, quantitative analysis, and qualitative labeling. We collect and curate agent-generated PRs and their associated artifacts (review comments, CI outcomes, and repository context), then conduct longitudinal and comparative analyses across agents and deployment settings. To operationalize failure-mode identification, we construct a hierarchical taxonomy of rejection reasons from observed review and gatekeeping signals, supported by systematic coding procedures and statistical validation (e.g., effect-size estimates and regression-based analyses) to relate rejection outcomes to agent-, repository-, and process-level factors. The study shows that agent-generated PR rejections are driven by a mix of technical defects (e.g., logic errors, inadequate tests, CI failures) and socio-technical misalignments (e.g., spec/intent mismatch, design fit, policy and maintainability concerns). The results further reveal that rejection behavior differs across agents and evolves over time, suggesting interactions between agent upgrades, developer learning, and repository guardrails. The derived taxonomy exposes recurring, high-impact rejection families that explain a substantial portion of negative PR outcomes and provide interpretable evidence for why current agents fail under real review constraints. A significant contribution of this work is a practitioner-oriented taxonomy of rejection reasons for agent-generated pull requests, grounded in real repository feedback and designed to be reusable for future measurement and benchmarking. Additionally, the thesis provides comparative evidence on how rejection patterns vary by agent and operational context and distills these findings into design recommendations for more reflexive agents (e.g., stronger spec alignment checks, test-aware patching, CI-informed iteration, and repository-norm adaptation). This research addresses the lack of large-scale, empirically grounded explanations for why agent-generated PRs fail in real-world repositories. By moving beyond purely benchmark-centric evaluations and linking PR outcomes to concrete reviewer and gatekeeping signals, the thesis provides both an explanatory framework and practical guidance for improving the reliability, acceptance, and governance of autonomous coding agents in modern software engineering.