Date of Award

3-2026

Document Type

Thesis

DOI

https://doi.org/10.83048/wnwc-sx73

Degree Name

Master of Science in Software Engineering

Department

Computer Science and Software Engineering

First Advisor

Dr. Yasir Mahmood

Abstract

This thesis investigates the real-world behavior of LLM-driven coding agents that generate code changes and submit pull requests (PRs) to public software repositories. As these tools evolve from autocomplete-style assistants into more autonomous agents, their contributions increasingly interact with socio-technical review processes (human reviewers, bots, CI/CD gates, and project norms). The thesis focuses on understanding why agent-generated PRs are accepted or rejected and what these outcomes reveal about current agent limitations in practical development workflows.

The main objective of this thesis is to systematically characterize rejection patterns and failure modes of agent-generated pull requests in real repositories. Specifically, the thesis aims to (i) quantify acceptance vs. rejection trends across agents and time, (ii) derive a structured taxonomy of rejection reasons grounded in reviewer comments and repository signals, and (iii) translate the empirical findings into actionable recommendations for practitioners and for designers of more reliable, reflexive coding agents.

The research employs an empirical software engineering methodology combining large-scale data mining, quantitative analysis, and qualitative labeling. We collect and curate agent-generated PRs and their associated artifacts (review comments, CI outcomes, and repository context), then conduct longitudinal and comparative analyses across agents and deployment settings. To operationalize failure-mode identification, we construct a hierarchical taxonomy of rejection reasons from observed review and gatekeeping signals, supported by systematic coding procedures and statistical validation (e.g., effect-size estimates and regression-based analyses) to relate rejection outcomes to agent-, repository-, and process-level factors.

The study shows that agent-generated PR rejections are driven by a mix of technical defects (e.g., logic errors, inadequate tests, CI failures) and socio-technical misalignments (e.g., spec/intent mismatch, design fit, policy and maintainability concerns). The results further reveal that rejection behavior differs across agents and evolves over time, suggesting interactions between agent upgrades, developer learning, and repository guardrails. The derived taxonomy exposes recurring, high-impact rejection families that explain a substantial portion of negative PR outcomes and provide interpretable evidence for why current agents fail under real review constraints.

A significant contribution of this work is a practitioner-oriented taxonomy of rejection reasons for agent-generated pull requests, grounded in real repository feedback and designed to be reusable for future measurement and benchmarking. Additionally, the thesis provides comparative evidence on how rejection patterns vary by agent and operational context, and distills these findings into design recommendations for more reflexive agents (e.g., stronger spec alignment checks, test-aware patching, CI-informed iteration, and repository-norm adaptation).

This research addresses the lack of large-scale, empirically grounded explanations for why agent-generated PRs fail in real-world repositories. By moving beyond purely benchmark-centric evaluations and linking PR outcomes to concrete reviewer and gatekeeping signals, the thesis provides both an explanatory framework and practical guidance for improving the reliability, acceptance, and governance of autonomous coding agents in modern software engineering.

Arabic Abstract

هندسة البرمجيات في عصر وكلاء البرمجة: أنماط الإخفاق والرفض

تستقصي هذه الأطروحة السلوك الواقعي لوكلاء البرمجة المعتمدين(Coding Agents) على نماذج اللغة الكبيرة(LLMs) الذين يولدّون تعديلات برمجية و يقدّمون طلبات دمج (Pull Requests) إلى مستودعات برمجية عامة. ومع تطوّر هذه الأدوات من مساعدين على نمط الإكمال التلقائي إلى وكلاء أكثر استقلالية، أصبحت مساهماتهم تتفاعل بشكل متزايد مع عمليات مراجعة اجتماعية-تقنية تشمل المراجعين البشر، والروبوتات، وبوابات التكامل/التسليم المستمر(CI/CD) ومعايير المشروع وأعرافه. وتركّز الأطروحة على فهم أسباب قبول أو رفض طلبات الدمج التي ،ينتجها الوكلاء، وما الذي تكشفه هذه النتائج عن حدود الوكلاء الحالية ضمن سير العمل الفعلي لتطوير البرمجيات.

الهدف الرئيس لهذه الأطروحة هو توصيف أنماط الرفض وأسباب الإخفاق في طلبات الدمج التي يولدّها الوكلاء داخل مستودعات حقيقية بصورة منهجية. وبشكل أدق، تهدف الأطروحة إلى(i) : قياس اتجاهات القبول مقابل الرفض عبر الوكلاء وعبر الزمن، (ii) اشتقاق تصنيف(Taxonomy) منظّم لأسباب الرفض مستند إلى تعليقات المراجعين وإشارات المستودع، و(iii) تحو يل النتائج التجريبية إلى توصيات عملية للممارسين ولمصمّمي وكلاء برمجة أكثر موثوقية و"انعكاسية".(Reflexive)

تعتمد الدراسة منهجية هندسة برمجيات تجريبية تجمع بين تقليب البيانات على نطاق واسع، والتحليل الكمي، والتوصيف/الترميز النوعي. نقوم بجمع وتنظيف طلبات الدمج التي ينتجها الوكلاء وما يرتبط بها من آثار(Artifacts) مثل تعليقات المراجعة، ومخرجات (CI)، وسياق المستودع، ثم نجري تحليلات طولية ومقارنات عبر الوكلاء، وأنماط النشر، وللتشغيل عملية تحديد الأخطاء عملياً (Operationalize)، يبني تصنيفاً هرمياً لأسباب الرفض انطلاقاً من إشارات المراجعة وبوابات القبول، مدعوماً بإجراءات ترميز منهجية والتحقق الإحصائي (مثل تقديرات حجم الأثر(effect size) ونماذج الانحدار (regression) لربط مخرجات الرفض بعوامل على مستوى الوكيل، والمستودع، وسير العملية.

تظهر الدراسة أن رفض طلبات الدمج التي ينتجها الوكلاء ناتج عن مزيج من عيوب تقنية (مثل أخطاء المنطق، وضعف الاختبارات، وفشل (CI)) وعدم توافق اجتماعي تقني (مثل عدم تطابق المواصفة/النية، وعدم ملاءمة التصميم، ومخاوف السياسات والقابلية للصيانة). كما تكشف النتائج أن سلوك الرفض يختلف بين الوكلاء ويتغير عبر الزمن، مما يشير إلى تفاعل بين توقعات الوكلاء، وتعلم المطورين، وآليات الحوكمة والضوابط في المستودعات، ويبرز التصنيف المستخلص عائلات رفض متكررة ومرتفعة الأثر جزءاً كبيراً من النتائج السلبية، وتقدم دليلاً تفسيرياً واضحاً لأسباب تعتبر الوكلاء الحاليين تحت قيود المراجعة الواقعية.

يتثل إسهام محوري في هذا العمل في تقديم تصنيف عملي موجه للممارسين لأسباب رفض طلبات الدم التي ينتجها الوكلاء (Taxonomy)، مبني على تغذية راجعة من مستودعات حقيقية ومصمم ليكون قابلاً لإعادة الاستخدام في القياس والمقارنة المستقبلية. بالإضافة إلى ذلك، توفر الأطروحة أدلة مقارنة حول اختلاف أنماط الرفض بحسب الوكيل والسياق التشغيلي، وتستخلص توصيات تصميم لوكلاء أكثر "انعكاسية" (Reflexive)، مثل تعزيز التحقق من توافق المواصفة/النية، وتوليد ترقيات واعية بالاختبارات، والتكرار المستند إلى نتائج (CI)، والتكيف مع أعراف المستودع.

تعالج هذه الدراسة نقص الأدوات والتفسيرات التجريبية واسعة النطاق التي توضح لماذا تفشل طلبات الدم التي ينتجها الوكلاء في مستودعات واقعية. ومن خلال تجاوز التقييمات المعتمدة على المقاييس المعيارية فقط، وربط مخرجات طلبات الدم بإشارات ملموسة من المراجعين وبوابات القبول، تقدم الأطروحة إطاراً تفسيرياً وإرشادات عملية لتحسين موثوقية وقبول وحوكمة البرمجيات المستقلة في هندسة البرمجيات الحديثة.

Recommended Citation

Hindi, Mahd Mohd, "SOFTWARE ENGINEERING IN THE AGE OF CODING AGENTS: FAILURE MODES AND REJECTION PATTERNS" (2026). Theses. 1428.
https://scholarworks.uaeu.ac.ae/all_theses/1428

Download

Included in

Software Engineering Commons

COinS

Theses

SOFTWARE ENGINEERING IN THE AGE OF CODING AGENTS: FAILURE MODES AND REJECTION PATTERNS