Framework

FORGE - AI-augmented SDLC

14 min read · Updated May 4, 2026

FORGE (Framework for Orchestrated, Reviewed, & Governed Engineering) is a framework for governing how AI participates in every phase of software delivery; from requirements to deployment, to maintenance; without trading long-term quality for short-term velocity.

TL;DR

Every phase of the SDLC can now be augmented by AI agents: requirements generation, architecture proposals, code writing, testing, and deployment. But augmentation without governance creates a new class of invisible technical debt: code that ships fast, passes tests, and quietly rots. This framework defines where AI accelerates delivery, where humans must remain in the loop by design, and how to build the quality gates between them.

I apply this daily at in my current role at Caylent, where AI-native delivery is the methodology, not a layer on top

The problem

Three things break when teams adopt AI coding tools without rethinking how their SDLC absorbs the output.

The velocity illusion

The first month feels transformative. Pull requests multiply. Lines of code spike. Cycle time shrinks. Leadership starts quoting throughput numbers in all-hands meetings. Then, quietly, the curve flattens.

A Carnegie Mellon University study tracked 807 GitHub repositories that adopted Cursor and compared them against 1,380 matched control repositories over a sixteen-month period. The pattern was consistent: an initial surge in output; lines of code jumped roughly threefold and commits rose about 55% in the first month; followed by a return to baseline by month three. The velocity spike was real but temporary. What persisted, however, was the damage. Static analysis warnings increased by 30% and code complexity rose by 41% in adopting repositories, and neither metric recovered after the speed gains faded. The study’s authors describe a feedback loop: the technical debt accumulated during the velocity spike subsequently dampened future development velocity, making teams slower than they would have been without AI assistance.

Carnegie Mellon / Cursor study

This isn’t an argument against AI coding tools. It’s an argument against adopting them without changing the system they feed into.

Invisible debt accumulation

Traditional technical debt is a conscious tradeoff. A team takes a shortcut, knows they took it, and accepts the cost of fixing it later. AI-generated debt is different: developers often don’t realize they’re taking it on.

GitClear analyzed 211 million changed lines of code across five years (2020–2024) and found structural shifts in how codebases are evolving. Duplicated code blocks increased eightfold in 2024. Refactoring; the practice of consolidating and improving existing code; collapsed from 25% of all changes in 2021 to under 10% in 2024. Code churn, defined as lines reverted or substantially revised within two weeks of being written, climbed from 3.1% in 2020 to 5.7% in 2024. For the first time, copy-pasted lines exceeded moved lines, suggesting that developers are reaching for quick insertions rather than reusing existing modules.

GitClear 2025 code quality research

A separate large-scale study published in March 2026 tracked 302,600 AI-authored commits across 6,299 GitHub repositories and five major AI coding assistants. The researchers identified 484,366 distinct issues introduced by AI commits. Code smells accounted for 89% of them; the kind of issues that don’t break anything immediately but compound over time. More than 15% of commits from every AI assistant introduced at least one issue. And critically, when the researchers tracked each issue forward to the latest repository snapshot, 24.2% of AI-introduced issues were still present and unresolved. The cumulative count of surviving issues exceeded 100,000 by February 2026 and was climbing. For correctness and security issues specifically, AI commits introduced roughly 1.5 times as many problems as they fixed.

Debt Behind the AI Boom

The pattern is clear: AI tools make it easy to produce code that works, and nearly as easy to produce code that quietly degrades the system around it.

The governance vacuum

Most teams bolt AI tools onto an existing SDLC without rethinking the gates. Code review processes designed for human output; where a developer might submit a few hundred lines of code in a thoughtful Pull Request (PR) for a peer review, can’t scale to agent output, where thousands of lines arrive in a single PR. The traditional model of a developer reviewing every line doesn’t hold at that volume. But removing human oversight entirely is reckless.

Stack Overflow’s 2025 Developer Survey captures the tension: 84% of developers are using or planning to use AI tools, and 51% of professional developers use them daily. But trust is eroding, not building. Only 29% of developers said they trust AI output, down eleven percentage points from the prior year. The biggest frustration, cited by 66% of respondents, is AI solutions that are almost right but not quite — close enough to accept, wrong enough to cause problems downstream. And 45% of professional developers rated AI tools as bad or very bad at handling complex tasks.

Stack Overflow 2025 Developer Survey

The gap between adoption and trust is a governance problem, not a tooling problem. Teams are using AI faster than they’re building the structures to catch its mistakes.

You need an AI-augmented SDLC not because AI assistance is optional; at this point, it isn’t; but because ungoverned AI assistance is worse than no assistance at all. It creates the illusion of progress while building a codebase nobody fully understands.

The model

The core idea is straightforward: every phase of the SDLC now has three explicit layers, and FORGE is the contract between them.

Layer 1: AI execution. What AI does autonomously within defined guardrails; generating stories from briefs, proposing architecture patterns, writing code from specs, producing test cases, validating pipelines, triaging alerts.

Layer 2: Human judgment. What humans own; validating business intent, making architectural tradeoffs, reviewing AI output for coherence with the broader system, approving production deployments, deciding what not to build.

Layer 3: The gate between them. The mechanism by which work passes from AI to human and back. The gate is where governance lives. Get this wrong and you get either rubber-stamped AI output (fast, fragile) or human bottlenecks on everything (slow, unsustainable).

The gate taxonomy

Not all gates are equal. FORGE defines three types, each with a different cost, volume, and purpose.

Deterministic gates are binary. Compiler, linter, type checker, test suite, security scanner. The code either passes or it doesn’t. These gates are cheap, fast, and should be ruthlessly automated. Every piece of AI-generated output should clear deterministic gates before a human ever sees it. Think of these as the floor; the minimum bar that filters out mechanical errors so human attention isn’t wasted on things a machine can catch.

Probabilistic gates are AI-powered quality checks. AI-assisted code review, semantic duplication detection, architectural conformance analysis, complexity scoring. These catch what deterministic gates miss; not syntactic errors, but structural problems. A function that passes all tests but duplicates logic from three other files. A refactor that works but violates the project’s architectural decision records. Probabilistic gates reduce the surface area humans need to review, but they are not final authority. They’re the filter that makes human review tractable at AI-output volume.

Human gates are judgment calls. Does this feature align with what the customer actually needs? Does this architecture hold up under the constraints we haven’t documented? Should we build this at all? Human gates are expensive and slow by design. The goal is to reduce their volume; fewer things for humans to review; while increasing their importance; the things that reach a human are the ones that genuinely require human context, domain knowledge, and accountability.

The key principle: as you move from deterministic to probabilistic to human gates, cost per review increases and volume decreases. A well-governed SDLC pushes as much as possible through the cheaper gates so the expensive ones are reserved for the decisions that actually matter.

Layer legend

Human judgment Quality gate AI execution Gate types: D deterministic P probabilistic H human

→

Gate cost and volume tradeoff

Deterministic

Linters, tests, types

→

Probabilistic

AI review, analysis

→

Human

Judgment, approval

Low cost, high volume High cost, low volume

Phase-by-phase breakdown

FORGE applies the three layers and the gate taxonomy to seven SDLC phases. Click any phase above to see the details. Here’s the logic behind each.

Requirements and discovery. The spec is the primary artifact. AI drafts it; humans own it. No deterministic check can tell you whether a feature should exist; this gate is always human.

Architecture and design. AI can propose three valid approaches. Only a human who knows the team, the timeline, and the business constraints can choose between them. The architectural decision record is a human document.

Implementation. All three gate types converge here because volume and risk are both high. The human gate at implementation should be narrower than you think; if requirements and architecture were governed well, most implementation decisions are already constrained.

Code review. AI review handles the mechanical checks at scale so the human reviewer can focus entirely on intent, coherence, and architectural fit. Volume shifts; importance doesn’t.

Testing. One of the phases where AI augmentation has the highest ROI and lowest risk, because tests are inherently verifiable. Human gates stay light here by design.

CI/CD and deployment. No AI should autonomously push to production in an environment with real users and real consequences. This is one of the few places where the human gate should stay heavy regardless of team maturity.

Maintenance and observability. Auto-remediation without human approval is tempting and dangerous; save it for well-understood, low-risk scenarios with extensive rollback coverage.

Spec-driven development as the connective tissue

FORGE works because the spec; not the code; becomes the primary artifact. Architecture decisions, coding standards, domain glossary, threat models, and interface contracts are maintained as versioned, machine-readable documents. AI agents read from this context layer when generating output. Humans write and maintain it.

This prevents the most common agent failure mode: hallucinating patterns because the agent lacks project context. An AI coding tool working from a blank context will produce generic, plausible output that may or may not fit your system. The same tool working from a rich spec produces output that’s constrained by reality.

Spec-driven development isn’t a nice-to-have addition to the framework. It’s the input that makes the gate taxonomy work.

Deterministic gates verify against the spec. Probabilistic gates analyze against the spec. Human gates evaluate against the spec. Without the spec, every gate is operating in the dark.

When to use it

FORGE fits when your team has adopted or is actively adopting AI coding assistants and needs to formalize how AI output enters the codebase. It’s particularly relevant if you’re building a new practice or team from scratch and want to design the SDLC around AI from day one, rather than bolting governance onto existing habits. It applies when you’re seeing early signs of the velocity illusion; throughput metrics are up but defect rates, code churn, or review bottlenecks are rising in parallel.

When not to use it

Skip the formal framework if your team is one to three developers working on a single product. At that scale, the gate overhead exceeds the benefit. Don’t force this model onto highly regulated environments where the SDLC is prescribed by compliance standards; augment within the existing mandated framework rather than replacing it. And don’t impose governance before adoption: if your team isn’t using AI tools at all yet, start with tooling, training, and experimentation before introducing process structure. Governance that arrives before the thing it governs is just bureaucracy.

Anti-patterns

”AI writes it, AI reviews it”

Using one AI tool to generate code and a different AI tool to review it, with no human in the loop, creates a closed system where both sides share the same architectural blind spots. Two AI models agreeing that a piece of code is fine tells you less than you think; they may be confident about the same things and blind to the same things.

In my current and previous roles, the most serious delivery issues were rarely caused by code that failed to compile. They came from choices that looked reasonable in isolation but did not fit the architecture, the operating model, or the client’s constraints. AI review can miss that because it often reviews the artifact, not the delivery context around it.

The rubber-stamp merge

AI generates a 400-line PR. The developer glances at it, sees green CI checks, clicks approve. This is the most common failure mode, and it’s how invisible debt accumulates. The deterministic gates passed; the code compiles, the tests are green, the linter is satisfied. But nobody checked whether the code should exist at all. The fix: human gates should review intent and architecture, not correctness. If you’re spending your review time checking syntax, your deterministic gates are too weak.

I have seen many teams mistake activity for progress. A large PR, a passing pipeline, and a fast merge can create a false sense of control. In consulting and client delivery, that debt shows up later as missed dates, rework, and hard conversations with stakeholders. The merge is not the finish line. It is a commitment that the team can support what it just added.

Overindexing on speed metrics

Measuring PRs per week, lines shipped, or cycle time without paired quality metrics is how you optimize for fast garbage. AI will optimize whatever you measure. Add AI-specific indicators: percentage of AI-generated code that survives two weeks without revision, duplication rate in AI-authored commits, and the ratio of AI-introduced issues to AI-resolved issues.

I have learned that velocity only matters when it converts into value. I have been in situations where the team was moving fast on paper, but the real signal was somewhere else: the backlog was growing, senior engineers were being pulled into fixes, and the client was losing confidence. AI can make that pattern worse if leaders reward volume without asking what survived production use.

Context starvation

Deploying AI agents against your codebase without providing a shared context layer; ADRs, coding standards, domain glossary, interface contracts, threat models; is asking them to generate from generic training data rather than from your project’s actual architecture. Context engineering isn’t a secondary concern; it’s the primary input that determines whether agent output is useful or harmful.

In enterprise environments, the hardest part is often not writing the code. It is understanding the environment around the code: legacy systems, integration limits, security, business ownership, and the parts of the architecture that exist for reasons not obvious in the repository. I have seen strong engineers lose time because that context lived in people’s heads. AI has the same problem, only faster and at larger scale.

Uniform gate density

Applying the same level of governance rigor to every SDLC phase is waste at one end and recklessness at the other. Requirements and architecture need heavy human gates because the decisions are strategic and difficult to reverse. Test generation needs mostly deterministic gates because tests are inherently verifiable. Match gate cost to decision stakes.

In my roles leading cloud, and software delivery, I have seen governance fail in two ways. Some teams create too many gates and slow the work down. Others avoid gates until a major decision has already been made. The better pattern is selective control. Spend leadership attention where the decision has long-term impact. Automate the rest.

”We’ll add governance later”

This is the most damaging anti-pattern because it compounds silently. Teams start with ungoverned AI adoption, plan to add structure once things stabilize, and discover too late that the codebase is full of code that nobody fully understands. Governance is a day-one decision. The cost of adding it later isn’t just the process overhead; it’s the archaeological effort of understanding what the ungoverned period produced.

I have been part of enough escalations to know that “we will clean it up later” usually means “we are moving risk into the future.” In services work, that future arrives fast. It shows up during UAT, production readiness, security review, or the first support handoff. AI governance should start small, but it cannot start late.

The shift that matters

The tools will keep changing. The models will get better. The agents will become more autonomous. None of that changes the fundamental problem this framework addresses: someone has to own the judgment layer, and that ownership has to be designed into the process, not assumed.

The teams that will build durable software in this era aren’t the ones that adopt AI the fastest. They’re the ones that figure out where to slow down on purpose; where to insert the gate that catches the thing no test suite can see, where to keep a human in the loop not because the machine can’t do it but because the decision matters more than the speed.

Every framework eventually becomes a checklist if you let it. Don’t let it. The gate taxonomy, the phase-by-phase breakdown, the spec-driven context layer; these are starting points, not rules. Adapt them to your team, your domain, your risk tolerance.

The only part that isn’t negotiable is the principle underneath: the faster AI lets you move, the more deliberate your governance has to be.

Build fast. Gate deliberately. Own what ships.

Sources

Zhao et al. (2025). “Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects.” Carnegie Mellon University. arxiv.org/html/2511.04427v2
GitClear (2025). “AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones.” Analysis of 211 million changed lines of code, 2020–2024. gitclear.com
Liu et al. (2026). “Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild.” arxiv.org/abs/2603.28592
Stack Overflow (2025). “2025 Developer Survey.” survey.stackoverflow.co/2025

← All frameworks