The Engineering Manager's Guide to Governing Agentic Development

Experienced developers using AI coding tools were 19% slower than developers working without them. Not juniors on unfamiliar codebases, where the deployment math runs the other way. Maintainers averaging five years on repositories with 22,000+ stars and over a million lines of code. METR ran the randomized controlled trial across 246 tasks. The measurement was clear.

The counterintuitive part: the developers didn't notice. Before the study, they predicted they'd be 24% faster. After completing every task, they still believed they'd been 20% faster. A 39-point gap between perception and measurement. And direct experience didn't close it.

Your team has Claude Code. Most of your engineers believe they're more productive with it. Some of them are right. The difference between consistent gains and an invisible slowdown isn't the tool itself. It's the workflow around the tool.

Fewer than one in five organizations have formal AI code governance policies, despite 99% of development teams using AI-generated code. Your team might have managed CLAUDE.md files and solid infrastructure. But if every engineer uses Claude Code their own way, with no shared workflow and no verification step, you have the same risk as the METR developers. A team that feels faster while the output tells a different story.

This isn't a training problem. I covered why more training doesn't fix adoption in an earlier post. It's a governance problem. The answer is not a policy document in Confluence. It is a reference workflow encoded into the tooling your team uses every day. This is the operational shape of production agentic delivery: the work is agentic, but every artifact passes through gates the manager owns.

The thesis: standardize the outcomes, not the keystrokes. Encode your non-negotiables into the tooling. Provide a reference workflow engineers can adapt. Stop writing policy documents nobody reads.

The Governance Gap Is Wider Than You Think

The data on ungoverned AI code is consistent across every source I've found. Only 18% of organizations have formal AI governance policies, while 99% of development teams use AI-generated code. That is an 81-point gap between shipping and governing. Most organizations are in it.

The quality consequences are measurable. GitClear analyzed 211 million changed lines of code and found code churn -- new code revised or reverted within two weeks -- nearly doubled, from 3.1% in 2020 to 5.7% in 2024. Duplicated code blocks rose eightfold in the same period.

The security picture is worse. Veracode's 2025 GenAI Code Security Report found AI-generated code introduced 2.74x more vulnerabilities than the baseline across the 100-plus models it tested. By mid-2025, Apiiro's analysis found AI code was adding more than 10,000 new security findings per month across studied repositories.

These are not projections. They are measurements from codebases where AI tools were adopted without workflow standards. The pattern: teams adopt AI coding tools, skip the governance step, and absorb a quality tax they don't see until it compounds.

If your team has Claude Code deployed with managed CLAUDE.md files but no standardized development workflow, you have solved the infrastructure problem. You have not solved the consistency problem. The 2026 DORA data found that AI amplifies existing conditions: teams with clear processes get stronger, teams without them see problems magnified. Governance is the variable that determines which category yours falls into.

Three Tiers of Standards: Mandated, Recommended, Discretionary

The mistake most teams make is binary thinking. Either govern everything (mandate a rigid process that every engineer must follow step-by-step) or govern nothing (trust everyone to figure it out). Both fail. Rigid processes kill the speed advantage that made AI coding tools worth adopting. No process at all creates the quality gap the research documents.

The framework that works separates standards into three tiers based on how they're enforced.

Mandated standards must happen on every task, every time, with zero trust required. You encode these into hooks and managed settings. Lint checks on every file write. Security scans before every commit. Test execution before every push. These fire automatically. The developer never has to remember them and can't skip them. A PreToolUse hook returning a deny decision blocks the action even in bypassPermissions mode. Enforcement by design, not by policy. The same principle scales to multi-agent diligence pipelines, where the artifact schema becomes the hook-equivalent. The AI Diligence Operating System makes the executive-tier argument.

Recommended standards are the reference workflow and conventions your team follows by default. You encode these into CLAUDE.md files, custom skills, and documented processes. The Plan-Implement-Verify-Review cycle I'll outline in the next section. Architecture conventions. Test coverage expectations. These are the path of least resistance, not the only path. An experienced engineer who has found a more effective personal workflow can adapt, as long as their output meets the mandated gates.

Discretionary standards are everything else. How the engineer interacts with Claude. Whether they use plan mode or conversation mode. Which surface they reach for first on a given task. Their personal CLAUDE.md preferences. Prompt style. These are not your business as a manager.

This three-tier separation resolves the core tension: "How do we standardize without micromanaging?" Mandate the outputs (hooks enforce quality gates automatically). Recommend the process (the reference workflow provides structure for engineers who want it). Leave the interaction to the individual.

Anthropic's 2026 Agentic Coding Trends Report documents what it calls the delegation gap: developers use AI in 60% of their work but fully delegate only 0-20% of tasks. I read that as a mandate for bounded autonomy: clear operational limits, escalation paths for high-stakes decisions, audit trails for everything. Engineers already self-limit. Your governance should match how they work, not fight it.

The Reference Workflow: Plan, Implement, Verify, Review

The Plan-Audit-Implement-Verify cycle gives individual developers a structured workflow. The governance question is different: how do you standardize that workflow across a team so every engineer produces consistent, reviewable output?

Here is the reference workflow I recommend. Four phases. Each produces a defined artifact and includes a governance gate.

Plan

Before Claude writes a line of code, the engineer reviews the task and breaks it into subtasks. For complex work, Claude proposes a plan using TodoWrite or plan mode. The engineer reviews it. Adjusts if needed. Then implementation begins.

This is the phase most ungoverned workflows skip. Developers jump straight from ticket to "Claude, build this." The METR study's acceptance rate tells you why that fails: developers used fewer than 44% of AI generations. More than half of what the AI produced was thrown away. Planning reduces that waste. Align Claude with the engineer's intent before code is written. Not after.

Governance gate: The plan exists. For high-risk changes (database migrations, auth changes, public API work), a second engineer reviews the plan first. Encode this in your project CLAUDE.md.

Implement

Claude works within the constraints defined by your project CLAUDE.md: coding conventions, architecture boundaries, test requirements. Hooks fire on every file write. Commits happen incrementally, not as one giant push at the end.

Governance gate: Hooks enforce lint, type checking, and security scanning automatically on every file operation. No engineer action required. The constraints are invisible until violated.

Verify

The engineer reviews the diff before pushing. Runs the full test suite. Then checks for AI-generated code patterns that slip past automated checks. Over-abstraction: extracting helpers for one-time operations. Phantom dependencies: importing packages that aren't installed. Unnecessary error handling for scenarios that can't happen. Defensive validation that duplicates what the framework already guarantees.

This phase separates teams getting consistent gains from teams accumulating hidden quality debt. The code compiled. The linter passed. But did the engineer read what was generated? Across the dozens of engineers I've onboarded to agentic workflows, verification is where the process breaks first. The speed feels so good that reviewing the diff feels like friction. One team I worked with took three weeks and two production bugs before diff review became non-negotiable. Both bugs had passed every automated check.

Governance gate: Tests must pass (hook-enforced). The human review of the diff is recommended, not automated. You cannot force someone to read their own code. You can build the expectation into the culture and surface gaps in code review.

Review

PR submitted with structured context. Claude Code can generate PR descriptions automatically, pulling context from the changed files, the commit history, and the project conventions. The reviewer focuses on three things: architecture decisions (does this approach fit the system?), business logic correctness (does it do the right thing?), and AI-specific anti-patterns (over-abstraction, duplicate code, pattern drift from the rest of the codebase).

Governance gate: PR approval required. Review checklist includes AI-specific items. And here is the feedback loop that makes the whole system improve over time: when a reviewer catches a recurring pattern, encode the correction in your project CLAUDE.md. Claude won't make that mistake in the next session.

The workflow is a recommendation. The gates are not. Hooks enforce the automated checks regardless of which workflow path the engineer takes. That distinction is the whole point.

Your Governance Stack: CLAUDE.md, Hooks, Skills, and Code Review

Claude Code's extension layer provides four governance primitives. Most teams use one or two. The advantage comes from using all four as a coordinated system, where each primitive handles a different type of governance need.

Put the wrong standard in the wrong primitive and you get either over-enforcement or under-enforcement. CLAUDE.md instructions that say "always run tests before committing" rely on Claude following the instruction. It usually does. Usually is not governance. That rule belongs in a hook.

Start with CLAUDE.md as your advisory layer. Project conventions, architecture decisions, the reference workflow. These files live in version control. Every Claude Code session loads the same project context. The .claude/rules/ directory scopes rules to file paths. API rules activate only when Claude touches src/api/**/*.ts. Frontend conventions apply only to React components. Relevant guidance, not noise.

Then add hooks as your enforcement layer. Anything that must happen every time goes here. Claude Code hooks fire at over two dozen lifecycle events. PreToolUse hooks block actions before they happen. PostToolUse hooks validate results after.

On one project, I configured 6 distinct hook types across a multi-agent pipeline running 10+ coordinated Claude Code sessions. Guard rails for data protection and recovery ran on every session. Zero manual oversight after the initial setup.

The one-way ratchet matters: hooks can deny actions, but they can never grant permissions beyond what settings allow. Governance tightens through hooks. It never loosens.

Use skills as your workflow layer. They encode repeatable processes. A PR creation skill that generates structured descriptions. A test-writing skill that follows your team's conventions. A deployment checklist that verifies every step. Skills use progressive disclosure: descriptions load at session start, full content loads only on invocation. Lightweight until needed. For the practitioner-level decision tree on where each directive belongs, see Where Does That Rule Go?.

Reserve code review for human judgment. Architecture decisions, business logic, and the patterns automated tools miss. Encode the AI-specific review checklist in your CLAUDE.md so reviewers know what to look for. The judgment itself requires a person.

"Won't This Kill the Speed Advantage?"

The strongest counter-argument comes from the same METR data I opened with. Developers were already 19% slower with AI tools. No governance overhead at all. Adding planning phases and review checklists could widen that gap for senior engineers who already know the codebase.

METR's February 2026 update makes this sharper. Recruitment became harder because developers increasingly refused to work without AI. The most AI-productive developers self-selected out of the study. Those productive developers likely found effective workflows through individual experimentation. Not standardization.

That is a fair point. It also misreads the thesis.

"Standardize the outcomes, not the keystrokes" means the mandated layer operates on output: did the code pass lint, did the tests run, did the security scan clear. These checks add milliseconds per file write. The recommended layer (reference workflow, CLAUDE.md conventions, skills) is a default, not a mandate. An experienced engineer who has found a faster path keeps it. Their output still clears the same gates as everyone else's.

The teams I've worked with that encoded governance into the tooling from the start didn't lose speed. One enterprise engineering organization hit a step-change in delivery acceleration on PI-level initiatives. Engineers produced 1,600 lines per day. Hooks, managed CLAUDE.md files, and structured workflows were running the entire time. The governance was the foundation for those gains, not a ceiling on them. Jellyfish data tells a similar story at the industry level: companies with 80-100% developer adoption see productivity gains exceeding 110%, but below 50% adoption the results are noise. Governance is what drives adoption from pockets to organization-wide.

The risk is not that governance slows your best engineers down. The risk is that no governance means only your best engineers get results while everyone else produces inconsistent output that erodes trust in the tooling.

Where to Start

The gap between "everyone has Claude Code" and "everyone ships consistently" closes when you encode standards into the tooling and provide a workflow engineers can adapt. Three tiers: mandate the quality gates, recommend the process, leave the interaction style alone. Four primitives: CLAUDE.md for conventions, hooks for enforcement, skills for workflows, code review for judgment.

You don't need to implement all of this in a week. Start with the mandated tier. Set up hooks for lint, type checking, and test execution. That's the highest-impact change: automated quality gates that apply to every engineer's output regardless of how they use Claude Code. Then document the reference workflow in your project CLAUDE.md. Then build the first skill (start with PR creation, since it touches every task). Layer it in.

If your team is starting from a previous AI rollout that didn't land, the re-engagement playbook for engineers who already tried AI is the prerequisite move. Governance lands cleanly only on a team that's willing to try the second attempt.

If you're working through this at your organization, the AI Readiness Assessment surfaces the governance and workflow gaps most teams overlook. Fifteen minutes of diagnostic before six months of misaligned rollout.

Or if your team needs help designing the governance stack and building the infrastructure that enforces it, book a 15-minute call and I'll walk through your specific situation.

Glossary terms used

Claude Code hook Production agentic delivery

The Engineering Manager's Guide to Governing Agentic Development

The Governance Gap Is Wider Than You Think

Three Tiers of Standards: Mandated, Recommended, Discretionary

The Reference Workflow: Plan, Implement, Verify, Review

Plan

Implement

Verify

Review

Your Governance Stack: CLAUDE.md, Hooks, Skills, and Code Review

"Won't This Kill the Speed Advantage?"

Where to Start

Claude Code Hooks in Production: The Gate the Model Doesn't Get to Skip

Claude Code Skills in Production: Two-Axed Discoverability and the Patterns That Make Skills Compound

Agentic AI Governance in Production: Who Owns the Bar When the Agent Ships

Continue reading: more in Roll out Claude

Half Your Team Is on Opus 4.6, Half Is on 4.7. The Problem Isn't the Model.

90 Minutes With 400 Engineers, PMs, and Ops on Claude Code: Here's What They Wanted to Know

Where Does That Rule Go? A Decision Tree for CLAUDE.md, Settings, Skills, and Hooks in Claude Code

Sources

The Governance Gap Is Wider Than You Think

Three Tiers of Standards: Mandated, Recommended, Discretionary

The Reference Workflow: Plan, Implement, Verify, Review

Plan

Implement

Verify

Review

Your Governance Stack: CLAUDE.md, Hooks, Skills, and Code Review

"Won't This Kill the Speed Advantage?"

Where to Start

Reference guides for this topic

Claude Code Hooks in Production: The Gate the Model Doesn't Get to Skip

Claude Code Skills in Production: Two-Axed Discoverability and the Patterns That Make Skills Compound

Agentic AI Governance in Production: Who Owns the Bar When the Agent Ships

Continue reading: more in Roll out Claude→

Half Your Team Is on Opus 4.6, Half Is on 4.7. The Problem Isn't the Model.

90 Minutes With 400 Engineers, PMs, and Ops on Claude Code: Here's What They Wanted to Know

Where Does That Rule Go? A Decision Tree for CLAUDE.md, Settings, Skills, and Hooks in Claude Code

Sources

Continue reading: more in Roll out Claude