Every pitch for an AI diligence operating system names the same six agents. Ingestion. Synthesis. Memo generation. Red-flag detection. Source verification. Follow-up questioning. What the pitches omit: the six agents are only as reliable as the memo schema they write against. The schema is the architectural decision your AI hire is being authorized to make. Not the agents themselves.
A UC Berkeley team analyzed 1,642 multi-agent execution traces at NeurIPS 2025 and measured failure rates from 41% to 86.7%. The four most common failure modes split between two diagnoses. Three describe the same root problem in different surface forms: step repetition (15.7%) is the agent not knowing it already did this work, termination unawareness (12.4%) is the agent not knowing when it's finished, and specification disobedience (11.8%) is the agent not knowing what it was supposed to produce. The fourth, reasoning-action mismatch at 13.2%, is a different failure mode entirely. The agent reasons one way and acts another, which needs runtime validation, not just schema. Schema-first handles the first three. The fourth is what the validator layer is for. For the boutique private-equity, commercial diligence, or strategy firm hiring a Claude or MCP architect, this is the most consequential decision your architect will make in the first thirty days.
What an AI diligence operating system actually is
An AI diligence operating system is a memo schema with subagents attached. The schema is the artifact graph: every section of the final memo, every required field type, every evidence-tier expectation, every piece of metadata partners need to trust the output. Subagents are functions over that graph. Each one owns the production or verification of specific fields.
A reasonable objection: our partners already use a memo template (management assessment, market sizing, deal-specific red flags, capital efficiency). Isn't that the schema? No, that's the raw material. A schema is machine-checkable, with field types and evidence-tier rules a validator runs before the next subagent fires.
Three things this is not. It's not a chatbot over a data room, which is retrieval. It's not a fleet of research agents running in parallel, which is parallel research. It's not a vendor subscription, which is a product you bought rather than architecture you authored.
Anthropic's canonical guidance on building effective agents treats structured output as appendix material. AG2 prescribes the opposite pattern and shows schema definition before agent instantiation as the canonical example. Claude's structured outputs documentation is more emphatic: without a schema, you get parsing failures, missing fields, and retry loops. Production frameworks are converging on schema-first. Executive pitch decks haven't caught up.
The subagent-first failure pattern at day 60
In October 2025, Deloitte's Australian arm refunded AUD 97,000 against its AUD 440,000 government assurance contract. Academics caught fabricated citations in their independent review for the Department of Employment and Workplace Relations, including a Federal Court judgment quote the judge had never given. Deloitte then issued a corrected version disclosing the use of generative AI in drafting. That's an architecture story, not a hallucination story. AI output shipped through to a federal regulator without a verification layer designed to catch fabricated citations before the report left the firm.
The day-sixty failure inside a multi-agent stack has the same shape, at fleet scale. Each subagent works in isolation. The ingestion agent extracts data. The red-flag agent flags risks. The memo agent writes paragraphs. Each output looks plausible. The synthesis layer can't assemble them into a memo your partner will sign, and the architect rerunning the pipeline can't trace where the inconsistency entered. No specification exists at any layer.
The MAST failure topology isn't software-specific. It measured engineering agents, but spec disobedience and termination unawareness happen whenever the artifact specification is absent. The mode is task-structural, not domain-bound.
Zalando's engineering team published a postmortem in September 2025 on an internal agentic incident-analysis pipeline. Small-model hallucination hit roughly 40%. The fix wasn't a single intervention: tighter prompting, a larger model, and human-evaluable digest artifacts inserted between every pipeline stage. Hallucination dropped below 15%. The structural lever was the artifact-layer checkpoint that let humans audit intermediate output before the next agent consumed it. GitHub's engineering team reached the same conclusion one quarter later: typed schemas at agent boundaries erase the field-name and type drift that breaks unstructured handoffs.
I learned this at smaller scale. Early in my day-job team's adoption of skills and subagents for code-quality governance, we deployed agents before specifying their outputs. Engineers and PMs expected consistent results; we got wildly different ones across runs and teams. The fix wasn't better prompts. We split the work into dedicated persona agents (one for code review, one for security scan, one for documentation), each with its own task list and defined output. Then we standardized the JIRA ticket format so downstream skills could rely on shape instead of writing per-source adapters.
Once the artifact layer was specified, four things changed at once. The orchestrator stopped skipping code review at will because it could now verify whether a previous run had populated the artifact instead of guessing from the presence of any output at all. PM search behavior changed next: the standardized JIRA ticket shape meant they stopped having to learn which skill wrote which kind of ticket. Downstream skills produced fewer defects because they consumed a known shape and could drop the per-source adapter code that had been silently swallowing edge cases. And leadership stopped questioning documentation differences across teams. The architecture problem had been masquerading as a model problem. This is the multi-agent version of the string-function-wrapper failure mode at fleet scale.
The memo IS the architecture: five tests your architect should pass before any subagent ships
The memo schema is the architecture. The subagents are functions over it. Before any subagent ships, your architect should be able to pass these five tests.
| # | Test | What passing looks like |
|---|---|---|
| 1 | Can you draw the complete memo on one page? | A blank template with every required heading. Sections, not contents. |
| 2 | Can you name field types and evidence-tier expectations per section? | "Financial: numeric claims, Tier 1-2 sources. Narrative: named-source attribution." |
| 3 | Can you draw the artifact graph for one complete run? | A diagram, not a list. Subagent X produces artifact Y; validator Z reads it. |
| 4 | Can you say which fields are non-negotiable for partner sign-off? | The signworthy threshold lives in the schema, not in a partner's head. |
| 5 | Does the schema include trust metadata? | Source confidence, evidence tier, model that produced it, time-to-revision. |
If your architect can't pass all five before authorizing a single subagent prompt, the architecture isn't ready. NIST's OSCAL project has been pointing at this kind of machine-readable, auditable architecture for years; the open standard is available and the discipline is the part teams keep skipping.
The pipeline that publishes this post was built schema-first. Twelve specialist subagents do bounded judgment work. Sixteen deterministic validators back them up. Five architectural layers govern the system. The orchestrating skill owns every disk write. Each subagent and each validator exists because the artifact (a published post with a fact-check ledger, claim audit, and provenance block) was specified first. The agents are downstream of the artifact, not upstream. The four sub-agent orchestration patterns become coherent once the schema contract is in place.
The strongest counter is Anthropic's own multi-agent research system, built subagent-first. The lead agent develops a strategy. Subagents spawn to explore aspects in parallel. Read the published prompts and the subagent instructions already include explicit output-format sections as behavioral constraints. What Anthropic informally encoded in research-context prompts, diligence-context systems must encode formally. A research agent that misses a citation embarrasses an engineer. A diligence agent that misses a citation costs a partner's signature.
A second objection: peer-reviewed accounting research argues that financial diligence judgment is irreducibly contextual. The "good enough" call can't be systematized. That's correct, and it doesn't contradict the thesis. Schema-first doesn't mean schema-replaces-judgment. Partner judgment lives in the annotated fields: confidence scores, evidence-tier markers, time-to-revision flags, escalation triggers. The schema scaffolds judgment. It doesn't eliminate it. QED Investors put the same point from the investor's side: AI excels at generating artifacts, but most regulatory obligations still require a human to demonstrate that judgment was exercised. The artifact carries the evidence; the partner signs.
Re-sequencing the 30/60/90
The standard ninety-day plan reads in three windows. Initial architecture by day thirty. Multi-agent system by day sixty. Production by day ninety. This is the sequence that guarantees the day-sixty wall. The architect ships a demo on day thirty, a multi-agent demo on day sixty, and discovers on day eighty-five that the memos won't sign.
A schema-first ninety days looks different.
Days 0-15: lock the artifact. Your architect writes the blank memo template, codifies field types, codifies evidence-tier rules per section, and codifies the trust metadata partners need. The deliverable is a one-page schema and the partner sign-off that reads "yes, if my analyst handed me a memo populated to this schema, I would read it and act on it." Anthropic frames its evaluation guidance the same way in demystifying evals for AI agents: evals force a team to specify what success means before any model is graded against it. The memo schema is that same forcing function, one layer up.
Days 16-45: one end-to-end vertical slice. One subagent per artifact field, end-to-end through the orchestrator, producing one complete memo on one real diligence target. The goal is to prove the schema can be populated. Not to optimize. Not to demo to leadership.
Days 46-90: horizontal expansion. Add subagents, parallelize, build the red-flag detection layer, build the follow-up questioning layer. Each addition is a defined contribution to the artifact graph. Nothing ships unless its place in the schema is already named.
This slows the first thirty days' velocity. No demo on day thirty, only a schema and a partner sign-off. That's the production-readiness investment your architect is being authorized to make. The demo without a schema is the day-sixty failure pre-staged. The governance-by-design principle scales here. Enforcement that has to hold every time lives in the schema, not in a prose instruction your orchestrator will route around when the pipeline's under pressure.
Stop authorizing the agents. Authorize the schema.
The architecture decision your AI hire is being authorized to make is upstream of every subagent. The memo schema is the architecture. The subagents are functions over it. If you've already authorized a Claude or MCP architect and the first thirty days are pointing anywhere other than a locked schema with partner sign-off, you have ninety days of work scheduled to produce something other than what your firm needs.
I run Agentic Workflow/Automation Development engagements that compress that decision into a sequence partners can sign off on. Schema design review. One end-to-end vertical slice. Then multi-agent expansion. The hardest decision goes first, while the cost of changing it is still paragraphs, not pull requests. If your operator is in place and the architect hire is authorized, book a fifteen-minute architecture review. Bring the blank memo your partner would sign. We'll write down what the schema needs to enforce before a single subagent prompt ships.