What 20-Turn Conversations Taught Me About the Claude API

I audited my own production Claude API agent against the patterns I recommended in the first two posts in this series. Fourteen of seventeen practices checked out. The AI Readiness Assessment uses tool use, prompt caching, streaming, extended thinking, and structured outputs. Exactly as I described.

The audit also found nine patterns the agent uses daily that I had never written about.

These aren't obscure API features. They're design choices that surfaced only after running a multi-turn agent in production for months. Real users broke the assumptions I made on day one. Turn-based effort tuning. Dynamic system prompt injection. Confidence caps that prevent the model from scoring what it can't see. None of these appeared in a tutorial or documentation page. They emerged from the gap between "it works in testing" and "it works on turn 18 when someone says something unexpected."

The first post in this series diagnosed why most Claude API integrations are wrappers around a string function. The second covered five foundational capabilities every production integration should use. This post goes further. Nine patterns, three sections, all extracted from code that's been running in production for months.

System Prompts That Rewrite Themselves Mid-Conversation

The assessment tool starts every conversation with a universal system prompt. Same prompt for a solo marketing consultant and a senior engineering leader. By turn 3, those two users are talking to what feels like a different agent. The system prompt changed between turns, and neither user knows it happened.

Here's how. On turn 2, the user describes their role. A parallel Haiku call classifies them along two axes: functional lens (technical builder, creative operator, business operator, product strategist) and altitude (solo, individual contributor, team lead, senior leader). Structured outputs with constrained decoding mean the result always parses cleanly. If Haiku errors out, a default keeps the conversation moving.

The result triggers a conditional injection. For the "technical builder" archetype, a 24KB module gets pushed into the system prompt. It contains role-specific rubrics, drill questions, vocabulary, and scoring traps:

const systemBlocks = [
  { type: "text", text: SYSTEM_PROMPT, cache_control: { type: "ephemeral", ttl: "1h" } },
];
if (sessionMeta) {
  const module = getArchetypeModule(sessionMeta);
  if (module) {
    systemBlocks.push({
      type: "text", text: module, cache_control: { type: "ephemeral", ttl: "1h" },
    });
  }
}

The base system prompt stays cached at a 1-hour TTL. The archetype module gets its own cache breakpoint, also at 1 hour, shared across every session with the same archetype. Only a small <session_context> block with per-user metadata sits outside the cache. This is the production lesson: you can inject dynamic content without killing your cache. You just need to understand the cache hierarchy. Tools cache first, then system blocks in order. Changing block 3 doesn't touch blocks 1 and 2.

The module itself is invisible to the user. Claude shifts its vocabulary from "sprint velocity" and "CI/CD" for an engineer to "campaign performance" and "content calendar" for a marketer. The scoring rubrics change. The drill questions change. The user just experiences a conversation that feels like it understands their world.

This pattern resembles how Claude Code custom skills encode team-specific instructions. Same idea, different layer. Skills inject at the IDE level. Archetype modules inject at the API level, mid-conversation, without the user seeing the seam.

The second prompt-level pattern is simpler but equally production-driven: a banned word list embedded directly in the system prompt.

Banned words: "genuinely" / "genuine" / "truly" / "landscape" /
"delve" / "comprehensive" / "robust" / "streamline" / "foster" /
"facilitate" / "leverage" (as verb) / "utilize" / "navigate"
(metaphorical) / "it's worth noting" / "importantly" /
"at the end of the day" / "here's the thing" / "the reality is."

Why does an assessment tool care about word choice? Because sounding like a chatbot kills trust. Every one of those words signals to a savvy reader that they're talking to generic AI output, not a purpose-built agent. I enforce the ban at the prompt level, not in post-processing. Post-processing can flag a bad sentence. It can't rewrite one. The prompt stops the sentence from being written in the first place.

The same section bans em dashes from every output field and enforces sentence rhythm variation. These aren't cosmetic preferences. They're trust mechanics. An assessment that reads like it was generated by a default Claude completion undercuts the entire value proposition of a custom agent.

Scoring That Stays Honest Across 20 Turns

A single-turn API call extracts data from one response. A 20-turn conversation builds a picture over time. The picture changes as new evidence arrives. Three patterns keep the scoring accurate across that full arc.

Altitude-based confidence caps. An individual contributor (IC) can tell you exactly which AI tools their team uses daily. They can't tell you whether the board has approved an AI budget. The tool encodes this visibility constraint into the system prompt:

IC: adoption_capability and tool_workflow_maturity = high max.
    data_risk and accountability_process = medium max.
    strategic_alignment = low max.

Team Lead: adoption_capability through accountability_process = high max.
           strategic_alignment = medium max.

Senior Leader: tool_workflow_maturity = medium max.
               data_risk through strategic_alignment = high max.

The cap is a ceiling, not a default. If an IC gives a vague answer about tool adoption, it scores low confidence under normal rules. The cap kicks in only when the respondent's role doesn't support high certainty. The model checks this in its thinking block before every tool call. Without caps, early versions of the tool scored strategic alignment with high confidence based on an IC saying "leadership seems supportive." That's hearsay, not signal.

Cross-dimensional signal extraction. Someone answers a question about tool adoption and mentions "no formal budget, leadership is supportive but it all comes from existing eng spend." That's a strategic alignment signal hiding inside an adoption answer. The system prompt tells Claude to scan every response for signals beyond the current question. Up to two extra dimensions per turn:

Cross-dimensional scoring cap: Maximum 2 additional dimensions per turn.
All cross-dimensional scores use low confidence. Emit them immediately
via update_dashboard.

This builds a scoring scaffold early. By the time the conversation reaches strategic alignment, the model already has a low-confidence anchor from turn 4. It confirms, revises, or upgrades that anchor instead of starting from nothing. Fewer questions per dimension. Tighter conversations. Scores that reflect everything the user said, not just their answer to the "right" question.

Score revision tracking. A 20-turn conversation reveals things that contradict earlier impressions. Someone scores a 2 on data risk in turn 5. Then in turn 12, while answering a different question, they mention a written data policy. The score needs to change. The update_dashboard tool schema includes explicit revision fields:

{
  name: "update_dashboard",
  strict: true,
  input_schema: {
    properties: {
      dimension: { type: "string", enum: DIMENSION_KEYS },
      score: { type: "integer", enum: [1, 2, 3, 4, 5] },
      confidence: { type: "string", enum: CONFIDENCE_LEVELS },
      revision: {
        type: "boolean",
        description: "True if this updates a previously scored dimension.",
      },
      previous_score: {
        type: "integer",
        description: "Required when revision is true. The score being replaced.",
      },
    },
  },
}

The revision and previous_score fields give the model a structured way to say "I was wrong earlier, here's the fix." Without them, early versions would either ignore new evidence (anchoring bias) or re-score without noting the change. The revision field also feeds the admin dashboard. I can see how scores evolved across a session and flag conversations where they swung hard.

These three patterns address one problem that no tutorial covers: keeping scores accurate when information arrives across 20 turns instead of all at once. Confidence caps prevent overreach. Cross-dimensional extraction catches signal you didn't ask for. Revision tracking corrects mistakes when better evidence appears. Together, they turn a stateless API into something that reasons about uncertainty over time.

The Runtime Layer

The first two sections covered what the model sees and how it reasons. This section covers what makes the system viable in production: cost control, latency, and the line between what the model should compute and what the server should handle.

Turn-based effort tuning. Does every turn in a 20-turn session need the same depth of reasoning? Not even close. When Claude processes a tool result (the browser sent back a dashboard update), it needs maybe 50 tokens of bridge text to the next question. Giving it a full thinking budget for that is waste. But when Claude writes the final report, with narrative analysis, scoring insights, and a strategic roadmap, it needs every thinking token it can get.

The assessment tool classifies each turn and routes it to a different configuration:

function detectTurnType(body) {
  if (body.max_tokens_override) return "override";
  const lastMessage = body.messages[body.messages.length - 1];
  if (lastMessage.role === "user" && Array.isArray(lastMessage.content)) {
    if (lastMessage.content.every((b) => b.type === "tool_result")) {
      return "continuation";
    }
  }
  return "normal";
}

const TURN_CONFIG = {
  continuation: { effort: "low",    maxTokens: 4096  },
  normal:       { effort: "medium", maxTokens: 16384 },
  override:     { effort: "high",   maxTokens: 32768 },
};

The effort parameter controls how much Claude thinks before responding. low means Claude barely pauses before moving on. high gives it the full budget for complex analysis. The cost difference across a full session adds up fast. Ten continuation turns at low effort versus high effort: that's the gap between a viable product and one that bleeds tokens on turns where the model has nothing complex to decide.

Opus 4.7 adds xhigh between high and max. The effort ladder is now five rungs: low, medium, high, xhigh, max. It also ships a task_budget primitive. Anthropic's docs describe it as a soft hint, not a hard cap. max_tokens is still the ceiling. task_budget is Messages-API only. It is not supported on Claude Code at launch. When the workload is long enough to justify Opus, those two levers are the governance story. More on where Opus 4.7 earns its premium in a separate post.

Non-blocking persistence with ctx.waitUntil. How much data does a 20-turn session need to persist? More than you'd think. Session metadata to KV. Conversation snapshots on every turn. D1 database updates for the admin dashboard. Token usage logs. If any of these writes blocked the streaming response, the user would see a delay between turns. And once the response finishes, Cloudflare Workers give you only 30 seconds of background execution before the runtime shuts down.

ctx.waitUntil() solves both problems. It registers a promise that the runtime will resolve after the response has been sent. Fire and forget.

ctx.waitUntil(
  env.ASSESS_RESULTS.put(
    sessionMetaKey, JSON.stringify(sessionMeta),
    { expirationTtl: 24 * 60 * 60 }
  ).catch(err => console.log(JSON.stringify({
    event: "session_meta_write_error",
    session_id: body.session_id, error: err?.message || String(err),
  })))
);

The pattern appears four times in the chat handler: session metadata, conversation snapshots, D1 score enrichment, and token usage logging. Each runs in the background while the stream is already flowing to the user. The .catch() on every call is essential. A failed background write can't throw an unhandled rejection, and it can't kill the worker process. Log it and move on. The user's conversation is more important than a metadata write.

Server-side reference range computation. The final assessment includes benchmark scores adjusted for industry, archetype, and altitude. A healthcare team lead sees different benchmarks than a solo SaaS founder. The naive approach: put the tables in the prompt, let Claude do the math. That works. But it burns tokens on arithmetic and adds one more place where the model can hallucinate a number.

The assessment tool computes reference ranges on the Worker instead:

function computeReferenceRanges(archetype, industry, altitude) {
  const ranges = { ...REFERENCE_BASELINES };
  const archetypeAdj = ARCHETYPE_ADJUSTMENTS[archetype] || {};
  const industryAdj = INDUSTRY_ADJUSTMENTS[industry] || {};
  for (const dim of DIMENSION_KEYS) {
    ranges[dim] += (archetypeAdj[dim] || 0) + (industryAdj[dim] || 0);
    if (altitude === "solo")
      ranges[dim] += (dim === "strategic_alignment" ? -0.3 : 0);
    ranges[dim] = Math.min(STACKING_CEILING, Math.round(ranges[dim] * 10) / 10);
  }
  return ranges;
}

The computed ranges go into the <session_context> block. The system prompt tells Claude to echo them into the final tool call. Claude's job is assessment and conversation. The server's job is math. Keep them separate.

Eager input streaming on the final assessment. The complete_assessment tool call generates the largest JSON payload in the conversation. Scores, narrative report, cross-dimensional insights, strategic roadmap, quick wins, metadata. Without eager input streaming, the API buffers the entire tool call JSON before sending it. With eager_input_streaming: true, the API streams tool input parameters as they generate. One property on one tool definition:

{
  name: "complete_assessment",
  eager_input_streaming: true,
  description: "Generate the full results payload...",
}

I enable this only on the final assessment tool, not on update_dashboard. The dashboard tool produces tiny payloads where streaming overhead would add latency, not reduce it. Selective application by payload size is the production pattern here. The docs describe the feature; they don't tell you when to skip it.

When You Don't Need Any of This

Anthropic's own Building Effective Agents guide says it plainly: "Start with simple prompts... and add multi-step agentic systems only when simpler solutions fall short." That advice is correct. A sentiment classifier doesn't need archetype injection. Effort tuning makes no sense for a one-shot endpoint. And if your chatbot handles three turns of form data, confidence caps are overhead you'll never recoup.

These nine patterns exist because the assessment tool demands them. Twenty-turn conversations. Five scored dimensions with varying confidence. Role-adapted language and rubrics. Real-time dashboard updates. A full analysis that synthesizes everything into a strategic roadmap. If your use case is simpler, your architecture should be too. The worst outcome would be treating this post as a checklist and bolting archetype modules onto a system that doesn't classify users.

The line is this: if you're building a multi-turn agent where state piles up, where the model needs to adapt based on what it's learned, and where reliability over many turns matters more than any single response, these patterns will find you. They found me. I didn't plan any of them before launch. Every one emerged from watching real conversations reveal a failure mode that testing never surfaced.

The AI Readiness Assessment runs all nine of these patterns in production. Taking it might be more useful than reading about it. The conversation adapts to your role. It scores five dimensions with confidence-capped accuracy and generates a personalized strategic roadmap. Ten minutes, no email required.

For the foundations these patterns build on, start with the first post in this series (the diagnosis) and the second (the five core capabilities). The agentic development starter guide covers the broader workflow patterns that apply beyond the API level. For patterns that extend past a single 20-turn agent into multi-agent orchestration, the four sub-agent orchestration patterns post names the architectures most teams hit next.

If you're building a production Claude API agent and want to skip the months of discovering these patterns the hard way, Claude API development and support is what I do.

Glossary terms used

Extended thinking Tool use Prompt caching

What 20-Turn Conversations Taught Me About the Claude API

System Prompts That Rewrite Themselves Mid-Conversation

Scoring That Stays Honest Across 20 Turns

The Runtime Layer

When You Don't Need Any of This

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agentic AI Governance in Production: Who Owns the Bar When the Agent Ships

Running Claude Code as a Production Engineering Practice

Continue reading: more in Lead with Claude

Model Availability Is a Production Dependency: Build the Fallback Ladder Before the Next Model Vanishes

Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Sources

System Prompts That Rewrite Themselves Mid-Conversation

Scoring That Stays Honest Across 20 Turns

The Runtime Layer

When You Don't Need Any of This

Reference guides for this topic

Claude API in Production: A Runtime, Not a String Function, and What It Leaves to You

Agentic AI Governance in Production: Who Owns the Bar When the Agent Ships

Running Claude Code as a Production Engineering Practice

Continue reading: more in Lead with Claude→

Model Availability Is a Production Dependency: Build the Fallback Ladder Before the Next Model Vanishes

Claude Fable 5's Silent Degradation: The Safety Tier You Couldn't See, Log, or Turn Off

Claude Fable 5 Is 'Mostly Drop-In.' The Word Doing the Work Is 'Mostly.'

Sources

Continue reading: more in Lead with Claude