59/60 Voice Fidelity: Multi-Agent AI Persona Profiler

April 11, 2026 ·AI/ML Tools ·8 days

Agentic Workflow/Automation DevelopmentClaude Code Infrastructure SetupCustom Skill Development

Built with Claude CodeClaude OpusBashPython

Why AI Personas Fail

Most AI persona tools treat personality as a static profile: a name, a few adjectives, maybe a paragraph of backstory. Load it as a system prompt and the AI "becomes" that person. For simple use cases, this works.

For anything requiring authentic representation, it falls apart fast. The AI defaults to helpful, diplomatic, verbose behavior regardless of the persona instructions. Ask it to be terse and it writes paragraphs. Tell it someone is blunt and informal and it smooths every edge out. Describe someone as conflict-avoidant and it handles disagreements with carefully balanced language the real person would never use.

The root problem: standard personas capture what someone IS but not how they COMMUNICATE across different emotional states, energy levels, and social contexts. They lack the behavioral depth for an LLM to produce responses that the real person would recognize as their own.

How the AI Persona Profiler Works

The system runs as a four-stage pipeline, each stage in a separate Claude Code session with Claude Opus.

Stage 1: Structured Interview. An AI interviewer runs a 60-100+ exchange interview across seven phases. These cover personality beyond the obvious use case, reaction prompts with real scenarios, communication stress tests (11+ scripted confrontation patterns), relationship dynamics, and cross-validation. The interviewer adapts in real time. It tracks energy shifts in message length and follows deflection patterns. When a response contradicts an earlier answer, it pushes back. Every exchange gets auto-captured with provenance tags from the moment it enters the system.

Stage 2: Adversarial Dual-Analysis. Two independent Opus instances analyze the same transcript at the same time. Analyst A runs a standard 9-step pass covering extraction, gap analysis, contradiction analysis, psychological profiling, attachment patterns, narrative identity, communication architecture, section drafts, and data gaps. Analyst B gets a skeptical framing: "When you see a pattern, look for moments that break it. Privilege the quiet data." Then a Challenger agent reconciles divergences through evidence arbitration. Every finding gets tagged HIGH, MEDIUM, LOW, or DIVERGENT. Divergent findings keep both readings instead of forcing a resolution.

Stage 3: Multi-Agent Update. An orchestrator spawns five specialized subagents (psychology, voice, relationships, calibration, QA) that refine the persona file through parallel and sequential waves. The QA agent runs voice simulation testing before the file is finalized.

Stage 4: Relationship Matrix Sync. Cross-persona relationship data is reconciled across all active personas, checking for drift between the authoritative matrix and per-persona views.

The output: persona files exceeding 12,000 words per profile. Each one covers core personality with behavioral evidence and emotional state variations with per-state communication shifts. Verbatim samples from stress tests are included word-for-word, not paraphrased. They include a contradiction map preserving both sides of every genuine contradiction. Anti-patterns document what this person would NEVER say or do. LLM tuning notes list adversarial test cases and known failure modes.

Measuring AI Personality Modeling Accuracy

Most AI persona projects have zero quantitative validation. The persona "works" if someone reads it and nods. This system does better. A 12-point rubric scores generated responses across six dimensions on a 0-2 scale: message length, punctuation patterns, language intensity, humor type, sentence structure, and register. Thresholds: 10-12 = PASS, 7-9 = DRIFT, 0-6 = FAIL.

The difference from a simple checklist: adversarial anchoring. Before scoring, the QA agent pulls 6-8 verbatim quotes from the real transcript. It computes voice metrics: word count distribution, sentence length, punctuation density, capitalization patterns. These numbers set min/max envelopes per register. Generated responses must land inside the real voice envelope. This prevents the most common validation failure: cherry-picking one similar quote as proof of accuracy.

The rubric also includes two tests designed to fail. Test 4 generates a deliberately bad response that violates known anti-patterns. It expects a FAIL score, proving the rubric catches deviations. Test 5 uses the wrong persona's voice entirely and expects a FAIL, proving the rubric tells people apart. The first completed persona scored 59/60 across all five tests.

If you're exploring how multi-agent AI systems could work for your team, book a free intro call -- 15 minutes, no pitch.

By the Numbers

59/60 voice simulation score across 5 tests with a 12-point rubric
10+ coordinated Claude Opus instances per full persona pipeline run
12,000+ words per completed persona profile with provenance-tagged data
12 automated validation gates enforcing quality throughout the pipeline
20 reusable prompt templates generating up to 15 persona-specific agent prompts
8,553 lines of orchestration code in bash and python, zero external dependencies
7 interview phases covering 60-100+ substantive exchanges per session
6 provenance tags tracking data lineage from interview to final output
8 days from first commit to working pipeline, built entirely with Claude Code

Architecture

The brain: A single CLAUDE.md file programs every agent session. It covers stage separation (interview agents never analyze; analysis agents never update files), agent privacy rules, and the behavioral constraints that keep 10+ agents consistent. Every new session loads it automatically. No configuration server. One file controls everything.

Runtime: The entire system runs inside Claude Code sessions. No application server or web framework. No containers. Each pipeline stage is a standalone Claude Opus session with bash and python scripts handling orchestration and validation between stages.

Orchestration layer: Fifteen hook scripts across six event types form an automatic pipeline. One user message triggers four hooks: raw capture, auto-commit, gap detection, handoff sync. PreToolUse guards enforce write rules on interview files. When a session ends, a six-step chain fires: reconcile, merge, detect gaps, sync, validate, commit-push.

Data integrity: Three safety nets prevent data loss. First: hooks enforce commit-and-push on every transcript write. Second: auto-commit hooks fire on every user message. Third: a SessionEnd chain pushes everything as a final catch. File operations use flock for atomic staging to prevent race conditions.

Prompt generation: A bash template engine fills 20 templates to produce persona-specific agent prompts. This builds on the custom skill pattern where behavioral instructions persist across sessions. Persona name, display name, and available analysis files get injected via sed/awk. A validation pass catches un-replaced patterns. Templates only generate when matching analysis artifacts exist on disk.

The pipeline follows a plan-audit-implement-verify cycle at every stage transition. No artifact moves forward without passing its validation gate.

Early Testing Results

The first fully profiled persona is running in live group conversations with other AI-represented personas. Where the system lands: tone calibration, informal language patterns, and what I call the "vulnerability wrapper" pattern. The real person deflects emotion with humor, then follows with genuine affirmation. The AI representation reproduces this two-step naturally without being explicitly told to sequence it that way. That was the moment the system proved its depth.

Where it drifts: the AI occasionally defaults to higher energy than the real person and slips into what I call "helpfulness bleed," where Claude's base personality seeps through the persona mask and the AI volunteers information instead of matching the real person's low-effort style. The system sits at approximately 80% accuracy in early testing, with the remaining gaps mapping to identifiable calibration targets addressable through follow-up interviews and refined LLM tuning notes.

Frequently Asked Questions

How does adversarial dual-analysis improve persona accuracy?: Two independent AI analysts process the same interview transcript with different analytical stances, one standard and one explicitly skeptical. A third challenger agent reconciles divergences through evidence arbitration. This adversarial structure catches blind spots and confirmation bias that a single-pass analysis would miss.
What does the voice simulation rubric measure?: Twelve dimensions scored on a 0-2 scale: message length, punctuation patterns, language intensity, humor presence and type, sentence structure, and register appropriateness. Scores are anchored against real transcript metrics with min/max envelopes per register to prevent cherry-picking.
Can this approach work for business personas or customer profiles?: The interview-to-analysis-to-validation pipeline is domain-agnostic. The same structured interview, adversarial analysis, and voice simulation methodology applies to any scenario where AI needs to authentically represent a specific person's communication style.

Need a multi-agent AI system designed for your use case? Book a free intro call -- 15 minutes, no pitch.

Book a Free Intro Call

See what I offer