A UX researcher I work with described the problem this way: software development keeps accelerating, the research repository cannot keep up, and structured reports are easier to generate than ever and harder than ever to read in depth. She'd been creating digestible per-individual Claude instance reports for the product leads and designers who needed to act on the findings. Fragmented. Manual. Each one disconnected from every new insight that arrived after it shipped.

For a long time I'd have agreed the fix was better reports. Tighter executive summaries, more visual one-pagers, a stronger top-of-document headline. Helping engineering teams hit the same wall and work through it changed my read.

Your research repository has a retrieval problem, not a reports problem. And every fix that leaves retrieval untouched, including the habit of loading reports into individual Claude instances one at a time, leaves the disconnect intact.

The Problem Isn't Reports. It's Retrieval.

This section complicates the standard "make reports better" framing by showing what changes when AI becomes the retrieval layer.

The conventional read of stalled UX research adoption goes like this. The team writes long research reports, the rest of the organization stops reading them, and the answer is to make reports shorter or more visual. Atomic insights as a presentational format. Executive one-pagers. Video summaries. All of those are improvements. None of them addresses why a product manager opening Claude on a Tuesday afternoon can't get an answer to "what did our users say about the onboarding flow last quarter?" without finding, downloading, and pasting in a specific report first.

The Nielsen Norman Group surveyed research repositories across organizations and found that only 9% reach what they call mature and thriving status. (Why Research Repositories Fail and How to Get Them Right, NN/G, 2024-07-26.) 29% have no owner. The pattern is consistent at every organization size. NN/G's read of the underlying issue is direct: "A research repository will not address the root cause problem your company likely has, which is a broken decision-making process based on political power and gut feelings."

That point is fair. A repository doesn't fix organizational dysfunction. But the pre-LLM framing of the repository was a place where researchers stored reports so other humans could find them later. The post-LLM question is different. The repository is now a place where structured findings live so any team member's AI can pull from them on demand. That shift changes what counts as a well-formed repository, and it changes the failure mode when the structure doesn't match.

The current state of AI use in research is striking.

0 % of research repositories reach mature, thriving status (NN/G)
0 % of orgs say research is essential to all strategy levels (up from 8%, Maze 2026)
0 % of UX researchers now use AI tools (a 24-point jump in one year)
0 % cite hallucinations as their top AI concern (User Interviews)

80% of researchers now use AI tools, a 24-point jump in a single year. 91% cite hallucinations as their primary concern. Only 21% are satisfied with how their team measures research impact. (State of User Research 2025, User Interviews.) Maze's 2026 report found that organizations where research is essential to all strategy levels tripled in one year, from 8% to 22%. (Future of User Research 2026, Maze.) 66% of research teams report increased demand for research, up from 55% the prior year. Supply is not keeping pace. The gap is what teams are trying to close, and the closing strategy on offer (more reports, more frequently, in more digestible formats) leaves the retrieval problem untouched.

This is the same shape engineering teams hit two years ago and worked through. The diagnosis is wrong-layer routing: every directive forced through one mechanism (a CLAUDE.md file in their case, a one-shot Claude chat session in yours), and the team concluding the model is not ready instead of recognizing the architecture is. The fix is the same. Build the infrastructure layer that lets every team member's AI reach into the same source of truth.

Why Loading One Report Into Claude at a Time Hits a Wall

This section explains the mechanism. The per-instance load fails for a specific reason that's worth seeing in detail before prescribing the fix.

When a designer or product manager loads a single PDF report into Claude and asks "what should we do about the onboarding flow?" the model is reasoning over a tiny, biased, time-bounded slice of what your team knows. The report covered three usability sessions from Q3 last year. It didn't cover the support-ticket pattern your CX team flagged in February. It didn't cover the diary study your team ran in March that contradicts one of the report's recommendations. The model can't pull either, because neither was loaded into the context window.

Then a stranger thing happens. The model produces a confident answer anyway.

Google Research published findings in May 2025 measuring this directly. They found that Gemma's incorrect-answer rate jumped from 10.2% with no context to 66.1% when the model was given context that was incomplete for the question being asked. (Deeper Insights into Retrieval Augmented Generation: The Role of Sufficient Context, Google Research.) The mechanism is straightforward and worth absorbing. With no context, the model often abstains or hedges. With incomplete context, it commits. State-of-the-art models including Gemini, GPT, and Claude share the failure pattern. They lack the ability to recognize and avoid generating incorrect answers when the provided context is insufficient.

This is the failure mode of the per-instance load. Every individual report load is, by definition, insufficient context. The model can't see the studies you didn't load. It'll answer anyway, with the confidence it would bring to a complete reference.

The architecture pattern Anthropic recommends for its own agents is the inverse. The team's engineering blog on context engineering describes maintaining lightweight identifiers (file paths, stored queries, web links) and pulling content just in time. Their canonical example is CLAUDE.md plus glob plus grep. The CLAUDE.md is short and stable; the agent retrieves specific files when a question requires them. Pre-loading every research report into a chat session is the opposite shape. It's the engineering equivalent of pasting your entire codebase into a system prompt. Works for tiny systems. Breaks the moment the corpus crosses a threshold.

There is one more layer. The chunking question (what unit of research artifact a model should retrieve at a time) does not have one right answer. An arxiv paper on retrieval granularity found that fixed-chunk-size retrieval underperforms because different queries need different sized chunks. A specific question like "what verbatim phrase did users use to describe the cancellation flow?" wants a small chunk. A broad question like "what have we learned about new-user onboarding over the last year?" wants larger ones. The corpus needs to support both. A research report loaded one at a time supports neither.

Before

Loading reports one at a time

  • One PDF in context per Claude session
  • Cross-study patterns invisible to the model
  • Confident answers from incomplete context (Gemma error rate: 10.2% to 66.1%)
  • Each team member's Claude has a different, unsynchronized view
  • Re-loading is manual every session, every person
After

Querying a shared corpus

  • Lightweight index loaded once; full findings retrieved on demand
  • Cross-study queries return matched findings from anywhere in the corpus
  • Granularity matches the question (nugget for specific, study summary for broad)
  • Every team member's Claude reads from the same source of truth
  • Adding a study updates what every future query can see

The Engineering KM Pattern Already Solved This

This section is the load-bearing synthesis. The architecture pattern engineering teams worked through for the same shape of problem transfers to research with one adjustment, which I'll address inside the section.

The instinct to drop a report into Claude is the same instinct engineers had two years ago: stuff every directive into a long prompt and hope the model honors it. The fix in the engineering domain has three load-bearing pieces.

A short shared file describes the rules of engagement (what the team values, what to avoid, what conventions matter). Skills encode reusable, on-demand procedures (how this specific team writes API integrations, how it runs a code review, how it interprets a flag). The Model Context Protocol (MCP) connects the model to live data sources so the AI can pull from a corpus on demand instead of being pre-loaded with everything. Each piece does work the others are bad at. None replaces the others.

This is the architecture pattern UX research repositories already need.

The atomic insight, by the way, isn't new to AI. Tomer Sharon (former Head of UX at WeWork) wrote in 2016 that "a report is not the atomic unit of a research insight" and built the Polaris system at WeWork around tagged, evidence-linked nuggets. (The atomic unit of a research insight, Tomer Sharon.) The User Interviews UX Research Field Guide formalizes the structure: a nugget is an observation backed by evidence (video clip, quote, screenshot) and tagged for retrieval. (Atomic Research Nuggets, User Interviews.) That structure was designed for human retrieval and human pattern-recognition. It happens to be exactly the right structure for an LLM to query.

There is a fair objection here. Engineering knowledge tends to be propositional. A function returns this; an API expects that. UX research is interpretive, qualitative, contextual. Atomized findings can lose the situational meaning that made them valuable in the original study. That objection deserves the section it gets later in this post. For the architecture question, what matters is whether the retrieval pattern still helps once you grant the difference. It does. A well-structured research corpus stores findings, study context, and methodology metadata as separately retrievable resources. The interpretive layer (how to weigh a 2024 mobile finding against a 2026 desktop question) lives in a conventions skill that travels with the queries, not stripped out of the data. The analogy holds at the level of architecture even when the content type differs.

The pattern generalizes beyond research and engineering. The same shape (corpus structured for retrieval, conventions encoded as instructions, decisions captured as rubrics) is what A Field Guide to Training AI on Your Company's Knowledge walks through across eight business roles.

I have shipped this pattern twice. At one enterprise organization, hundreds of engineers had been using Claude as individual chat sessions for months with negligible team-level effect. After shared CLAUDE.md routing and a small library of internal skills went into the build, the same engineers (including dozens who had been burned by earlier AI-coding tools) recovered to consistent all-team adoption inside a quarter. The shape mattered. The infrastructure layer was the unlock, not the model upgrade.

The second is in an adjacent research domain. The AI Persona Profiler is a multi-agent system that ingests research artifacts (transcripts, observations, persona criteria), structures them into queryable findings, and runs adversarial dual-analysis over the corpus to produce a high-fidelity persona model. The voice-fidelity score across a 12-point rubric is 59 out of 60, achieved by structuring the corpus first and querying second rather than loading everything into one massive prompt. Personality research is not UX research. The pipeline shape is the same argument.

What does the corpus look like on disk? Smaller than you think.

research/
research/
├── README.md # how the corpus is structured
├── conventions.md # tagging vocabulary; confidence levels
├── studies/
│ ├── 2026-q1-onboarding/
│ │ ├── summary.md # 200-word study summary + methodology
│ │ ├── findings/
│ │ │ ├── F-001.md # one finding = observation + evidence + tags
│ │ │ ├── F-002.md
│ │ │ └── F-003.md
│ │ └── sources/ # transcripts, recordings, raw notes
│ └── 2026-q1-cancellation/
│ └── ...
└── tags.yaml # canonical tag list with definitions

Every finding file is a small markdown document with a YAML header (id, study, date, tags, confidence, linked evidence) and prose. Studies have a 200-word summary that lets a model orient quickly without reading the full study. Tags live in one canonical file the team agrees on. There is no proprietary platform here. There is no enterprise vector database. This is a folder structure your existing tools (Notion, Confluence, Dovetail exports, Google Docs) can produce by following a convention.

The Nugget, Not the Report, Is the Retrieval Unit

This is the prescriptive section. Two layers (the unit of retrieval and how the corpus is exposed) plus the convention skill that interprets queries against it.

Once the corpus is structured this way, the retrieval architecture has two layers.

Start with the unit of retrieval. A finding (observation + evidence + tags) is the atomic unit. A study summary (200 words plus methodology metadata) is the broader unit. Both live in the corpus. A specific question pulls findings; a broad question pulls study summaries with links to the underlying findings. The granularity routing arxiv paper above is the academic justification; the lived practitioner version is that "what did users say about pricing?" wants different chunks than "what's the state of our cancellation research over the last year?"

Exposure is the second layer. The corpus needs to be reachable by any team member's AI without that person manually pasting context. Two viable shapes exist today.

One is AI as a feature layer, which is what most vendors ship. Dovetail's AI Chat and Search lets non-researchers query the Dovetail workspace from inside Dovetail itself, Slack, or Microsoft Teams. Marvin's Ask AI Search plays the same role inside Marvin. Notably and Maze are similar. The corpus stays inside the platform's walled garden, queryable from the vendor's surfaces but invisible to a researcher's standalone Claude or a designer's ChatGPT subscription. Fine if your whole team agrees to do all research-related thinking inside the vendor's chat interface or its bot integrations. Not fine if anyone wants to use the AI tools they already pay for, on the surfaces they already use.

The other shape exposes the corpus via MCP. MCP is an open protocol Anthropic introduced that lets any compatible AI client (Claude, ChatGPT, Cursor, VS Code) read from any MCP server, regardless of who built the server. The metaphor in the spec is "USB-C for AI." Build one MCP server for your research corpus and every team member's AI can query it from their preferred tool. Great Question's MCP integration is the first UX research vendor I've seen ship this pattern, with 70+ MCP operations documented. It's in enterprise early access as of this writing, which means the pattern is buildable today and adoption is early. The vendors lagging here will be playing catch-up within twelve months.

Think of the MCP server as the retrieval pipe. The convention layer (how findings are tagged, what confidence levels mean, how to weight a 2022 study against a 2026 one) is encoded as a Claude Code skill or its equivalent. Skills aren't engineering-only. A research-conventions skill can be a 200-line markdown file that tells any Claude session how to interpret your team's tag vocabulary and what to do when two findings conflict. Pair the MCP server (the data) with a skill (the interpretation) and a non-engineer team member's Claude can answer a research question correctly without them ever touching a file.

If the build sounds like a lot, it is not.

1

Pick the next study, not the backlog

Do not migrate every legacy report. Start with the next research study going out the door. Write its findings as atomic markdown files (observation + evidence + tags) instead of a 30-page PDF.

2

Define the tag vocabulary in one file

tags.yaml lives at the corpus root. 20 tags covering products, surfaces, user types, confidence levels. Edit it once a quarter. Everyone uses the same vocabulary.

3

Write a one-paragraph conventions file

conventions.md describes how to interpret findings: what confidence levels mean, how to weigh older studies, how to flag contradictions. This is the skill content later.

4

Expose the folder via MCP

Use an off-the-shelf filesystem MCP server, a vendor MCP integration if your repository tool offers one, or build a thin custom server. The server reads the conventions file and exposes findings as Resources.

5

Load the convention skill on team Claudes

Every team member adds the skill to their Claude environment. Now any researcher, PM, or designer asking 'what do we know about X?' gets a corpus-grounded answer with citations to the underlying finding files.

Tip

If your team's research is currently in Dovetail or Notion, do not delete those tools. The corpus convention can sit on top of them. The folder structure is a logical view; the source of truth can be any system that supports markdown export. The migration is additive, not destructive.

This is the kind of work my MCP Server Setup and Integration and Custom Skill Development engagements cover when teams want help getting from "we have research scattered across three tools" to "every team member's AI can pull from a unified corpus." But the architecture is documented; the protocol is open; and a research team with one engineer-friendly ally can ship a usable v1 in a sprint.

When This Doesn't Help (And Why You Should Build It Anyway)

The strongest counter-argument to corpus-style architecture comes from NN/G. Their July 2023 article on AI-powered tools for UX research documents a specific failure mode: an AI tool recommended adding filters to a UI where filters already existed but weren't being used. The tool retrieved a finding ("users want filters") without retrieving the context that would have told it the filters were already shipped. NN/G's argument generalizes. A finding stripped of methodology, participant background, interface version, and study conditions cannot be assessed for current applicability. No amount of tagging fixes that.

The architectural response is partial. Linked evidence helps; expose the study summary as an MCP Resource alongside every finding so the AI client pulls both. Methodology metadata helps; date the study, name the version of the product it tested, name the cohort. A skill that tells the model to refuse a recommendation if the cited study predates the current product version helps. None of this fully eliminates de-contextualization risk. The model can still pull a finding without grounding it. The risk reduces from "no way to ground findings" to "ground them by default and watch for cases where you forgot to."

The other counter is the one NN/G makes head-on. A repository doesn't fix a broken decision-making process. If your organization's product leadership ignores research findings regardless of how they're presented, structuring the corpus differently won't change the outcome. That objection is correct, and it's also a different problem. The question for a research lead is whether the corpus shift is cheaper than the cultural shift. Corpus restructuring takes a quarter. Decision-making culture takes years. The cheaper move comes first, and the resulting visibility is what puts pressure on the cultural problem.

There's also a recovery dynamic worth knowing. If your team has researchers or product partners who already tried "drop a report into Claude" and concluded AI can't help with research synthesis, they're not your hardest re-engagement segment. They're your easiest. The pattern they're reacting to (per-instance load, fragmented context, confident wrong answers) is the failure mode the corpus shape removes. Show them one query against a structured corpus, with grounded citations to specific findings, and they'll move faster than they would after a year of keynote talks.

Adoption tip: don't announce the corpus migration as a tools program. Announce it as a research-impact program. Borrow the framing from The Engineering Manager's Guide to Governing Agentic Development: standardize the corpus, not the analyst's prompt. The team standardizes how findings are tagged and where they live. Every analyst writes their own queries. The infrastructure constrains the inputs; the practitioners stay free to interpret.

What to Build This Quarter

The disconnect your team is feeling isn't a reports problem. Your reports are probably fine. The issue is that they're the wrong primitive for the kind of retrieval AI assistants now make possible. Switching the primitive from the report to the finding, exposing the corpus via MCP, and encoding interpretation conventions as a skill is roughly a quarter of focused work. The payoff is permanent: every future researcher's output compounds into something the rest of the organization can pull from.

If your research team is sitting on years of insights and wondering why no one acts on them, the next study is the inflection point. Write that one as a corpus instead of a report. Pull the rest in as you have time.

If you want a structured way to figure out where your organization stands on AI readiness (research, engineering, ops, all of it), the AI Readiness Assessment takes about 15 minutes and produces a report you can share. Teams that assess early move faster when it's time to build. If you'd rather talk through what this looks like for a specific research org, grab fifteen minutes.

Small change in shape. Big change in what the rest of your stack can do with research. Build it once. Skip it and the disconnect keeps growing as everything around it accelerates.