Why skills, not just better prompts?

I write skills the way other people write internal libraries: a small piece of repeatable shape that lives in version control, gets reviewed, and earns its keep by reducing the cost of the next similar task. What changes when the capability lives in a skill rather than as text I paste into chat is not the wording. It is the lifecycle.

A pasted prompt has no shape past the session that used it. The next person who needs the same procedure has to find the conversation that worked, copy it again, and trust their own redaction. A Claude Code skill is a versioned module: a directory with a markdown body and YAML frontmatter describing what the skill does and when it applies. The model reads the frontmatter and can route to the skill on its own when a request matches; the user can also invoke it by name through the slash menu. The body loads only on invocation, so long reference material costs almost nothing in context until it's needed.

Anthropic's own Claude Code skills documentation draws the boundary between memory and skills directly. CLAUDE.md is for facts Claude should hold in every session: build commands, project layout, "always do X" rules. If an entry is a multi-step procedure or only matters for one part of the codebase, the recommendation is to move it into a skill or a path-scoped rule. The architectural framing on the Agent Skills overview is more explicit: at session start, roughly a hundred tokens of metadata per installed skill load into context, and only that metadata. The skill body is read only when the model decides the skill is relevant. Most of the disk footprint of a skill is invisible at session start by design.

That progressive loading is the difference between a writer's prompt library and a team's skill catalog. A prompt library scales with the writer; a skill catalog scales with the codebase. By the time a team has shared CLAUDE.md routing plus a small cluster of skills, the cost of the next "we always do it this way" capability is the cost of writing one more skill, not the cost of one more conversation that has to bootstrap from zero.

In this repository, the skill count is currently eighteen, alongside sixteen specialist subagents and forty-two named validators. Each layer catches a failure class the others cannot. The skill layer specifically catches procedural-knowledge gaps: things the model can do well only when it follows the right shape, and that the wider context doesn't carry. The day-job analog has held the same way. Hundreds of engineers across the team I work with now use Claude Code for all development tasks; the structural difference from the early Copilot trials is not better prompting but a shared infrastructure layer in which routing rules, hooks, and skills are deliberate.

The value proposition collapses to one sentence: a skill is the abstraction that turns repeatable procedural knowledge into versioned, reviewable, on-demand capability that the model and the user can each find.

What does a skill unlock?

Three things the prompt and the project file cannot do on their own.

The first is shared procedural knowledge that doesn't pay rent in context. A long reference document inside CLAUDE.md is loaded into every conversation whether the conversation needs it or not. The same reference inside a skill body is invisible until the description matches what the user asked. The reference document I keep as the migration record for this repo (an opt-in tool-capability summary) is structured as a skill specifically for this reason: the file is several thousand tokens, but it costs nothing until a request triggers it.

The second is two-axed discoverability. The same skill answers a typed slash command and a natural-language request, without you having to choose between them at write time. A user who knows the slash name types it directly. A user who doesn't know the slash name describes the task; the model reads its description and routes. This is different in kind from CLAUDE.md (one surface, always loaded), inline prompts (one surface, hand-pasted), and hooks (no user surface at all, runtime-only). It's the closest single primitive in the Claude Code stack to "a capability the team can both teach by name and discover by intent."

The third is portability under a standard. The Agent Skills open standard was published as the agentskills.io specification in December 2025, and the agentskills.io directory lists over forty tools (as of May 2026) as adopters today. The base spec defines six frontmatter fields; Claude Code extends it with additional fields that govern invocation control, model selection, tool scoping, and subagent forking. A skill written to the base spec is portable across the ecosystem; a skill written to the Claude Code extensions is most powerful inside Claude Code but degrades gracefully elsewhere. The portability is real but not symmetrical: a Claude-Code-specific behavior like the disable-model-invocation flag has no equivalent in tools that haven't adopted that extension. The implication for an author is that the choice of which fields to use is also a choice about how portable the resulting skill is.

What does not survive the move from prompt to skill is informality. A pasted prompt can drift in voice or detail because no other reader will see it. A skill that the model and users both invoke is a public interface inside the team; it earns the same review discipline as a function signature. That is the cost of the unlock.

How does the two-axed surface change your job?

You're writing for two readers now, both of whom are choosing whether to pull the skill in.

The first reader is the model. It sees the skill's description (and the optional when_to_use text) at session start as part of its routing surface, decides if a request matches, and routes accordingly. Anthropic's skill authoring best practices is unusually direct about this surface:

The description is critical for skill selection: Claude uses it to choose the right Skill from potentially 100+ available Skills. Anthropic, "Skill authoring best practices"

Three rules in that documentation are load-bearing for routing accuracy. Descriptions are written in third person; inconsistent point-of-view causes what Anthropic calls "discovery problems." Descriptions name specific keywords or triggers a request would naturally include, not vague phrasing like "Helps with documents." Descriptions stay short; the per-skill character cap on the description plus the optional when_to_use text is 1,536 characters in Claude Code, and that combined listing competes with other skills for a finite budget. When the budget overflows, the least-invoked skills' descriptions are dropped first; routing for those skills silently goes blind.

The second reader is the user typing the slash menu. They see the skill's name, an optional argument hint, and (depending on settings) a brief surface gloss. A user-facing surface has different rules than a model-facing one. A slash name should be short, memorable, and unambiguous in the menu. It should not collide with sibling skill names that do different things. Argument hints, when present, are decoration for autocomplete, not gating on what the skill accepts.

Designing for both surfaces deliberately means treating them as a co-authored constraint. The description has to be sharp enough for the model to route on without colliding with neighbor skills. The slash name has to be memorable enough for the user to invoke without consulting documentation. The scope has to be tight enough that both surfaces still resolve to one coherent capability. A skill whose description fires for many unrelated requests will route incorrectly; a skill whose slash name is generic will be ambiguous in the menu; a skill whose scope is sprawling will fail one of the two readers no matter how the names are tuned.

When a workflow only belongs on one surface, you can opt out of the other. Setting disable-model-invocation to true takes the skill off the model's routing surface and leaves it as a user-invocable command, useful for actions with side effects where the user should sign off, like deploying or committing. Setting user-invocable to false hides the skill from the slash menu and leaves it as model-routed reference knowledge. Neither is a default. Both are deliberate choices, and choosing them well is the substance of skill authorship.

A practitioner observation from inside this codebase: the skill that taught me the most about the two-axed surface was one of the earliest I wrote, called code-archeology. The first version was a vanilla-Claude prompt that I copied between sessions. The second version, rebuilt through Anthropic's /skill-creator interview and tightened with description-optimization, raised activation accuracy meaningfully without changing the underlying procedure. The procedure was right the first time; the description wasn't. That is the lesson the two-axed surface tries to surface at write time: the procedure and the description are different artifacts, and both need their own iteration.

When is a skill the right shape (vs. subagent, hook, MCP server, or inline knowledge)?

This is the decision that, when made wrong, makes a skill underperform the alternative it should have been. Anthropic's features-overview carries a trigger-to-primitive table I keep close to the keyboard. The shorthand version:

If the trigger is...The right primitive is...Why
You keep typing the same prompt to start a taskA user-invocable skillThe procedure is reusable; both surfaces help different users discover it
You keep pasting a multi-step reference into chatA skill (reference type)Long material loads on demand instead of every session
Claude gets a convention wrong twiceCLAUDE.md (or path-scoped rule)The fact should hold in every session, not be discovered
Something has to happen every time without askingA hookSkills are advisory; hooks are deterministic
You keep copying data from a system Claude can't seeAn MCP serverExternal-system access; per-user OAuth lives here
A side task floods your conversationA subagentContext isolation is the property

The most common mistake I watch new authors make is reaching for a skill when the requirement is enforcement. The features-overview comparison is unambiguous about this:

An instruction like "never edit .env" in CLAUDE.md or a skill is a request, not a guarantee. A PreToolUse hook that blocks the edit is enforcement. Anthropic, "Extend Claude Code" (features overview)

That contrast is the difference between a skill and a hook, compressed into two sentences. A skill says "here is how to do this thing." A hook says "this thing will not happen unless the gate passes." When a project needs the latter, the production secrets that must never be edited, the validator that must always run before commit, the push-to-main from a worktree that must always be refused, the right shape is a hook (Anthropic's hooks-guide is the canonical reference), and a skill that "really, really tries to enforce" the same rule is structurally a wish.

The second-most common mistake is reaching for a skill when the requirement is external-system access. An MCP server gives the model a way to read and write to systems it cannot otherwise see (issue trackers, monitoring dashboards, internal databases), and the Anthropic MCP documentation carries the native OAuth surface that a skill cannot replicate. A skill that wraps an external API by pasting credentials into context is a security incident waiting for an audit. The right shape there is MCP with per-resource OAuth, and a skill that names the external system as part of the workflow it documents.

The third mistake is reaching for a skill when the requirement is context isolation. Long-running exploration, file-heavy research, or any work whose intermediate token stream would overwhelm the main conversation belongs in a subagent (the canonical Anthropic reference is Create custom subagents), not a skill. The subagent-orchestration cornerstone documents the operational shape: each worker runs in its own context window, returns only a distilled finding, and the lead never holds the intermediate state. A skill running in the main conversation does the opposite: it adds to the same window.

The composable case is worth naming because it changes the trade-off. Claude Code supports a context: fork mode in which a skill runs inside a fresh subagent context, combining the skill's procedural shape with the subagent's isolation guarantee. That is the right pattern when a procedure is repeatable enough to deserve a skill but expensive enough in context that running it inline would crowd the main conversation. The pattern shows up most in skills that read many files, run validations across a corpus, or fan out to other tools.

There is one more primitive that authors sometimes overlook: the existing project memory file (CLAUDE.md) and its sibling project-rule files. Anthropic's memory documentation is explicit that CLAUDE.md content is delivered as a user message, "not as part of the system prompt itself. Claude reads it and tries to follow it, but there's no guarantee of strict compliance." CLAUDE.md and skills are both probabilistic: neither is the right answer when the requirement is enforcement. The split between them is about lifecycle. Facts that hold every session belong in CLAUDE.md. Procedures that only matter on certain requests belong in skills.

The honest decision rule for me runs in this order. If the requirement is "this must happen every time," I write a hook. If the requirement is "the model needs access to a system it cannot see, with per-user authorization," I configure MCP. If the requirement is "this side task will flood the context window or risks hallucination contamination," I dispatch a subagent. If the requirement is a fact that should hold in every conversation, I put it in CLAUDE.md. If the requirement is a procedure I will repeat, that two different users might discover differently, that I want to version and review, I write a skill.

What patterns make a skill compound?

The thesis of this cornerstone says the pattern compounds: each well-shaped skill lifts the next, because the model sees a sharper catalog and the user learns a sharper menu. The honest version is that this is true up to a threshold, and beyond the threshold the catalog itself becomes the operational surface that has to be maintained.

The structural evidence is the most useful place to start. Published academic research on skill routing at scale measures a log-decay relationship between library size and routing accuracy: in the formal model from the Scaling Laws of Skills in LLM Agent Systems paper, doubling the catalog costs roughly three percentage points of routing accuracy across the model families tested. The same paper names a discrete failure mode I now watch for in my own catalogs: "black-hole skills," abstract descriptions that capture probability mass from many unrelated queries and ruin the routing for their neighbors. The SkillRouter paper from March 2026, working on a registry of eighty thousand skills, measured a 31 to 44 percentage point routing-accuracy gap between description-only and full-body retrieval, and the gap remained at or above 26 percentage points even at the top quartile of description quality. Description engineering changes the slope of decay; it does not lift the structural ceiling.

This matters because Anthropic's own skills troubleshooting section, in plain language, acknowledges the same failure modes. Skill descriptions can get truncated when the listing budget overflows, which means the model goes blind to less-recently-used skills. The model can also choose other approaches over a loaded skill, even when the skill is the better answer. Anthropic's documented remedy for the second case is to "use hooks to enforce behavior deterministically." That sentence, read closely, is the company saying: when you need a guarantee, this is not the abstraction.

What compounds in spite of all this is the catalog when it is curated. Five patterns hold in my experience, and each one corresponds to a documented failure mode of the unmaintained catalog.

Pattern one: tight scope per skill. The skill answers one task with one shape, regardless of which surface invoked it. A skill that does three loosely related things forces both the model and the user to disambiguate at the wrong moment. Splitting the skill into three skills with sharper descriptions tightens routing for all three; merging them recreates the ambiguity.

Pattern two: description-as-routing-input, not as documentation. The description is short, third-person, and packed with the keywords a real request would carry. A Builder.io practitioner taxonomy names three failure modes here that I have hit personally: "Secret Handshake" descriptions are too abstract for the model to recognize the trigger; "Everything Bagel" descriptions apply universally and route the skill into requests it shouldn't handle; "Encyclopedia" descriptions signal a narrow trigger for a sprawling body that bloats context. The body of the skill is for the model after it routes in. The description is for the routing decision itself. Conflating the two destroys the routing surface.

Pattern three: catalog hygiene as a recurring discipline. Once the catalog approaches the threshold where description-routing accuracy starts to decay measurably, the cost of NOT deleting low-value skills exceeds the cost of writing the next new one. The discipline is to audit which skills fire (and which fire for the wrong requests), remove black-hole descriptions, and keep the total under a threshold the team's routing accuracy can absorb. The published log-decay law gives the principle; the practical threshold in any given codebase will depend on how distinct each skill's description is and how often the catalog gets curated.

Pattern four: posture per skill, deliberately. Setting disable-model-invocation or user-invocable: false is not a default to flip out of habit. It is a deliberate trade-off: removing one of the two surfaces removes one of the two failure modes (an accidental model invocation or a missed slash-menu entry) and removes one of the two ways the skill is discoverable. A skill that disables the model surface should be a skill the user is reliably going to remember to invoke; a skill that disables the user surface should be one whose description is sharp enough that the model will reliably route. Picking the wrong posture per skill is recoverable; making the choice without intent is not.

Pattern five: hooks for the floor, skills for the ceiling. Workflows that must hold every time become hooks. Workflows that should hold most of the time but reward judgment become skills. CLAUDE.md carries the facts every workflow assumes. MCP carries the external-system access skills cannot legally provide. The skill catalog is a discoverable shape for procedural knowledge; it is not a substitute for the deterministic layer that the project needs underneath.

A structural alternative is worth naming before this section closes. The SkillRouter paper that documents the description-only routing ceiling proposes a retrieval-augmented routing layer (a 1.2-billion-parameter rerank pipeline over the full skill body) as the way past that ceiling. For a generalist registry of tens of thousands of skills, that approach is the right move. For a team's curated catalog of dozens, the engineering cost of standing up a retrieval-and-rerank service exceeds the catalog-hygiene cost, and the academic decay slope is the wrong threat model. Catalog hygiene is the answer when the catalog is human-curated and bounded; retrieval-augmented routing is the answer when it isn't. The cornerstone's "compounds" claim holds in the bounded-catalog regime, which is the regime most teams operate in.

Two reliability dynamics also need acknowledgment because they shape what "compounding" looks like in operation. The first is that description routing depends on the model's reasoning quality, which Anthropic controls at the runtime layer. The InfoQ postmortem on Anthropic's Q1 2026 quality complaints documented an undisclosed default-effort downgrade in early March 2026 that measurably affected routing across the user base before it was reverted. The second is that the two surfaces have, on at least two recent release versions, drifted in implementation: the disable-model-invocation flag has been observed to contaminate slash-command invocation in one release, and slash-command skill loading has been observed to regress entirely in another. The right structural inference is not that the surfaces are unreliable; it is that they are coupled enough to break together when a runtime change in one direction surfaces in the other. Catalog hygiene cannot fix either of these. What it can do is shrink the blast radius when they happen, because the catalog the model has to route through is smaller and sharper.

The cumulative point is that the thesis holds with a qualifier the original framing didn't carry. A well-shaped skill lifts the next skill, until the catalog crosses a threshold where description-routing degrades. After that, the same catalog hygiene that prevents black-hole descriptions is what keeps the compounding going. That is not a limitation. It is a design rule worth knowing exists.

How does this page stay current?

This guide is on the spine, not the freshness reference. Cornerstones engage time-sensitive vendor surfaces (Claude Code documentation, Anthropic best practices, academic preprints), and I want this page to age honestly.

The dual-cap rule the site uses on cornerstone sources: AI and SaaS tool-capability claims are revisited at three months, general tool-capability claims at six months. Any row in the ## Sources table that falls past its cap without a fresher equivalent is documented in the "Fresher alternatives considered" footer of the sibling sources file with a search trail explaining why no better source qualified.

Three specific triggers will pull this page back for an update: an Anthropic change to the two-axed invocation surface (a new posture flag, a deprecation, or a change in default), a structural change to the Agent Skills open standard that affects portability, or new academic work on skill-routing at scale that shifts the threshold framing in the catalog-hygiene section. If none of those fire, the next refresh window is the three-month AI/SaaS cap on the most time-sensitive Sources rows.

What I want this page to never need to retract is the value claim. The architecture of progressive disclosure plus a two-axed invocation surface is what makes a skill a different kind of artifact from a prompt or a project file. The patterns that make a catalog compound or decay are findable in the public record. Your job as a skill author is to write each skill so that both surfaces still resolve to one coherent capability, and to maintain the catalog the way an engineering team maintains any other shared surface. The rest is detail.