Claude Code: How to Build, Evaluate, and Tune AI Agent Skills

Mastering Claude Code Agent Skills: Effective Strategies for Building, Evaluating, and Optimizing AI Workflows

Claude Code: Build, Evaluate, and Tune AI Skills

Unlock the full potential of AI with "Supercharging Claude Code"! Discover how to build, evaluate, and fine-tune AI skills, implement skill evaluations, and master trigger tuning for your Claude Code agent library.

Summary: Claude Code Agent Skills are SKILL.md files that enhance Claude's functionality for specific workflows, categorized into Capability Uplift and Encoded Preference skills. The Skill Creator automates creation, modification, evaluation, and trigger optimization. Skill evaluations use parallel subagents and binary assertions to measure performance. Trigger tuning resolves false triggers and misfires through iterative description rewriting. The Agent Skills open standard at agentskills.io enables portability across AI coding tools.

Learn how Claude Code Skills work, how to use the Skill Creator for trigger tuning and evaluations, and how to maintain a durable skill library across model updates.

TL;DR

Claude Code Skills are SKILL.md files that extend Claude's behavior for your specific workflows -- and the Skill Creator is the official tool for building and maintaining them
There are two types: Capability Uplift skills (teach Claude to think differently) and Encoded Preference skills (encode your specific workflows)
Skill evaluations use parallel subagents and verifiable assertions to measure whether your skill still outperforms the base model
Skill trigger tuning fixes the false trigger problem, when you have 10 or more skills and the wrong one keeps firing, using iterative description rewriting
The Agent Skills open standard at agentskills.io makes your Claude Code skills portable to Cursor, VS Code, and other tools

What You'll Learn

What Claude Code Skills are and how the SKILL.md format works
The strategic difference between Capability Uplift and Encoded Preference skills
How to use the Skill Creator for creation, modification, and evaluation
How to run skill evaluations and interpret benchmark results
How to fix false triggers and misfires with Claude Code skill trigger tuning
How to maintain a durable skill library across model updates

Prerequisites

Familiarity with Claude Code (you have used it for at least a few tasks)
Basic comfort with YAML and Markdown
Optional: existing Claude Code skills you want to evaluate or tune

You have 15 skills. You type "clean up my code." Three skills think you are talking to them. One of them fires. It is the wrong one.

That is the false trigger problem. If you have spent any time building a serious Claude Code skill library, you have hit it. The wrong skill activates because two descriptions overlap semantically. Or no skill activates because the description was too narrow. Either way, you get unpredictable behavior at exactly the moment you need precision.

The Skill Creator's trigger tuning exists for exactly this. To understand why Claude Code skill trigger tuning matters, you have to understand the whole system: what Claude Code Agent Skills are, how they work, and why the description field is the single most important thing you will write.

This article covers the complete picture: what Claude Code Agent Skills are, the two strategic types you can build, how the Skill Creator handles evaluation and trigger tuning, and how to maintain a library that stays useful as models evolve.

What Claude Code Skills Are

A Claude Code skill is a SKILL.md file with YAML frontmatter. That is the whole format. The name and description fields in the frontmatter are what Claude reads to decide whether to activate the skill. Everything else -- the body of the file -- is the instruction set Claude follows when the skill fires.

---
name: commit-message
description: Use when creating a git commit message. Generates conventional commit format with scope and breaking change detection.
---
# Commit Message Skill
[skill body here]

This YAML snippet is deceptively simple. The description field does all the heavy lifting. It is what Claude reads to decide whether your skill is relevant to the current task. The name identifies it. The description activates it.

Claude Code Agent Skills live in two locations depending on scope. Personal skills at ~/.claude/skills/<name>/SKILL.md apply to all your projects. Project skills at .claude/skills/<name>/SKILL.md apply only within that repository.

Agent Skills Standard

The skill format follows the Agent Skills open standard, documented at agentskills.io. This is not a Medium-specific or Claude-specific format. It is designed to be portable.

That portability matters more than it looks. Your encoded workflows transfer as the ecosystem expands. Write the skill once, use it across tools.

Two Types of Claude Code Skills: Capability Uplift and Encoded Preference

The most important strategic decision when building Claude Code Skills is understanding which type you are building. Get this wrong and you will invest heavily in skills that age out quickly.

AI Agent: Agent Skills — Capability Uplift and Encoded Preference

Every skill you build falls into one of two branches. The branch determines how long the skill stays useful.

Capability Uplift Skills

Capability uplift skills teach Claude to think better than the base model on a specific task. The anti-AI-slop design skill is the canonical example. Claude, without guidance, converges on certain patterns: the same sentence structures, the same hedging language, the same generic frameworks. Left to its own devices, it produces competent but predictable output.

An uplift skill overrides that convergence with specific taste. It avoids the AI Slop. Specify exactly which fonts to avoid, which argument structures to prefer, which clichés to remove on sight. The skill transfers your aesthetic judgment into Claude's behavior.

The limitation: capability uplift skills are a bet that the model will not catch up fast enough to make your skill obsolete. As Claude's base capability improves, some uplift skills get outgrown. The model eventually learns what you encoded.

Uplift skills have a shelf life. Plan for it.

Encoded Preference Skills

Encoded preference skills are different in kind. They encode your specific workflow: the sequence of steps, tools, and source-of-truth systems that only you know about. The model cannot outgrow these because they are not teaching Claude to think differently -- they are teaching Claude your specific process.

Here are some examples of encoded preferences:

The Customer Support Triage example makes this concrete. The skill might: ingest incoming tickets from Zendesk, classify urgency and intent, retrieve relevant knowledge base articles, draft responses, and flag edge cases for human review. Every step is specific to your organization's system. The model cannot replicate this without the skill.

The Sales Intelligence Pipeline example makes this concrete. The skill might: monitor target accounts for signals (job changes, funding, product launches), enrich contacts via APIs, score opportunities, and push updates to your CRM. The workflow is unique to your sales process.

The DevOps Incident Response example makes this concrete. The skill might: watch logs and metrics across environments, detect anomalies, correlate signals across services, spin up runbooks, and alert the right humans in the right channels. Your incident response process is not generic.

The Content Repurposing Engine example makes this concrete. The skill might: ingest long-form content (podcasts, blogs, transcripts), extract key themes, generate platform-specific formats (LinkedIn posts, Twitter threads, email newsletters), and publish to your distribution channels. Your content workflow is specific to your channels and voice.

The Data Quality Monitoring example makes this concrete. The skill might: scan incoming data pipelines, validate against business rules, detect anomalies, trigger remediation workflows, and generate summary reports. Your data quality rules are specific to your domain.

Encoded preference skills compound. These are your workflows. They work with your access skills and MCPs. Each one represents an accumulation of decisions that took you years to make. The model cannot replicate that without the skill.

The practical guidance: build encoded preference skills first. Build them aggressively. Capability uplift skills are worth building, but they require ongoing maintenance. Encoded preference skills are more durable because the model cannot outgrow your specific workflow.

If more than 20% of your Claude Code skill library is capability uplift skills, you are holding skills that will age out faster than you expect.

The Skill Creator: Anthropic's Official Meta-Skill

The Skill Creator is an official Anthropic skill, available at github.com/anthropics/skills. It builds and maintains other skills. It is a meta-skill: a skill that creates skills.

It handles four operations: creation, modification, evaluation, and trigger optimization. Creation and modification are straightforward -- they generate and refine SKILL.md files. The interesting operations are evaluation and trigger tuning.

The evaluation and trigger tuning capabilities are what make the Skill Creator worth installing. Those are the operations that tell you whether your skill library is working, and fix it when it is not.

Running Skill Evaluations in Claude Code

The core problem skill evaluations solve: how do you know if your Claude Code skill is still working?

Skills are static text. Models change. A skill written six months ago against an earlier Claude model might be interpreted differently by the current model. Evaluations give you a measurement instead of a guess.

The Skill Creator's eval system stores test cases in evals/evals.json. Each test case has an input (what the user types) and assertions (what the response must contain). Assertions are concrete binary checks.

{
  "test_cases": [
    {
      "input": "I fixed a bug in the auth middleware that was causing session tokens to expire early",
      "assertions": [
        "response starts with 'fix(' or 'fix:'",
        "response mentions 'auth' or 'middleware' or 'session'",
        "response is under 72 characters"
      ]
    }
  ]
}

Notice the assertion design. Each assertion is a concrete, binary check: either the response starts with fix( or it does not. Either it mentions auth or it does not. Binary assertions are the only kind that give you a reliable pass rate.

When you run a skill evaluation, the Skill Creator spawns parallel subagents. One run executes with the skill enabled. One run executes without it (the baseline). The result is a pass rate comparison: your skill versus the base model.

Two things skill evaluations tell you that nothing else will:

Catching regressions. Your commit-message skill's pass rate drops from 92% to 71% after a Claude model update. The model changed how it interprets the description. Without an eval, you would not notice until a commit message came out wrong in production.

Spotting outgrowth. Your eval shows the baseline (no skill) achieves 88% pass rate, and your skill achieves 89%. The delta is too small to justify the overhead. The model has caught up. Archive the skill.

AI Agents — Agent Skills — Running Skill Evaluations in Claude Code

The diagram maps three failure modes: outgrowth (base model caught up), regression (model update broke the skill), and underperformance (skill never worked as well as expected). All three are invisible without measurements.

Spotting outgrowth is where skill evaluations pay for themselves in maintenance savings. A library you never prune grows into a liability. Skills that no longer add value still fire, still consume context, and still confuse trigger matching.

Skill Benchmarking: pass_rate, duration_ms, and Token Costs

Skill benchmarking extends evaluations with three specific metrics: pass_rate, duration_ms, and total_tokens. Each metric includes mean, standard deviation, and delta versus baseline.

pass_rate

What it measures: Percentage of assertions passing
When it matters: Always: this is the quality signal

duration_ms

What it measures: Execution time per run
When it matters: When you care about latency in your workflow

total_tokens

What it measures: Token consumption per run
When it matters: When cost or rate limits matter

The delta column is the number you care about. A skill that improves pass_rate by 15 percentage points but doubles token consumption is a different tradeoff than one that improves pass rate by 3 points with no token cost increase.

Standard deviation matters more than people expect. A skill with a mean pass_rate of 85% but high standard deviation is unreliable. You want consistent performance, not a skill that scores 100% sometimes and 60% others.

Run a skill benchmark at these specific moments:

When you first create a skill (this is your baseline).
When you update a skill (this is your before/after evidence).
When a Claude model version changes.
When you suspect a capability uplift skill has been outgrown.

The benchmark result is your evidence for the decision you are about to make. Without it, you are guessing.

One practical note on benchmark inputs: use inputs that reflect your real usage, not idealized clean examples. A commit-message skill that scores 95% on perfect inputs might score 70% on the messy, half-formed messages you actually type.

Claude Code Skill Trigger Tuning: Fixing False Triggers and Misfires

This is the capability most skill library owners eventually need, and the one most people do not discover until they are frustrated enough to search for it.

Once you have more than 10 Claude Code Skills, false triggers and misfires are not edge cases. They are the dominant maintenance burden. Two skills with semantically overlapping descriptions will conflict. The wrong one fires. Or neither fires when the request is slightly unusual.

A false trigger: you say "clean up this code" and the PR review skill fires instead of the refactor skill. Both descriptions say something like "Use when improving code quality." Claude reads them and picks the wrong one.

A misfire: you say "set up cron scheduling" and nothing fires. Your Tokio cron skill's description reads "Use when implementing async scheduling with Tokio timers." You did not say Tokio. The description was too specific.

Both problems are solvable with the Skill Creator's trigger tuning process.

AI Agent: Agent Skills — Claude Code Skill Trigger Tuning: Fixing False Triggers and Misfires

Trigger tuning is a loop, not a one-shot fix. Each iteration rewrites the description and tests it against labeled queries. Here is how it works.

Step 1: Generate Labeled Queries

Run trigger tuning on a skill. The Skill Creator analyzes the current description and generates 20 labeled queries: 10 that should trigger the skill, 10 that should not. These queries are generated to test the semantic boundary of the description.

Step 1: Generate Labeled Queries

The generated queries are a diagnostic first. If you look at the "should trigger" list and some entries feel wrong, that is a signal your description is either too broad or mischaracterizes the skill's purpose.

Output 1 from Step 1

Output 2 from Step 2

Step 2: Review in the HTML UI

The trigger tuning UI presents the 20 queries as a table with their labels. You review each one and correct any that are wrong. This is where domain knowledge enters: you know which queries should trigger your skill and which should not.

Review in the HTML UI

Review that we don't trigger skills we did not intend to.

Step 3: Iterative Description Rewriting

The scripts/run_loop Python script runs up to five iterations. Each iteration rewrites the description field and tests it against your corrected labels.

The rewriting is informed by what it learned: which queries were triggering that should not, and which were not triggering that should. The script tries to find a description that satisfies the corrected labels.

Step 4: Review the Result

The before/after difference is usually dramatic. Here is a realistic example for a refactoring skill that was being confused with a code review skill:

Before:

description: Use when the user wants to improve code quality

After:

description: Use when the user explicitly says "refactor", "clean up", "improve readability", or "restructure" on existing working code. Do NOT trigger for PR reviews, new feature development, or debugging sessions.

The "Do NOT trigger for" pattern is what makes the difference. Claude's description matching is semantic, not keyword-based. Explicit exclusions help Claude discriminate between overlapping skills.

Notice what the tuning process would do to the Tokio cron skill misfire. The original description anchored on implementation details ("Tokio timers"). The tuned description would include broader trigger phrases ("cron scheduling", "recurring tasks", "scheduled jobs") while excluding irrelevant ones.

When to Run Trigger Tuning

Run trigger tuning on every new Claude Code skill that overlaps thematically with existing skills. If you have a "commit-message" skill and add a "changelog" skill, run trigger tuning on both. They share semantic territory.

Run trigger tuning after you add your tenth skill. Then run it again each time you add a skill that enters territory already covered by existing skills.

Maintaining Your Claude Code Skill Library

Claude Code Skills are not set-and-forget. The Skill Creator implies a maintenance cycle, and following it consistently is what separates a useful skill library from a liability.

Perfecting your Agent Skills

The state diagram shows the full lifecycle. Skills move from Draft to Active, enter Evaluated states periodically, and either get archived or refined based on evaluation results.

The maintenance cycle works like this:

Create with the Skill Creator. Get the format right, get an initial description, get the body structured properly. Do not try to get the description perfect on the first pass.

Accumulate test cases in evals.json. Add one test case every time a skill produces unexpected output. Not every run -- only when something breaks your expectation. After ten unexpected outputs, you have a meaningful eval set.

Run skill evaluations after every Claude model update. The release of a new Claude model is the highest-risk event for your skill library. Run evaluations before you use the new model in production contexts where your skills matter.

Benchmark when you suspect outgrowth. This applies primarily to capability uplift skills. If the base model has released two versions since you wrote the skill, it is worth checking whether the skill still adds value.

Trigger-tune when you add your tenth skill, and any time you notice misfire patterns in your own usage. Misfires leave a trail -- you notice them in your day-to-day work before you formally measure them.

Archive when outgrowth is confirmed. Archiving is not failure. A skill that successfully transferred a capability to the model has done its job. Archiving it keeps your library clean.

The health signal for your library: if more than 20% of your Claude Code Skills are capability uplift skills, you are holding skills that will age out faster than you expect. Rebalance toward encoded preference skills.

The Future of Claude Code Skill Creation

The Skill Creator, as it exists today, requires you to know what assertions are verifiable and what trigger conditions are precise. Both take experience to get right.

The direction Anthropic has signaled: eventually, high-level natural language descriptions will be enough to generate a complete skill with evals and optimized triggers. You describe the outcome, the system generates the skill, the evals, and the tuned description.

That matters most for non-technical users and for lowering the barrier to skill creation. Today, there is a gap between "I want a skill that does X" and "I have a working skill with passing evals and tuned triggers." That gap requires technical work. Future versions of the Skill Creator will close it.

What will not change: encoded preference skills will still require your context. The model cannot know your PR review checklist, your deploy sequence, or your incident response runbook. Those will always require authoring.

The Agent Skills open standard at agentskills.io is designed for this future. A standard format means the ecosystem that grows around skill creation -- tooling, sharing, evaluation -- works across all tools that adopt it.

Three Actions to Take After Reading

First: install the Skill Creator from github.com/anthropics/skills. If you already have more than five skills and have never run trigger tuning, start there.

Second: audit your existing skills. Label each one: Capability Uplift or Encoded Preference. For every uplift skill, ask whether the base model has gotten close enough to retire it. For every encoded preference skill, ask whether the workflow it encodes is still your workflow.

Third: add one test case to evals.json every time a skill produces unexpected output. Start today. The first test case is the hardest. After that, the pattern is automatic.

The trigger is the whole game. A skill that fires accurately every time is worth ten skills that fire unpredictably. The description is not a label -- it is a routing instruction. Write it like one.

The Skill Creator makes the full maintenance loop fast enough that there is no excuse for letting skill libraries drift. Run evaluations. Tune triggers. Archive what the model has outgrown.

Ready to start?

Install the Skill Creator: github.com/anthropics/skills
Explore the Agent Skills open standard: agentskills.io

Check out these related articles on Agent Rules, Agent Skills and Claude Code Auto memory:

Learn more about Agent Rules to avoid stuffing everything in your CLAUDE.md file.
Learn more about Claude Code's Automatic Memory: No More Re-Explaining Your Project.
Learn more about Claude's built in tasks to do scheduling: Put Claude on Autopilot: Scheduled Tasks with /loop and /schedule built-in skills
Learn more about Skills 2.0: Claude Code Agent Skills 2.0: From Custom Instructions to Programmable Agents
Here is a deep dive on building your first useful skill that expands what Claude Code can do for memory: Build Your First Claude Code Skill: A Simple Project Memory System That Saves Hours

Sample output from an actual run

This is just here so you can see how I ran it in Claude Code. I just cut and paste the output of Claude Code terminal and asked Claude to clean it up.

Here's a terminal session cleaned up and structured for readability where I tuned a skill.

My inputs are clarified (but not changed in meaning)
Outputs are preserved as-is (only spacing/formatting improved)
Sections are organized into a logical flow
It was all very automated.

Skill Creator Eval Session (Formatted)

Source:

Goal

Rick Hightower (me) asked:

Use the Skill Creator eval system

Store test cases in evals/evals.json
Extend benchmarking with:
pass_rate
duration_ms
total_tokens
Each metric includes:
mean
stddev
delta vs baseline
Focus on: Fixing false triggers and misfires

Claude Loaded up the Skill Session Start

⏺ Skill(skill-creator)
  ⎿ Successfully loaded skill

Claude Did an Exploration Phase

⏺ Now let me explore the skill-creator's eval system and the codebase-mentor skill structure to set this up properly.

⏺ Explore(Explore skill-creator evals system)
  ⎿ Done (24 tool uses · 110.9k tokens · 7m 24s)

Claude Code Insight #1 — No Built-in Eval Framework

★ Insight
The skill-creator doesn't have a built-in evals framework yet.
We need to create the eval infrastructure ourselves.

Key insight:
- False triggers → skill activates when it shouldn't
- Misfires → skill doesn't activate when it should
Both are driven by:
👉 the description field in SKILL.md frontmatter

Rick Hightower: Note this was my first clue that I need to upadte the skill creator to the newest version

Reading Skill Description

⏺ Let me first read the current skill description to generate accurate labeled queries.
⏺ Read 1 file

Creating Eval Directory

⏺ Bash(mkdir -p /Users/richardhightower/clients/spillwave/src/codebase-mentor/evals)
  ⎿ Done

Insight #2 — Trigger Tuning Strategy

★ Insight
Trigger tuning tests the semantic boundary of the skill description.

- 10 "should trigger" → coverage
- 10 "should NOT trigger" → precision
Critical edge cases:
👉 near-miss queries (sound similar, wrong intent)

Creating evals.json

⏺ Write(evals/evals.json)
  ⎿ Wrote 154 lines to evals/evals.json

Key Structure (excerpt)

{
  "skill": "codebase-wizard",
  "skill_path": "plugins/codebase-wizard/skills/explaining-codebase/SKILL.md",
  "description_snapshot": "Universal explainer for codebases...",
  "generated_at": "2026-03-22",
  "baseline_run": null,
  "queries": [
    {
      "id": "T01",
      "query": "explain this codebase to me"
    }
  ]
}

Insight #3 — High-Risk False Triggers

★ Insight
Critical near-miss categories:

- F06 → PR review (evaluates vs explains)
- F07 → doc generation (produces vs teaches)
- F10 → migration ("walk me through" ambiguity)

Eval Dataset Summary

evals/evals.json — 20 labeled queries
T01–T10 → Should trigger
F01–F10 → Should NOT trigger

Metrics Defined (Stubbed)

pass_rate
duration_ms
total_tokens

Each includes:

mean
stddev
delta_vs_baseline

Suggested Next Steps

1. Run baseline
2. Tune description
3. Re-run + compare deltas

Runner Script Offer

Want me to build the runner script?

My Response

❯ yes go ahead

Environment Check

⏺ Bash(python3 -c "import anthropic; print(anthropic.__version__)" 2>&1)
  ⎿ 0.40.0

Good — Anthropic SDK v0.40.0 is available.

Insight #4 — How Runner Works

★ Insight
Claude doesn't expose:
👉 "would this skill trigger?"
So we simulate it by:
- Sending query + skill description
- Adding decoy skills
- Asking Claude to choose
Model used:
👉 claude-haiku-4-5-20251001
(low cost + fast)

Creating Runner Script

⏺ Write(evals/run_evals.py)
  ⎿ Wrote 284 lines to evals/run_evals.py

Make Script Executable

⏺ Bash(chmod +x /Users/richardhightower/clients/spillwave/src/codebase-mentor/evals/run_evals.py)
  ⎿ Done

Running Baseline

⏺ Bash(python3 evals/run_evals.py --set-baseline --verbose 2>&1)
  ⎿ Error: Exit code 1
     Error: ANTHROPIC_API_KEY not set

Checking Environment Variables

⏺ Bash(env | grep -i anthropic 2>/dev/null; env | grep -i claude 2>/dev/null | head -5)
  ⎿ CLAUDE_CODE_ENTRYPOINT=cli
     CLAUDECODE=1

Missing API Key

No ANTHROPIC_API_KEY is set.

Options:

Option A — set in shell (I ended up exiting and then setting it)

export ANTHROPIC_API_KEY=sk-ant-...

Option B — load from .env (I read through the release notes and there is a version that does not need the ANTHROPIC_API_KEY)

Tell me location → script can be updated

Insight #5 — Full Runner Behavior

★ Insight
1. Simulates skill dispatch with decoy skills:
   - test-runner
   - pr-reviewer
   - refactor-helper
   - deploy-assistant
   - doc-generator
2. Metrics:
   - pass_rate → accuracy
   - duration_ms → latency
   - total_tokens → cost
3. Multi-run mode:
   --runs 5 → enables stddev
4. Baseline:
   --set-baseline → enables delta tracking
5. CI-friendly:
   exit code 1 on failure

Final Command to Run

python3 evals/run_evals.py --set-baseline --verbose

What This Clean Version Gives You

Clear flow of intent → action → output
Highlights the real system design insights
Makes this usable for:
docs
blog post
internal playbook
PR description

Skill Creator Eval Session — Baseline Run

API Key Set + Baseline Run Requested

Once you set the API key, run:
  python3 evals/run_evals.py --set-baseline --verbose

✻ Crunched for 1m 26s
❯ I set the key - go ahead and run:
  python3 evals/run_evals.py --set-baseline --verbose

Running the Eval Runner

⏺ Bash(python3 evals/run_evals.py --set-baseline --verbose 2>&1)
  ⎿ Running 1 eval run(s) against 'codebase-wizard' skill...
     Model: claude-haiku-4-5-20251001
     Queries: 20 (10 true, 10 false)
     … +29 lines (ctrl+o to expand)
  ⎿ (timeout 3m)

Result Summary

⏺ 100% pass rate — baseline is clean. All 20 queries classified correctly.

Insight From Claude Code — What the Results Mean

★ Insight
What these results tell us:
1. Zero false triggers
   None of the 10 "should NOT trigger" queries matched codebase-wizard.
   The decoy skills absorbed them correctly:
   - refactor-helper caught F01 / F08 / F10
   - deploy-assistant caught F05 / F09
   - etc.
2. Zero misfires
   All 10 "should trigger" queries matched codebase-wizard,
   including subtle cases such as:
   - T02 onboarding
   - T07 pasted code
3. The dangerous near-misses held
   - F06 → PR review → pr-reviewer
   - F07 → doc generation → doc-generator
   - F10 → migration → refactor-helper
   These routed away correctly despite strong semantic overlap.
4. Baseline metrics are now frozen
   - pass_rate = 100%
   - duration_ms mean = 700
   - total_tokens mean = 405
   Future runs after description edits will show deltas versus this baseline.

Eval Toolkit Commands

python3 evals/run_evals.py --verbose

Quick check after editing the description

python3 evals/run_evals.py --runs 5

Benchmark with stddev to measure consistency

python3 evals/run_evals.py --set-baseline

Reset baseline after confirmed improvements

Recommended Trigger-Tuning Workflow

Workflow for tuning triggers:

1. Edit the description in SKILL.md
2. Update description_snapshot in evals.json to match
3. Run:
   python3 evals/run_evals.py --verbose
4. Check delta
   If pass_rate improved, run:
   python3 evals/run_evals.py --set-baseline

Suggested Next Experiments

Want to add harder edge cases?
Example:
- "walk me through migrating to TypeScript"
This is a good stress test because it combines:
- a trigger-like phrase ("walk me through")
- a non-trigger intent (implementation / migration)
Or:
- run a multi-run benchmark for variance analysis

Compact Readable Version

API key was set and the eval runner was executed with:
python3 evals/run_evals.py --set-baseline --verbose
The runner executed 20 labeled queries against the codebase-wizard skill
using claude-haiku-4-5-20251001.
Result:
- 100% pass rate
- 0 false triggers
- 0 misfires
Important near-miss cases also routed correctly:
- PR review → pr-reviewer
- doc generation → doc-generator
- migration → refactor-helper
Baseline metrics were frozen at:
- pass_rate: 100%
- duration_ms mean: 700
- total_tokens mean: 405
This means future description changes can now be measured against a clean benchmark.

Skill Creator Eval Session — HTML Review UI

Goal

❯ Can we use the skill-creator to review in an HTML UI?

Desired Behavior

Show all 20 queries in a table

Display:

query
expected label (trigger / not trigger)
Allow:
flipping labels
correcting mistakes

Save changes back to evals.json

This enables human-in-the-loop tuning

Concept — Why This Matters

The trigger tuning UI presents queries as a table with labels.

You review each and correct mistakes.

This is where domain knowledge comes in:
- Maybe a "should trigger" case should NOT trigger
- Maybe a "false" case actually belongs to this skill

These corrected labels become:
👉 the training signal for improving the skill description

Proposed Approach

⏺ Build a local Python server with an HTML UI

Insight — Why Not Static HTML?

★ Insight

Static HTML cannot write back to evals.json.

A local Python server (~30 lines):
- Serves the HTML page
- Loads evals.json on page load
- Saves updates directly to disk

Result:
👉 No copy/paste
👉 No manual file editing
👉 One-click persistence

At this point, I realized that I need to update my version of the creator skill. I had a fairly recent one but not the full version with the eval viewer. So I updated it.

Interruption / Plugin Attempt

❯ /plugin install github.com/anthropics/skills
⎿ Marketplace not found

❯ /plugin install https://github.com/anthropics/skills
⎿ Marketplace not found

Manual Install Request

❯ install this:
https://github.com/anthropics/skills/tree/main/skills/skill-creator

Installing Latest Skill-Creator

⏺ Bash(cd /tmp && git clone --depth 1 --filter=blob:none --sparse https://github.com/anthropics/skills.git anthropic-skills)
⎿ Cloning into 'anthropic-skills'...

⎿ Shell cwd reset to:
  /Users/richardhightower/clients/spillwave/src/codebase-mentor

Comparing Local vs Repo

The official repo includes:
- eval-viewer/
- agents/
- assets/
- references/

Your local version was missing these.

Replacing Local Skill

⏺ Bash(
  mv ~/.claude/skills/skill-creator ~/.claude/skills/skill-creator.bak &&
  cp -r /tmp/anthropic-skills/skills/skill-creator ~/.claude/skills/
)
⎿ Installed

Inspecting eval-viewer

⏺ Read 2 files

Key Discovery

The official eval-viewer is NOT for trigger tuning.

It is for:
👉 output quality evaluation

Example:
"Did the skill produce good output?"

Insight — Two Types of Eval UIs

★ Insight

There are TWO fundamentally different evaluation problems:

1. Output Quality Evaluation
   - Used by official eval-viewer
   - Focus: output correctness
   - UI: shows outputs, artifacts

2. Trigger Tuning Evaluation (what you need)
   - Focus: classification (should this trigger?)
   - UI: table + toggle labels

Conclusion

The official eval-viewer cannot be reused directly.

You need:
👉 a purpose-built trigger tuning UI

What You're Actually Building

You are effectively creating:

Skill Eval Stack

Layer 1 — Generation
  → skill-creator produces labeled queries

Layer 2 — Human Review (NEW UI)
  → correct labels (inject domain knowledge)

Layer 3 — Runner
  → measures performance vs labels

Layer 4 — Feedback Loop
  → refine SKILL.md description

Key Takeaway

This is not just tooling — it's a training loop.

You are:
- defining semantic boundaries
- measuring them
- correcting them
- iterating toward precision

Then the session was over.

About the Author

Rick Hightower is a technology executive and data engineer who led ML/AI development at a Fortune 100 financial services company. He is the creator of skillzwave.ai and multiple Claude Code skills and agents.

Rick has been actively developing generative AI systems, agents, and agentic workflows for years. He is the author of numerous articles on Claude Code, AI agents, and AI-driven development workflows. Connect on LinkedIn or follow on Medium.