Evaluating skills

Evals verify that skills and commands produce consistent, expected output. They are strongly encouraged for all contributions, though not required to submit a pull request. The eval runner described below is a Claude Code tool; if you use a different coding tool, you can still write eval definitions in evals.json and describe expected behavior in your PR for reviewers to verify.

Eval structure

Each plugin can include an evals/ directory with test definitions:

plugins/my-plugin/
├── .claude-plugin/
│   └── plugin.json
├── commands/
│   └── my-command.md
├── evals/
│   └── evals.json
└── README.md

Writing evals

Define test cases in evals/evals.json:

{
  "skill_name": "my-plugin-command",
  "evals": [
    {
      "id": 0,
      "prompt": "/my-plugin:command",
      "expected_output": "Description of expected result",
      "files": [],
      "assertions": [
        {
          "name": "descriptive_assertion_name",
          "type": "contains",
          "value": "expected text in output"
        }
      ]
    }
  ]
}

Assertion types

Type	Description
`contains`	Output contains the specified value
`not_contains`	Output does not contain the specified value

Writing good assertions

Give each assertion a descriptive name that reads clearly in results (e.g., correct_greeting_format, not test1)
Focus assertions on what differentiates the skill from baseline behavior
Assertions that pass both with and without the skill are "non-discriminating" — they don't prove the skill adds value
For subjective outputs (writing style, tone), rely on qualitative review instead of assertions

Running evals

Use the skill-creator:skill-creator skill to run and evaluate test cases:

/skill-creator test the <plugin-name> skill at plugins/<plugin-name>/commands/<command>.md

The skill-creator will:

Spawn parallel test runs — one with the skill and one without (baseline)
Grade each run against the assertions
Aggregate results into a benchmark
Generate an HTML viewer for qualitative review

Eval workspace

Test outputs are written to a <plugin-name>-workspace/ directory (gitignored by default). The workspace is organized by iteration:

my-plugin-workspace/
└── iteration-1/
    ├── test-case-name/
    │   ├── with_skill/
    │   │   └── outputs/
    │   ├── without_skill/
    │   │   └── outputs/
    │   └── eval_metadata.json
    ├── benchmark.json
    └── review.html

Reviewing results

The generated review.html has two tabs:

Outputs — browse each test case, compare with-skill vs. without-skill output, and leave feedback
Benchmark — view pass rates, timing, and token usage across configurations

After reviewing, click Submit All Reviews to export feedback.json. Empty feedback means the output looked good.

Iterating

If results need improvement, edit the skill and rerun. The skill-creator tracks iterations (iteration-1/, iteration-2/, etc.) and can show diffs between versions.

Example: hello-world

The hello-world plugin includes a reference eval set at plugins/hello-world/evals/evals.json with three test cases:

Test case	Prompt	Expected output
No argument	`/hello-world:greet`	`Hello! Welcome to Red Hat Docs Agent Tools.`
With name	`/hello-world:greet Alice`	`Hello, Alice! Welcome to Red Hat Docs Agent Tools.`
Multi-word name	`/hello-world:greet Dr. Smith`	`Hello, Dr. Smith! Welcome to Red Hat Docs Agent Tools.`

Run the example:

/skill-creator test the hello-world plugin's greet command at plugins/hello-world/commands/greet.md