Improving Skill-creator: Test, measure and redefine Agent Skills

2026-03-18T15:18:02+00:00 View Original

Full Report

Claude classifies skills into two buckets: capability uplift and encoded preference. The former is for getting Claude to perform actions that it cannot do by itself. The second is something that Claude already can do but doing it according to some process. The difference between these changes the way that they are evalulated. The skill-creator helps write tests that check that Claude did what you expected for a given prompt. This is similar to softweare tests. This is important for A) understanding the capabililties, B) catching regressions and C) detecting whether the base model has outgrown the skill. The metrics show the time, the pass/fail rate and amount of tokens used. All really important evals! This now has A/B testing too. Another new aspect of skill creator is tuning the descriptions to have Claude automatically trigger them at the proper times. Overall, a good post on a feature that I had no idea about to improve Claude skills.

Analysis Summary