Full Report
Claude classifies skills into two buckets: capability uplift and encoded preference. The former is for getting Claude to perform actions that it cannot do by itself. The second is something that Claude already can do but doing it according to some process. The difference between these changes the way that they are evalulated. The skill-creator helps write tests that check that Claude did what you expected for a given prompt. This is similar to softweare tests. This is important for A) understanding the capabililties, B) catching regressions and C) detecting whether the base model has outgrown the skill. The metrics show the time, the pass/fail rate and amount of tokens used. All really important evals! This now has A/B testing too. Another new aspect of skill creator is tuning the descriptions to have Claude automatically trigger them at the proper times. Overall, a good post on a feature that I had no idea about to improve Claude skills.
Analysis Summary
# Best Practices: Agent Skill Development & Evaluation
## Overview
These practices address the security, reliability, and lifecycle management of AI "Skills" (specialized agentic workflows). By applying software engineering rigor—specifically testing and benchmarking—to LLM instructions, organizations can prevent "hallucination regressions," ensure process compliance (Encoded Preferences), and identify when legacy prompt engineering (Capability Uplift) is no longer necessary or secure.
## Key Recommendations
### Immediate Actions
1. **Audit Skill Descriptions:** Use the "skill-creator" tool to tune descriptions. Precise descriptions prevent "False Triggers" where the agent might execute a skill (and potentially access data or tools) in an inappropriate context.
2. **Define Success Criteria:** For every agent skill, document "what good looks like" in plain language. This forms the basis of the automated evaluation (eval) process.
3. **Categorize Skills:** Separate skills into "Capability Uplift" (technical workarounds) and "Encoded Preference" (business/security workflows) to determine their long-term maintenance strategy.
### Short-term Improvements (1-3 months)
1. **Implement Benchmarking:** Establish a baseline "pass rate" for all production skills. Track elapsed time and token usage to identify inefficient or costly logic patterns.
2. **Conduct A/B Testing:** Use comparator agents to test skill updates against previous versions. This ensures that "improvements" don't introduce new security vulnerabilities or logic flaws.
3. **Clean Context Execution:** Utilize multi-agent support to run tests in parallel. This prevents "context bleed," where information from one session inappropriately influences the next.
### Long-term Strategy (3+ months)
1. **CI/CD Integration:** Integrate your skill evals and benchmark results into your standard software Continuous Integration (CI) systems to catch regressions before they reach production.
2. **Deprecation Monitoring:** Regularly test base models without skills loaded. If the base model meets the eval criteria, retire the custom skill to reduce attack surface and minimize "prompt injection" vulnerabilities inherent in complex instructions.
## Implementation Guidance
### For Small Organizations
- Focus on **Description Tuning**. Use the automated tool to ensure your agents aren't firing skills unnecessarily, which saves tokens and prevents errors.
- Store evals locally in a version-controlled repository (e.g., Git) to track changes over time.
### For Medium Organizations
- Implement **Benchmark Mode** during model updates. When a new version of Claude is released, run your suite of evals to ensure business workflows (Encoded Preferences) remain intact.
- Use **Comparator Agents** to validate that workflow changes have been correctly interpreted by the AI.
### For Large Enterprises
- **Automated Regression Testing:** Mandatory eval passes should be required for any skill update.
- **Resource Optimization:** Use token and time metrics to monitor the "Cost of Intelligence" across the organization.
- **Compliance Mapping:** Map "Encoded Preference" skills directly to corporate policy/standard operating procedures (SOPs) and use evals as an audit trail of compliance.
## Configuration Examples
While specific code varies, the framework for an eval involves:
- **Test Prompt:** The user input designed to trigger the skill.
- **Context/Files:** Supporting data (e.g., an NDA PDF).
- **Assertion:** "Claude must extract exactly four text anchors" or "Claude must not use external tools if the file is encrypted."
## Compliance Alignment
- **NIST AI RMF:** Aligns with the "Measure" and "Manage" functions by quantifying performance and risks.
- **ISO/IEC 42001:** Supports the requirement for monitoring and evaluation of AI system performance.
- **CIS Controls:** Aligns with software engineering best practices for testing and documentation.
## Common Pitfalls to Avoid
- **Context Bleed:** Running multiple tests in one session can cause the AI to remember previous results, leading to false positives. Use independent agent instances for testing.
- **Over-Engineering:** Maintaining "Capability Uplift" skills that the base model can now handle natively. This increases complexity and maintenance debt.
- **Trigger Bloat:** Developing too many skills with broad descriptions, causing the AI to select the wrong tool for the task.
## Resources
- **Skill-creator official repo:** github[.]com/anthropics/skills/tree/main/skills/skill-creator
- **Claude Code Plugins:** github[.]com/anthropics/claude-plugins-official/tree/main/plugins/skill-creator
- **Official Documentation:** anthropic[.]com (Search for "Agent Skills")