When this matters
- A skill works well on a demo but fails when the input is incomplete.
- A developer needs to set usage limits for a paid Team plan.
- A marketplace wants to compare skills across different agent runtimes.
How to run the workflow
- Create representative tasks for normal, edge, ambiguous, and failure-prone cases.
- Estimate context and tool load for each task class.
- Model retries when criteria are not met or external tools fail.
- Calculate average tokens, p90 tokens, elapsed time, and failure rate.
- Publish a clear forecast with plan limits and buyer-facing assumptions.
Common risks
- Averages hide costly p90 runs.
- Tool timeouts can create retries even when model prompts are strong.
- Runtime assumptions should be updated when the skill or model changes.
Where SkillCost Meter fits
SkillCost Meter generates a standard task set and models 20 typical runs so teams can price and govern skills with evidence.