Agent Skills·Tag ·evaluation
Tag · 504 skills

Agent skills tagged evaluation

504 SKILL.md skills tagged evaluation — the most complete ones are below, all usable across Hermes, Cursor, Codex, Gemini CLI, OpenCode, Claude Code and 30+ more agents.

Browse all 504 evaluation skills →

Research
assess-outline
Evaluates research paper outlines against empirical research criteria for research question clarity, contributions, hypotheses, data description, results approach, and robustness c…
claude-codecodexcursorgemini-cliresearchoutlineevaluation
AI / ML
forge-evals
Design evaluations for LLM features including golden datasets, rubric scoring, LLM-as-judge calibration, CI regression detection, online A/B tests, cost and latency budgets, and ad…
claude-codecodexcursorgemini-cliai:llmllmevaluation
AI / ML
ce-optimize
Run metric-driven iterative optimization loops. Define a goal, add measurement scaffolding, execute parallel experiments across approaches, score results against gates or quality j…
claude-codecodexcursorgemini-cliai:llmtype:generatoroptimization
AI / ML
experiment-lab
Run reproducible CS/AI experiments, model evaluation, regression/classification/clustering analyses, bioinformatics workflows, QC, differential expression, single-cell starter anal…
claude-codecodexcursorgemini-cliexperimentsmlbioinformatics
Testing
os-eval-runner
Stateless evaluation engine that scores and gates skill improvement iterations using headless Python scripts. Use for "evaluate this skill", "run autoresearch loop", "optimize this…
claude-codecodexcursorgemini-clilang:pythonevaluationpython
AI / ML
evaluating-cosmos-policy
Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations…
claude-codecodexcursorgemini-cliroboticssimulationnvidia
Content
都市悬疑-平台签约评估框架
Evaluates urban mystery manuscripts for platform serialization potential using strict, conservative criteria that prioritize negative evidence. Routes through the core platform ass…
claude-codecodexcursorgemini-cliurban-mysteryassessmentplatform
Engineering
choose-stack
Evaluates and selects technologies using a weighted decision matrix. Activates for technology choices, framework comparisons, technical alternatives, decision matrices, stack evalu…
claude-codecodexcursorgemini-clistackframeworksdatabases
Testing
paper-autoraters
Execute the four paper-quality autoraters: Citation F1, Literature Review Quality (6-axis), SxS Overall Paper Quality, and SxS Literature Review Quality. Triggers on requests to sc…
claude-codecodexcursorgemini-clitype:reviewevaluationquality
Testing
gloss-review
Stage 8 interlinear evaluator — audit completeness and correctness of candidate interlinear artifacts, benchmark them against ground truth, and decide pass/block/promote. Use for e…
claude-codecodexcursorgemini-clitype:audittype:reviewinterlinear
Research
ai-scientist-evaluator
Critically review, score, compare, and rank AI scientist outputs for biology, bioinformatics, and life science research. Evaluate notebooks, code, analyses, manuscripts, and report…
claude-codecodexcursorgemini-clitype:audittype:reviewevaluation
Design
visual-plan
Reviews plan.md against an existing resolved design.md, then runs generate+evaluate loops to produce iters/*.png with status completed/partial/aborted. Requires finalized visual-sp…
claude-codecodexcursorgemini-clitype:reviewdesign.mdplan.md
Research
deeptutor
Academic advisor evaluation system that investigates professors worldwide. Searches publications, maps co-author networks, tracks student outcomes, classifies advisor types, and as…
claude-codecodexcursorgemini-cliacademicadvisorpublications
Business
workflow-mieterhoehung-entscheidung
Rent increase decision workflow for rental and condominium law: evaluates consent requirements, caps, blocking periods, and objections with cold-start, deadline checks, evidence ma…
claude-codecodexcursorgemini-clirent-increaserental-lawevaluation
Business
azubi-zeugnis-analyse
Analyzes apprenticeship certificates under BBiG. Evaluates learning progress, vocational school performance, practical training tasks, and workplace behavior with industry-neutral …
claude-codecodexcursorgemini-clibbigapprenticeshipevaluation
Testing
evaluation
Applies when evaluating agent performance, building test frameworks, measuring agent quality, or creating evaluation rubrics. Covers LLM-as-judge methods, multi-dimensional evaluat…
claude-codecodexcursorgemini-cliai:llmllm-judgerubrics
AI / ML
eval-driven-development
Build language-model-integrated systems by writing evaluations first. Covers statistical eval nature, five primitives, judgment taxonomy, system evals vs benchmarks, and how result…
claude-codecodexcursorgemini-clievaluationprimitivesbenchmarks
Business
framework-fit-analysis
Evaluates frameworks, libraries, SDKs, runtimes, databases, UI kits, or platforms by fit criteria including constraints, team skill, ecosystem maturity, migration cost, operability…
claude-codecodexcursorgemini-clievaluationframeworksmigration
Content
youtube-verdict
Evaluate YouTube videos before watching. Returns a firm WATCH or SKIP recommendation, 0-10 score, best viewing range, substance density, and audience fit when a URL is provided wit…
claude-codecodexcursorgemini-cliyoutubeevaluationrecommendation
AI / ML
arize-prompt-optimization
Optimize, debug, or improve LLM prompts with production trace data, evaluations, and annotations. Extract prompts from spans, gather performance signals, and run data-driven optimi…
claude-codecodexcursorgemini-cliarizellmprompts
AI / ML
evaluating-llms
Evaluates LLM systems using automated metrics, LLM-as-judge methods, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety, or comparing model…
claude-codecodexcursorgemini-cliai:llmllmevaluation
Design
ux-evaluator
Evaluate UI components against UX best practices using a 3-dimension framework (Position, Visual Weight, Spacing) aligned with industry standards. Reviews buttons, navigation, spac…
claude-codecodexcursorgemini-clitype:reviewuxevaluation
Research
approach-evaluation
Research industry standards, identify viable approaches for a technical or architectural problem, and deliver a structured factual comparison against project constraints. Reports o…
claude-codecodexcursorgemini-clievaluationcomparisonstandards
AI / ML
council-mode
Runs identical tasks across multiple model tiers or providers in parallel to expose disagreements, failures, and hallucinations. Surfaces failure modes for routing decisions rather…
claude-codecodexcursorgemini-climulti-modelevaluationrouting
AI / ML
simmer-judge
Scores candidate artifacts against user-defined criteria on a 1-10 scale and generates ASI guidance for the next round. Supports judge-only, runnable evaluator, and hybrid modes as…
claude-codecodexcursorgemini-clitype:generatorsimmerevaluation
DevOps
autopilot-deploy
Advances agent or skill deployments through four canary stages (1 percent, 10 percent, 50 percent, 100 percent) gated by continuous evaluation and SLO compliance, with automatic ro…
claude-codecodexcursorgemini-clicanaryslorollback
Business
通用-平台签约评估框架
Evaluates a work's signing or approval potential on target platforms using strict, conservative criteria with negative evidence prioritized. Includes phased admission checks and co…
claude-codecodexcursorgemini-clievaluationplatformssigning
Content
Création de fiches d'évaluation et questionnaires en bureautique
Generates structured templates for evaluating office software skills, including main evaluation forms, evaluator comment templates, and general competency questionnaires formatted …
claude-codecodexcursorgemini-clitemplatesevaluationbureautique
Research
paper-repo-evaluator
Evaluates code repository quality linked to research papers, covering GitHub metrics, language, and integration effort. Auto-triggers during paper-pipeline workflows at the code as…
claude-codecodexcursorgemini-cligithubrepositoryevaluation
AI / ML
langsmith-sdk-for-llm-tracing-and-evaluation
Provides tracing, evaluation, and debugging workflows for LLM applications. Useful when an agent team needs structured observability around prompts, chains, tool calls, datasets, a…
claude-codecodexcursorgemini-cliai:llmlangsmithsdk
Content
content-evaluation-framework
Applies a weighted 6-category rubric to assess book chapters, lessons, or educational content for technical accuracy, pedagogy, writing quality, structure, AI-first teaching, and c…
claude-codecodexcursorgemini-clitype:reviewevaluationrubric
Automation
skill-iter-tune
Runs iterative tuning via execute-evaluate-improve loops. Executes the skill, evaluates output quality, and applies improvements until thresholds are met. Triggers on tuning-relate…
claude-codecodexcursorgemini-cliai:geminituningiteration
Research
wuerfel-aufbauen
Builds the three-dimensional cube structure for a new review project across three axes: columns as data points, rows as documents with optional row prompts, and depth as evaluation…
claude-codecodexcursorgemini-clitype:reviewreviewstructure
Engineering
tech-radar
Evaluates, compares, or recommends technologies, frameworks, or tools for formal adoption. Triggers on /tech-radar, technology assessment, or questions about adding items to a tech…
claude-codecodexcursorgemini-clitech-radarevaluationframeworks
Engineering
search-before-building
Evaluate existing repository capabilities, external libraries, MCP options, and maintenance risks before authoring custom code, then decide adopt, wrap, or build using explicit cri…
claude-codecodexcursorgemini-clievaluationresearchlibraries
Research
paper-writing-bench
Reverse-engineers raw materials from an existing AI research paper to create benchmark cases for evaluating paper-writing pipelines, following the PaperWritingBench construction me…
claude-codecodexcursorgemini-clibenchmarkingai-researchpaper-writing
Productivity
thoroughly-rate-review
Reviews, rates, scores, assesses, evaluates, grades, or benchmarks quality using a custom weighted scoring model with multiple checks per category, explicit evidence, and a final s…
claude-codecodexcursorgemini-clitype:reviewevaluationscoring
Productivity
skilleval
Professional skills assessment system providing evidence-based evaluation across technical abilities, soft skills, and domain expertise with multi-dimensional scoring and gap analy…
claude-codecodexcursorgemini-clievaluationskillsscoring
AI / ML
evaluating-machine-learning-models
Evaluates trained machine learning models using appropriate metrics and comparison logic. Use for benchmark review, threshold selection, calibration, validation, and model comparis…
claude-codecodexcursorgemini-clitype:audittype:reviewml
Research
ground-truth
Stage 8 ground-truth provider that acquires external benchmark corpora, builds gloss benchmarks and generates dictionary-gap reports as read-only evidence for gloss-review evaluati…
claude-codecodexcursorgemini-clitype:reviewbenchmarkgloss
AI / ML
launching-evals
Runs, monitors, analyzes, and debugs LLM evaluations using nemo-evaluator-launcher. Supports status checks, progress tracking, failure debugging, artifact export, and result analys…
claude-codecodexcursorgemini-clillmevaluationnemo
Content
女频爱情-平台签约评估框架
Evaluates romance manuscripts for platform acquisition and serialization using strict, conservative criteria focused on negative evidence. Loads the generic platform assessment ski…
claude-codecodexcursorgemini-cliromanceevaluationpublishing
General
os-skill-improvement
Improves existing agent skills using RED-GREEN-REFACTOR based on evaluation results. Runs baseline tests, applies targeted patches, and refactors until accuracy thresholds are met.
claude-codecodexcursorgemini-clitype:generatorrefactoringtesting
AI / ML
simmer-setup
Inspects artifacts or workspaces, infers evaluation contracts and search space, then proposes a complete assessment. Conversational setup that presents findings after confirmation.
claude-codecodexcursorgemini-clisimmersetupevaluation
AI / ML
simmer-judge-board
Dispatches a panel of judges with varied lenses, runs deliberation to challenge scores, then synthesizes consensus plus ASI. Drop-in replacement for simmer-judge under board mode.
claude-codecodexcursorgemini-clisimmerevaluationpanel
General
thinking-lindy-effect
For non-perishable things, future life expectancy is proportional to current age. Use for technology selection, evaluating frameworks and libraries, and predicting tool longevity.
claude-codecodexcursorgemini-clitechnology-selectionlongevityevaluation
AI / ML
aice
AI Confidence Engine scoring across five bidirectional domains. Supports agent and user evaluation with triggers for task completion, idea validation, and pooled runtime scoring.
claude-codecodexcursorgemini-cliconfidencescoringevaluation
Testing
llm-ai-pipeline-test-review
Review LLM or AI pipeline evaluation setup including metrics, golden datasets, thresholds, adversarial coverage, and regression gating to catch unsafe outputs before production.
claude-codecodexcursorgemini-cliai:llmtype:reviewllm
Testing
agent-evals
Design evaluation frameworks for AI agents. Apply when testing reasoning quality, building graders, analyzing errors, or adding regression protection. Works with any agent SDK.
claude-codecodexcursorgemini-cliai:agentai-agentsevaluation
Research
argument-evaluation
Evaluate arguments for structure, validity, soundness, and charitable interpretation through premise identification, conclusion extraction, argument mapping, and steel-manning.
claude-codecodexcursorgemini-cliargumentvalidityevaluation
AI / ML
write-judge-prompt
Design LLM-as-Judge evaluators for subjective criteria such as tone, faithfulness, relevance, or completeness. Use only when code-based checks cannot validate the failure mode.
claude-codecodexcursorgemini-cliai:llmllmevaluation
Productivity
exploration-optimizer
Evaluates and improves exploration-cycle skills, prompts, routing, and artifact quality through baseline-first iteration loops, keep-discard decisions, and experiment ledgers.
claude-codecodexcursorgemini-clioptimizationiterationevaluation
Testing
agency-evaluation-criteria
Quality evaluation criteria for AI agency project output covering design quality, originality, completeness, functionality scoring, and Playwright-based testing requirements.
claude-codecodexcursorgemini-clievaluationcriteriaquality
AI / ML
langsmith-observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, and monitoring production.
claude-codecodexcursorgemini-clilangsmithtracingevaluation
AI / ML
langsmith-observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, and monitoring production.
claude-codecodexcursorgemini-clilangsmithllmobservability
Business
bvssh-check
Evaluates whether current work aligns with Better Value Sooner Safer Happier principles. Invoke at diamond completion milestones and at regular intervals during development.
claude-codecodexcursorgemini-cliprinciplesevaluationmilestones
Design
design-evaluation-audit
Systematically evaluate existing designs for cognitive alignment, conduct reviews or critiques, diagnose usability issues, and compare alternatives with objective criteria.
claude-codecodexcursorgemini-clitype:audittype:debugtype:review
AI / ML
generate-synthetic-data
Create diverse synthetic test inputs for LLM pipeline evaluation via dimension-based tuple generation. Use when bootstrapping eval datasets or stress-testing failure modes.
claude-codecodexcursorgemini-clitype:generatorsynthetic-datallm
AI / ML
agent-ready-eval
Evaluate codebases for agent-friendliness using autonomous agent best practices. Assess infrastructure suitability for unattended agent execution and autonomous workflows.
claude-codecodexcursorgemini-cliai:agenttype:audittype:review
AI / ML
ai-prompt-engineering
Apply operational prompt engineering for production LLM applications including structured outputs, RAG grounding, tool workflows, safety controls, and evaluation testing.
claude-codecodexcursorgemini-cliai:claudeai:geminiai:llm

Showing the top 60 of 504. See the full list →