Agent Skills·Tag ·evaluation

Tag · 976 skills

Agent skills tagged evaluation

976 SKILL.md skills tagged evaluation — the most complete ones are below, all usable across Hermes, Cursor, Codex, Gemini CLI, OpenCode, Claude Code and 30+ more agents.

Browse all 976 evaluation skills →

Skill Forge 技能熔炉

技能熔炉 — 锻造/评估/改进 Skill。说技能熔炉走全流程（含R5改进已有skill）；说技能评估/skill评估/评估技能只做同类比对+腾讯9维度。可选能力：搜索SkillHub同类技能（通过TRAE内置工具）、修改已有skill文件（仅R5诊断修复路径，需用户确认）。发布环节请用 skill-pu...

openclawadaptive-interviewauthoringauthoring-principles

Diagnose and test Claude Code skills against Anthropic's 7 principles. Scans SKILL.md files, checks 8 rules (gotchas, description, allowed-tools, file-size, structure, frontmatter,…

claude-codeclaude-codeskillsvalidation

Darwin Skill (达尔文.skill): autonomous skill optimizer inspired by Karpathy's autoresearch. Evaluates SKILL.md files using an 8-dimension rubric (structure + effectiveness), runs hil…

claude-codeskill-optimizationevaluationgit

empirical-prompt-tuning

Fetch and execute mizchi's empirical-prompt-tuning skill at runtime. Use when evaluating or iteratively refining an agent-facing prompt (skill / slash command / task prompt / CLAUD…

claude-codecodexcursorgemini-cliprompt-tuningevaluationmetrics

research-idea-benchmark

This skill should be used when the user asks to "evaluate my research idea", "benchmark my research", "score my paper idea", "assess research quality", "check if my idea is publish…

claude-codecodexcursorgemini-cliresearchbenchmarkingacademic

Design, evaluate, and iterate on Claude Code agents, commands, and skills with structural best practices. Use when creating a new .claude/agents/ file, updating an existing agent's…

claude-codecodexcursorgemini-cliclaude-codeagentsevaluation

Evaluates research paper outlines against empirical research criteria for research question clarity, contributions, hypotheses, data description, results approach, and robustness c…

claude-codecodexcursorgemini-cliresearchoutlineevaluation

Design evaluations for LLM features including golden datasets, rubric scoring, LLM-as-judge calibration, CI regression detection, online A/B tests, cost and latency budgets, and ad…

claude-codecodexcursorgemini-cliai:llmllmevaluation

Run metric-driven iterative optimization loops. Define a goal, add measurement scaffolding, execute parallel experiments across approaches, score results against gates or quality j…

claude-codecodexcursorgemini-cliai:llmtype:generatoroptimization

通用-平台签约评估框架

Evaluates creative works for platform acceptance or serialization potential using strict, conservative criteria that prioritize negative evidence. Provides a unified assessment fra…

claude-codecodexcursorgemini-cliplatformevaluationserialization

Designs the evaluation harness for agent loops, establishing trustworthy verification through a 7-layer suite with false-completion-rate and repair-productivity as core metrics. En…

claude-codecodexcursorgemini-cliai:agentloopsevaluation

Run reproducible CS/AI experiments, model evaluation, regression/classification/clustering analyses, bioinformatics workflows, QC, differential expression, single-cell starter anal…

claude-codecodexcursorgemini-cliexperimentsmlbioinformatics

legw-rmap-evaluierung-und-aenderung

Manages the full lifecycle of a rulemap norm: versioning, drag-and-drop edits in the builder, NKRG/GGO evaluation, impact monitoring, and feedback from implementation. Outputs chan…

claude-codecodexcursorgemini-clirule managementevaluationversioning

Stateless evaluation engine that scores and gates skill improvement iterations using headless Python scripts. Use for "evaluate this skill", "run autoresearch loop", "optimize this…

claude-codecodexcursorgemini-clilang:pythonevaluationpython

data-source-eval

Evaluates new data sources for feasibility, integration effort, and expected value. Runs a standardized spike covering objectives, API discovery, technical viability, architecture,…

claude-codecodexcursorgemini-clidata-sourceevaluationapi

shadow-mode-runner

Coordinates SHADOW mode operation where the agent runs parallel to human input without delivering output or incurring billing. Measures agreement rates and generates promotion repo…

claude-codecodexcursorgemini-clishadowevaluationmetrics

new-project-gate

Pre-build gate that evaluates new ideas against both agent criteria (legible, drivable, automatable) and human criteria (usefulness, maintainability) before creating any tool or re…

claude-codecodexcursorgemini-cligateevaluationchecklist

capstone-final-eval

Evaluates Ewha Womans University capstone design final deliverables including team presentations, PDF reports, GitHub repositories, and Project Briefs. Use for end-of-term capstone…

claude-codecodexcursorgemini-clicapstoneevaluationuniversity

light-idea-critique

Evaluates research ideas against top-tier publication standards to distinguish genuine breakthroughs from incremental or superficial combinations. Delivers blind-then-explicit scor…

claude-codecodexcursorgemini-cliresearchevaluationcritique

loop-design-check

Designs and reviews goal-oriented agent loops for failure modes like token waste, verifier gaming, and completing wrong answers. Writes loops with decidable goals and skeletons; au…

claude-codecodexcursorgemini-clitype:reviewloopsagent

Calibrates prompts, agents, and skills by comparing AI outputs against author-edited versions. Identifies recurring divergence patterns and generates targeted instruction fixes in …

claude-codecodexcursorgemini-cliprompt-engineeringevaluationagent-calibration

Applies one of nine structured reasoning methods—pre-mortem, inversion, first principles, red team/blue team, Socratic, constraint removal, stakeholder mapping, analogical, or seco…

claude-codecodexcursorgemini-clireasoninganalysiscritical-thinking

evaluating-cosmos-policy

Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations…

claude-codecodexcursorgemini-cliroboticssimulationnvidia

Profiles models from any provider family against a fixed set of 10 work categories, mapping public benchmarks into tier rankings with full provenance. Emits routing-table.json, aud…

claude-codecodexcursorgemini-clillmbenchmarkingrouting

legalsearchqa-eval

Benchmarks retrieval of current legal information from external sources and reasoning over it for multiple-choice legal questions. Measures factual accuracy, uncertainty calibratio…

claude-codecodexcursorgemini-clilegalbenchmarkingretrieval

eval-case-author

Generates evaluation cases for any artifact using real human/agent pairs or declared synthetic cases, then persists them in evals/{artifact_id}/cases/case-{n}.md. Supports C4 (≥30 …

claude-codecodexcursorgemini-clievaluationcasesground-truth

Evaluates job fit by discovering matching roles across boards and scoring provided postings A–F; researches compensation, company signals, and legitimacy, then outputs ranked brief…

claude-codecodexcursorgemini-clijob-searchcareerevaluation

eval-harness-first

Builds the evaluation harness that gates every fine-tuning run — golden sets, per-failure-mode graders, judge calibration, and base-model baselines. Use when launching fine-tuning,…

claude-codecodexcursorgemini-cliai:llmevaluationfine-tuning

raven-research-looped-reasoning-eval

Scaffolds reproducible experiments comparing looped/recurrent-depth transformers against matched vanilla models on vulnerability discovery after security-corpus pretraining. Use fo…

claude-codecodexcursorgemini-clitype:generatortransformersevaluation

Pre-evaluate any study for The Lancet by checking clinical or public-health significance, global reach, practice-changing potential, and equity considerations to determine fit for …

claude-codecodexcursorgemini-clijournal-fitpeer-reviewmedical

workflow-debate

Structured adversarial debate between AI councillors representing distinct perspectives to rigorously evaluate ideas, plans, or decisions. Produces nuanced verdicts (PROCEED / PROC…

claude-codecodexcursorgemini-clilang:rubydebatedecision-making

Generalised autonomous optimisation loop for measurable artifacts. Uses connector-backed memory first, then project-pack, then none. Supports iterative experiments, multi-dimension…

claude-codecodexcursorgemini-cliai:claudetype:integrationoptimization

Creates rigorous eval suites for AI agents and LLM systems using trace-driven analysis, binary judges, cross-validation, and agreement metrics. Delivers failure taxonomies, rubrics…

claude-codecodexcursorgemini-clitype:audittype:reviewevaluation

09. RAG 与语言智能体

本页是网站中的第 9 个 CS224n 专题，内容对应官方 PPT `Lecture 10: RAG and Language Agents`。PPT 前几页先收尾上一讲的 Adapter 和高效适配，本页只简要定位；主体是 open-domain question answering、retrieval augmented generation、langu…

claude-codecodexcursorgemini-cliagents-mdcross-agentrag

agent-eval-adoption

Reference implementation for adopting agent evaluation primitives: manifest definitions, runLoop topologies, prompt optimization, MCP delegation, trace sinks, and CI gates. Pairs w…

claude-codecodexcursorgemini-cliai:agentevaluationmanifest

AI-powered team performance auditing using the Elon Algorithm for ruthless evaluation of output velocity, quality, independence, and initiative. Delivers A/B/C stack ranking with p…

claude-codecodexcursorgemini-clitype:auditperformancestack-ranking

都市悬疑-平台签约评估框架

Evaluates urban mystery manuscripts for platform serialization potential using strict, conservative criteria that prioritize negative evidence. Routes through the core platform ass…

claude-codecodexcursorgemini-cliurban-mysteryassessmentplatform

Designs and implements rigorous evaluations for LLM agents, multi-agent systems, skills, and prompts—covering frameworks like DeepEval, Braintrust, and RAGAS plus metrics such as p…

claude-codecodexcursorgemini-cliai:agentllmevaluation

Evaluates and selects technologies using a weighted decision matrix. Activates for technology choices, framework comparisons, technical alternatives, decision matrices, stack evalu…

claude-codecodexcursorgemini-clistackframeworksdatabases

OPC — One Person Company. Digraph-based task pipeline with independent multi-role evaluation. Builds, reviews, analyzes, and brainstorms with specialist agents. Every path ends wit…

claude-codeopenclawpipelineagentsevaluation

paper-autoraters

Execute the four paper-quality autoraters: Citation F1, Literature Review Quality (6-axis), SxS Overall Paper Quality, and SxS Literature Review Quality. Triggers on requests to sc…

claude-codecodexcursorgemini-clitype:reviewevaluationquality

analyze-generative-diffusion-model

Evaluates pre-trained generative diffusion models by computing FID, IS, CLIP scores, precision/recall, inspecting noise schedules, extracting attention maps, and probing latent spa…

claude-codecodexcursorgemini-clidiffusiongenerativeevaluation

vendor-analysis

Comprehensive vendor evaluation covering financial stability, operational metrics, technical alignment, risk factors, and growth opportunities. Delivers HTML reports, Excel data, a…

claude-codecodexcursorgemini-clitype:reviewvendorevaluation

langsmith-observability

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, o…

claude-codecodexcursorgemini-clillmobservabilitytracing

Stage 8 interlinear evaluator — audit completeness and correctness of candidate interlinear artifacts, benchmark them against ground truth, and decide pass/block/promote. Use for e…

claude-codecodexcursorgemini-clitype:audittype:reviewinterlinear

ai-scientist-evaluator

Critically review, score, compare, and rank AI scientist outputs for biology, bioinformatics, and life science research. Evaluate notebooks, code, analyses, manuscripts, and report…

claude-codecodexcursorgemini-clitype:audittype:reviewevaluation

business-ideation

A general-purpose skill for developing business or service ideas through three phases—divergence, deepening, and evaluation—usable with any theme or industry. Maintains notes as th…

claude-codecodexcursorgemini-cliideationbrainstormingbusiness-models

Measures variance of a stochastic evaluator by running N parallel executions, aggregates scores deterministically, and flags noise versus regression against a baseline. Use when re…

claude-codecodexcursorgemini-clitype:reviewevaluationstochastic

model-migration-plan

Creates a phased migration plan for swapping LLM models in production, including eval gates, shadow and canary stages, prompt adaptation, and rollback criteria. Use when deprecatin…

claude-codecodexcursorgemini-clillmmigrationevaluation

Reviews plan.md against an existing resolved design.md, then runs generate+evaluate loops to produce iters/*.png with status completed/partial/aborted. Requires finalized visual-sp…

claude-codecodexcursorgemini-clitype:reviewdesign.mdplan.md

Academic advisor evaluation system that investigates professors worldwide. Searches publications, maps co-author networks, tracks student outcomes, classifies advisor types, and as…

claude-codecodexcursorgemini-cliacademicadvisorpublications

talk-moss-skills-team-workflow

Guides teams in structuring skill governance: decomposition, ownership, versioning, evaluation scenarios, quality reviews, and lifecycle maintenance. Use when shifting from ad-hoc …

claude-codecodexcursorgemini-clitype:reviewskillsgovernance

Scans LLM call sites in any codebase to improve code quality, validate business value through user-task testing, and benchmark model performance vs cost. Stores findings in an Obsi…

claude-codecodexcursorgemini-cliai:llmllmbenchmarking

weak-agent-test

Executes adversarial testing of document agents across six real-world scenarios—five editing tasks and one multi-column poetry authoring—measuring tool efficiency, rendering fideli…

claude-codecodexcursorgemini-cliai:agentadversarialdocument

Scores a generated visual asset against its brief for fidelity and brand alignment, returning a verdict plus diagnosis inside an evaluation loop. Works on one asset per cycle; excl…

claude-codecodexcursorgemini-clitype:debugtype:generatorvisual-asset

agsy-data-and-model-evaluation

Evaluate models and analyze results for Agricultural Systems manuscripts. Includes observed-vs-simulated comparisons, fit statistics, sensitivity and uncertainty analysis, and trad…

claude-codecodexcursorgemini-clitype:reviewagricultureevaluation

evaluate-submission

Run complete TAM partnership evaluations on WooCommerce Marketplace submissions: analyze threads, detect gaps, score across six fit dimensions with adversarial checks, draft respon…

claude-codecodexcursorgemini-cliwoocommercemarketplacepartnership

给数字生命做绩效考核的赛博HR系统。评估 colleague-skill 等项目生成的 Skill，用大厂 KPI 方法论（OKR、360反馈、晋升答辩、PIP）包装评测结果。内置 linkskill-bench 评测引擎。| Cyber HR system that evaluates digital personas using big-tech KPI…

claude-codekpiperformanceevaluation

oss-kpi-evaluation

Evaluates open source projects against KPIs via multi-agent research. Pulls live web data on community health, maintenance, security, documentation, adoption, and code quality for …

claude-codecodexcursorgemini-cliai:agentopen-sourcekpi

workflow-mieterhoehung-entscheidung

Rent increase decision workflow for rental and condominium law: evaluates consent requirements, caps, blocking periods, and objections with cold-start, deadline checks, evidence ma…

claude-codecodexcursorgemini-clirent-increaserental-lawevaluation

Showing the top 60 of 976. See the full list →