Research
assess-outline
Evaluates research paper outlines against empirical research criteria for research question clarity, contributions, hypotheses, data description, results approach, and robustness c…
AI / ML
forge-evals
Design evaluations for LLM features including golden datasets, rubric scoring, LLM-as-judge calibration, CI regression detection, online A/B tests, cost and latency budgets, and ad…
AI / ML
ce-optimize
Run metric-driven iterative optimization loops. Define a goal, add measurement scaffolding, execute parallel experiments across approaches, score results against gates or quality j…
AI / ML
experiment-lab
Run reproducible CS/AI experiments, model evaluation, regression/classification/clustering analyses, bioinformatics workflows, QC, differential expression, single-cell starter anal…
Testing
os-eval-runner
Stateless evaluation engine that scores and gates skill improvement iterations using headless Python scripts. Use for "evaluate this skill", "run autoresearch loop", "optimize this…
AI / ML
evaluating-cosmos-policy
Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations…
Content
都市悬疑-平台签约评估框架
Evaluates urban mystery manuscripts for platform serialization potential using strict, conservative criteria that prioritize negative evidence. Routes through the core platform ass…
Engineering
choose-stack
Evaluates and selects technologies using a weighted decision matrix. Activates for technology choices, framework comparisons, technical alternatives, decision matrices, stack evalu…
Testing
paper-autoraters
Execute the four paper-quality autoraters: Citation F1, Literature Review Quality (6-axis), SxS Overall Paper Quality, and SxS Literature Review Quality. Triggers on requests to sc…
Testing
gloss-review
Stage 8 interlinear evaluator — audit completeness and correctness of candidate interlinear artifacts, benchmark them against ground truth, and decide pass/block/promote. Use for e…
Research
ai-scientist-evaluator
Critically review, score, compare, and rank AI scientist outputs for biology, bioinformatics, and life science research. Evaluate notebooks, code, analyses, manuscripts, and report…
Design
visual-plan
Reviews plan.md against an existing resolved design.md, then runs generate+evaluate loops to produce iters/*.png with status completed/partial/aborted. Requires finalized visual-sp…
Research
deeptutor
Academic advisor evaluation system that investigates professors worldwide. Searches publications, maps co-author networks, tracks student outcomes, classifies advisor types, and as…
Business
workflow-mieterhoehung-entscheidung
Rent increase decision workflow for rental and condominium law: evaluates consent requirements, caps, blocking periods, and objections with cold-start, deadline checks, evidence ma…
Business
azubi-zeugnis-analyse
Analyzes apprenticeship certificates under BBiG. Evaluates learning progress, vocational school performance, practical training tasks, and workplace behavior with industry-neutral …
Testing
evaluation
Applies when evaluating agent performance, building test frameworks, measuring agent quality, or creating evaluation rubrics. Covers LLM-as-judge methods, multi-dimensional evaluat…
AI / ML
eval-driven-development
Build language-model-integrated systems by writing evaluations first. Covers statistical eval nature, five primitives, judgment taxonomy, system evals vs benchmarks, and how result…
Business
framework-fit-analysis
Evaluates frameworks, libraries, SDKs, runtimes, databases, UI kits, or platforms by fit criteria including constraints, team skill, ecosystem maturity, migration cost, operability…
Content
youtube-verdict
Evaluate YouTube videos before watching. Returns a firm WATCH or SKIP recommendation, 0-10 score, best viewing range, substance density, and audience fit when a URL is provided wit…
AI / ML
arize-prompt-optimization
Optimize, debug, or improve LLM prompts with production trace data, evaluations, and annotations. Extract prompts from spans, gather performance signals, and run data-driven optimi…
AI / ML
evaluating-llms
Evaluates LLM systems using automated metrics, LLM-as-judge methods, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety, or comparing model…
Design
ux-evaluator
Evaluate UI components against UX best practices using a 3-dimension framework (Position, Visual Weight, Spacing) aligned with industry standards. Reviews buttons, navigation, spac…
Research
approach-evaluation
Research industry standards, identify viable approaches for a technical or architectural problem, and deliver a structured factual comparison against project constraints. Reports o…
AI / ML
council-mode
Runs identical tasks across multiple model tiers or providers in parallel to expose disagreements, failures, and hallucinations. Surfaces failure modes for routing decisions rather…
AI / ML
simmer-judge
Scores candidate artifacts against user-defined criteria on a 1-10 scale and generates ASI guidance for the next round. Supports judge-only, runnable evaluator, and hybrid modes as…
DevOps
autopilot-deploy
Advances agent or skill deployments through four canary stages (1 percent, 10 percent, 50 percent, 100 percent) gated by continuous evaluation and SLO compliance, with automatic ro…
Business
通用-平台签约评估框架
Evaluates a work's signing or approval potential on target platforms using strict, conservative criteria with negative evidence prioritized. Includes phased admission checks and co…
Content
Création de fiches d'évaluation et questionnaires en bureautique
Generates structured templates for evaluating office software skills, including main evaluation forms, evaluator comment templates, and general competency questionnaires formatted …
Research
paper-repo-evaluator
Evaluates code repository quality linked to research papers, covering GitHub metrics, language, and integration effort. Auto-triggers during paper-pipeline workflows at the code as…
AI / ML
langsmith-sdk-for-llm-tracing-and-evaluation
Provides tracing, evaluation, and debugging workflows for LLM applications. Useful when an agent team needs structured observability around prompts, chains, tool calls, datasets, a…
Content
content-evaluation-framework
Applies a weighted 6-category rubric to assess book chapters, lessons, or educational content for technical accuracy, pedagogy, writing quality, structure, AI-first teaching, and c…
Automation
skill-iter-tune
Runs iterative tuning via execute-evaluate-improve loops. Executes the skill, evaluates output quality, and applies improvements until thresholds are met. Triggers on tuning-relate…
Research
wuerfel-aufbauen
Builds the three-dimensional cube structure for a new review project across three axes: columns as data points, rows as documents with optional row prompts, and depth as evaluation…
Engineering
tech-radar
Evaluates, compares, or recommends technologies, frameworks, or tools for formal adoption. Triggers on /tech-radar, technology assessment, or questions about adding items to a tech…
Engineering
search-before-building
Evaluate existing repository capabilities, external libraries, MCP options, and maintenance risks before authoring custom code, then decide adopt, wrap, or build using explicit cri…
Research
paper-writing-bench
Reverse-engineers raw materials from an existing AI research paper to create benchmark cases for evaluating paper-writing pipelines, following the PaperWritingBench construction me…
Productivity
thoroughly-rate-review
Reviews, rates, scores, assesses, evaluates, grades, or benchmarks quality using a custom weighted scoring model with multiple checks per category, explicit evidence, and a final s…
Productivity
skilleval
Professional skills assessment system providing evidence-based evaluation across technical abilities, soft skills, and domain expertise with multi-dimensional scoring and gap analy…
AI / ML
evaluating-machine-learning-models
Evaluates trained machine learning models using appropriate metrics and comparison logic. Use for benchmark review, threshold selection, calibration, validation, and model comparis…
Research
ground-truth
Stage 8 ground-truth provider that acquires external benchmark corpora, builds gloss benchmarks and generates dictionary-gap reports as read-only evidence for gloss-review evaluati…
AI / ML
launching-evals
Runs, monitors, analyzes, and debugs LLM evaluations using nemo-evaluator-launcher. Supports status checks, progress tracking, failure debugging, artifact export, and result analys…
Content
女频爱情-平台签约评估框架
Evaluates romance manuscripts for platform acquisition and serialization using strict, conservative criteria focused on negative evidence. Loads the generic platform assessment ski…
General
os-skill-improvement
Improves existing agent skills using RED-GREEN-REFACTOR based on evaluation results. Runs baseline tests, applies targeted patches, and refactors until accuracy thresholds are met.
AI / ML
simmer-setup
Inspects artifacts or workspaces, infers evaluation contracts and search space, then proposes a complete assessment. Conversational setup that presents findings after confirmation.
AI / ML
simmer-judge-board
Dispatches a panel of judges with varied lenses, runs deliberation to challenge scores, then synthesizes consensus plus ASI. Drop-in replacement for simmer-judge under board mode.
General
thinking-lindy-effect
For non-perishable things, future life expectancy is proportional to current age. Use for technology selection, evaluating frameworks and libraries, and predicting tool longevity.
AI / ML
aice
AI Confidence Engine scoring across five bidirectional domains. Supports agent and user evaluation with triggers for task completion, idea validation, and pooled runtime scoring.
Testing
llm-ai-pipeline-test-review
Review LLM or AI pipeline evaluation setup including metrics, golden datasets, thresholds, adversarial coverage, and regression gating to catch unsafe outputs before production.
Testing
agent-evals
Design evaluation frameworks for AI agents. Apply when testing reasoning quality, building graders, analyzing errors, or adding regression protection. Works with any agent SDK.
Research
argument-evaluation
Evaluate arguments for structure, validity, soundness, and charitable interpretation through premise identification, conclusion extraction, argument mapping, and steel-manning.
AI / ML
write-judge-prompt
Design LLM-as-Judge evaluators for subjective criteria such as tone, faithfulness, relevance, or completeness. Use only when code-based checks cannot validate the failure mode.
Productivity
exploration-optimizer
Evaluates and improves exploration-cycle skills, prompts, routing, and artifact quality through baseline-first iteration loops, keep-discard decisions, and experiment ledgers.
Testing
agency-evaluation-criteria
Quality evaluation criteria for AI agency project output covering design quality, originality, completeness, functionality scoring, and Playwright-based testing requirements.
AI / ML
langsmith-observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, and monitoring production.
AI / ML
langsmith-observability
LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, and monitoring production.
Business
bvssh-check
Evaluates whether current work aligns with Better Value Sooner Safer Happier principles. Invoke at diamond completion milestones and at regular intervals during development.
Design
design-evaluation-audit
Systematically evaluate existing designs for cognitive alignment, conduct reviews or critiques, diagnose usability issues, and compare alternatives with objective criteria.
AI / ML
generate-synthetic-data
Create diverse synthetic test inputs for LLM pipeline evaluation via dimension-based tuple generation. Use when bootstrapping eval datasets or stress-testing failure modes.
AI / ML
agent-ready-eval
Evaluate codebases for agent-friendliness using autonomous agent best practices. Assess infrastructure suitability for unattended agent execution and autonomous workflows.
AI / ML
ai-prompt-engineering
Apply operational prompt engineering for production LLM applications including structured outputs, RAG grounding, tool workflows, safety controls, and evaluation testing.
Showing the top 60 of 504. See the full list →