docs: prefer rubrics over llm-grader with inline prompt by christso · Pull Request #1097 · EntityProcess/agentv

christso · 2026-04-14T05:54:39Z

Summary

Replaces all instances of the llm-grader-with-inline-prompt: antipattern across examples, skill docs, and web docs with rubrics assertions.

Why: rubrics gives structured, per-criterion scores that are easier to debug when a test fails. An inline prompt: on an llm-grader is an opaque free-form blob that produces the same result but with no criterion breakdown.

Preferred hierarchy (now reflected in examples):

Deterministic assertions (contains, regex, equals, is-json) — no LLM cost, fully reproducible
rubrics / string shorthand — structured LLM judgment with per-criterion scores
llm-grader (no prompt) — implicit eval against criteria, fine for simple cases
llm-grader with prompt: ./file.md — custom prompts; the legitimate use case for a file-backed grader

Files changed:

examples/features/preprocessors/evals/dataset.eval.yaml
examples/features/threshold-evaluator/evals/dataset.eval.yaml
examples/features/experiments/evals/coding-ability.eval.yaml
examples/features/default-evaluators/evals/dataset.eval.yaml
plugins/agentv-dev/skills/agentv-bench/SKILL.md
apps/web/src/content/docs/docs/evaluation/examples.mdx
apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx

Deleted baselines (need regeneration with real LLM calls):

examples/features/threshold-evaluator/evals/dataset.eval.baseline.jsonl
examples/features/default-evaluators/evals/dataset.eval.baseline.jsonl

To regenerate: bun scripts/check-eval-baselines.ts --update --eval-file <file>

Not changed: apps/cli/test/commands/eval/pipeline/fixtures/input-test.eval.yaml — this test fixture explicitly tests llm-grader pipeline behavior (prompt content writing, llm_grader_results/ directory) and is not a teaching example.

🤖 Generated with Claude Code

Replace all instances of the llm-grader-with-inline-prompt antipattern in examples, skill docs, and web docs with rubrics assertions. Rubrics give per-criterion score breakdowns and make failure debugging easier; inline llm-grader prompts are an opaque free-form alternative. Also update agent-eval-layers.mdx tables to reference `rubrics` instead of `llm-grader` for LLM-based quality checks, and delete two stale baseline files (need regeneration after grader type change). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-04-14T05:55:13Z

Deploying agentv with Cloudflare Pages

Latest commit:	`2f63c97`
Status:	✅ Deploy successful!
Preview URL:	https://42ae0559.agentv.pages.dev
Branch Preview URL:	https://docs-prefer-rubrics-over-inl.agentv.pages.dev

View logs

Replace full `type: rubrics` + `criteria:` blocks with inline string shorthand where there are no weights, required gates, or composite naming concerns. Composite children in threshold-evaluator keep named rubrics graders to preserve threshold aggregation semantics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso merged commit 209db97 into main Apr 14, 2026
4 checks passed

christso deleted the docs/prefer-rubrics-over-inline-llm-grader branch April 14, 2026 22:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: prefer rubrics over llm-grader with inline prompt#1097

docs: prefer rubrics over llm-grader with inline prompt#1097
christso merged 2 commits intomainfrom
docs/prefer-rubrics-over-inline-llm-grader

christso commented Apr 14, 2026

Uh oh!

cloudflare-workers-and-pages bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 14, 2026

Summary

Uh oh!

cloudflare-workers-and-pages bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages bot commented Apr 14, 2026 •

edited

Loading