Verifier Registry

38 verifiers across 19 domains, organised into three tiers: HARD (deterministic state probes), SOFT (LLM-judged rubrics), and AGENTIC (multi-step tool-use checks). Browse below or filter by tier and domain.

38 verifiers

aiv.calendar.event_created

AGENTIC

Verifies that a calendar event was created with correct title, date, and participants via HTTP API query.

aiv.email.sent_folder_confirmed

AGENTIC

Opens IMAP connection to Sent folder and searches for a matching message by recipient and subject fragment. Proves the 'latent state' insight that UI confirmation does not equal actual delivery.

15.0s

net:imap

3+ / 3− / 3 adv

aiv.shell.state_probe

AGENTIC

Executes a sandboxed read-only shell command and compares stdout to an expected value, probing whether an agent action actually changed system state.

api.http.header_present

HARD

Verifies that an HTTP response contains expected headers.

api.http.response_matches

HARD

Verifies that an HTTP response body contains expected substrings.

api.http.status_ok

Verifies that an HTTP endpoint returns the expected status code.

code.python.lint_ruff

HARD

Verifies that agent-generated Python code passes ruff lint checks with configurable violation thresholds.

code.python.tests_pass

HARD

Writes agent-generated Python code to a temp directory and runs pytest via the sandbox runner. Score is based on the fraction of tests that pass.

database.row.exists

Verifies that a row matching given criteria exists in a database table.

database.row.updated

Verifies that a database row was updated to contain expected values.

database.table.row_count

HARD

Verifies that a database table has the expected number of rows.

document.csv.row_count

HARD

Verifies that a CSV file has the expected number of data rows.

document.json.valid

Verifies that a file contains valid JSON with optional type and key checks.

document.pdf.page_count

HARD

Verifies that a PDF has the expected number of pages.

document.text.contains

HARD

Verifies that a text file contains all expected substrings.

document.yaml.valid

Verifies that a file contains valid YAML with optional key checks.

filesystem.file_created

HARD

Verifies that a file was created at the expected path with optional size and content hash checks.

git.commit_present

Verifies that a specific git commit exists in a repository by SHA prefix or message substring.

rubric.code.logic_correct

SOFT

LLM-judge rubric verifier scoring code logic correctness on 4 criteria: algorithm, edge cases, logic errors, requirements.

rubric

2.5s

3+ / 3− / 3 adv

rubric.email.tone_professional

SOFT

4-component rubric scored by LLM judge: greeting, formality, key info, no inappropriate content. Score = sum/4. MUST be composed with vr/aiv.email.sent_folder_confirmed.

2.0s

3+ / 3− / 3 adv

rubric.summary.faithful

SOFT

Scores summary faithfulness to source text via a 3-component LLM rubric: factual accuracy, key points coverage, no hallucinations.

tau2.airline.rebooking_correct

HARD

Queries airline API to confirm flight rebooking fields match expected values (date, cabin class, passengers).

tau2.policy.constraint_not_violated

HARD

Pure-logic verifier checking agent action traces against domain policy rules. Works for any domain with codifiable constraints.

tau2.retail.inventory_updated

HARD

Queries retail API to confirm that a product SKU has the expected quantity in inventory after an agent action.

tau2.retail.order_cancelled

HARD

Queries retail API to confirm an order is in cancelled state with matching reason code.

tau2.retail.refund_processed

HARD

Queries mock/real retail API to confirm a refund has been processed with the expected amount and status. Catches agents that claim refunds were issued but the actual state shows otherwise.

tau2.telecom.plan_changed

HARD

Verifies that a customer's telecom plan was changed to the expected plan with correct effective date via CRM API.

web.browser.element_visible

AGENTIC

Navigates to a URL via headless browser and checks whether a specific CSS selector is present in the DOM. Catches agents that claim UI actions succeeded but never actually modified the page.

web.browser.screenshot_match

AGENTIC

Captures a live screenshot via browser automation and compares it to a reference image using SSIM (Structural Similarity Index).

web.ecommerce.order_placed