Verifier Registry

38 verifiers across 19 domains, organised into three tiers: HARD (deterministic state probes), SOFT (LLM-judged rubrics), and AGENTIC (multi-step tool-use checks). Browse below or filter by tier and domain.

38 verifiers

aiv.calendar.event_created

AGENTIC

Verifies that a calendar event was created with correct title, date, and participants via HTTP API query.

aiv
12.0s
$$
net:http
3+ / 3− / 3 adv

aiv.email.sent_folder_confirmed

AGENTIC

Opens IMAP connection to Sent folder and searches for a matching message by recipient and subject fragment. Proves the 'latent state' insight that UI confirmation does not equal actual delivery.

email
15.0s
$$
net:imap
3+ / 3− / 3 adv

aiv.shell.state_probe

AGENTIC

Executes a sandboxed read-only shell command and compares stdout to an expected value, probing whether an agent action actually changed system state.

aiv
12.0s
$$
subprocess:readonly
3+ / 3− / 3 adv

api.http.header_present

HARD

Verifies that an HTTP response contains expected headers.

api
40ms
free
net:http
3+ / 3− / 3 adv

api.http.response_matches

HARD

Verifies that an HTTP response body contains expected substrings.

api
40ms
free
net:http
3+ / 3− / 3 adv

api.http.status_ok

HARD

Verifies that an HTTP endpoint returns the expected status code.

api
40ms
free
net:http
3+ / 3− / 3 adv

code.python.lint_ruff

HARD

Verifies that agent-generated Python code passes ruff lint checks with configurable violation thresholds.

code
35ms
free
fs:write_tmp
exec:ruff
3+ / 3− / 3 adv

code.python.tests_pass

HARD

Writes agent-generated Python code to a temp directory and runs pytest via the sandbox runner. Score is based on the fraction of tests that pass.

code
35ms
free
fs:write_tmp
exec:pytest
3+ / 3− / 3 adv

database.row.exists

HARD

Verifies that a row matching given criteria exists in a database table.

database
30ms
free
db:read
3+ / 3− / 3 adv

database.row.updated

HARD

Verifies that a database row was updated to contain expected values.

database
30ms
free
db:read
3+ / 3− / 3 adv

database.table.row_count

HARD

Verifies that a database table has the expected number of rows.

database
30ms
free
db:read
3+ / 3− / 3 adv

document.csv.row_count

HARD

Verifies that a CSV file has the expected number of data rows.

document
15ms
free
fs:read
3+ / 3− / 3 adv

document.json.valid

HARD

Verifies that a file contains valid JSON with optional type and key checks.

document
15ms
free
fs:read
3+ / 3− / 3 adv

document.pdf.page_count

HARD

Verifies that a PDF has the expected number of pages.

document
15ms
free
fs:read
3+ / 3− / 3 adv

document.text.contains

HARD

Verifies that a text file contains all expected substrings.

document
15ms
free
fs:read
3+ / 3− / 3 adv

document.yaml.valid

HARD

Verifies that a file contains valid YAML with optional key checks.

document
15ms
free
fs:read
3+ / 3− / 3 adv

filesystem.file_created

HARD

Verifies that a file was created at the expected path with optional size and content hash checks.

filesystem
10ms
free
fs:read
3+ / 3− / 3 adv

git.commit_present

HARD

Verifies that a specific git commit exists in a repository by SHA prefix or message substring.

code
35ms
free
exec:git
3+ / 3− / 3 adv

rubric.code.logic_correct

SOFT

LLM-judge rubric verifier scoring code logic correctness on 4 criteria: algorithm, edge cases, logic errors, requirements.

rubric
2.5s
$
3+ / 3− / 3 adv

rubric.email.tone_professional

SOFT

4-component rubric scored by LLM judge: greeting, formality, key info, no inappropriate content. Score = sum/4. MUST be composed with vr/aiv.email.sent_folder_confirmed.

email
2.0s
$
3+ / 3− / 3 adv

rubric.summary.faithful

SOFT

Scores summary faithfulness to source text via a 3-component LLM rubric: factual accuracy, key points coverage, no hallucinations.

nlp
2.0s
$
net:llm
3+ / 3− / 3 adv

tau2.airline.rebooking_correct

HARD

Queries airline API to confirm flight rebooking fields match expected values (date, cabin class, passengers).

airline
25ms
free
net:http
3+ / 3− / 3 adv

tau2.policy.constraint_not_violated

HARD

Pure-logic verifier checking agent action traces against domain policy rules. Works for any domain with codifiable constraints.

cross-domain
25ms
free
3+ / 3− / 3 adv

tau2.retail.inventory_updated

HARD

Queries retail API to confirm that a product SKU has the expected quantity in inventory after an agent action.

retail
25ms
free
net:http
3+ / 3− / 3 adv

tau2.retail.order_cancelled

HARD

Queries retail API to confirm an order is in cancelled state with matching reason code.

retail
25ms
free
net:http
3+ / 3− / 3 adv

tau2.retail.refund_processed

HARD

Queries mock/real retail API to confirm a refund has been processed with the expected amount and status. Catches agents that claim refunds were issued but the actual state shows otherwise.

retail
25ms
free
net:http
3+ / 3− / 3 adv

tau2.telecom.plan_changed

HARD

Verifies that a customer's telecom plan was changed to the expected plan with correct effective date via CRM API.

tau2
25ms
free
net:http
3+ / 3− / 3 adv

web.browser.element_visible

AGENTIC

Navigates to a URL via headless browser and checks whether a specific CSS selector is present in the DOM. Catches agents that claim UI actions succeeded but never actually modified the page.

web
20.0s
$$
net:browser
3+ / 3− / 3 adv

web.browser.screenshot_match

AGENTIC

Captures a live screenshot via browser automation and compares it to a reference image using SSIM (Structural Similarity Index).

web
20.0s
$$
net:browser
3+ / 3− / 3 adv

web.ecommerce.order_placed

HARD

Verifies that an e-commerce order was placed with correct items and total via HTTP API query.

web
25ms
free
net:http
3+ / 3− / 3 adv

ci.github.workflow_passed

HARD

Verifies that a specific GitHub Actions workflow completed successfully.

ci
250ms
free
api:github
3+ / 3− / 3 adv

git.ci.passed

HARD

Verifies that all CI check runs passed for a given commit SHA.

git
200ms
free
api:github
3+ / 3− / 3 adv

git.pr.merged

HARD

Verifies that a GitHub Pull Request was merged to the target branch.

git
200ms
free
api:github
3+ / 3− / 3 adv

messaging.slack.message_sent

HARD

Verifies that a message containing expected text exists in a Slack channel.

messaging
300ms
free
api:slack
3+ / 3− / 3 adv

messaging.slack.reaction_added

HARD

Verifies that a specific reaction was added to a Slack message.

messaging
250ms
free
api:slack
3+ / 3− / 3 adv

payment.stripe.charge_succeeded

HARD

Verifies that a Stripe charge was completed successfully.

payment
300ms
free
api:stripe
3+ / 3− / 3 adv

payment.stripe.refund_processed

HARD

Verifies that a Stripe refund was processed successfully.

payment
300ms
free
api:stripe
3+ / 3− / 3 adv

project.jira.ticket_transitioned

HARD

Verifies that a Jira ticket has been transitioned to the expected status.

project
350ms
free
api:jira
3+ / 3− / 3 adv

Missing a verifier?

Tell us what domain or task you need verified and we'll prioritize it. You can also build your own or browse verifier ideas.

Open on GitHub