HermesBench: Workflow Reliability Evals for Personal AI Agents

Most agent benchmarks measure whether a model can answer a question or generate code. HermesBench measures whether a personal agent can finish a workflow when tools fail, sessions pause, and context windows fill up.

The project ships 27 workflow recipes that test complete agent configurations: prompt, model, tools, memory, gateway behavior, delegation, and stability. The current public baseline scores 78.2. Every result links to redacted traces showing tool timelines, assistant replies, checks, and judge summaries.

Why Workflow Reliability Matters

Personal agents run multi-step tasks across days. A calendar agent might:

Parse a meeting request from email
Check availability in Google Calendar
Propose three slots
Wait for user confirmation (session ends)
Resume next day and book the meeting
Send confirmation email

Single-turn benchmarks like MMLU or HumanEval do not capture this. They measure knowledge retrieval or code generation, not state persistence, tool chaining, or recovery when an API times out mid-task.

Real agent failures happen during:

Tool call sequences when one API returns unexpected data
Context window exhaustion after 15 tool calls
Session resumption when state was not persisted correctly
Graceful degradation when a required tool is unavailable

HermesBench tests these failure modes explicitly.

Architecture: Recipes, Profiles, and Traces

The eval system has three layers:

Recipes define user-like jobs with goals, criteria, and side-effect boundaries. Example: “Schedule a meeting with three participants, handle conflicts, send confirmation.” Each recipe specifies:

Prompt template
Expected tool sequence
Success criteria (deterministic checks and LLM-as-judge axes)
Allowed side effects (read-only vs. write operations)

Profiles describe the agent configuration under test:

Model and provider
Available tools (calendar, email, search, etc.)
Memory strategy (stateless, session-scoped, long-term vector store)
Gateway behavior (rate limits, retries, fallback models)
Safety boundaries (approval required for destructive actions)

Traces capture what happened:

Tool call timeline with latencies
Assistant reasoning and replies
Deterministic check results (did the meeting get booked?)
Judge summaries (was the confirmation email polite and accurate?)

Traces are redacted to remove private payloads but preserve structure. You can see that the agent called calendar.search twice, waited 1.2 seconds, then called email.send, without seeing the actual email body.

Scoring: Capability, Reliability, UX

HermesBench scores three axes:

Axis	What It Measures	Example Failure
Capability	Did the agent complete the task?	Meeting scheduled but wrong time zone
Reliability	Did it handle errors gracefully?	Crashed when calendar API returned 429
UX	Was the interaction smooth?	Asked user to confirm three times for one action

Each recipe defines deterministic checks (meeting exists in calendar) and LLM-as-judge criteria (confirmation email tone). The final score is a weighted average across all recipes in a suite.

The current baseline (78.2) reflects one specific configuration. The site deliberately avoids a leaderboard because the goal is not to rank models but to expose workflow failure modes.

State Persistence Across Sessions

Most personal agent tasks span multiple sessions. The agent starts a task, waits for user input or external data, then resumes hours or days later.

HermesBench tests this with recipes that include explicit pause points:

# Simplified recipe structure
recipe = {
    "steps": [
        {"action": "parse_email", "tool": "email.read"},
        {"action": "check_calendar", "tool": "calendar.search"},
        {"action": "propose_slots", "output": "user_message"},
        {"pause": "wait_for_user_confirmation"},  # Session ends here
        {"action": "book_meeting", "tool": "calendar.create"},
        {"action": "send_confirmation", "tool": "email.send"}
    ],
    "state_requirements": [
        "proposed_slots must persist across pause",
        "user_confirmation must be retrievable after resume"
    ]
}

The eval driver simulates session boundaries by serializing agent state, clearing memory, then reloading. If the agent loses context (forgets which slots were proposed), the reliability score drops.

Tool Chaining and Recovery

A common failure mode: the agent chains three tool calls, the second one returns unexpected data, and the third call uses malformed input.

Example sequence:

search.query("best Italian restaurant Brooklyn") returns 10 results
maps.get_details(place_id) returns {"status": "ZERO_RESULTS"} (API changed response format)
calendar.create_event(location=???) fails because location is undefined

HermesBench recipes include “fault injection” variants:

Tool returns empty result
Tool returns rate limit error
Tool returns data in unexpected schema

The agent must either retry with adjusted parameters, fall back to a different tool, or ask the user for clarification. Crashing or hallucinating data counts as a reliability failure.

Measuring Context Drift

Personal agents accumulate context over time. After 50 interactions, does the agent still remember the user prefers morning meetings? Or does it start suggesting 6pm slots again?

HermesBench tests this with “preference consistency” recipes:

User states preference (“I only take meetings before noon”)
Agent completes 10 unrelated tasks
Agent schedules a new meeting

The eval checks whether the meeting time respects the original preference. If the agent forgets, the UX score drops.

This requires memory strategies beyond simple conversation history:

Session-scoped memory: Preferences expire when session ends (fails this test)
Long-term vector store: Preferences embedded and retrieved (passes if retrieval works)
Explicit preference schema: Preferences stored in structured DB (most reliable)

The trace shows which memory strategy was used and whether retrieval succeeded.

Agent-Driven Quick Start

HermesBench is designed to run through a coding agent. Copy this prompt into Codex, Claude, or another agent:

Use the HermesBench skill and run one default scenario recipe 
for my current Hermes configuration.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Follow the skill's "Run Current Hermes Configuration" workflow. 
Use the Python API default single-recipe path, save artifacts, 
and summarize the score and main findings.

Do not run the full bundle unless I explicitly ask.

The agent loads the skill, runs one recipe, and returns a score with trace link. Full bundle runs (all 27 recipes) are opt-in because they take longer and cost more in API calls.

This workflow tests the agent’s ability to:

Load external skills from a GitHub URL
Follow multi-step instructions
Manage API credentials securely
Save artifacts to disk
Summarize structured results

If your agent cannot complete this workflow, that is itself a useful signal.

Observability: Redacted Traces

Every published result links to a redacted trace. Example structure:

{
  "recipe_id": "schedule_meeting_with_conflicts",
  "profile": {
    "model": "gpt-4",
    "tools": ["calendar", "email"],
    "memory": "session-scoped"
  },
  "timeline": [
    {"step": 1, "tool": "email.read", "latency_ms": 340, "status": "success"},
    {"step": 2, "tool": "calendar.search", "latency_ms": 1200, "status": "success"},
    {"step": 3, "tool": "calendar.search", "latency_ms": 890, "status": "success"},
    {"step": 4, "output": "user_message", "redacted": true},
    {"step": 5, "pause": "wait_for_user_confirmation"},
    {"step": 6, "tool": "calendar.create", "latency_ms": 450, "status": "success"},
    {"step": 7, "tool": "email.send", "latency_ms": 320, "status": "success"}
  ],
  "checks": {
    "meeting_created": true,
    "correct_time_zone": true,
    "confirmation_sent": true
  },
  "judge_summary": "Confirmation email was polite and accurate. No unnecessary user confirmations."
}

The trace shows tool call order, latencies, and check results without exposing private data. You can see that the agent called calendar.search twice (likely checking multiple participants’ availability) before creating the event.

Deployment Shape

HermesBench runs as a Python library, not a hosted service. You install it locally, configure your agent, and run evals against your own infrastructure.

This avoids:

Sending private data to a third-party eval service
Trusting someone else’s tool mocks
Waiting for eval queue availability

The trade-off: you manage your own tool credentials, API rate limits, and result storage.

Typical deployment:

Install HermesBench via pip
Configure agent profile (model, tools, memory strategy)
Run one recipe to verify setup
Run full suite (27 recipes) and save traces
Publish results to internal dashboard or public site

The project includes a static site generator that turns trace JSON into browsable HTML.

Likely Failure Modes

Context window exhaustion: After 15 tool calls, the agent runs out of tokens and drops early conversation context. The eval detects this when state persistence checks fail.

Tool schema drift: Your calendar API changes response format. The agent’s tool definitions are stale. Calls succeed but return unexpected data. The eval catches this when deterministic checks fail (meeting created but wrong time zone).

Memory retrieval failure: The agent stores a preference in a vector DB but retrieval returns irrelevant results due to poor embedding. The eval detects this when preference consistency checks fail.

Approval loop deadlock: The agent requires user approval for every destructive action. A recipe with 5 write operations triggers 5 approval prompts. The UX score drops.

Graceful degradation absence: A required tool is unavailable. The agent crashes instead of falling back or asking the user. The reliability score drops to zero.

Technical Verdict

Use HermesBench when:

You are deploying personal agents that run multi-step workflows across sessions
You need to test state persistence, tool chaining, and error recovery
You want redacted traces that show what happened without exposing private data
You prefer local evals over sending data to a hosted service

Avoid HermesBench when:

You only need single-turn model benchmarks (use MMLU or HumanEval instead)
Your agent does not persist state across sessions
You need real-time production monitoring (this is a dev-time eval, not observability)
You want a leaderboard to compare base models (this tests configurations, not models)

The project is in alpha. The 27 recipes cover common personal agent tasks (scheduling, email, search, reminders) but not specialized domains (code execution, data analysis, creative work). If your agent does those things, you will need to write custom recipes.

The scoring methodology is documented but opinionated. Capability, reliability, and UX weights are fixed. If you disagree with the weights, fork the repo and adjust them.