mech.app
AI Agents

HermesBench: Workflow Reliability Evals for Personal AI Agents

How to benchmark multi-step agent workflows across sessions, tool chains, and API failures instead of single-turn accuracy.

Source: verkyyi.github.io
HermesBench: Workflow Reliability Evals for Personal AI Agents

Most agent benchmarks measure whether a model can answer a question or generate code. HermesBench measures whether a personal agent can finish a workflow when tools fail, sessions pause, and context windows fill up.

The project ships 27 workflow recipes that test complete agent configurations: prompt, model, tools, memory, gateway behavior, delegation, and stability. The current public baseline scores 78.2. Every result links to redacted traces showing tool timelines, assistant replies, checks, and judge summaries.

Why Workflow Reliability Matters

Personal agents run multi-step tasks across days. A calendar agent might:

  1. Parse a meeting request from email
  2. Check availability in Google Calendar
  3. Propose three slots
  4. Wait for user confirmation (session ends)
  5. Resume next day and book the meeting
  6. Send confirmation email

Single-turn benchmarks like MMLU or HumanEval do not capture this. They measure knowledge retrieval or code generation, not state persistence, tool chaining, or recovery when an API times out mid-task.

Real agent failures happen during:

  • Tool call sequences when one API returns unexpected data
  • Context window exhaustion after 15 tool calls
  • Session resumption when state was not persisted correctly
  • Graceful degradation when a required tool is unavailable

HermesBench tests these failure modes explicitly.

Architecture: Recipes, Profiles, and Traces

The eval system has three layers:

Recipes define user-like jobs with goals, criteria, and side-effect boundaries. Example: “Schedule a meeting with three participants, handle conflicts, send confirmation.” Each recipe specifies:

  • Prompt template
  • Expected tool sequence
  • Success criteria (deterministic checks and LLM-as-judge axes)
  • Allowed side effects (read-only vs. write operations)

Profiles describe the agent configuration under test:

  • Model and provider
  • Available tools (calendar, email, search, etc.)
  • Memory strategy (stateless, session-scoped, long-term vector store)
  • Gateway behavior (rate limits, retries, fallback models)
  • Safety boundaries (approval required for destructive actions)

Traces capture what happened:

  • Tool call timeline with latencies
  • Assistant reasoning and replies
  • Deterministic check results (did the meeting get booked?)
  • Judge summaries (was the confirmation email polite and accurate?)

Traces are redacted to remove private payloads but preserve structure. You can see that the agent called calendar.search twice, waited 1.2 seconds, then called email.send, without seeing the actual email body.

Scoring: Capability, Reliability, UX

HermesBench scores three axes:

AxisWhat It MeasuresExample Failure
CapabilityDid the agent complete the task?Meeting scheduled but wrong time zone
ReliabilityDid it handle errors gracefully?Crashed when calendar API returned 429
UXWas the interaction smooth?Asked user to confirm three times for one action

Each recipe defines deterministic checks (meeting exists in calendar) and LLM-as-judge criteria (confirmation email tone). The final score is a weighted average across all recipes in a suite.

The current baseline (78.2) reflects one specific configuration. The site deliberately avoids a leaderboard because the goal is not to rank models but to expose workflow failure modes.

State Persistence Across Sessions

Most personal agent tasks span multiple sessions. The agent starts a task, waits for user input or external data, then resumes hours or days later.

HermesBench tests this with recipes that include explicit pause points:

# Simplified recipe structure
recipe = {
    "steps": [
        {"action": "parse_email", "tool": "email.read"},
        {"action": "check_calendar", "tool": "calendar.search"},
        {"action": "propose_slots", "output": "user_message"},
        {"pause": "wait_for_user_confirmation"},  # Session ends here
        {"action": "book_meeting", "tool": "calendar.create"},
        {"action": "send_confirmation", "tool": "email.send"}
    ],
    "state_requirements": [
        "proposed_slots must persist across pause",
        "user_confirmation must be retrievable after resume"
    ]
}

The eval driver simulates session boundaries by serializing agent state, clearing memory, then reloading. If the agent loses context (forgets which slots were proposed), the reliability score drops.

Tool Chaining and Recovery

A common failure mode: the agent chains three tool calls, the second one returns unexpected data, and the third call uses malformed input.

Example sequence:

  1. search.query("best Italian restaurant Brooklyn") returns 10 results
  2. maps.get_details(place_id) returns {"status": "ZERO_RESULTS"} (API changed response format)
  3. calendar.create_event(location=???) fails because location is undefined

HermesBench recipes include “fault injection” variants:

  • Tool returns empty result
  • Tool returns rate limit error
  • Tool returns data in unexpected schema

The agent must either retry with adjusted parameters, fall back to a different tool, or ask the user for clarification. Crashing or hallucinating data counts as a reliability failure.

Measuring Context Drift

Personal agents accumulate context over time. After 50 interactions, does the agent still remember the user prefers morning meetings? Or does it start suggesting 6pm slots again?

HermesBench tests this with “preference consistency” recipes:

  1. User states preference (“I only take meetings before noon”)
  2. Agent completes 10 unrelated tasks
  3. Agent schedules a new meeting

The eval checks whether the meeting time respects the original preference. If the agent forgets, the UX score drops.

This requires memory strategies beyond simple conversation history:

  • Session-scoped memory: Preferences expire when session ends (fails this test)
  • Long-term vector store: Preferences embedded and retrieved (passes if retrieval works)
  • Explicit preference schema: Preferences stored in structured DB (most reliable)

The trace shows which memory strategy was used and whether retrieval succeeded.

Agent-Driven Quick Start

HermesBench is designed to run through a coding agent. Copy this prompt into Codex, Claude, or another agent:

Use the HermesBench skill and run one default scenario recipe 
for my current Hermes configuration.

Skill: https://github.com/verkyyi/hermesbench/blob/main/agent-skills/hermesbench/SKILL.md

Follow the skill's "Run Current Hermes Configuration" workflow. 
Use the Python API default single-recipe path, save artifacts, 
and summarize the score and main findings.

Do not run the full bundle unless I explicitly ask.

The agent loads the skill, runs one recipe, and returns a score with trace link. Full bundle runs (all 27 recipes) are opt-in because they take longer and cost more in API calls.

This workflow tests the agent’s ability to:

  • Load external skills from a GitHub URL
  • Follow multi-step instructions
  • Manage API credentials securely
  • Save artifacts to disk
  • Summarize structured results

If your agent cannot complete this workflow, that is itself a useful signal.

Observability: Redacted Traces

Every published result links to a redacted trace. Example structure:

{
  "recipe_id": "schedule_meeting_with_conflicts",
  "profile": {
    "model": "gpt-4",
    "tools": ["calendar", "email"],
    "memory": "session-scoped"
  },
  "timeline": [
    {"step": 1, "tool": "email.read", "latency_ms": 340, "status": "success"},
    {"step": 2, "tool": "calendar.search", "latency_ms": 1200, "status": "success"},
    {"step": 3, "tool": "calendar.search", "latency_ms": 890, "status": "success"},
    {"step": 4, "output": "user_message", "redacted": true},
    {"step": 5, "pause": "wait_for_user_confirmation"},
    {"step": 6, "tool": "calendar.create", "latency_ms": 450, "status": "success"},
    {"step": 7, "tool": "email.send", "latency_ms": 320, "status": "success"}
  ],
  "checks": {
    "meeting_created": true,
    "correct_time_zone": true,
    "confirmation_sent": true
  },
  "judge_summary": "Confirmation email was polite and accurate. No unnecessary user confirmations."
}

The trace shows tool call order, latencies, and check results without exposing private data. You can see that the agent called calendar.search twice (likely checking multiple participants’ availability) before creating the event.

Deployment Shape

HermesBench runs as a Python library, not a hosted service. You install it locally, configure your agent, and run evals against your own infrastructure.

This avoids:

  • Sending private data to a third-party eval service
  • Trusting someone else’s tool mocks
  • Waiting for eval queue availability

The trade-off: you manage your own tool credentials, API rate limits, and result storage.

Typical deployment:

  1. Install HermesBench via pip
  2. Configure agent profile (model, tools, memory strategy)
  3. Run one recipe to verify setup
  4. Run full suite (27 recipes) and save traces
  5. Publish results to internal dashboard or public site

The project includes a static site generator that turns trace JSON into browsable HTML.

Likely Failure Modes

Context window exhaustion: After 15 tool calls, the agent runs out of tokens and drops early conversation context. The eval detects this when state persistence checks fail.

Tool schema drift: Your calendar API changes response format. The agent’s tool definitions are stale. Calls succeed but return unexpected data. The eval catches this when deterministic checks fail (meeting created but wrong time zone).

Memory retrieval failure: The agent stores a preference in a vector DB but retrieval returns irrelevant results due to poor embedding. The eval detects this when preference consistency checks fail.

Approval loop deadlock: The agent requires user approval for every destructive action. A recipe with 5 write operations triggers 5 approval prompts. The UX score drops.

Graceful degradation absence: A required tool is unavailable. The agent crashes instead of falling back or asking the user. The reliability score drops to zero.

Technical Verdict

Use HermesBench when:

  • You are deploying personal agents that run multi-step workflows across sessions
  • You need to test state persistence, tool chaining, and error recovery
  • You want redacted traces that show what happened without exposing private data
  • You prefer local evals over sending data to a hosted service

Avoid HermesBench when:

  • You only need single-turn model benchmarks (use MMLU or HumanEval instead)
  • Your agent does not persist state across sessions
  • You need real-time production monitoring (this is a dev-time eval, not observability)
  • You want a leaderboard to compare base models (this tests configurations, not models)

The project is in alpha. The 27 recipes cover common personal agent tasks (scheduling, email, search, reminders) but not specialized domains (code execution, data analysis, creative work). If your agent does those things, you will need to write custom recipes.

The scoring methodology is documented but opinionated. Capability, reliability, and UX weights are fixed. If you disagree with the weights, fork the repo and adjust them.