AI Agent Benchmarks: The 2026 Enterprise Evaluation Guide

Tested on Princeton/Sierra τ-bench — the industry standard for AI agent evaluation — our base agents outperformed every published leaderboard result available at the time of submission, across all pass levels.

Using GBA-Bench, our proprietary enterprise evaluation suite, we then tested the impact of providing agents with Context Intelligence: enterprise-level memory and procedural context that helps them adapt to organization-specific workflows. The result was significant improvement in both Trajectory Accuracy and Goal Completion.

The takeaway: model choice matters, but the architecture around the model matters just as much. How an agent plans, uses tools, recovers from errors, and applies enterprise context shapes how reliably it performs in real workflows.

Here's what we measured, how we measured it, and what it means.

Evaluation note: τ-bench results reflect Automation Anywhere's evaluation runs submitted to the public leaderboard in May 2026 (pending merge at time of publication). All comparisons reference scores published at time of submission.

Introduction

In our earlier paper, A Framework for Evaluating Goal-Based AI Agents, we introduced a dual-metric evaluation framework that measures not only whether an agent completes a task, but whether it follows the correct reasoning path to get there - because an agent that arrives at the right answer through fragmented or unreliable execution is a production liability, even when the output looks right. That paper established the methodology.

This paper applies it in two ways.

First, we ran our agents against τ-bench to establish an external point of comparison. Developed by Princeton and Sierra, τ-bench is one of the most rigorous publicly available benchmarks for evaluating agent performance on general-purpose service tasks. Covering 375 multi-turn tasks across airline, retail, telecom, and banking domains, it gives us a way to show how our agents stack up against a widely recognized industry reference point.

But external comparability is only part of the picture. τ-bench is valuable because it tells us how agents perform on standardized service workflows. It does not fully capture the enterprise conditions our framework was designed to evaluate: workflows grounded in real source documents, domain-specific policy validation, organization-specific tool schemas, and the business rules that govern how work actually gets done.

To test those conditions, we built GBA-Bench, our proprietary enterprise evaluation suite. GBA-Bench applies the same dual-metric framework to a more demanding set of enterprise workflows across seven domains: Banking, Insurance, Healthcare, Supply Chain, Sales, Finance, and Vendor Onboarding. In total, we evaluated more than 30 frontier models.

Using GBA-Bench as the evaluation standard, we then tested what happens when agents are given memory - measuring not just whether memory improves task completion, but whether it improves the quality, reliability, and enterprise-readiness of the execution path.

τ-bench: External validation

We ran our base agents through the full τ-bench evaluation, using our core agent framework.

Across all four pass levels, our agents achieved the highest scores among published leaderboard results at the time of submission. At pass^1, our agents achieved 74.5%, a +4.3 point lead over the next-best published result, ahead of GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. That lead held across each subsequent pass level, widening slightly at pass^2 and remaining +4.1 points at pass^4.

Pass Level	Automation Anywhere base agent	Leaderboard #1	Delta
pass¹	74.50%	70.20%	+4.3 pts
pass²	67.90%	63.10%	+4.8 pts
pass³	63.60%	59.30%	+4.3 pts
pass⁴	60.30%	56.20%	+4.1 pts

Table 2.1: τ-bench pass-level results — Automation Anywhere base agents vs. Leaderboard #1, pooled across 375 tasks and 4 domains. Beating Qwen3.5, GPT-5.2, Claude Opus 4.5, Gemini 3 Pro.

The pass^k structure is what makes this result especially relevant for enterprise deployment. pass^1 measures raw task accuracy. pass^4 measures consistency: the agent must complete the same task correctly across four independent runs. A production agent does not handle a workflow once. It handles the same workflow type hundreds of times a day. Performance that holds across pass^2, pass^3, and pass^4 is a stronger signal of architectural reliability than a single successful run.

The results also point to the importance of how agents are built. When we ran the same underlying LLMs used by other top-performing agents through our agent framework, performance improved, in some cases meaningfully. Model capability matters, but how the model is run matters as well. Agent architecture, tool use, and planning are themselves performance drivers.

Execution speed and domain breakdown

Domain	AA Score	vs. Leader	Execution Speed	Rank
Airline	84.50%	+0.5 pts	1.6× slower (230s vs 145s)	#1
Retail	82.90%	−1.5 pts	3.2× faster (223s vs 703s)	~#2
Telecom	98.20%	+0.4 pts	2.6× faster (330s vs 841s)	#1
Banking	31.70%	+0.5 pts	2.7× faster (584s vs 1568s)	#1

Table 2.2: τ-bench results by domain — accuracy and execution speed vs. leaderboard leader.

In our τ-bench evaluation, our base agents were faster than the published leaderboard comparison point in three of the four domains. Airline was the speed exception, running 1.6× slower while still achieving the highest accuracy in the comparison set.

In three of the four domains, they also paired faster execution with the highest accuracy among published results at the time of submission:

Telecom: 2.6x faster (330s vs 841s), highest accuracy in the comparison set
Banking: 2.7× faster (584s vs 1568s), highest accuracy in the comparison set
Airline: 1.6× slower (230s vs 145s) , highest accuracy in the comparison set
Retail: 3.2× faster (223s vs 703s), our fastest result in the comparison set and a domain we are targeting for accuracy improvement in the next evaluation cycle

Banking deserves specific attention. Absolute scores are low across all competitors. Our agents reached 31.7% in this domain, the highest score in the comparison set at the time of submission, but the number reflects a broader domain-wide bottleneck: retrieval latency. The agent must retrieve policy and account information in real time, and that constraint suppresses scores regardless of model quality.

That bottleneck is also precisely the kind of problem Context Intelligence is designed to address. As that layer matures, Banking is where we expect to see some of the greatest improvement.

GBA-Bench: Our enterprise evaluation standard

τ-bench gives us an external benchmark for comparison. GBA-Bench gives us the evaluation environment that reflects how enterprise agents actually get used.

GBA-Bench is our proprietary evaluation suite for goal-based agents running real enterprise workflows. Test cases are generated from actual source documents, including SOPs, support tickets, and workflow definitions. Those documents are converted through a four-stage pipeline into structured agent definitions, scenario-milestone pairs, and executable Python test classes.

Coverage spans seven enterprise domains: Banking, Insurance, Healthcare, Supply Chain, Sales, Finance, and Vendor Onboarding.

We have formally evaluated more than 30 frontier models across every major model family, including Anthropic, OpenAI, Google, Meta, Qwen, DeepSeek, Mistral, and Zhipu/GLM. Each evaluation uses the same dual-metric framework introduced in our earlier paper: Task Success and Trajectory Accuracy. Both are required.

GBA-Bench is also designed for rapid iteration. Because the pipeline can generate new test cases in hours, we can evaluate new frontier models shortly after release and understand not just whether they perform well in general, but whether they can handle the domain-specific rules, tools, and decision paths that enterprise workflows require.

The limitation of stateless agents

GBA-Bench also makes it possible to isolate a core limitation of base agents: stateless execution.

Even well-built agents start each task with no memory of previous runs. They do not retain which tool parameters failed, which paths were inefficient, or which recovery strategies worked. As a result, the same errors recur. The same unnecessary steps repeat. The same fragile reasoning patterns show up run after run.

This limitation is visible in our Customer Churn Prevention agent. Without memory, the agent achieved a baseline Trajectory Accuracy of 0.12. In other words, only 12% of runs followed the correct reasoning path, even when the agent sometimes reached a plausible-looking outcome.

That means the issue is not simply whether the model can complete the task. It is whether the agent can learn from repeated execution and avoid recreating the same failure modes. A Scrappy Win (low trajectory accuracy, high task success accuracy) is not a model-quality problem. It is an architectural limitation. And it is fixable.

PRE & Context Intelligence: From baseline reasoning to enterprise memory

Process Reasoning Engine: Baseline workflow intelligence

The Process Reasoning Engine gives agents a baseline understanding of common workflow failure patterns, derived from aggregated execution data across the 400 million automations we see on our platform each year. It is part of the core agent framework: a generalized reasoning layer that improves planning, tool use, and recovery behavior across tasks, without relying on organization-specific memory or context.

That is what the τ-bench results reflect. Our base agents were evaluated with PRE as part of the core agent framework.

Context Intelligence: Enterprise-level memory and context

Context Intelligence addresses the next limitation: even with strong baseline reasoning, an agent starts each enterprise task without access to the organization's accumulated context. Relevant business rules, workflow-specific constraints, prior execution lessons, and procedural patterns from that environment are all absent. As a result, the same tenant-specific errors can recur. The same inefficient paths can repeat.

Context Intelligence adds that missing layer. It retrieves relevant enterprise-specific guidance before and during execution, so the agent can adapt to the organization's rules, tools, and workflow history rather than treating each run as isolated.

The key is quality filtering. Successful executions are preserved as replicable patterns. Imperfect runs are distilled into high-impact lessons about what to avoid or correct. The goal is not to remember everything. It is to surface the context most likely to improve the next execution.

We also tested a dual-tier variant that separates strategy-level and procedural context. Strategy-level context captures high-level workflow patterns and is retrieved at task start. Procedural context captures granular state-transition records and is retrieved mid-task, with queries constructed from the tools the agent has just executed. This grounds retrieval in the agent's current state rather than only in its starting prompt.

The results: Up to 32-point goal completion uplift

We tested Context Intelligence on top of PRE-enabled base agents across four enterprise agent types on GBA-Bench.

Agent Type	Baseline (no memory)	With PRE+CI	Improvement (absolute gain pp)
Claim Details	0.70	0.90	+0.20
Customer Churn Prevention	0.12	0.59	+0.47
Finance Credit Hold	0.35	0.55	+0.20
Sales Deal Acceleration	0.33	0.66	+0.33

Table 4.2: Trajectory Accuracy — Baseline vs. PRE + Context Intelligence (GBA-Bench)

The gains were consistent across agent types. Trajectory Accuracy improved by 20 to 47 percentage points. Goal Completion improved by up to 32 percentage points. The Customer Churn Prevention agent saw one of the largest lifts, with Trajectory Accuracy rising from 0.12 to 0.53, roughly a 4.4x improvement.

Context-enabled agents also reduced average tool calls per run by roughly 20% on complex workflows. Fewer tool calls means fewer error-and-retry cycles. The agent is not just doing less work; it is taking the right path sooner. In production, that translates into lower API costs, faster execution, and more predictable behavior at scale.

One example makes the change concrete. On the Sales Deal Acceleration agent, the baseline repeatedly called send_deal_alert with an invalid alert_type parameter, received an error, retried with the correct value, and completed the task. Under a task-success-only metric, that looks like a win. Under our framework, it is a Scrappy Win: the outcome is right, but the execution path is degraded.

With Context Intelligence enabled, the agent retrieved the relevant enterprise-level guidance before repeating the same mistake: verify valid alert types before sending escalation notifications. It invoked the tool correctly on the first attempt. Trajectory Accuracy: 100%. No retry required.

PRE provides baseline workflow intelligence. Context Intelligence adds enterprise-specific context and memory. Together they represent two distinct levels of agent improvement: stronger general execution, and better adaptation to the enterprise environment.

Conclusion: What enterprise agent readiness requires

The results point to a clear conclusion: enterprise agent performance is not determined by model choice alone.

On τ-bench, our base agents achieved the highest scores among published leaderboard results at the time of submission across all four pass levels, while also running faster than the published comparison point in three out of four domains. They reflect the strength of the core agent framework, including PRE's baseline workflow intelligence.

But τ-bench is only one part of the readiness picture. Enterprise agents do not operate only on standardized service tasks. They operate inside organization-specific workflows, with domain-specific policies, custom tools, procedural constraints, and recurring execution patterns. That is what GBA-Bench was built to evaluate.

The GBA-Bench results show that stateless execution remains a fundamental limitation. Even strong base agents can complete tasks through inefficient or unreliable paths, creating Scrappy Wins that look correct at the output layer but are not production-ready underneath.

Context Intelligence addresses that gap. By giving agents access to relevant enterprise-level memory and procedural context, we saw Trajectory Accuracy improve by 20 to 47 percentage points, Goal Completion improve by up to 32 percentage points, and 20% tool-call decline on complex workflows.

Together, these results show two distinct requirements for enterprise-grade agents. First, they need strong baseline reasoning: the ability to plan, use tools, and recover from common workflow failures. Second, they need enterprise adaptation: the ability to apply organization-specific context and improve from repeated execution.

That is the shift this paper measures. The next generation of enterprise agents will not be judged only by whether they can produce the right answer once. They will be judged by whether they can produce the right answer consistently, through the right path, at production speed, while becoming more reliable over time.

For the full methodology, experimental data, and GBA-Bench leaderboard results across 30+ frontier models, download the AI Agent benchmark report 2026. For the evaluation framework that underpins this work, read A Framework for Evaluating Goal-Based AI Agents.

This post references two Automation Anywhere technical white papers: A Framework for Evaluating Goal-Based AI Agents and the AI Agent benchmark report 2026. τ-bench results reflect evaluation runs submitted to the public leaderboard in May 2026 (pending merge at time of publication). GBA-Bench results are based on Automation Anywhere's proprietary evaluation suite. Content reviewed prior to publication.

Beyond Tau Bench: The Performance Impact of PRE and Context Intelligence on Enterprise AI Agents

In this article