AI Agent Benchmark: New Safety Standards Revealed

AI agents abound—and they’re increasingly gaining autonomy. From navigating the web to recursively improving its own coding skills, agentic AI promises to reorder the online economy and redefine the internet.

For enterprise environments, however, AI agents pose a huge risk. Shifting from augmentation to automation can be a precarious move, especially when the entities involved will be given full rein to perform crucial actions—from fulfilling a simple financial transaction to coordinating complex supply chains.

To mitigate the risk, researchers at Carnegie Mellon University and Fujitsu developed three benchmarks that measure when AI agents are safe or effective enough to run business operations without human oversight. These benchmarks were presented at a workshop on 26 January as part of the 2026 AAAI Conference on Artificial Intelligence held in Singapore.

Safety first

The first benchmark, called FieldWorkArena, evaluates AI agents deployed in the field, particularly logistics and manufacturing environments like factories and warehouses. FieldWorkArena calculates the accuracy rate of agents tasked with detecting safety rule violations and deviations from work procedures, as well as generating incident reports. For instance, an AI agent that checks compliance with wearing personal protective equipment (PPE) in a high-risk zone will need to understand PPE standards, identify workers within the zone, analyze what they’re wearing and if it adheres to the standards, and report on the number of compliant personnel.

Instead of simulations, the benchmark employs real-world data sources, including work manuals, safety regulations, and images and videos captured on-site. Hideo Saito, a professor at Japan’s Keio University who isn’t involved with the research but is one of the workshop’s organizers, emphasizes the importance of data privacy when collecting input datasets for agentic AI benchmarks, “especially when you want to deploy such a dataset for commercial, nonacademic use.” Data for FieldWorkArena, for example, was obtained with the consent of those appearing in video footage, while faces and sensitive work areas were blurred to prevent identification.

The researchers assessed three multimodal large language models (LLMs) capable of processing both image and text data: Anthropic’s Claude Sonnet 3.7, Google’s Gemini 2.0 Flash, and OpenAI’s GPT-4o. The results were bleak, with all three models obtaining low accuracy scores. Although they excelled in information extraction and image recognition, the LLMs sometimes hallucinated and struggled with counting objects precisely and measuring specific distances.

These findings demonstrate the need for agentic AI benchmarks for businesses that are grounded by enterprise contexts and rooted in realistic tasks. That’s why Fujitsu spearheaded FieldWorkArena, noticing a growing demand from its customers to gauge the efficiency of AI agents fine-tuned for field work, says Hiro Kobashi, a senior project director of the AI Lab at Fujitsu Research. “Customers are uncertain and concerned about LLMs, so we want to provide good, sufficient benchmarks for them,” he adds.

Overall system configuration of FieldWorkArena.Atsunori Moteki, Shoichi Masui et al.

Data access without hallucination

While FieldWorkArena can be accessed through its GitHub repository, Kobashi notes that the other two benchmarks presented at the workshop, ECHO (EvidenCe-prior Hallucination Observation) and an enterprise retrieval-augmented generation (RAG) benchmark, will be made available to the public within a month. ECHO evaluates the effectiveness of hallucination mitigation strategies for vision language models (VLMs), which are designed to answer questions about images or generate text from visual inputs. The results indicate that techniques such as cropping images so models focus their attention on relevant regions and applying reinforcement learning for reasoning can minimize hallucinations in VLMs.

Meanwhile, the enterprise RAG benchmark appraises the ability of AI agents to retrieve data from an authoritative knowledge base and use that data to augment their generated responses. Metrics measured include retrieving the right areas related to a query and correctly reasoning from the retrieved information.

In the future, Kobashi and his team aim to expand the capabilities of the benchmarks they’ve created to accommodate other industries and use cases. “Customer requests are so diverse. We can’t cover all requests by utilizing one single benchmark, so we need to have many kinds of benchmarks,” he says.

Continuously updating benchmarks is another crucial next step the team plans to take. As AI agents evolve, their benchmark scores could also rise, reaching the point of minimal progress. This will then signal the need for newer, more comprehensive benchmarks that guide the development of better enterprise AI agents.

From Your Site Articles