ExACT: Improving AI agents’ decision-making via test-time compute scaling

A gradient blue to green background features a white flowchart with rectangular boxes connected by arrows, ending in a hexagonal “STOP” sign and a check mark on the right side.

Autonomous AI agents are transforming the way we approach multi-step decision-making processes, streamlining tasks like web browsing, video editing, and file management. By applying advanced machine learning, they automate workflows, optimize performance, and reduce the need for human input.

However, these systems struggle in complex, dynamic environments. A key challenge lies in balancing exploitation, using known strategies for immediate gains, with exploration, which involves seeking new strategies that could yield long-term benefits. Additionally, they often have difficulty adapting to unpredictable changes in conditions and objectives, as well as generalizing knowledge across contexts, limiting their ability to transfer learned strategies between domains.

In response, we developed ExACT, an approach for teaching AI agents to explore more effectively, enabling them to intelligently navigate their environments, gather valuable information, evaluate options, and identify optimal decision-making and planning strategies. ExACT combines two key techniques: Reflective-MCTS (R-MCTS) and Exploratory Learning.

R-MCTS builds on the traditional Monte Carlo Tree Search (MCTS) algorithm, introducing features like contrastive reflection and a multi-agent debate function. Through contrastive reflection, the agent refines its decision-making by comparing expected outcomes with actual results, allowing it to learn from both its successes and mistakes. The multi-agent debate function provides various evaluations of a given state, where multiple agents offer contrasting perspectives to help provide a balanced and reliable assessment.

Exploratory Learning trains agents to navigate environments effectively. Together, these techniques show strong computational scalability during both training and testing, as demonstrated on VisualWebArena—a benchmark for evaluating multimodal autonomous language agents (Figure 1).

Figure 1. Evaluation demonstrates the compute scaling properties of GPT-4o during both training and testing. The assessment includes two scenarios: (1) applying the GPT-4o-based R-MCTS agent to all 234 tasks from the Classifieds category in VisualWebArena (left), and (2) testing fine-tuned GPT-4o on 169 previously unseen tasks from Classifieds without using search algorithms (right).

R-MCTS extends the classic MCTS by enabling real-time improvements in decision-making. Shown in Figure 2, an iterative feedback loop allows R-MCTS to learn from past experiences, avoid prior mistakes, and focus on more effective actions in similar contexts.

Figure 2. Overview of the R-MCTS process in ExACT.

Evaluating R-MCTS

R-MCTS demonstrates state-of-the-art performance across all VisualWebArena environments, surpassing the previous best-performing method, Search Agent, with improvements ranging from 6% to 30% (Table 1). Additionally, as of January 2025, it holds the second position on the OSWorld leaderboard and demonstrates state-of-the-art performance in the blind test setting, where there is no prior access to the test environment, reflecting its advanced capabilities (Table 2).

Rank	Model	Score
1	GPT-4o + ExACT	33.70
2	GPT-4o + Search	26.40
3	GPT-4o + WebDreamer	23.60
4	GPT-4o + ICAL	23.40
5	GPT-4o	19.78
6	Llama-3-70B + Search	16.70

Table 1. The VisualWebArena leaderboard highlights R-MCTS as achieving state-of-the-art performance as of December 2024.

Rank	Model	Blind Test	Score
1	learn-by-interact w/ Claude-3.5-sonnet		22.50
2	ExACT w/ GPT-4o	✔	16.60
3	GPT-4	✔	12.24
4	GPT-4o	✔	11.36
5	GPT-4 Vision (0409)	✔	10.82
6	learn-by-interact w/ Gemini-1.5-pro	✔	10.30

Table 2. The OSWorld leaderboard for the category of A11y tree inputs shows that ExACT with GPT-4o ranks second and demonstrates state-of-the-art performance in the blind test setting, as of December 2024.

How Exploratory Learning works

Exploratory Learning enables agents to dynamically search and adjust their computational resources during testing without depending on MCTS. In contrast to Imitation Learning, which centers on training models using the optimal actions identified through search, Exploratory Learning focuses on cultivating the agent’s ability to navigate its environment by teaching it to evaluate states, explore different pathways, and efficiently backtrack from unpromising paths to identify more favorable alternatives.

Figure 3. In contrast to Imitation Learning, Exploratory Learning uses the entire search trajectory for training.

Evaluating Exploratory Learning

We conducted experiments using GPT-4o fine-tuned with Exploratory Learning in the VisualWebArena environment. Results demonstrate the following key benefits:

Improved performance: GPT-4o achieves performance improvement, comparable with scaling test-time compute with MCTS, even without search.
Test-time compute scaling: GPT-4o performs better when given more actions per task, leading to improved decision-making and task completion, which increased from 5% to 12.4%.
Improved generalization on unseen tasks: Exploratory Learning helps fine-tuned GPT-4o handle unseen tasks more effectively than agents trained with Imitation Learning or no additional training.

The following video provides a detailed demonstration of how R-MCTS and Exploratory Learning function.

Continued exploration

Advancing autonomous AI agents is key to enabling them to handle complex, multi-step tasks with greater precision and adaptability. ExACT represents a significant step toward creating agents that can perform complex decision-making before taking action, leading to improved performance, but challenges remain. How can AI agents improve decision-making in real-world scenarios, where they may be constrained by time or resources? How can they learn effectively and efficiently from environmental feedback? We are currently investigating these questions, and we invite you to explore them with us by building on the ExACT framework. Access the ExACT code at our GitHub repository (opens in new tab).

Source link

What's Hot

LA immigration protests live updates: Trump deploys 2,000 National Guard members

Trump attends UFC championship fight in NJ, taking a break from politics, Musk feud

What to know about the much-anticipated Nintendo Switch 2 on launch day

ExACT: Improving AI agents’ decision-making via test-time compute scaling

BenchmarkQED: Automated benchmarking of RAG systems – Microsoft Research

What AI’s impact on individuals means for the health workforce and industry

FrodoKEM: A conservative quantum-safe cryptographic algorithm

Abstracts: Zero-shot models in single-cell biology with Alex Lu

Abstracts: Aurora with Megan Stanley and Wessel Bruinsma

Collaborators: Healthcare Innovation to Impact

ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

Laws, norms, and ethics for AI in health

LA immigration protests live updates: Trump deploys 2,000 National Guard members

Trump attends UFC championship fight in NJ, taking a break from politics, Musk feud

What to know about the much-anticipated Nintendo Switch 2 on launch day

Musk appears to delete X posts claiming Trump was in Epstein files

ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

Our Picks

LA immigration protests live updates: Trump deploys 2,000 National Guard members

Trump attends UFC championship fight in NJ, taking a break from politics, Musk feud

What to know about the much-anticipated Nintendo Switch 2 on launch day

Most Popular

ChatGPT’s viral Studio Ghibli-style images highlight AI copyright concerns

Best Cyber Forensics Software in 2025: Top Tools for Windows Forensics and Beyond

An ex-politician faces at least 20 years in prison in killing of Las Vegas reporter

Subscribe to Updates

What's Hot

ExACT: Improving AI agents’ decision-making via test-time compute scaling

GraphRAG auto-tuning provides rapid adaptation to new domains

Evaluating R-MCTS

How Exploratory Learning works

Evaluating Exploratory Learning

Continued exploration

Related Posts

Subscribe to Updates