🏆 Online-Mind2Web Leaderboard
Online-Mind2Web is a benchmark designed to evaluate the real-world performance of web agents on live websites, featuring 300 tasks across 136 popular sites in diverse domains. Based on the number of steps required by human annotators, tasks are divided into three difficulty levels: Easy (1–5 steps), Medium (6–10 steps), and Hard (11+ steps).
Leaderboard
Our goal is to conduct a rigorous assessment of the current state of web agents. We maintain two leaderboards—one for automatic evaluation and another for human evaluation. Please click "Submission Guideline" for details.
Claude Computer Use | OpenAI Computer-Using Agent | Emergence AI | OSU NLP | 83.1 | 58.0 | 43.2 | 61.3 | 2025-3-22 |
Visualization
This figure presents a fine-grained heatmap illustrating task-level completion across different agents. Each row corresponds to a specific agent, and each column represents a task (identified by its task ID). Blue bars indicate successful completions, while white spaces denote failures. Any agent: A task is considered successful if at least one agent is able to complete it. (This style of visualization is inspired by HAL.)
In certain scenarios, testing on the full Online-Mind2Web dataset may not be feasible due to cost, privacy, or legal constraints. To facilitate fair and apple-to-apple comparisons, we release both our human evaluation labels and auto-eval details.
Claude Computer Use | OpenAI Computer-Using Agent | Emergence AI | OSU NLP | 80.3 | 73.4 | 59.0 | 71.8 | 2025-3-22 |
Operator | OpenAI Computer-Using Agent | OpenAI | OSU NLP | 80.3 | 73.4 | 59.0 | 71.8 | 2025-3-22 |
Browser Use | gpt-4o-2024-08-06 | Browser Use | OSU NLP | 58.6 | 37.5 | 24.3 | 40.1 | 2025-3-22 |
SeeAct | gpt-4o-2024-08-06 | OSU | OSU NLP | 65.1 | 36.1 | 18.5 | 39.8 | 2025-3-22 |
Claude Computer Use | claude-3-5-sonnet-20241022 | Anthropic | OSU NLP | 61.9 | 28.1 | 21.2 | 35.8 | 2025-3-22 |
Agent-E | gpt-4o-2024-08-06 | Emergence AI | OSU NLP | 57.4 | 31.9 | 14.4 | 34.7 | 2025-3-22 |
You should use the script provided in our GitHub repository to obtain automatic evaluation results on your own and submit them along with all trajectories. To ensure the authenticity and reliability of the reported results, we will also conduct a verification of auto-eval results. If you have conducted your own human evaluation, please also attach your human-eval results. We will spot-check these before adding them to the human-eval table.
⚠ Please submit the trajectory file with the following format:
The result of each task is stored in a folder named as its task_id
, containing:
trajectory/
: Stores screenshots of each step.result.json
: Task metadata and action history.
Here is an example of the format.
Structure:
main_directory/
└── task_id/
├── result.json
└── trajectory/
├── 0_screenshot.png
├── 1_screenshot.png
└── ...
result.json
format:
{
"task_id": 123,
"task": "abc",
"action_history": ["abc", "xyz", "..."]
}
human_result.json
format:
[
{
"task_id": 123,
"task": "abc",
"human_label": 0 or 1 (failure or success)
},
{
"task_id": 456,
"task": "def",
"human_label": 0 or 1 (failure or success)
},
]
Please email your agent's name, model family, and organization to xue.681@osu.edu, and include the trajectory directory and auto-eval results file as attachments (optional: human evaluation results).