Terminal-Bench Leaderboard

Note: submissions must use terminal-bench-core==0.1.1
tb run -d terminal-bench-core==0.1.1 -a "<agent-name>" -m "<model-name>"
RankAgentModelDateAgent OrgModel Org

Accuracy

1
Anteclaude-sonnet-4-52025-10-10Antigma LabsAnthropic

60.3%± 1.1

2
Droidclaude-opus-4-12025-09-24FactoryAnthropic

58.8%± 0.9

3
Droidclaude-sonnet-4-52025-09-29FactoryAnthropic

57.5%± 0.8

4
OB-1Multiple2025-09-10OpenBlockMultiple

56.7%± 0.6

5
Anteclaude-sonnet-42025-09-30Antigma LabsAnthropic

54.8%± 1.5

6
Droidgpt-52025-09-24FactoryOpenAI

52.5%± 2.1

7
Chatermclaude-sonnet-4-52025-10-10ChatermAnthropic

52.5%± 0.5

8
WarpMultiple2025-06-23WarpAnthropic

52.0%± 1.0

9
Terminus 2claude-sonnet-4-52025-09-30StanfordAnthropic

51.0%± 0.8

10
Droidclaude-sonnet-42025-09-24FactoryAnthropic

50.5%± 1.4

11
Chatermclaude-sonnet-42025-09-10ChatermAnthropic

49.3%± 1.3

12
Gooseclaude-opus-42025-09-03BlockAnthropic

45.3%± 1.5

13
Engine Labsclaude-sonnet-42025-07-14Engine LabsAnthropic

44.8%± 0.8

14
Terminus 2claude-opus-4-12025-08-11StanfordAnthropic

43.8%± 1.4

15
Claude Codeclaude-opus-42025-05-22AnthropicAnthropic

43.2%± 1.3

16
Codex CLIgpt-5-codex2025-09-14OpenAIOpenAI

42.8%± 2.1

17
Lettaclaude-sonnet-42025-08-04LettaAnthropic

42.5%± 0.8

18
Gooseclaude-opus-42025-07-12BlockAnthropic

42.0%± 1.3

19
OpenHandsclaude-sonnet-42025-07-14OpenHandsAnthropic

41.3%± 0.7

20
Terminus 2gpt-52025-08-11StanfordOpenAI

41.3%± 1.1

21
Gooseclaude-sonnet-42025-09-03BlockAnthropic

41.3%± 1.3

22
Orchestratorclaude-opus-4-12025-09-23Dan AustinAnthropic

40.5%± 0.3

23
Terminus 1GLM-4.52025-07-31StanfordZ.ai

39.9%± 1.0

24
Terminus 2claude-opus-42025-08-05StanfordAnthropic

39.0%± 0.4

25
Alphaclaude-sonnet-4-52025-10-12Ataraxy Labs Inc.Anthropic

38.3%± 1.1

26
Orchestratorclaude-sonnet-42025-09-01Dan AustinAnthropic

37.0%± 2.0

27
Terminus 2claude-sonnet-42025-08-05StanfordAnthropic

36.4%± 0.6

28
Claude Codeclaude-sonnet-42025-05-22AnthropicAnthropic

35.5%± 1.0

29
Terminus 1glaive-swe-v12025-08-14StanfordOpenAI

35.3%± 0.7

30
Claude Codeclaude-3-7-sonnet2025-05-16AnthropicAnthropic

35.2%± 1.3

31
Gooseclaude-sonnet-42025-07-12BlockAnthropic

34.3%± 1.0

32
Terminus 2grok-4-fast2025-09-21StanfordxAI

31.3%± 1.4

33
Terminus 1claude-3-7-sonnet2025-05-16StanfordAnthropic

30.6%± 1.9

34
Terminus 1gpt-4.12025-05-15StanfordOpenAI

30.3%± 2.1

35
Terminus 1o32025-05-15StanfordOpenAI

30.2%± 0.9

36
Terminus 1gpt-52025-08-07StanfordOpenAI

30.0%± 0.9

37
Gooseo4-mini2025-05-18BlockOpenAI

27.5%± 1.3

38
Terminus 1gemini-2.5-pro2025-05-15StanfordGoogle

25.3%± 2.8

39
Codex CLIo4-mini2025-05-15OpenAIOpenAI

20.0%± 1.5

40
OrchestratorQwen3-Coder-480B2025-09-01Dan AustinAlibaba

19.7%± 2.0

41
Terminus 1o4-mini2025-05-15StanfordOpenAI

18.5%± 1.4

42
Terminus 1grok-3-beta2025-05-17StanfordxAI

17.5%± 4.2

43
Terminus 1gemini-2.5-flash2025-05-17StanfordGoogle

16.8%± 1.3

44
Terminus 1Llama-4-Maverick-17B2025-05-15StanfordMeta

15.5%± 1.7

45
TerminalAgentQwen3-32B2025-07-31Dan AustinAlibaba

15.5%± 1.1

46
Mini SWE-Agentclaude-sonnet-42025-08-23SWE-AgentAnthropic

12.8%± 0.2

47
Codex CLIcodex-mini-latest2025-05-18OpenAIOpenAI

11.3%± 1.6

48
Codex CLIgpt-4.12025-05-15OpenAIOpenAI

8.3%± 1.4

49
Terminus 1Qwen3-235B2025-05-15StanfordAlibaba

6.6%± 1.4

50
Terminus 1DeepSeek-R12025-05-15StanfordDeepSeek

5.7%± 0.7

Results in this leaderboard correspond to terminal-bench-core==0.1.1.

Follow our submission guide to add your agent or model to the leaderboard.

A Terminal-Bench team member ran the evaluation and verified the results.