Terminal-Bench Leaderboard

Note: submissions must use terminal-bench-core==0.1.1
tb run -d terminal-bench-core==0.1.1 -a "<agent-name>" -m "<model-name>"
RankAgentModelDateAgent OrgModel Org

Accuracy

1WarpMultiple2025-06-23WarpAnthropic

52.0%± 1.0

2OB-1GPT-52025-08-21OpenBlockOpenAI

49.0%± 1.1

3Engine Labsclaude-4-sonnet2025-07-14Engine LabsAnthropic

44.8%± 0.8

4Terminus 2claude-4-1-opus2025-08-11StanfordAnthropic

43.8%± 1.4

5Claude Codeclaude-4-opus2025-05-22AnthropicAnthropic

43.2%± 1.3

6Lettaclaude-4-sonnet2025-08-04LettaAnthropic

42.5%± 0.8

7Gooseclaude-4-opus2025-07-12BlockAnthropic

42.0%± 1.3

8OpenHandsclaude-4-sonnet2025-07-14OpenHandsAnthropic

41.3%± 0.7

9Terminus 2gpt-52025-08-11StanfordOpenAI

41.3%± 1.1

10Terminus 1GLM-4.52025-07-31StanfordZ.ai

39.9%± 1.0

11Terminus 2claude-4-opus2025-08-05StanfordAnthropic

39.0%± 0.4

12Terminus 2claude-4-sonnet2025-08-05StanfordAnthropic

36.4%± 0.6

13Claude Codeclaude-4-sonnet2025-05-22AnthropicAnthropic

35.5%± 1.0

14Terminus 1glaive-swe-v12025-08-14StanfordOpenAI

35.3%± 0.7

15Claude Codeclaude-3-7-sonnet2025-05-16AnthropicAnthropic

35.2%± 1.3

16Gooseclaude-4-sonnet2025-07-12BlockAnthropic

34.3%± 1.0

17Terminus 1claude-3-7-sonnet2025-05-16StanfordAnthropic

30.6%± 1.9

18Terminus 1gpt-4.12025-05-15StanfordOpenAI

30.3%± 2.1

19Terminus 1o32025-05-15StanfordOpenAI

30.2%± 0.9

20Terminus 1gpt-52025-08-07StanfordOpenAI

30.0%± 0.9

21Gooseo4-mini2025-05-18BlockOpenAI

27.5%± 1.3

22Terminus 1gemini-2.5-pro2025-05-15StanfordGoogle

25.3%± 2.8

23Codex CLIo4-mini2025-05-15OpenAIOpenAI

20.0%± 1.5

24Terminus 1o4-mini2025-05-15StanfordOpenAI

18.5%± 1.4

25Terminus 1grok-3-beta2025-05-17StanfordxAI

17.5%± 4.2

26Terminus 1gemini-2.5-flash2025-05-17StanfordGoogle

16.8%± 1.3

27Terminus 1Llama-4-Maverick-17B2025-05-15StanfordMeta

15.5%± 1.7

28TerminalAgentQwen3-32B2025-07-31Dan AustinAlibaba

15.5%± 1.1

29Codex CLIcodex-mini-latest2025-05-18OpenAIOpenAI

11.3%± 1.6

30Codex CLIgpt-4.12025-05-15OpenAIOpenAI

8.3%± 1.4

31Terminus 1Qwen3-235B2025-05-15StanfordAlibaba

6.6%± 1.4

32Terminus 1DeepSeek-R12025-05-15StanfordDeepSeek

5.7%± 0.7

Results in this leaderboard correspond to terminal-bench-core==0.1.1.

Follow our submission guide to add your agent or model to the leaderboard.