terminal-bench

Introduction

A benchmark to measure the capabilities of AI agents in a terminal environment.

#####################################################################
#  _____                   _             _     ______________       #
# |_   _|__ _ __ _ __ ___ (_)_ __   __ _| |   ||            ||      #
#   | |/ _ \ '__| '_ ` _ \| | '_ \ / _` | |   || >          ||      #
#   | |  __/ |  | | | | | | | | | | (_| | |   ||            ||      #
#   |_|\___|_|  |_| |_| |_|_|_| |_|\__,_|_|   ||____________||      #
#   ____                  _                   |______________|      #
#  | __ )  ___ _ __   ___| |__                 \\############\\     #
#  |  _ \ / _ \ '_ \ / __| '_ \                 \\############\\    # 
#  | |_) |  __/ | | | (__| | | |                 \      ____    \   #
#  |____/ \___|_| |_|\___|_| |_|                  \_____\___\____\  #
#                                                                   #
#####################################################################

Terminal-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks in a terminal environment.

Examples of tasks include:

  • Compiling and packaging a code repository
  • Downloading a dataset and training a classifier on it
  • Setting up a server

Terminal-Bench consists of two parts: a dataset of tasks and an execution harness.

Dataset of tasks

Each task in Terminal-Bench includes

  • a description in English
  • a docker environment
  • a test script to verify if the agent completed the task successfully
  • a reference ("oracle") solution that solves the task

Check out our existing tasks here.

Want to help?

T-Bench is an active open-source project. We're always looking for new tasks! If you have a task you'd like to see in T-Bench please follow our quickstart guide to add it, or join our Discord server to chat with us.

Execution harness

The execution harness connects a language model to a sandboxed terminal environment, sets up the task environments, and evaluates the agent's ability to complete the tasks.

After installing the package you can run the harness via tb run.