Introduction
A benchmark to measure the capabilities of AI agents in a terminal environment.
#####################################################################
# _____ _ _ ______________ #
# |_ _|__ _ __ _ __ ___ (_)_ __ __ _| | || || #
# | |/ _ \ '__| '_ ` _ \| | '_ \ / _` | | || > || #
# | | __/ | | | | | | | | | | | (_| | | || || #
# |_|\___|_| |_| |_| |_|_|_| |_|\__,_|_| ||____________|| #
# ____ _ |______________| #
# | __ ) ___ _ __ ___| |__ \\############\\ #
# | _ \ / _ \ '_ \ / __| '_ \ \\############\\ #
# | |_) | __/ | | | (__| | | | \ ____ \ #
# |____/ \___|_| |_|\___|_| |_| \_____\___\____\ #
# #
#####################################################################
Terminal-Bench is a framework and set of tasks for evaluating how well AI agents can accomplish complex tasks in a terminal environment.
Examples of tasks include:
- Compiling and packaging a code repository
- Downloading a dataset and training a classifier on it
- Setting up a server
Terminal-Bench consists of two parts: a dataset of tasks and an execution harness.
Dataset of tasks
Each task in Terminal-Bench includes
- a description in English
- a docker environment
- a test script to verify if the agent completed the task successfully
- a reference ("oracle") solution that solves the task
Check out our existing tasks here.
Want to help?
T-Bench is an active open-source project. We're always looking for new tasks! If you have a task you'd like to see in T-Bench please follow our quickstart guide to add it, or join our Discord server to chat with us.
Execution harness
The execution harness connects a language model to a sandboxed terminal environment, sets up the task environments, and evaluates the agent's ability to complete the tasks.
After installing the package you can run the harness via tb run
.