Terminal-Bench Science: Contribute your scientific workflows as tasks for AI Agents

terminal-bench

science

A Benchmark for Evaluating AI Agents on Computational Workflows in the Natural Sciences

GitHub Discord Calendar Contact

Contribution deadline: Aug 17, 2026

Proposefeedback & approvalPull requestreview & mergeContributorinvite-onlyReviewer & maintainer

Task Dashboard

ABOUT

What is Terminal-Bench Science?

Terminal-Bench Science is a benchmark for evaluating AI agents on real computational workflows from scientific research. It builds on Terminal-Bench, which has been adopted by frontier labs including Anthropic, OpenAI, and Google DeepMind and has helped drive progress in AI agents on software engineering tasks by defining what those labs measure and optimize for. Terminal-Bench Science brings the same approach to the natural sciences.

Why do we need Terminal-Bench Science?

Most existing "AI for Science" benchmarks test textbook knowledge, not real workflows. Terminal-Bench Science closes this gap with real computational workflow tasks from research labs, evaluated in containerized environments with programmatic verification. Our goal is to give scientists a direct voice in shaping AI progress: domain experts contribute scientific workflows as benchmark tasks, frontier labs evaluate and improve their AI agents against them, and the improved AI agents with stronger scientific capabilities flow back as better tools for researchers.

NATURAL SCIENCE
COMMUNITY

FRONTIER AI
AGENTS & MODELS

CONTRIBUTE TASKS

EVALUATE & IMPROVE

ACCELERATE SCIENCE

AI FOR SCIENCE
PROGRESS

Domain Coverage

Terminal-Bench Science is targeting 100+ benchmark tasks across the life sciences, physical sciences, and earth sciences, but is also open to tasks from the mathematical sciences and other domains with computational workflows.

Domain	Areas
Life Sciences	Biology, Ecology, Medicine, Neuroscience
Physical Sciences	Astronomy, Chemistry, Materials Science, Physics
Earth Sciences	Atmospheric Sciences, Environmental Sciences, Geosciences, Ocean Sciences
Mathematical Sciences	Applied Mathematics, Autoformalization, Scientific Computing, Statistics
Other Sciences	Engineering Sciences, Interdisciplinary Sciences, Miscellaneous Sciences

CONTRIBUTE

Why Contribute?

Make AI better at your science. Frontier labs optimize for what benchmarks measure. Your tasks directly incentivize them to improve their AI systems on the scientific problems in your domain.
Gain experience in agentic evaluation. Get hands-on with evaluating frontier AI agents — learn how to design rigorous benchmarks and see firsthand where today's best models succeed and fail on real scientific work.
Become a co-author. Contributors with merged tasks receive co-authorship on the Terminal-Bench Science paper, targeting submission to a high-impact scientific journal.

What We're Looking For

We're looking for complex, real-world computational workflows from practicing scientists across the natural sciences that meet the following three key criteria:

Scientifically grounded. Tasks should reflect computational workflows from real research in the natural sciences — ideally drawn from your own work or replicating published results in your domain of expertise.
Objectively verifiable. Solutions must be programmatically checkable with deterministic pytest-based evaluation. We are not looking for open-ended tasks like hypothesis generation or literature review.
Genuinely difficult. We target tasks that today's best AI agents cannot yet reliably solve. Hard tasks expose real gaps and push capabilities forward — we're aiming for a 10–20% solve rate at release.

Tasks follow the Harbor Task Format. Check out Example Tasks and the Task Dashboard for reference.

How to Contribute

Before you start, join our Discord, introduce yourself in #tb-science, and optionally pitch your task idea there for early feedback. Follow #tb-science-announcements for updates and our weekly meetings (Mondays, 9am PT). We then follow a curated three-stage contribution process to maintain quality:

Propose — When you're ready, submit your idea via the Task Proposal Form. Proposals are posted on our Task Proposal Board and in #tb-science-task-proposals. An LLM judge evaluates it against our Task Proposal Rubric, and human reviewers use that to approve your proposal and guide you toward implementation.
Build — Once approved, build the task in the Harbor Task Format and submit a Pull Request following our Contributing Guide. Your implementation is evaluated against our Task Implementation Rubric, and human reviewers also assess difficulty, scientific quality, and overall fit. We work with you iteratively through review until it's ready to merge. Once merged, you earn contributor status and authorship credit on the Terminal-Bench Science paper.
Review — Top contributors are invited to reviewer & maintainer status — a senior role with elevated authorship credit and area chair candidacy for a scientific domain. An area chair leads a specific scientific area: they recruit new contributors within their domain, manage the reviewing team and progress, and set the scientific bar for tasks in their area. It is one of the highest roles in the project.

You can follow every open proposal, pull request, status, and domain coverage in real time on the Task Dashboard. Once the task collection is complete, we run frontier AI agents against it to calibrate difficulty. Tasks that pass are included in the official Terminal-Bench Science release on the Terminal-Bench Benchmarks and Terminal-Bench Leaderboards.

Weekly Meetings and Office Hours

We host a weekly meeting every Monday at 9am PT for project updates and open discussion. Reviewers also run office hours throughout the week for feedback on proposals, implementation questions, and review guidance. You can subscribe to the Terminal-Bench Science calendar to see all upcoming sessions. Drop into any session — no RSVP needed.

Session	Areas	Time (PT)	Notes	Meeting
Weekly Meeting	General	Monday 9am	Notes	Join
Office Hour: Steven Dillmann	General, Applied Mathematics, Astronomy, Physics, Statistics	Monday 10am	Notes	Join
Office Hour: Allen Hart	General, Applied Mathematics, Physics, Autoformalization, Scientific Computing	Thursday 10am	Notes	Join

Add to your calendar: Google·Outlook·Apple

Authorship & Credit

Contributors with merged tasks earn contributor status — co-authorship on the Terminal-Bench Science paper and a listing on the Terminal-Bench Contributors page. Author order is determined by the number and impact of accepted tasks.

Top contributors are invited into reviewer & maintainer status — a senior role that comes with elevated authorship credit, voting rights on proposal approvals and PR merges, and the chance to shape the benchmark's scientific direction.

Reviewers in good standing become eligible for area chair, which leads a specific scientific area, manages its reviewing team and progress, and recruits new contributors — one of the highest roles in the project.

Faculty who bring in contributors or review tasks as domain experts are eligible for senior authorship.

Deadline

Pull requests must be submitted by August 17, 2026. Review, iteration, and merge happen after the deadline, but no new PRs will be accepted past that date. Starting early is highly recommended — most tasks require a few rounds of feedback and iteration before they're ready to merge.

RESOURCES

Join our Discord and reach out to @stevendi11 on Discord or stevendi@stanford.edu to get involved. Key channels: #tb-science for general discussion and early feedback on task ideas, #tb-science-announcements for project updates, and #tb-science-task-proposals for submitted proposals, automated reviews, and reviewer feedback. Drop into our weekly meetings and office hours — see the Terminal-Bench Science calendar for the schedule.

Acknowledgements

Terminal-Bench Science is an open academic collaboration hosted by Stanford University and the Laude Institute. As part of the Terminal-Bench franchise, it is built by the Terminal-Bench & Harbor Framework team, and scientific contributors. We thank Snorkel AI for support via the Open Benchmarks Grants program, the Laude Institute via the Slingshots program, and 2077AI for API credits that power benchmark evaluations.

Contact

For questions, feedback, or if you're interested in contributing, reach out to Steven Dillmann at stevendi@stanford.edu.