Overview
Inspect is an open-source framework designed to make systematic, extensible, and reproducible evaluations of large language models (LLMs) straightforward. Created by the UK AI Security Institute (AISI) and hosted in the UK Government BEIS GitHub organization, Inspect bundles common evaluation patterns and utilities so teams can focus on designing and running meaningful tests rather than reimplementing evaluation scaffolding.
Key features
- Built-in components for prompt engineering: helpers and templates to manage prompts consistently across experiments.
- Tool usage and multi-turn dialogue support: evaluate models in interactive, multi-step scenarios including tool calls and stateful conversations.
- Model-graded evaluations: support for evaluations where models themselves provide graded responses or judgments.
- Large collection of pre-built evaluations: includes 100+ ready-to-run evaluation suites (Inspect Evals) that can be executed against any compatible model.
- Extensible architecture: additional elicitation and scoring techniques can be provided via separate Python packages, making it straightforward to add new metrics or evaluation styles.
Who it's for
Inspect targets researchers, safety engineers, and practitioners who need a repeatable framework for benchmarking and stress-testing LLMs. It’s useful for comparing model behavior across prompts, tool configurations, multi-turn interactions, and scoring strategies.
Quick start (high-level)
- Clone the repository and install in editable mode with dev dependencies (recommended for contributors):
git clone https://github.com/UKGovernmentBEIS/inspect_ai.git cd inspect_ai pip install -e ".[dev]" - Optionally install pre-commit hooks and run checks/tests (
make hooks,make check,make test). - See the documentation site for detailed guides and the collection of pre-built evaluations: https://inspect.aisi.org.uk/.
Integration and extensibility
Inspect is designed as an evaluation infrastructure rather than a model runtime. It can be used with different model backends and evaluation extensions. Teams can add new elicitation techniques, scoring methods, or datasets via Python extension packages. The project also documents how to build and preview documentation using Quarto for contributors working on docs.
Project signals and community
The GitHub repository (UKGovernmentBEIS/inspect_ai) is actively published with community-facing documentation and a demo collection (Inspect Evals). It contains contributor guidance (linting, tests, recommended VS Code extensions) to help development and review. The project README links to the official Inspect documentation (inspect.aisi.org.uk) for full usage details.
