Superred — AI red-teaming and AI system security framework

Why superred

Attacks, systems, and benchmarks no longer live in separate worlds

Published attacks on AI systems and the benchmarks that measure them are scattered across one-off codebases. Each attack is wired to the system it was first demonstrated on; each benchmark ships its own runner. Answering a simple question — does this attack beat that system, and how much access does it take? — usually means re-implementing one against the other by hand.

Superred makes the three moving parts independent and composable. The attacker (an optimizer), the system under test (a target), and the definition of success (a security claim) are separate, interchangeable modules. Any attacker can be pointed at any target and scored by any claim — under a threat model you state explicitly.

Who it is for

For red-teamers

A growing library of faithful re-implementations of published attacks — AutoDAN-Turbo, PAIR, TAP, Crescendo, GPTFuzzer, many-shot, and more — ready to run against any target and scored by real safety benchmarks such as HarmBench, StrongREJECT, and AgentDojo. A fixed model and a fixed budget per attempt keep every comparison fair.

For system builders

Describe your system once: its trust boundaries, the points an attacker could influence, and the facts they could read. Superred then measures exactly which attacks succeed at each level of access — so you learn not just whether your system can be broken, but how much an attacker must control before it is.

High-level features

What the framework gives you

Mix and match

Any optimizer, any target, any security claim. Components are independent packages that combine freely.

Precise threat models

Attacker access is scoped to a forest of trust boundaries, so you measure behavior under exactly the access a real attacker would hold.

Event-driven by design

Attackers act only through declared injection points and observe only through declared signals and the run trajectory — no hidden coupling to the target.

Faithful to the source

Ported attacks and benchmarks stay byte-identical to their upstream prompts and data wherever possible, with every deviation recorded.

Fair and reproducible

A fixed model and a cost budget per task bound every attacker equally, and results are written to disk for later analysis.

Built to scale

Run many threat models in parallel, each against its own independent instance of the system under test.

In one sentence

A controller runs an optimizer against a target to pursue an adversarial goal from a security claim, while enforcing a security scope — recording every event and response along the way.

Learn the core concepts — models and events →

Easy to get started

Install, wire three pieces together, run an evaluation

Install the framework and whichever modules you want to combine, connect an optimizer, a target, and a claim in a short script, and run an evaluation end to end.

# install the framework and the pieces you want to combine
pip install -e ./superred
pip install -e ./superred-modules/optimizers/pair             # an attacker
pip install -e ./superred-modules/targets/chatbot             # a system to test
pip install -e ./superred-modules/security_claims/harmbench   # what counts as broken

# wire them together in a short script, then run
python run.py

Run your first evaluation — the Quick Start →

Documentation

Two ways in

Guide

Hands-on and example-driven: getting started, the core concepts of models and events, and worked examples. Start here.

Open the guide →

Reference

The framework interfaces in technical detail: the controller, optimizer, target, task, security claim, and core types.

Open the reference →