AI red-teaming and AI system security framework
Point an automated attacker at any AI system and measure whether it can be made to misbehave — under a precisely defined level of access.
Scroll to learn more
Why superred
Published attacks on AI systems and the benchmarks that measure them are scattered across one-off codebases. Each attack is wired to the system it was first demonstrated on; each benchmark ships its own runner. Answering a simple question — does this attack beat that system, and how much access does it take? — usually means re-implementing one against the other by hand.
Superred makes the three moving parts independent and composable. The attacker (an optimizer), the system under test (a target), and the definition of success (a security claim) are separate, interchangeable modules. Any attacker can be pointed at any target and scored by any claim — under a threat model you state explicitly.
Who it is for
A growing library of faithful re-implementations of published attacks — AutoDAN-Turbo, PAIR, TAP, Crescendo, GPTFuzzer, many-shot, and more — ready to run against any target and scored by real safety benchmarks such as HarmBench, StrongREJECT, and AgentDojo. A fixed model and a fixed budget per attempt keep every comparison fair.
Describe your system once: its trust boundaries, the points an attacker could influence, and the facts they could read. Superred then measures exactly which attacks succeed at each level of access — so you learn not just whether your system can be broken, but how much an attacker must control before it is.
High-level features
Any optimizer, any target, any security claim. Components are independent packages that combine freely.
Attacker access is scoped to a forest of trust boundaries, so you measure behavior under exactly the access a real attacker would hold.
Attackers act only through declared injection points and observe only through declared signals and the run trajectory — no hidden coupling to the target.
Ported attacks and benchmarks stay byte-identical to their upstream prompts and data wherever possible, with every deviation recorded.
A fixed model and a cost budget per task bound every attacker equally, and results are written to disk for later analysis.
Run many threat models in parallel, each against its own independent instance of the system under test.
In one sentence
A controller runs an optimizer against a target to pursue an adversarial goal from a security claim, while enforcing a security scope — recording every event and response along the way.
Easy to get started
Install the framework and whichever modules you want to combine, connect an optimizer, a target, and a claim in a short script, and run an evaluation end to end.
# install the framework and the pieces you want to combine
pip install -e ./superred
pip install -e ./superred-modules/optimizers/pair # an attacker
pip install -e ./superred-modules/targets/chatbot # a system to test
pip install -e ./superred-modules/security_claims/harmbench # what counts as broken
# wire them together in a short script, then run
python run.py
Documentation