Core Concepts
This page defines the vocabulary and shows the loop that ties everything together. Read it once and the rest of the guide will make sense.
The five roles
SecurityClaim = a bundle of Tasks (what to test)
Task = one adversarial objective: set up the target, judge the result
Target = the AI system under test
Optimizer = the attacker
Controller = the orchestrator: runs ONE threat model
Three of these (Target, Optimizer, SecurityClaim) are things you implement and ship as separate packages. The Controller is part of the framework; you construct it but do not subclass it. A Task is the unit inside a claim.
Target
The AI system you want to test. You wrap your system once and reuse it across many attacks. A target exposes:
- Controllables - injection points the attacker may manipulate (the user message, a tool’s return value, a document fetched from a database, the system prompt, …). Each controllable carries a security domain tag.
- Observables - facts about the system the attacker is allowed to read (the model name, the current system prompt text, …). Each carries a security domain tag too.
- Config specs - named slots the task fills in before a run (e.g. the
system prompt, a database seed). Set via
set_config. - Query specs - named questions the evaluator asks after a run (e.g. “what
was the last response?”). Answered via
query. run(emit, send_event)- executes one interaction. It records facts by callingemit(...)and pauses at each controllable by callingawait send_event(...)to ask the attacker what to inject.
The guiding principle: build the target to be general and reusable, and keep benchmark-specific logic out of it. The chatbot target wraps “any LLM”; the AgentDojo target wraps “the AgentDojo agent”. The specifics of a given benchmark live in the SecurityClaim, not the target. See Writing a Target.
Optimizer
The attacker. It is an actor: the Controller launches optimizer.run(channel)
as its own concurrent asyncio task, and the optimizer pulls events off the
channel at its own pace. Most optimizers inherit the default run() and just
implement on_event(event), a small state machine that reacts to each event.
The optimizer:
- is
initialized with the goal, the controllables and observables it is allowed to see (already filtered to its scope), and an LLM client; - handles events one at a time: a
RunStartEvent, then a controllable event for each injection point the target reaches, then aRunEndEvent; - decides what to inject at each controllable, and whether to stop (by
returning
done=Trueon theRunEndEvent); - may call
self.llm.complete(...)to generate attacks, if the experiment granted it an LLM. The model and the spending budget are chosen by the experiment, not by the optimizer: that is part of the threat model.
See Writing an Optimizer.
Task
One adversarial objective. A task:
- configures the target before each run (e.g. plants a secret, sets a benign user goal);
- evaluates the trajectory afterwards and returns an
EvaluationResultwith asuccessflag and a numericprimary_score.
Tasks are stateless: they never store a reference to the target. They get
the target handed to them in configure_target and again in evaluate. This is
what lets a claim be iterated many times. See Writing Tasks.
SecurityClaim
A re-iterable collection of tasks. This is what you hand to the Controller. Claims compose:
from superred.core.interfaces.security_claim import SecurityClaim
claim = SecurityClaim.from_tasks([task_a, task_b, task_c])
# Build a bigger claim out of smaller ones (lazy, re-iterable):
combined = SecurityClaim.from_claims([claim_1, claim_2])
In practice you rarely build claims by hand: a module ships a factory
function (e.g. harmbench_claim(...), sorry_bench_claim(...)) that loads a
dataset and produces one task per prompt. See Writing Tasks and Security
Claims.
Controller = one threat model
A threat model is the answer to “what can the attacker do?”. In superred it is captured by two things:
- a scope: which security domains (trust boundaries) the attacker controls and can observe;
- an
llm_config: which model the attacker may call, and how much it may spend.
One Controller instance evaluates one claim under one (scope, llm_config)
threat model. To compare several threat models (a weak attacker vs a strong
one, with feedback vs without), you build several Controllers. That is covered
in Running Evaluations and
Advanced Patterns.
The Controller never shares a target between concurrent tasks: it gets each task
a fresh target from the TargetFactory and a fresh optimizer from the
optimizer_factory. The factory also declares how many tasks may run in
parallel (concurrency).
The run loop
For each task in the claim, the Controller does this (simplified):
target = target_factory.create() # fresh instance for this task
task.configure_target(target) # set up the scenario (or skip if NotApplicable)
optimizer = optimizer_factory() # fresh attacker
optimizer.initialize(goal, controllables, observables, llm_client)
# controllables/observables are scope-filtered
launch optimizer.run(channel) # attacker runs concurrently
LOOP (until the optimizer says done, or max_runs_per_task):
send RunStartEvent -> optimizer
target.run(emit, send_event):
target emits ObservableEvent facts via emit(...)
target pauses at each controllable: send_event(...) -> optimizer injects
evaluation = task.evaluate(trajectory, target) # did it work?
send RunEndEvent(evaluation) -> optimizer # optimizer may answer done=True
target.reset_ephemeral_state() # reset for the next run
optimizer.teardown()
target.reset_ephemeral_state(); target.teardown() # instance is then discarded
A “run” is one full pass of the target plus its evaluation. A task can take many
runs: the attacker keeps trying until it gives up (done=True), exhausts its
budget, or hits the Controller’s max_runs_per_task safety cap (default 100).
Events and responses
The target and optimizer never call each other directly. They communicate through typed events carried on a channel. As an optimizer author, these are the events you will see:
| Event | When the optimizer sees it | Valid responses |
|---|---|---|
RunStartEvent |
Just before a new run begins | any EventResponse |
ControllablePreCallEvent |
Target reached an injection point | ControllableInjection, ControllableNoInjection |
ControllablePostCallEvent |
Target finished using an injection (lets you observe the effect) | ControllableInjection, ControllableNoInjection |
RunEndEvent |
Run finished and was evaluated | RunEndResponse(done=...) |
ControllableInjection(value=...) supplies a value; ControllableNoInjection
declines (the target falls back to its own default). The Controller itself sends
ControllableNoInjection automatically for any controllable that is out of
scope, so the optimizer is never even asked about surfaces it does not control.
There is a second one-way event, ObservableEvent, that the target emits to
record what happened (the prompt it sent, the response it got, a tool call). The
optimizer does not respond to these; it reads them from the trajectory.
Trajectory
Every run produces a trajectory: an ordered list of Event | EventResponse
objects, the single source of truth for what happened. It contains:
ObservableEvent- one-way facts emitted by the target (e.g. the model request and the model response). Each has anobservable(which names it and carries its security domain) and acontentpayload.ControllablePreCallEvent/ControllablePostCallEventand their responses (ControllableInjection/ControllableNoInjection) - recorded by the Controller around each injection point.RunEndEvent- written by the Controller after evaluation; it carries theevaluationresult so the optimizer can read feedback from past runs.
RunStartEvent is not stored in the trajectory (it carries the trajectory
itself and adds nothing as a record). RunEndEvent is stored.
The optimizer does not see the raw trajectory. It sees a FilteredTrajectory:
a read-only view containing only entries whose security domain is in scope (plus
entries explicitly tagged with no domain). The evaluator, by contrast, sees the
full unfiltered trajectory.
Security domains
A security domain is a trust boundary: a labelled surface that an attacker either does or does not control. The target defines them as a tree (or a forest of independent trees). A parent tag includes all its descendants.
system (root: full control of the system)
├── system_prompt (can override the system prompt)
└── user (can send user messages)
When you scope a Controller to {user}, the Controller filters everything the
optimizer can see or touch to that boundary:
- only
usercontrollables are offered to it, - only
userobservables are passed to it, - its trajectory view contains only
user-tagged entries, - evaluation sub-scores outside
userare hidden, - and any out-of-scope controllable is auto-answered
ControllableNoInjection.
This is how you ask precise questions like “what can an attacker achieve if they control only the user message, and nothing else?” Choosing these boundaries well, based on the real trust structure of the system, is the most important modelling decision you make. Security Domains is devoted to it.
Scores
An EvaluationResult carries a success: bool, a primary_score: Score, and
optional named sub_scores. A Score has:
value: float- higher is better (the scale is whatever the task defines);security_domain: SecurityDomainTag | None- on a sub-score, which boundary it pertains to (Nonemeans always visible); theprimary_scorecarries nosecurity_domain;name: str- the dimension name (default"primary").
The Controller tracks the best primary score across a task’s runs. sub_scores
carrying an out-of-scope security_domain are filtered out of the feedback the
optimizer sees (an untagged sub-score, security_domain=None, stays visible);
the primary_score, success, and rationale are always shown, because the
attacker needs the main signal to improve. The primary_score is never
scope-filtered: it is the unscoped optimization signal.