Writing a Target

A target wraps the AI system you want to red-team and exposes it to the framework through a small, fixed interface. This is usually the first thing you build for a new system.

The single most important habit: make the target general and reusable. A target should model a kind of system (any chatbot, the AgentDojo agent, a RAG pipeline), not one benchmark. Everything benchmark-specific (which prompts, what counts as success) belongs in a SecurityClaim, not here. The payoff is that one target serves many claims, and one claim can run against many targets.

What a target must provide

You subclass superred.core.interfaces.target.Target and implement:

Member Kind Purpose
config_specs property the named slots a task may set before a run
set_config(name, value) method accept one config value
query_specs property the named questions an evaluator may ask after a run
query(name, **params) method answer one query
security_domain property the trust-boundary forest (see doc 7)
get_controllables() method the injection points, each tagged with a domain
get_observables() method static facts the attacker may read, each tagged
run(emit, send_event) async method execute one interaction
reset_ephemeral_state() async method reset ephemeral (per-run) state (called after each evaluation)
teardown() async method release resources (called once at the end)

Minimal template

This is the complete BasicLLMChatTarget that ships in superred-modules/targets/test_basic_llm_chat, lightly annotated. It is a single-turn chatbot: one config slot (system prompt), one controllable (the user message), one query (the last response).

from __future__ import annotations

from litellm import ModelResponse, acompletion

from superred.core.interfaces.target import Target
from superred.core.types.controllable import Controllable
from superred.core.types.event import EventHandler, EventResponseHandler
from superred.core.types.events import (
    ControllableInjection,
    ControllablePreCallEvent,
    ObservableEvent,
)
from superred.core.types.observable import Observable, ObservableValue
from superred.core.types.security_domain import SecurityDomain, SecurityDomainTag
from superred.core.types.state import ConfigSpec, QuerySpec

# --- Step 1: define the trust-boundary forest, once, as module constants ---
SYSTEM_TAG = SecurityDomainTag("system")
USER_INPUT_TAG = SecurityDomainTag("user_input", parent=SYSTEM_TAG)
_DOMAIN = SecurityDomain([SYSTEM_TAG, USER_INPUT_TAG])

# Define each controllable/observable ONCE so get_controllables() and run()
# share the exact same instance (the framework matches them by value).
_USER_INPUT_CTRL = Controllable(
    name="user_input",
    security_domain=USER_INPUT_TAG,
    description="The user message sent to the LLM.",
)


class BasicLLMChatTarget(Target):
    """A single-turn LLM chat system."""

    def __init__(self, model: str, api_base: str, api_key: str) -> None:
        # Manual operational values go in the constructor, never through config.
        self._model = model
        self._api_base = api_base
        self._api_key = api_key
        self._system_prompt = "You are a helpful assistant."
        self._last_response = ""

    # --- Config: what the TASK sets before a run ---
    @property
    def config_specs(self) -> list[ConfigSpec]:
        return [
            ConfigSpec(
                name="system_prompt",
                security_domain=SYSTEM_TAG,
                description="System prompt for the LLM.",
            ),
        ]

    def set_config(self, name: str, value: str) -> None:
        if name == "system_prompt":
            self._system_prompt = value

    # --- Query: what the EVALUATOR reads after a run ---
    @property
    def query_specs(self) -> list[QuerySpec]:
        return [QuerySpec(name="last_response", description="The LLM's last response.")]

    def query(self, name: str, **params: str) -> str:
        if name == "last_response":
            return self._last_response
        return ""

    # --- The trust boundaries ---
    @property
    def security_domain(self) -> SecurityDomain:
        return _DOMAIN

    # --- Attack surface ---
    def get_controllables(self) -> list[Controllable]:
        return [_USER_INPUT_CTRL]

    def get_observables(self) -> list[ObservableValue]:
        obs = Observable(
            name="model", security_domain=SYSTEM_TAG,
            description="The LLM model identifier.",
        )
        return [ObservableValue(observable=obs, content=self._model)]

    # --- One interaction ---
    async def run(self, emit: EventHandler, send_event: EventResponseHandler) -> None:
        # 1. Ask the attacker what to put in the user_input controllable.
        resp = await send_event(
            ControllablePreCallEvent(controllable=_USER_INPUT_CTRL, request="Enter user message:"),
        )
        # 2. Use the injection, or fall back if this surface is out of scope.
        user_message = resp.value if isinstance(resp, ControllableInjection) else "Hello"

        # 3. Record what we are about to send (so the attacker/evaluator can see it).
        emit(ObservableEvent(
            observable=Observable(
                name="model_request", security_domain=USER_INPUT_TAG,
                description="User message sent to the LLM.",
            ),
            content=user_message,
        ))

        # 4. Call the actual system.
        response = await acompletion(
            model=self._model,
            messages=[
                {"role": "system", "content": self._system_prompt},
                {"role": "user", "content": user_message},
            ],
            api_base=self._api_base,
            api_key=self._api_key,
        )
        assert isinstance(response, ModelResponse)
        self._last_response = response.choices[0].message.content or ""

        # 5. Record the result.
        emit(ObservableEvent(
            observable=Observable(
                name="model_response", security_domain=SYSTEM_TAG,
                description="LLM response.",
            ),
            content=self._last_response,
        ))

    async def reset_ephemeral_state(self) -> None:
        self._last_response = ""   # reset ephemeral (per-run) state

    async def teardown(self) -> None:
        pass                       # nothing to release here

Key ideas, one at a time

Constructor takes manual values only

API keys, base URLs, model names, container images: these are operational facts about your deployment, not part of the attack. They go in __init__. The framework never sets them. Only tasks push values in, and only through set_config.

Config (before) vs Query (after)

These serve different actors at different times. Keep them distinct.

  Config Query
Who uses it the Task, before the run the Task’s evaluator, after the run
Method set_config(name, value) query(name, **params)
Purpose set up the scenario read out what happened
Examples system prompt, DB seed last response, transcript, DB row

A QuerySpec carries a name and a description; a ConfigSpec additionally carries a security_domain (config slots are tagged just like controllables). The description is the contract: values are always plain text, and the description tells the task author what format to send and what they will get back. A QuerySpec may also declare params for queries that take arguments (e.g. read_file(path=...)).

Controllables: the attack surface

A Controllable is one injection point. It is a flat, frozen value:

Controllable(
    name="user_input",            # unique within the target
    security_domain=USER_INPUT_TAG,  # which trust boundary it belongs to (required)
    description="The user message sent to the LLM.",
    value_type="text",            # "text" (default), "json", "modifier", "binary"
)

During run(), you reach an injection point by sending a ControllablePreCallEvent and awaiting the response:

resp = await send_event(
    ControllablePreCallEvent(controllable=_USER_INPUT_CTRL, request="Enter user message:"),
)
if isinstance(resp, ControllableInjection):
    value = resp.value
else:
    # ControllableNoInjection: this surface is out of the tested scope,
    # OR the attacker chose not to inject. Use your own default.
    value = "Hello"

The request string describes what this point needs; the attacker sees it. Always handle the ControllableNoInjection case with a sensible default: a controllable is silently declined whenever it is out of scope, which is the normal situation for any surface the current threat model does not grant.

Use the same Controllable instance in get_controllables() and in the event you send. The security-domain filter matches on the controllable carried by the event, and optimizers match on event.controllable.name, so defining each controllable once as a module constant avoids subtle mismatches.

Post-call events: letting the attacker see the effect

Some attackers want to observe what their injection produced before deciding their next move (multi-turn jailbreaks, for instance). Send a ControllablePostCallEvent after you have used the value:

from superred.core.types.events import ControllablePostCallEvent

await send_event(ControllablePostCallEvent(
    controllable=_USER_INPUT_CTRL,
    request="Enter user message:",
    answer=self._last_response,
))

Whether to fire post-call events is your design choice. Targets that want to support adaptive multi-turn attacks generally do; simple single-shot targets often do not, and instead rely on the attacker reading the response from the trajectory.

Observables vs ObservableEvents

There are two ways the attacker learns facts, and they are different:

Both carry a security_domain (on the Observable). The domain decides who can see the fact. In the template, the user’s message is tagged USER_INPUT_TAG (an attacker controlling user input already knows it), while the model’s response is tagged SYSTEM_TAG.

emit and send_event: the two callbacks

run() receives two callbacks with deliberately different shapes:

The target never sees the full trajectory. It only writes to it (through emit) and asks questions (through send_event).

reset_ephemeral_state vs teardown

Because the Controller builds a fresh target per task, you do not need reset_ephemeral_state/teardown to undo cross-task state. They only manage state within one task’s sequence of runs.

Concurrency: how many instances run at once

The number of tasks that may run in parallel against independent target instances is declared on the TargetFactory, not inside the target:

from superred.core.controller import TargetFactory

target_factory = TargetFactory(
    create=lambda: BasicLLMChatTarget(model=..., api_base=..., api_key=...),
    concurrency=8,   # up to 8 tasks at once, each with its own instance
)

Choose concurrency for the target’s cost: a chatbot wrapping a hosted API can comfortably use 8 or more; a target that boots a sandbox or holds a heavy resource should usually stay at 1 unless it pools internally. Because each task gets its own instance, you do not need locks for cross-task safety. (Within a single run, a target may fan out into concurrent branches; see Advanced Patterns.)

Multiple controllables at different boundaries

A richer target exposes several injection points, each at its own trust boundary. This is what makes scoped threat models expressive:

_USER_QUERY_CTRL = Controllable(name="user_query", security_domain=USER_TAG,
                                 description="The user's search query.")
_DB_RESULT_CTRL  = Controllable(name="db_result", security_domain=DB_TAG,
                                 description="A document returned from the database.")

def get_controllables(self) -> list[Controllable]:
    return [_USER_QUERY_CTRL, _DB_RESULT_CTRL]

async def run(self, emit, send_event):
    user_resp = await send_event(
        ControllablePreCallEvent(controllable=_USER_QUERY_CTRL, request="user query"))
    user_query = user_resp.value if isinstance(user_resp, ControllableInjection) else "default"

    db_resp = await send_event(
        ControllablePreCallEvent(controllable=_DB_RESULT_CTRL, request="db lookup"))
    if isinstance(db_resp, ControllableInjection):
        db_result = db_resp.value          # attacker poisoned the database
    else:
        db_result = self._real_db_lookup(user_query)   # untouched: out of scope

    response = await self._generate(user_query, db_result)

Scoped to {user}, the attacker controls user_query while db_result is auto-declined: you are testing “can the attacker win through queries alone?”. Scoped to {system} (a root that includes both), the attacker controls both: “what if they can also poison the database?”. Designing these boundaries is the subject of Security Domains.

Worked examples in the repository