AI red teaming starts with the application, not only the model.
Most production AI systems combine a model with prompts, retrieval, memory, tools, APIs, user roles, and business workflows. Security testing must evaluate the complete system and the decisions it is allowed to make.
1. Map the AI trust boundaries
- Identify every model, prompt, retrieval source, tool, API, memory store, and human approval step.
- Document which inputs are trusted, untrusted, tenant-specific, or externally sourced.
- Define what the AI system must never reveal, change, approve, or execute.
2. Test instruction handling
- Attempt direct and indirect prompt injection through user input, uploaded content, retrieved documents, and connected tools.
- Test whether system instructions, hidden prompts, or internal reasoning artifacts can be exposed.
- Evaluate whether lower-trust content can override higher-priority instructions.
3. Validate data isolation
- Test cross-user and cross-tenant retrieval boundaries.
- Check whether sensitive data appears in logs, caches, traces, embeddings, or model responses.
- Confirm that authorization is enforced by the application, not delegated to the model.
4. Challenge tools and agents
- Test whether the model can invoke tools outside the user’s permissions.
- Attempt parameter manipulation, chained actions, unsafe retries, and approval bypasses.
- Verify that high-impact actions require deterministic controls and appropriate human review.
5. Measure guardrail effectiveness
Record which controls prevent, detect, or limit each scenario. A useful AI red-team report explains the attack path, the business impact, the control gap, and the remediation pattern.
