Control Reference: ICO-SC-2 Clause Description
Organisations should implement content filtering and guardrails to prevent the generation or dissemination of harmful, illegal, biased, or otherwise inappropriate content by AI systems. This includes technical measures to detect and block unsafe outputs, as well as processes to ensure guardrails are effective, regularly tested, and updated in response to emerging risks. Why This Control Exists
Agentic AI can generate or act on content that violates laws (hate speech, misinformation, explicit material), causes harm (self-harm encouragement, radicalisation), breaches privacy (PII exposure), or undermines trust (offensive/deceptive outputs). ICO requires proactive filtering and guardrails to protect individuals, uphold data protection principles (especially lawfulness and fairness), and prevent reputational or legal liability for organisations deploying autonomous AI. How Katyar Helps Achieve Compliance Katyar implements content filtering and guardrails through its semantic firewall and built-in guardrail engine, which scans both inputs and outputs in real time and takes preventive action when harmful content is detected. Evaluation Criteria
Katyar considers the control satisfied when both of the following are true:
- Guardrails have actively scanned content
- At least one harmful or unsafe output has been blocked (or masked/redacted) by guardrails
- Number of guardrail scan events (inputs/outputs checked)
- Number of blocked/masked/redacted events due to content violations
- Breakdown by content threat type: harmful language, explicit content, hate speech, misinformation flags, PII in output, etc.
- Actions taken: full block, partial redaction, safe fallback response, escalation to HITL
- Recent detection timestamps and associated agent/tool/context
-
Real-time Content Scanning
Every prompt and output is scanned by the semantic firewall for harmful, illegal, or policy-violating content. -
Multi-Layer Guardrails
Built-in detectors for:- Prompt injection & jailbreak attempts
- Harmful/toxic language
- Explicit or violent content
- Misinformation patterns
- PII / sensitive data leakage in outputs
-
Flexible Response Options
Configurable actions on detection:- Block execution entirely
- Mask/redact sensitive parts
- Return safe fallback message
- Escalate to HITL for borderline cases
-
Audit-Ready Logging
Every scan and block event logged with: full input/output, threat type, confidence score, response action, and timestamp. -
Dashboard Guardrail Insights
Real-time view of scan volume, block rate, top threat types, and blocked content examples (anonymized).
- Ensure guardrail scanning is enabled (default in workspace settings).
- Run agent scenarios that could trigger content filtering (e.g., test harmful prompts, include mock toxic language or PII in outputs).
- Confirm both scanning and blocking occur:
- Scan events appear as guardrail.scanned
- Block events appear as guardrail.blocked or output.masked
- Check Compliance dashboard → ICO-SC-2 card to verify both scan and block events exist.
- (Recommended) Review blocked events in Observability tab and adjust guardrail sensitivity/response rules if needed.
- Evidence of active scanning of inputs and outputs
- Preventive actions — actual blocks, redactions, or safe fallbacks
- Effectiveness — harmful content prevented from reaching users or external systems
- Coverage — guardrails applied to relevant risk vectors (toxic language, PII, illegal content)
- Traceability — full audit trail of scan → detection → response
Read the full UK ICO Guidance on AI and data protection (including content safety):
ICO Guidance on AI and data protection
(Relevant sections: “Safety and security” and “Preventing harm”)
