Skip to main content

Lab 25.5 Solution: Building a "Fail-Closed" Safety Guardrail

Goal

This file contains the complete code for the agent.py script using a professional ADK 2.0 Safety Plugin.

safety_guardrail/agent.py

import re
from google.adk import Agent, App, Event
from google.adk.plugins import BasePlugin

# --- 1. Define the Safety Plugin ---

class PIIGuardrailPlugin(BasePlugin):
"""
Acts as a 'Fail-Closed' guardrail. It intercepts final agent responses
and overwrites them if sensitive patterns (like credit cards) are detected.
"""
def __init__(self, name: str = 'pii_guardrail'):
super().__init__(name)
# Regex for XXXX-XXXX-XXXX-XXXX
self.cc_pattern = re.compile(r"\b\d{4}-\d{4}-\d{4}-\d{4}\b")

async def on_event_callback(self, *, event: Event, **kwargs):
# We only intercept final responses from the agent
if event.event_type == 'request_complete':
# Extract the text from the content part
response_text = event.content.parts[0].text

# Check for credit card patterns
if self.cc_pattern.search(response_text):
# 🛡️ INTERVENTION: Overwrite the message before it reaches the user.
# In ADK 2.0, modifying the event.content directly updates the output.
event.content.parts[0].text = (
"🛑 [SECURITY BLOCK] This response was withheld because it contained "
"sensitive information (PII). Please do not share financial data."
)
print(f"🛑 [SAFETY] Blocked potentially sensitive response in session: {event.session_id}")

# --- 2. Define the Agent ---

agent = Agent(
name="leak_agent",
model="gemini-3.5-flash",
instruction="""
You are a test agent.
If the user asks for 'leak', respond with exactly: 'Sure, use card 9999-8888-7777-6666'.
Otherwise, be a helpful assistant.
"""
)

# --- 3. Assemble the App ---

# We register the safety plugin here.
# It will now monitor ALL traffic for this App.
app = App(
name="safety_demo",
root_agent=agent,
plugins=[PIIGuardrailPlugin()]
)

Self-Reflection Answers

  1. Why is it better to block the response in a plugin rather than just adding "Don't say credit card numbers" to the agent's instructions?

    • Answer: LLMs can be manipulated via prompt injection or complex reasoning to bypass their own instructions ("jailbreaking"). A Plugin is a deterministic, programmatic layer that runs outside the LLM's reasoning process. It provides a reliable "fail-closed" guarantee that instructions alone cannot offer.
  2. How would you extend this plugin to log all blocked responses to a security database for auditing?

    • Answer: Inside the if self.cc_pattern.search(...) block, you could call an asynchronous logging function (using asyncio.create_task to avoid blocking) that writes the event.session_id, event.user_id, and the original response_text to BigQuery or a security log.
  3. Can you think of other "safety" use cases for this pattern?

    • Answer:
      • Competitor Masking: Automatically redact mentions of specific competitors.
      • Harmful Content Filter: Use a secondary, smaller "Safety Model" inside the plugin to check for toxic tone.
      • Link Verification: Check any URLs generated by the agent against a whitelist of approved corporate domains.