ShieldAgent

Shielding Agents via Verifiable Safety Policy Reasoning

Zhaorun Chen¹, Mintong Kang², Bo Li^1,2

¹University of Chicago, ²University of Illinois at Urbana-Champaign

ICML 2025

                    Abstract: LLM-based autonomous agents have demonstrated wide adoption in various applications, leveraging their ability to access sensitive data and make autonomous decisions. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first LLM-based guardrail agent designed to enforce explicit safety policy compliance of the action trajectory of other agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. During inference, it retrieves relevant rule circuits based on the invoked action and generates a shielding plan, leveraging a comprehensive tool library and executable code for formal verification. Then it performs probabilistic inference to assign safety labels and report rule violations. Recognizing the lack of guardrail benchmarks for agents, we introduce SA-Guard, a dataset with 2K safety-related instructions across 7 risk categories and 6 web environments, each paired with risky trajectories generated under 2 SOTA attacks. Experiments show that ShieldAgent achieves SOTA on SA-Guard and existing benchmarks, outperforming prior methods by 11.3% on average with a high rule recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding LLM agents.
                

Policy Model Construction

Automatic Extraction: Parses policy documents and extracts verifiable rules

Rule Refinement: Iteratively refines rules to ensure they can be concretely verified

Action-Based Circuits: Organizes rules into probabilistic circuits for each agent action

Verification Pipeline

Efficient Localization: Identifies relevant rule circuits for each action

Formal Verification: Executes specialized tools to verify each policy rule

Probabilistic Inference: Assigns safety labels and identifies rule violations

ShieldAgent-Bench

Comprehensive Coverage: 2K instructions across 7 risk categories and 6 web environments

Attack Simulation: Risky trajectories generated under agent-based and environment-based attacks

Explicit Risk Definition: Clear safety policies and violation mechanisms

Demo Videos

Safeguarding Email Agents

ShieldAgent protects against malicious poisoning attacks targeting email agents, preventing sending malware attachments and blocking harmful actions.

Code Security OWASP Policy Poisoning Attacks

Safeguarding Financial Agents

ShieldAgent surveillances financial transactions and prevents buying unintended stocks and items, ensuring compliance with financial safety policies.

Financial Security FINRA Policy Prompt Injection Attacks

Overview of ShieldAgent

Overview of ShieldAgent: (Top) From government regulations or platform-wide policies, ShieldAgent first extracts verifiable rules and iteratively refines them to ensure each rule is accurate, concrete, and atomic. It then applies spectral clustering to assemble these rules into a robust action-based safety policy model, associating actions with their corresponding constraints (with weights learned from real or simulated data). (Bottom) During inference, ShieldAgent retrieves relevant rule circuits w.r.t. the invoked action and verifies the rules one by one. By referencing existing workflows from a hybrid memory module, it first generates a step-by-step shielding plan with operations supported by a rich tool library to assign truth values for all predicates, then produces executable code to formally verify each rule. Finally, ShieldAgent runs probabilistic inference within the rule circuits to provide a safety label and explanations and report any violated rules.

ShieldAgent-Bench

ShieldAgent-Bench: The first comprehensive agent guardrail benchmark with 2K safety-related instructions across 7 risk categories and 6 web environments. Each instruction is paired with risky agent trajectories generated under two types of attacks, capturing risks during both training and deployment.

BibTeX

@article{chen2025shieldagent,
    title={Shieldagent: Shielding agents via verifiable safety policy reasoning},
    author={Chen, Zhaorun and Kang, Mintong and Li, Bo},
    journal={arXiv preprint arXiv:2503.22738},
    year={2025}
}