
ShieldAgent
Shielding Agents via Verifiable Safety Policy Reasoning
1University of Chicago, 2University of Illinois at Urbana-Champaign

Automatic Extraction: Parses policy documents and extracts verifiable rules
Rule Refinement: Iteratively refines rules to ensure they can be concretely verified
Action-Based Circuits: Organizes rules into probabilistic circuits for each agent action

Efficient Localization: Identifies relevant rule circuits for each action
Formal Verification: Executes specialized tools to verify each policy rule
Probabilistic Inference: Assigns safety labels and identifies rule violations

Comprehensive Coverage: 2K instructions across 7 risk categories and 6 web environments
Attack Simulation: Risky trajectories generated under agent-based and environment-based attacks
Explicit Risk Definition: Clear safety policies and violation mechanisms
ShieldAgent protects against malicious poisoning attacks targeting email agents, preventing sending malware attachments and blocking harmful actions.
ShieldAgent surveillances financial transactions and prevents buying unintended stocks and items, ensuring compliance with financial safety policies.


BibTeX
@article{chen2025shieldagent,
title={Shieldagent: Shielding agents via verifiable safety policy reasoning},
author={Chen, Zhaorun and Kang, Mintong and Li, Bo},
journal={arXiv preprint arXiv:2503.22738},
year={2025}
}