Andy Zou

Improving Security and Safety of Generative Models

Abstract

The rapid advancement of artificial intelligence, particularly large language models and multimodal systems, has brought unprecedented capabilities alongside critical questions about safety, robustness, and alignment with human values. This thesis presents a comprehensive investigation into AI safety spanning three interconnected research directions.

First, we demonstrate the brittleness of current alignment methods through adversarial attacks that reliably circumvent safety measures in text-only, multimodal, and embodied AI systems. The Greedy Coordinate Gradient attack and related techniques reveal that alignment is not equivalent to adversarial robustness, with attacks transferring to production systems including ChatGPT, Claude, and Gemini.

Second, we develop rigorous evaluation frameworks for AI safety. HarmBench provides a standardized platform for comparing red teaming methods and defenses. AgentHarm demonstrates that LLM agents are surprisingly compliant with malicious requests even without jailbreaking. Our large-scale red teaming competition reveals widespread policy violations in deployed AI agents, while additional benchmarks enable assessment of AI capabilities against human cybersecurity professionals and detection of deceptive reasoning.

Third, we introduce techniques for improving AI alignment and control. Representation engineering provides a top-down framework for understanding and manipulating high-level cognitive phenomena in neural networks. Circuit breakers build on this foundation to provide robust defense against adversarial attacks by directly controlling harmful representations. Safety pretraining demonstrates that building safety into models from the start can dramatically reduce attack success rates while maintaining general capabilities.

Together, these contributions advance our understanding of AI safety challenges and provide practical tools for building more robust and trustworthy AI systems.

Thesis Committee

Keywords

Thesis Document