Certifying LLM Safety Against Adversarial Prompting | Digital Data Design Institute at Harvard

Large language models (LLMs) released for public use incorporate guardrails to ensure their output is safe, often referred to as “model alignment.” The study presented by Chirag Agarwal Chirag Agarwal , Suraj Srinivasan Suraj Srinivasan , Himabindu Lakkaraju, Aounon Kumar, and Aaron Jiaxun Li, along with University of Maryland colleague Soheil Feizi, investigates a novel approach for ensuring the safety of Large Language Models (LLMs) against adversarial prompts. These prompts are designed to manipulate LLMs into generating harmful content, challenging the current safety measures in place.

The research introduces an “erase-and-check” method, which evaluates the safety of prompts by sequentially erasing tokens and checking the modified sequences for harmful content. This method is tested against various forms of adversarial attacks, demonstrating its effectiveness in maintaining the integrity of LLM responses. The study also compares this approach with existing techniques like randomized smoothing, highlighting its superior performance in certifying safety.

This paper offers a significant contribution to the field of AI safety, proposing a robust and effective method to protect LLMs from sophisticated adversarial prompts. The findings emphasize the need for ongoing advancements in safety measures to ensure the responsible and secure use of LLMs in various applications.

Read the Full Research Paper

Join Our Community