Certifying LLM Safety Against Adversarial Prompting
Large language models (LLMs) released for public use incorporate guardrails to ensure their output is safe, often referred to as “model alignment.” The study presented by Chirag Agarwal Private: Chirag Agarwal , Suraj Srinivasan Suraj Srinivasan , Himabindu Lakkaraju, Aounon Kumar, and Aaron Jiaxun Li, along with University of Maryland colleague Soheil Feizi, investigates a […]