As AI continues to evolve at a breakneck pace, the quest for aligning these systems with human values has become paramount. However, a recent study, “More RLHF, More Trust? On The Impact of Preference Alignment on Trustworthiness”, by Aaron J. Li, a master’s student at the Harvard John A. Paulson School of Engineering and Applied Sciences (SEAS); Himabindu Lakkaraju, Assistant Professor of Business Administration at Harvard Business School and Principal Investigator at the Digital Data Design (D^3) Institute at Harvard Trustworthy AI Lab; and Satyapriya Krishna, PhD graduate from Harvard SEAS and the Trustworthy AI Lab, revealed that the current methods used to achieve this alignment may have unexpected consequences on AI trustworthiness. The study explored the complex relationship between AI alignment techniques and various aspects of trustworthiness, and offered crucial insights for business leaders navigating this new technology landscape.
Key Insight: The Misalignment Paradox
“We identify a significant misalignment between generic human preferences and specific trustworthiness criteria, uncovering conflicts between alignment goals and exposing limitations in conventional RLHF datasets and workflows.” [1]
The team’s research uncovered a surprising paradox in AI development: the techniques designed to align AI with human preferences may inadvertently compromise its trustworthiness. In the study, Reinforcement Learning from Human Feedback (RLHF)—a common method for fine-tuning machine learning models to improve self-learning—showed mixed results across different trustworthiness metrics. While it improved performance in machine ethics (observing ethical principles) by an average of 31%, it led to concerning increases in stereotypical bias (150% increase) and privacy leakage (12% increase), and a 25% decrease in truthfulness.
Key Insight: The Ethics Exception
“Empirically, RLHF does not improve performance on key trustworthiness benchmarks such as toxicity, bias, truthfulness, and privacy, with machine ethics being the only exception.” [2]
The study showed that machine ethics stood out as the only aspect of large language model (LLM) trustworthiness that consistently improved through RLHF. The researchers found that the false negative rate (FNR) for ethical decision-making decreased significantly across all tested models. This suggests that current AI alignment techniques are particularly effective at instilling ethical behavior, but struggle with other trustworthiness metrics. These metrics include truthfulness (accurate information), toxicity (harmful or inappropriate content), fairness (assessing and addressing biases), robustness (performance under different conditions), and privacy (protecting user data and preventing data leaks).
Key Insight: The Data Attribution Dilemma
“To address this, we propose a novel data attribution analysis to identify fine-tuning samples detrimental to trustworthiness, which could potentially mitigate the misalignment issue.” [3]
Li, Krishna, and Lakkaraju introduced an innovative approach to understanding the root causes of trustworthiness issues in AI alignment. By analyzing the contribution of individual data samples to changes in trustworthiness, they developed a tool to identify and quantify the effects of problematic training data.
Key Insight: The Scale of the Challenge
“Although our experiments focus on models up to 7 [billion] parameters, we expect similar trends in larger models because prior research […] suggests that larger models are not inherently more trustworthy in the aspects where we have observed negative RLHF effects.” [4]
The research indicated that the trustworthiness issues identified are not limited to smaller AI models. Even as AI systems grow in size and complexity, they remain susceptible to these alignment-induced trustworthiness problems. In fact, the study referred to findings of large-size models using RLHF that demonstrated stronger political views and racial biases.
Why This Matters
For business leaders and executives, the insights from the team’s research are crucial for understanding the complexities of deploying AI systems, and highlights that simply focusing on aligning AI with human preferences is not enough to ensure trustworthy and reliable AI systems.
Companies investing in AI technologies must be aware of the potential trade-offs between different aspects of trustworthiness. While improvements in ethical decision-making are encouraging, the increased risks of bias, privacy breaches, and misinformation cannot be ignored. This research calls for a more nuanced approach to AI alignment that balances multiple dimensions of trustworthiness. Using the data attribution analysis method the team proposed to identify problematic training data, companies can potentially improve the trustworthiness of their AI systems without compromising on performance or alignment with human preferences.
References
[1] Aaron J. Li, Satyapriya Krishna, and Himabindu Lakkaraju, “More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness”, arXiv:2404.18870v2 [cs.CL] (December 21, 2024): 2.
[2] Li, Krishna, and Lakkaraju, “More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness”, 11.
[3] Li, Krishna, and Lakkaraju, “More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness”, 11.
[4] Li, Krishna, and Lakkaraju, “More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness”, 2.
Meet the Authors

Aaron J. Li is a master’s student in Computational Science & Engineering at the Harvard University John A. Paulson School of Engineering and Applied Sciences (SEAS). He obtained his BA in Mathematics from Harvard. His interests include mathematics, theoretical CS, and physics.

Satyapriya Krishna recently completed his PhD at John A. Paulson School of Engineering and Applied Sciences (SEAS) and worked with the D^3 Trustworthy AI Lab, where his research focused on the trustworthy aspects of generative models. He earned his MS in Computer Science from Carnegie Mellon University and his BS in Computer Science and Engineering from the LNM Institute of Information Technology in Jaipur, India.

Himabindu Lakkaraju is an Assistant Professor of Business Administration at Harvard Business School and PI in D^3’s Trustworthy AI Lab. She is also a faculty affiliate in the Department of Computer Science at Harvard University, the Harvard Data Science Initiative, Center for Research on Computation and Society, and the Laboratory of Innovation Science at Harvard. She teaches the first year course on Technology and Operations Management, and has previously offered multiple courses and guest lectures on a diverse set of topics pertaining to artificial intelligence (AI) and machine learning (ML), and their real-world implications.