Dylan Hadfield-Menell, UC Berkeley/MIT: On the value alignment problem in AI

7 min read

Last updated 16 Jun 2026

Kanjun Qiu

CEO, Co-founder

Josh Albrecht

CTO, Co-founder

Some quotes we loved
Show Notes

RSS · Spotify · Apple Podcasts · Pocket Casts

Dylan Hadfield-Menell (Google Scholar) (Website) recently finished his PhD at UC Berkeley and is starting as an assistant professor at MIT. He works on the problem of designing AI algorithms that pursue the intended goal of their users, designers, and society in general. This is known as the value alignment problem. His most recent paper at NeurIPS is Consequences of Misaligned AI. It models the value alignment problem in AI by looking at a common situation, where the user’s true goals are only expressed to an AI system through proxies. This initially leads to positive utility, but decreases to negative utility over time as the AI system over-optimizes for the proxy objective. Their solution is to give the user the ability to update their proxied goals, thus increasing utility again. This model offers a general look at the consequences of misalignment and how AI recommender systems can be improved.

Dylan would love to hear any questions or comments on his paper, so feel free to reach out!

Highlights from our conversation:

👨‍👩‍👧‍👦 How to align AI to human values

📉 Consequences of misaligned AI -> bias & misdirected optimization

📱 Better AI recommender systems

Below are the show notes and full transcript. As always, please feel free to reach out with feedback, ideas, and questions!

Some quotes we loved

[3:30] Dylan’s work on normative information about AI systems:

“Since then, my research has been, how to provide normative information about AI system behavior. We often talk about, in the world, the distinction between objective and subjective properties. Like predicting images from pixel to pixel is a fairly objective thing. There’s a clear well-defined right answer. You predict the right pixels or you don’t. For normative properties of the world. That’s not true.
When the right answer is not externally defined, you have to appeal to who built the system and what do they want it to do, in order to really answer that question. And I think most of my research is about trying to understand what are the channels by which we communicate that information. How do we make sure that system’s behavior aligns well enough with this subjective goal that we have.”

[18:25] The main result in Dylan’s 2020 NeurIPS paper:

“What we show is in this model, if you optimize for any fixed proxy utility function, eventually the overall utility is driven either towards a minimum at certain features or overall drives away unbounded if you don’t hit any environmental bounds. …
(Their solution:) We have a property of a proxy utility function and a true utility function such that local improvements in one lead to improvements in the other. And this implies that if you can update the features and your utility function fast enough, you can use proxy utility functions to maybe not define what you want in the long run, but to provide local directions of improvement for how your system should allocate its effort. And so this is another style of solution.”

[31:33] Obvious discoveries:

“I think that is the ultimate dream as a researcher: things that you didn’t realize before, but then seem so obvious, you can’t imagine not thinking of them in hindsight. If I can have a couple ideas like that in my career, I will call it a big success.”

[33:22] Current AI systems as an analog to compilers:

“I’ve come to believe that a lot of ML as we study it right now will fill a role in the future that’s analogous to what compilers fill in AI systems today. … Over the course of a long period of time is the combination of decision theory and statistics to effectively build a compiler that allows us to represent now more intelligent behaviors. Really just behaviors defined on complex open-world inputs in a higher level representation that can then be compiled down. That higher level representation in the form of supervised learning is a label dataset. … It’s a representation of an objective. It’s a ranking of different possible behaviors where those behaviors are encoded as the weights of the neural net, the parameters of a policy, something of that nature. We do have something like the general purpose programming language today. I think the supervised learning data set like Imagenet is, in this analogy, it looks like Python or C or something like that.”

Show Notes

How did Dylan’s research interests start and evolve? [02:00]
Dylan’s current research interests on normative information about AI system behavior [03:00]
Framing AI safety as the principal agent problem from economics [07:51]
Different approaches to AI safety [08:21]
How Dylan’s research relates to social media and content recommendation [09:31]
Language models and their pitfalls in data collection [12:06]
One solution to the problem: paying for data [14:20]
Dylan’s NeurIPS 2020 paper, Consequences of Misaligned AI [15:33]
CEO pay as an analog for AI systems optimization [16:04]
Dylan’s proposed solutions to over-optimization in his paper [25:04]
Why their solution lines up well with deployed AI systems in practice [27:07]
How does clickbait happen? [28:14]
Managing personal optimization in life [30:36]
Finding obvious phenomena as a researcher [31:23]
Current AI systems as an analog to compilers [33:22]
Creating datasets to allow “general purpose programming” to happen in AI, rather than “low-level code” [36:00]
Dylan’s paper on social norms: Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors [41:49]
Dylan’s “controversial” opinions for ML - choosing the wrong ways to fix biased datasets [47:36]
How to better teach ML systems design in academia [52:07]
Weakly supervised learning for general purpose ML programming [55:30]
Dylan’s concerns with current AI systems - shallow metrics and negative externalities [57:44]
The importance of subjectivity in AI and normative information [58:56]
Enabling better content recommendation and filters for users [1:04:11]
Do more choices lead to user happiness? [1:08:16]
Creating your own information diet [1:13:19]
Lack of power and feedback for users in content recommendation [1:14:48]
Problems with unsupervised learning regarding AI safety [1:19:41]
Mathematical models to measure manipulation [1:22:18]
Work that has impacted Dylan most [1:25:58]
Dylan’s request for the audience [1:30:24]

Thanks to Luke Cheng for writing drafts of this post and Tessa Hall for editing the podcast.