Jacob Steinhardt, UC Berkeley: On machine learning safety, alignment and measurement

5 min read

Last updated 15 Jun 2026

Kanjun Qiu

CEO, Co-founder

Josh Albrecht

CTO, Co-founder

Some quotes we loved
Show Notes

RSS · Spotify · Apple Podcasts · Pocket Casts

Jacob Steinhardt (Google Scholar) (Website) is an assistant professor at UC Berkeley. His main research interest is in designing machine learning systems that are reliable and aligned with human values. Some of his specific research directions include robustness, rewards specification and reward hacking, as well as scalable alignment. His most recent paper at ICLR 2021 proposes a new test to measure an NLP model’s accuracy on a wide variety of tasks, ranging from mathematics, US history, law, and more. It provides a measurement tool to help researchers specify an important problem: while current models can achieve superhuman performance on benchmarks, they lack the ability to understand language on a whole. Another of Jacob’s papers at ICLR focuses on measuring a language model’s knowledge of basic concepts of morality. It shows that current language models have a promising but incomplete ability to predict basic human ethical judgements.

Highlights from our conversation:

📜 “Test accuracy is a very limited metric.“

👨‍👩‍👧‍👦 “You might not be able to get lots of feedback on human values.”

📊 ”I’m interested in measuring the progress in AI capabilities.”

Below are the show notes and full transcript. As always, please feel free to reach out with feedback, ideas, and questions!

Some quotes we loved

[11:33] On the freedom of knowing how to communicate unusual ideas:

“But I had to learn how to write a good paper without having a template. I think it required me to learn to become a significantly better writer. And I think that helped later on, because it made me feel more comfortable pursuing unusual ideas. I knew I had the skills to present those ideas. As long as I believed in them, I could get other people to believe in them.”

[34:55] On learning hard phenomena from big data sets:

“People have historically been interested in these parts, like compositionality of objects and occlusion…but thinking about these complicated things directly is just not really the right way to go. You just want this very diverse distribution of things that are deeply ingrained in evolutionary history as opposed to being part of explicit reasoning”

[21:10] Why measurements matters for AI safety:

“I’ve been really obsessed with this idea of measurement. First of all, test accuracy is a very limited metric. What are we trying to do with it? I’m kind of thinking in analogy with climate change as another field. For a while, there was a lot of climate skepticism or climate denial. At some point it becomes pretty clear, when there’s regular heat waves fires and that sort of thing. You probably wanted to do something about it before that point. Having these more subtle measurements that you can look at are important. And the other thing is I think it actually laid the groundwork for the more extreme weather events to become a convincing signal.”

Show Notes

Jacob’s original career plan and the exploration that led him to ML [01:40]
Jacob’s first research area in grad school: computationally bounded reasoning [05:47]
Pivoting to robustness research [09:16]
How Jacob’s early research directions helped him learn to communicate unusual ideas [10:40]
Two different types of adversarial robustness research and its significance [12:30]
Gap year at Open Philanthropy and OpenAI [17:11]
Learning about scaling at OpenAI [18:08]
Working on Covid research and lessons [19:37]
Working on measurement (not just test accuracy but other metrics) [21:09]
Measuring AI capability jumps for safety [24:23]
Measuring progress in AI capabilities - Jacob’s paper [27:14]
AI calibration for prediction accuracy [27:55]
A summary of different types of robustness [29:16]
Dan Hendryck’s work with collecting data sets [30:39]
Jacob’s most unusual paper , measuring the ethics of AI models [33:19]
What work has impacted Jacob most? (GPT-3, scaling laws for NNs) [36:23]
Why does measurement matter? And Jacob’s interest in the history of science. [38:37]
Filtering your information diet [40:17]
Should researchers be hedgehogs or foxes? [42:25]
What methods are underrated in ML research? [46:34]
Jacob’s paper on troubling trends in Machine Learning Scholarship [47:49]
Attributes of a great research lab [53:02]
What makes for a great advisor? [56:07]
What makes for a great researcher? [57:21]

Thanks to Luke Cheng for writing drafts of this post and Tessa Hall for editing the podcast.