RSS · Spotify · Apple Podcasts · Pocket Casts · YouTube
Below are some highlights from our conversation as well as links to the papers, people, and groups referenced in the episode.
Some highlights from our conversation
On optimizing sparse masks
“If you optimize a sparse mask, all you’re saying, basically, is: I want to pick and choose the terms that I want — the parameter times the input, in each of these cases. And if I just optimize that, I can solve anything. And that’s really very expressive, it turns out. So when you think about kind of what happens when you remove low magnitude weights, it’s basically a mask where you’re removing the terms, which by the nature of that low magnitude rate ended up being closest to zero.
And as a result, when you actually go and do push it through the non-linearity and get your output for that node, it doesn’t actually change it all that much. Which I think really goes to, when you think about how should you be optimizing these systems, understanding what are the components which lead to big changes in the output, and what are the components which don’t, is consistently a lens that works very well.”
On data washing out inductive bias
“Data has a really nice advantage, because if you understand what’s good or bad about data, it’s actually quite easy to make an improvement based off of that. Whereas if you understand what’s good about a representation, you can try to optimize for it […] and that sometimes works, but a lot of the time, it doesn’t. Also, I think one of the things that has become very clear over the last five years or so is that inductive biases consistently just get washed out by data. And that never used to be true because we never showed models enough data, but now that we’re showing models tons of data, the inductive bias just gets totally overwhelmed. And that also reduces the impact of crafting new inductive biases.”
On the “bitter lesson” of human-designed systems
“The key takeaway that I have taken from “The Bitter Lesson” is that, ultimately, as scientists, we like to think that we can design these systems, and that we’ll build a whole bunch of rules into a system that will create AI. But, over time, what has been shown is that strategies which can effectively leverage computing data consistently outperform strategies which are hand-designed. And one of the things that’s nice about transformers is that they can very effectively leverage compute and data. They scale well, and there’s a very general purpose way to make that work. But I think the bitter lesson for me was very bitter because I had been spending a lot of time trying to figure out how do I come up with better inductive biases for models to help them learn these things.”
On the usefulness of interpolation
“In many ways, by training on the whole internet, what we’ve done is kind of turned everything into an interpolation. Everything’s in distribution now, and maybe that’s just why it ends up working. It actually caused me to start thinking about what I do as a scientist — like, am I actually extrapolating? Or am I just interpolating? And the conclusion I came to, which is somewhat depressing as a scientist, is that I think I actually just interpolate most of the time. I think in practice what I do is I see a problem, and then I bucket that problem into various other categories of problems that I’ve seen in my career. […]It’s why interdisciplinary research ends up being so useful.”
On data redundancy and necessary variance
“One of the things that’s often really hard about identifying what data are good or bad is that redundancy is important. We can’t remove redundancy entirely, right? And in general, when you start going from like exact deduplication to redundancy, it’s a fuzzy boundary. There are things which are semantically very similar that you might want to fully deduplicate, but then there are other things where, they’re similar, but you actually do need to see that variance.
[…] The challenge is that you don’t need infinite redundancy, number one, and the amount of redundancy you need is likely not consistent with the distribution of the data. And different concepts will require different redundancy.”
On the challenge of using synthetic data
“The challenge is making sure that the generated data matches the distribution that you actually want to do. This, in general, is the challenge with synthetic data right now. Synthetic data is an incredibly exciting direction — it’s one that I think will have a ton of impact, definitely an area that we’re thinking very hard about in tautology and that we’ll be doing a lot of work in. And I think that there are clear places where it can make a huge impact, in particular with helping to augment tails and take areas of a distribution that are undersampled relative to where they should be, and helping to fill those in.
That said, if you kind of use synthetic data naively, it leads to all these problems. There have been a couple of really beautiful papers that have basically shown that you get model collapse if you do this. And the reason for this is fairly intuitive: any time you train a generative model on a dataset, it tends to overfit the modes, and it underfits the tails. So, if you then were to recursively do this n times, each time training on the outputs of the generative model, you would eventually completely lose the tails and you end up with a dumb function.”
Referenced in this podcast
- AlexNet / ImageNet Classification with Deep Convolutional Neural Networks
- Playing Atari with Deep Reinforcement Learning
- Mitchell Wortsman
- The Generalization-Stability Tradeoff In Neural Network Pruning
- Brian Bartoldson
- ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases
- The Bitter Lesson by Rich Sutton
- Fundamental AI Research (FAIR), Core Machine Learning
- Measuring abstract reasoning in neural networks
- Felix Hill (DeeMind)
- Beyond neural scaling laws: beating power law scaling via data pruning
- Ben Sorcher
- Surya Ganguli
- Gemini: A Family of Highly Capable Multimodal Models
- Mayee Chen
- Skill-it! A data-driven skills framework for understanding and training language models
- Text Is All You Need: Learning Language Representations for Sequential Recommendation
- The perceptron: A probabilistic model for information storage and organization in the brain
- Yann LeCun cake metaphor
Thanks to Tessa Hall for editing the podcast.