Ari Morcos, DatologyAI: On leveraging data to democratize model training

RSS · Spotify · Apple Podcasts · Pocket Casts · YouTube

Ari Morcos is the CEO of DatologyAI, which makes training deep learning models more performant and efficient by intervening on training data. He was at FAIR and DeepMind before that, where he worked on a variety of topics, including how training data leads to useful representations, lottery ticket hypothesis, and self-supervised learning. His work has been honored with Outstanding Paper awards at both NeurIPS and ICLR.

Below are some highlights from our conversation as well as links to the papers, people, and groups referenced in the episode.

Some highlights from our conversation

On optimizing sparse masks

“If you optimize a sparse mask, all you’re saying, basically, is: I want to pick and choose the terms that I want — the parameter times the input, in each of these cases. And if I just optimize that, I can solve anything. And that’s really very expressive, it turns out. So when you think about kind of what happens when you remove low magnitude weights, it’s basically a mask where you’re removing the terms, which by the nature of that low magnitude rate ended up being closest to zero.

And as a result, when you actually go and do push it through the non-linearity and get your output for that node, it doesn’t actually change it all that much. Which I think really goes to, when you think about how should you be optimizing these systems, understanding what are the components which lead to big changes in the output, and what are the components which don’t, is consistently a lens that works very well.”

On data washing out inductive bias

“Data has a really nice advantage, because if you understand what’s good or bad about data, it’s actually quite easy to make an improvement based off of that. Whereas if you understand what’s good about a representation, you can try to optimize for it […] and that sometimes works, but a lot of the time, it doesn’t. Also, I think one of the things that has become very clear over the last five years or so is that inductive biases consistently just get washed out by data. And that never used to be true because we never showed models enough data, but now that we’re showing models tons of data, the inductive bias just gets totally overwhelmed. And that also reduces the impact of crafting new inductive biases.”

On the “bitter lesson” of human-designed systems

“The key takeaway that I have taken from “The Bitter Lesson” is that, ultimately, as scientists, we like to think that we can design these systems, and that we’ll build a whole bunch of rules into a system that will create AI. But, over time, what has been shown is that strategies which can effectively leverage computing data consistently outperform strategies which are hand-designed. And one of the things that’s nice about transformers is that they can very effectively leverage compute and data. They scale well, and there’s a very general purpose way to make that work. But I think the bitter lesson for me was very bitter because I had been spending a lot of time trying to figure out how do I come up with better inductive biases for models to help them learn these things.”

On the usefulness of interpolation

“In many ways, by training on the whole internet, what we’ve done is kind of turned everything into an interpolation. Everything’s in distribution now, and maybe that’s just why it ends up working. It actually caused me to start thinking about what I do as a scientist — like, am I actually extrapolating? Or am I just interpolating? And the conclusion I came to, which is somewhat depressing as a scientist, is that I think I actually just interpolate most of the time. I think in practice what I do is I see a problem, and then I bucket that problem into various other categories of problems that I’ve seen in my career. […]It’s why interdisciplinary research ends up being so useful.”

On data redundancy and necessary variance

“One of the things that’s often really hard about identifying what data are good or bad is that redundancy is important. We can’t remove redundancy entirely, right? And in general, when you start going from like exact deduplication to redundancy, it’s a fuzzy boundary. There are things which are semantically very similar that you might want to fully deduplicate, but then there are other things where, they’re similar, but you actually do need to see that variance.

[…] The challenge is that you don’t need infinite redundancy, number one, and the amount of redundancy you need is likely not consistent with the distribution of the data. And different concepts will require different redundancy.”

On the challenge of using synthetic data

“The challenge is making sure that the generated data matches the distribution that you actually want to do. This, in general, is the challenge with synthetic data right now. Synthetic data is an incredibly exciting direction — it’s one that I think will have a ton of impact, definitely an area that we’re thinking very hard about in tautology and that we’ll be doing a lot of work in. And I think that there are clear places where it can make a huge impact, in particular with helping to augment tails and take areas of a distribution that are undersampled relative to where they should be, and helping to fill those in.

That said, if you kind of use synthetic data naively, it leads to all these problems. There have been a couple of really beautiful papers that have basically shown that you get model collapse if you do this. And the reason for this is fairly intuitive: any time you train a generative model on a dataset, it tends to overfit the modes, and it underfits the tails. So, if you then were to recursively do this n times, each time training on the outputs of the generative model, you would eventually completely lose the tails and you end up with a dumb function.”

Referenced in this podcast

Thanks to Tessa Hall for editing the podcast.