Hattie Zhou, Mila: On supermasks, iterative learning, and fortuitous forgetting

RSS · Spotify · Apple Podcasts · Pocket Casts

Hattie Zhou is a Ph.D. student at Mila working with Hugo Larochelle and Aaron Courville. Her research focuses on understanding how and why neural networks work, starting with deconstructing why lottery tickets work and most recently exploring how forgetting may be fundamental to learning. Prior to Mila, she was a data scientist at Uber and did research with Uber AI Labs. In this episode, we chat about supermasks and sparsity, coherent gradients, iterative learning, fortuitous forgetting, and much more.

Below are some highlights from our conversation as well as links to the papers, people, and groups referenced in the episode.

Some highlights from our conversation

“I think if you looked at the structure of the masks on a simple dataset, like MNIST, you could actually see the structure, especially in the first layer; you can see that the weights that are masked are the weights that connect to empty space, usually right on the borders of the digit. That kind of analysis gets impossible as soon as you move away from MNIST. And I almost think—I’m pessimistic about taking a sparsity structure that you identify from lottery tickets to a general principle of how you should design architectures.”

“I have this speculation or hunch or hypothesis—people talk about system 1 and system 2, right? And they say deep learning does system 1 mostly. But for humans, we don’t always rely on system 2 to solve questions that are reasoning problems. […] You have to make a conscious decision when to do that. It takes energy. It’s not the default state. So my speculation is that these large pre-trained models also have both system 1 and system 2 capabilities, it’s just they may be in different proportions. And also the system 2 capabilities might be masked by system 1—by the more correlational and easier and probably stronger features in the model.”

“For humans, even though we don’t like the idea of forgetting, we think it’s a bad thing, it’s actually a very important mechanism in our brain that helps with learning, helps with processing information. So could it be the case that that same process could be beneficial to artificial neural networks?”

Referenced in this podcast

The Lottery Ticket Hypothesis
Hattie’s paper Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask
Supermasks in Superposition
The “system 1 and system 2” concept from Thinking, Fast and Slow by Daniel Kahnema
Coherent Gradients
LCA: Loss Change Allocation for Neural Network Training
Compositional Languages Emerge in a Neural Iterated Learning Model
Knowledge Evolution in Neural Networks
Hattie’s paper Fortuitous Forgetting in Connectionist Networks
RIFLE: Backpropagation in Depth for Deep Transfer Learning through Re-Initializing the Fully-connected LayEr
The Primacy Bias in Deep Reinforcement Learning
Chris Olah and his work on neural circuits
Greg Yang and his work on neural tangent kernels

Thanks to Tessa Hall for editing the podcast.