Tri Dao, Stanford: On FlashAttention and sparsity, quantization, and efficient inference

RSS · Spotify · Apple Podcasts · Pocket Casts

Tri Dao is a PhD student at Stanford, co-advised by Stefano Ermon and Chris Re. He’ll be joining Princeton as an assistant professor next year. He works at the intersection of machine learning and systems, currently focused on efficient training and long-range context.

Below are some highlights from our conversation as well as links to the papers, people, and groups referenced in the episode.

Some highlights from our conversation

“I think there are many paths to a high-performing language model. So right now there’s a proven strategy and people follow that. I think that doesn’t have to necessarily be the only path. I think my prior is that as long as your model architecture is reasonable and is hardware efficient, and you have lots of compute, and you have lots of data, the model would just do well.”

“So we’ve seen that sparsity now is proven to be more useful as people think about hardware-friendly sparsity. I would say the high-level point is we show that there are ways to make sparsity hardware-friendly and there are ways to maintain quality while using sparsity.”

“So I think there’s gonna be a shift towards focusing a lot on inference. How can we make inference as efficient as possible from either model design or software framework or even hardware? We’ve seen some of the hardware designs are more catered to inference now—think, for example, Google TPU has a version for inference, and has a different version for training where they have different numbers of flops and memory bandwidth and so on.”

“So we want to understand, from an academic perspective, when or why do we need attention. Can we have other alternatives that scale better in terms of sequence length? Because the longer context length has been a big problem for attention for a long time. Yes, we worked on that. We spent tons of time on that. I looked around and maybe it’s a contrarian bet that I wanna work on something that maybe scaled better in terms of sequence length that, maybe in two to three years, would have a shot at not replacing transformer but augmenting transformer in some settings.”

Referenced in this podcast

Steven Boyd, Stanford
A Kernel Theory of Modern Data Augmentation by Tri Dao, Albert Gu, Alexander J. Ratner, Virginia Smith, Christopher De Sa, Christopher Ré
Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations by Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, Christopher Ré
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models by Tri Dao*, Beidi Chen*, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, Christopher Ré.
Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps by Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, Christopher Ré.
Monarch: Expressive Structured Matrices for Efficient and Accurate Training by Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré.
ImageNet Classification with Deep Convolutional Neural Networks by Alex Krizhevsky, et al.
LLaMA: Open and Efficient Foundation Language Models by Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample
Fast Transformer Decoding: One Write-Head is All You Need by Noam Shazeer
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré.
MLPerf
Young-Jun Ko from Inflection
Online normalizer calculation for softmax by Maxim Milakov (NVIDIA), Natalia Gimelshein (NVIDIA)
Dan Fu
Christopher Ré
Albert Gu
Phil Wang

Thanks to Tessa Hall for editing the podcast.