'State of the Art: Training >70B LLMs on 10,000 H100 clusters': Josh on the Latent Space podcast

Along with our three-part series on how we trained our 70B model, our CTO, Josh, went on the Latent Space podcast with Swyx and Jonathan Frankle, Chief AI Scientist at Databricks, to chat about the training process and our toolkit release. They discussed how to make significant progress even with imperfect evaluations, how Imbue and Databricks differ on their philosophies around building infrastructure, and why the tools we’re releasing today can be the difference between success and failure in model training.

You can find the full episode and transcript here.

“This is the stuff that nobody ever talks about. That is the difference between success and failure in this stuff. Like, can you get your cluster to run? Can you get software on your cluster? Can you figure out what broke? Because fault tolerance is still not really built into any of the fundamental primitives of training models. And so if something breaks, you have to go figure out what broke, your job stops, you have to restart your job. It is a nightmare just to get to the point where anything can train on the cluster. A basic MPI ‘hello world’ that has the GPUs talk to each other is hard enough, let alone actually training a model, let alone getting good performance out of the GPUs, let alone actually getting a model that converges to anything interesting. There’s so many levels of things you have to accomplish. This is the kind of stuff that matters. I think to a point that Josh made earlier, before we got on here, there are plenty of weights out there. Nobody’s released this.” — Jonathan Frankle

Read our blog post series here:

This model training was one of many projects we are working on at Imbue. If you’re interested in learning more about our other projects and building collaborative agents that can reason and code, we’re hiring!