Along with our three-part series on how we trained our 70B model, our CTO, Josh, went on the Latent Space podcast with Swyx and Jonathan Frankle, Chief AI Scientist at Databricks, to chat about the training process and our toolkit release. They discussed how to make significant progress even with imperfect evaluations, how Imbue and Databricks differ on their philosophies around building infrastructure, and why the tools we’re releasing today can be the difference between success and failure in model training.
You can find the full episode and transcript here.
“This is the stuff that nobody ever talks about. That is the difference between success and failure in this stuff. Like, can you get your cluster to run? Can you get software on your cluster? Can you figure out what broke? Because fault tolerance is still not really built into any of the fundamental primitives of training models. And so if something breaks, you have to go figure out what broke, your job stops, you have to restart your job. It is a nightmare just to get to the point where anything can train on the cluster. A basic MPI ‘hello world’ that has the GPUs talk to each other is hard enough, let alone actually training a model, let alone getting good performance out of the GPUs, let alone actually getting a model that converges to anything interesting. There’s so many levels of things you have to accomplish. This is the kind of stuff that matters. I think to a point that Josh made earlier, before we got on here, there are plenty of weights out there. Nobody’s released this.” — Jonathan Frankle
Read our blog post series here:
- Introduction: Training a 70B model from scratch: open-source tools, evaluation datasets, and learnings
- Evaluations: Sanitized open-source datasets for natural language and code understanding: how we evaluated our 70B model
- Infrastructure: From bare metal to a 70B model: infrastructure set-up and scripts
- CARBS: Open-sourcing CARBS: how we used our hyperparameter optimizer to scale up to a 70B-parameter language model
This model training was one of many projects we are working on at Imbue. If you’re interested in learning more about our other projects and building collaborative agents that can reason and code, we’re hiring!