
Earlier this year, we pre-trained a 70B-parameter model and fine-tuned it on a range of multiple-choice reasoning benchmarks. On these benchmarks, our fine-tuned model outperforms GPT-4o zero-shot (which was not tuned on these benchmarks). Our fine-tuned model, pre-trained on 2T tokens, also approaches the performance of fine-tuned Llama 3 70B, which was pre-trained on more than seven times as much data.

Because we evaluated GPT-4o zero-shot without chain-of-thought, its performance above does not reflect the best possible scores it can achieve on these datasets. However, this is the most faithful comparison to the fine-tuned 70B model evaluations, which also do not include chain-of-thought.
Using our hyperparameter optimizer, CARBS, we scaled this system up to 70B parameters on our first attempt with minimal training instability and no loss spikes. This involved training thousands of dense transformer models with group query attention, SwiGLU activations, RMS normalization, and a custom tokenizer at a range of smaller sizes.
To help other teams train, scale, and evaluate models tailored to their own research and product goals, we’re releasing the tools that facilitated this work. The toolkit includes:
For all of the above tools, we expanded upon our process for creating and utilizing them in the following blog posts:
We are sharing datasets for model evaluation, consisting of high-quality subsets of 11 public datasets, and a set of original questions for code comprehension. We found that both open-source and closed models achieved nearly 100% accuracy on some datasets when evaluated only on high-quality, unambiguous questions. For more on why we selected these particular datasets, as well as details about the process of creating the data and the actual datasets themselves, see our detailed write-up on evaluations.
These scripts are a critical (and often undisclosed) piece of training very large language models. We hope that our efforts will make it easier for others to experiment at larger scales without needing to reproduce this infrastructure code and knowledge. For more details, see our write-up of our training process and infrastructure bring-up.
CARBS allowed us to scale to our large training run with minimal training instability and loss spikes on the first attempt — eliminating a huge source of risk for smaller teams experimenting with novel model architectures. We published an extended write-up on how we used CARBS to scale up to our 70B model.
We trained our model from scratch as an experiment to help answer a few critical questions:
Some key learnings from this experience:
This model training — including all of the above work on infrastructure, evaluations, and hyperparameter optimization — was completed by about a dozen of our engineers and researchers. It is one of many projects we are working on at Imbue. Our other focus areas include reinforcement learning, agent and reasoning architectures, data generation techniques, and experience design to make these powerful capabilities accessible and intuitive to users.