This is the first of a three-part series on how we trained our 70B model. We covered setting up infrastructure, conducting evaluations, and hyperparameter optimization.
Introduction
When training our 70B model, we sought to accurately evaluate models for natural language understanding and reasoning abilities. To do so, we needed high-quality datasets on which to evaluate our models — free of confusingly worded, unclear, subjective, ambiguous, unanswerable, or mislabeled questions. Such questions can skew evaluation results: if a model “incorrectly” answers a highly ambiguous question, the issue would lie more with the question itself than with the model’s reasoning capabilities.
To address this, we sanitized 11 publicly available multiple-choice question-answering datasets and created private versions consisting of handwritten questions by human annotators. After removing low-quality and mislabeled questions, we found that all evaluated open-source and closed models — our 70B model, Llama 2 70B, Llama 3 70B, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro — achieved high accuracy on both publicly available and private benchmarks.
Today, we are releasing the sanitized public datasets, our private datasets, a fine-tuned Llama 3 70B model to identify question quality, along with an entirely new dataset of questions related to reasoning about code. In this piece, we:
- Explain the datasets we’re releasing
- Detail our reasoning behind creating and sanitizing these datasets
- Outline our process for evaluating questions
- Share our findings from evaluating a variety of models on both the public and sanitized versions of each dataset
What we are releasing
As part of our 70B model toolkit, we are releasing a series of new and sanitized evaluation datasets to help robustly evaluate reasoning models. With these sanitized datasets, we were able to more accurately evaluate the performance of our 70B model against other frontier models.
These resources include:
- High-quality1 and correctly-labeled subsets of 11 academic reasoning benchmarks:
- Up to 1,000 items from the original dataset that have been screened for quality
- Up to 1,000 new human-written questions, so others can precisely evaluate their own models without fear of data contamination
- Tools for identifying and removing low-quality questions from an evaluation dataset:
- A dataset of 450,000 human judgments that our question quality model was trained on, so that others can investigate factors in human uncertainty about real-world questions or train their own question quality assessment models
- A fine-tuned 70B model, built with Meta Llama 3, to identify question quality in any evaluation dataset
- A completely new dataset about code understanding so that others can improve model performance on code-related reasoning
By releasing these tools and datasets, we hope to enable researchers to conduct accurate model evaluations and sanitize their own datasets to do so.
Why we created and sanitized these datasets
Why we selected these benchmarks
To evaluate our language model on natural language understanding and reasoning, we selected 11 benchmark datasets that fit a set of criteria. We rejected any benchmark dataset that was either:
- Too small to provide statistical power for comparing language model performance.
- Low-quality, meaning that from a manual exploration of a small number of examples, it was apparent that the dataset contains large amounts of low-quality or mislabeled examples.
- Optimized for capabilities other than reasoning. For example, we eliminated benchmarks like MMLU, for which performance relies to a large extent on memorization. We believe that the approach to solving memorization tasks involves developing tools that can find relevant information in real time, and the pre-trained language model itself should primarily be optimized for reasoning, not factual knowledge.
- Uncommon in other AI/ML research.
Given these specifications, we decided to use the following datasets: ANLI, ARC, BoolQ, ETHICS, GSM8K, HellaSwag, OpenBookQA, MultiRC, RACE, Social IQa, and WinoGrande.
Why we chose multiple-choice evaluations
Our goal was to assess the performance of our base language model, a single next-token predictor, without chain-of-thought reasoning or in-context learning. To evaluate this, we converted all evaluation questions to a multiple-choice format, using a simple prompt that asked the model to output single-token completions (e.g., “A”, “B”, “C”). For datasets without alternative incorrect answer options like GSM8K, we created sensible options via dataset-specific approaches.2 For datasets which contain a single passage of text with multiple questions (like RACE), we converted the dataset to a list of passage-question pairs.
Next, we standardized the model responses. To ensure that our model would always answer with a capital letter, we fine-tuned the model to follow a straightforward prompt (e.g., “Always answer A, B, or C”). We used CARBS, our hyperparameter optimizer, to determine the optimal parameters for instruction fine-tuning. For the open models we evaluated (Llama 2 70B and Llama 3 70B), we used the same prompt, procedure, and model-specific optimal hyperparameters as determined by CARBS. For the closed API-based models that were already fine-tuned internally (GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro), we engineered a prompt to ensure that these models would also answer the original question without additional reasoning traces or chain-of-thought. We scored all models on accuracy: the fraction of questions for which the highest-likelihood answer is correct.
Why we sanitized public datasets
The 11 public datasets we chose were not immediately usable, as all of them contained different amounts of low-quality questions: questions that were confusingly worded, unclear, subjective, ambiguous, unanswerable, or mislabeled.
An example of a low-quality question from the RACE dataset:
For this question, first read the passage below.