October 1, 2024

Model Distillation in the API

Fine-tune a cost-efficient model with the outputs of a large frontier model–all on the OpenAI platform

DALL·E generated impressionist oil painting of stacked light green rectangles serving as columns, with emerald streams weaving repeatedly through each tier

We’re introducing a new Model Distillation offering to provide developers with an integrated workflow to manage the entire distillation pipeline directly within the OpenAI platform. This lets developers easily use the outputs of frontier models like o1‑preview and GPT‑4o to fine-tune and improve the performance of more cost-efficient models like GPT‑4o mini.

Model distillation involves fine-tuning smaller, cost-efficient models using outputs from more capable models, allowing them to match the performance of advanced models on specific tasks at a much lower cost. Until now, distillation has been a multi-step, error-prone process, which required developers to manually orchestrate multiple operations across disconnected tools, from generating datasets to fine-tuning models and measuring performance improvements. Since distillation is inherently iterative, developers needed to repeatedly run each step, adding significant effort and complexity.

Our new Model Distillation suite includes:

Stored Completions⁠(opens in a new window): Developers can now easily generate datasets for distillation by automatically capturing and storing the input-output pairs generated by one of our models, like GPT‑4o or o1‑preview through our API. With Stored Completions, you can easily build datasets with your production data to evaluate and fine-tune models. Developers can review this integration guide⁠(opens in a new window) to learn how to opt-in to storing completions.
Evals⁠(opens in a new window) (beta): Developers can now create and run custom evaluations on our platform to measure model performance on specific tasks. Instead of manually creating evaluation scripts and integrating disparate logging tools, Evals provides an integrated way to measure model performance. You can either use data from Stored Completions or upload existing datasets to set up your evaluations. Evals can also be used independently of fine-tuning to quantitatively evaluate model performance for your use cases.
Fine-tuning⁠(opens in a new window): Stored Completions and Evals are fully integrated with our existing fine-tuning offering. This means that developers can use datasets created with Stored Completions in their fine-tuning jobs and run evaluations on fine-tuned models using Evals, all within our platform.

How to use Model Distillation

First, create an evaluation⁠(opens in a new window) to measure the performance of the model you want to distill into, which in this example will be GPT‑4o mini. This evaluation will be used to continuously test the distilled model’s performance, to help you decide whether to deploy it.

Example of evaluation used for model distillation

Next, use Stored Completions to create a distillation dataset of real-world examples using GPT‑4o’s outputs for the tasks on which you want to fine-tune GPT‑4o mini. You can do this by setting the ‘store:true’ flag in the Chat Completions API to automatically store these input-output pairs without any latency impact. These stored completions can be reviewed, filtered, and tagged to create high-quality datasets for fine-tuning or evaluation.

Python

1response = client.chat.completions.create(
2  model="gpt-4o",
3  messages=[
4    {
5      "role": "user",
6      "content": [
7        {
8          "type": "text",
9          "text": "what's the capital of the USA?"
10        }
11      ]
12    }
13  ],
14  store=True,
15  metadata={"username": "user123", "user_id": "123", "session_id": "123"}

Finally, use this dataset to fine-tune GPT‑4o mini. Stored Completions can be used as a training file when creating a fine-tuned model. Once the model is fine-tuned, you can go back to Evals to test whether the fine-tuned GPT‑4o mini model meets your performance criteria when compared to GPT‑4o.

00:00

Fine-tuning is an iterative process. If the initial results aren’t satisfactory, you may need to refine the dataset, adjust the training parameters, or capture more specific examples where the model is underperforming. The goal is to incrementally improve the distilled model until it performs well enough for production use.

Availability & Pricing

Model Distillation is available today to all developers and can be used to distill any of our models, including GPT‑4o and o1‑preview. As a reminder, we’re also offering 2M free training tokens per day on GPT‑4o mini and 1M free training tokens per day on GPT‑4o until October 31 to help developers get started with distillation. Beyond that limit, the cost of training and running a distilled model is the same as our standard fine-tuning prices, which you can find on our API pricing page⁠.

Stored Completions is available for free. Evals, which are available in beta, are charged at standard model prices based on the tokens used. Through the end of the year, developers can run evaluations for free (up to 7 per week) when they opt in⁠(opens in a new window) to share their Evals with OpenAI. Evals shared with us will be used to help us improve and evaluate our future models.

For more information, check out our Model Distillation docs⁠(opens in a new window).

Author

OpenAI