Introducing vision to the fine-tuning API
Developers can now fine-tune GPT-4o with images and text to improve vision capabilities
Today, we’re introducing vision fine-tuning(opens in a new window) on GPT-4o1, making it possible to fine-tune with images, in addition to text. Developers can customize the model to have stronger image understanding capabilities which enables applications like enhanced visual search functionality, improved object detection for autonomous vehicles or smart cities, and more accurate medical image analysis.
Since we first introduced fine-tuning on GPT-4o, hundreds of thousands of developers have customized our models using text-only datasets to improve performance on specific tasks. However, for many cases, fine-tuning models on text alone doesn’t provide the performance boost expected.
How it works
Vision fine-tuning follows a similar process to fine-tuning with text—developers can prepare their image datasets to follow the proper format(opens in a new window) and then upload that dataset to our platform. They can improve the performance of GPT-4o for vision tasks with as few as 100 images, and drive even higher performance with larger volumes of text and image data.
1{2 "messages": [3 { "role": "system", "content": "You are an assistant that identifies uncommon cheeses." },4 { "role": "user", "content": "What is this cheese?" },5 { "role": "user", "content": [6 {7 "type": "image_url",8 "image_url": {9 "url": "https://upload.wikimedia.org/wikipedia/commons/3/36/Danbo_Cheese.jpg"10 }11 }12 ] 13 },14 { "role": "assistant", "content": "Danbo" }15 ]16}17
We collaborated with a small group of trusted partners to better understand the real-world applications of vision fine-tuning. We’re excited to showcase a few examples of what they built.
Grab improves image detection and understanding on the road
Grab(opens in a new window), a leading food delivery and rideshare company, turns street-level imagery collected from their drivers into mapping data used to power GrabMaps(opens in a new window), a mapping service enabling all of their Southeast Asia operations. Using vision fine-tuning with only 100 examples, Grab taught GPT-4o to correctly localize traffic signs and count lane dividers to refine their mapping data. As a result, Grab was able to improve lane count accuracy by 20% and speed limit sign localization by 13% over a base GPT-4o model, enabling them to better automate their mapping operations from a previously manual process.
Automat improves success rate of desktop bots automating business processes
Automat(opens in a new window), an enterprise automation company, builds desktop and web agents that process documents and take UI-based actions to automate business processes. With vision fine-tuning and a dataset of screenshots, Automat trained GPT-4o to locate UI elements on a screen given a natural language description, improving the success rate of their RPA agent from 16.60% to 61.67%—a 272% uplift in performance compared to base GPT-4o. Additionally, Automat trained GPT-4o on just 200 images of unstructured insurance documents to achieve a 7% lift in F1 score on information extraction tasks.
Coframe enhances quality of digital content creation
Coframe(opens in a new window) is building an AI growth engineering assistant that helps businesses continuously create and test variations of their websites and UIs to optimize business metrics. A key part of this task is autonomously generating new branded sections of a website, based on the rest of the website. Coframe tasked GPT-4o with generating code for the next section of a website based on images and existing code. By fine-tuning GPT-4o with images and code, they improved the model’s ability to generate websites with consistent visual style and correct layout by 26% compared to base GPT-4o.
Safety & privacy
We continuously run automated safety evals on fine-tuned models and monitor usage to ensure applications adhere to our usage policies. As with all API services, vision fine-tuning is subject to our Enterprise privacy commitments. Fine-tuned models remain entirely under your control, with full ownership of your business data. We do not train our models on the inputs or outputs used in this service without your explicit permission.
Availability & pricing
Vision fine-tuning capabilities are available today for all developers on paid usage tiers(opens in a new window). These capabilities are supported on the latest GPT-4o model snapshot, gpt-4o-2024-08-06
. Developers can extend existing fine-tuning training data for images using the same format as our Chat endpoints(opens in a new window).
We’re offering 1M training tokens per day for free through October 31, 2024 to fine-tune GPT-4o with images. After October 31, 2024, GPT-4o fine-tuning training will cost $25 per 1M tokens and inference will cost $3.75 per 1M input tokens and $15 per 1M output tokens. Image inputs are first tokenized based on image size, and then priced at the same per-token rate as text inputs. More details can be found on the API Pricing page.
To get started, visit the fine-tuning dashboard(opens in a new window), click ‘create’ and select gpt-4o-2024-08-06
from the base model drop-down. To learn how to fine-tune GPT-4o with images, visit our docs(opens in a new window).