25 de marzo de 2025

Presentamos la generación de imágenes 4o

Desbloquear la generación de imágenes útil y valiosa con un modelo nativo multimodal capaz de producir resultados precisos, exactos y fotorrealistas.

Probar en ChatGPT

Cargando...

En OpenAI, desde hace mucho tiempo creemos que la generación de imágenes debe ser una capacidad fundamental de nuestros modelos de lenguaje. Por eso, hemos integrado nuestro generador de imágenes más avanzado hasta el momento en GPT‑4o. El resultado: generación de imágenes que no solo es hermosa, sino útil.

A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection.

The text reads:

(left)
"Transfer between Modalities:

Suppose we directly model
p(text, pixels, sound) [equation]
with one big autoregressive transformer.

Pros:
* image generation augmented with vast world knowledge
* next-level text rendering
* native in-context learning
* unified post-training stack

Cons:
* varying bit-rate across modalities
* compute not adaptive"

(Right)
"Fixes:
* model compressed representations
* compose autoregressive prior with a powerful decoder"

On the bottom right of the board, she draws a diagram:
"tokens -> [transformer] -> [diffusion] -> pixels"

^{Best of 8}

selfie view of the photographer, as she turns around to high five him

^{Best of 8}

Generación de imágenes útil

Desde las primeras pinturas rupestres hasta las infografías modernas, los humanos han utilizado imágenes visuales para comunicar, persuadir y analizar, no solo para decorar. Los modelos generativos actuales pueden crear escenas surrealistas y asombrosas, pero tienen dificultades con las imágenes prácticas que la gente usa para compartir y crear información. Desde logotipos hasta diagramas, las imágenes pueden transmitir un significado preciso cuando se complementan con símbolos que aluden a un lenguaje y experiencia compartidos.

La generación de imágenes de GPT‑4o se destaca por su capacidad para representar texto con precisión, seguir exactamente el mensaje y aprovechar los conocimientos inherentes de 4o y el contexto del chat, incluida la transformación de imágenes cargadas o su uso como inspiración visual. Estas capacidades facilitan crear exactamente la imagen que imaginas, ayudándote a comunicarte de manera más efectiva a través de imágenes y convirtiendo la generación de imágenes en una herramienta práctica con precisión y poder.

Capacidades mejoradas

Entrenamos nuestros modelos en la distribución conjunta de imágenes en línea y texto, aprendiendo no solo cómo las imágenes se relacionan con el lenguaje, sino también cómo se relacionan entre ellas. Combinado con un entrenamiento posterior agresivo, el modelo resultante tiene una sorprendente fluidez visual, capaz de generar imágenes útiles, consistentes y con conciencia del contexto.

Representación de texto

Una imagen vale más que mil palabras, pero a veces generar algunas palabras en el lugar adecuado puede realzar el significado de una imagen. La capacidad de 4o para combinar símbolos precisos con imágenes transforma la generación de imágenes en una herramienta para la comunicación visual.

Create a photorealistic image of two witches in their 20s (one ash balayage, one with long wavy auburn hair) reading a street sign.

Context:
a city street in a random street in Williamsburg, NY with a pole covered entirely by numerous detailed street signs (e.g., street sweeping hours, parking permits required, vehicle classifications, towing rules), including few ridiculous signs at the middle: (paraphrase it to make these legitimate street signs)"Broom Parking for Witches Not Permitted in Zone C" and "Magic Carpet Loading and Unloading Only (15-Minute Limit)" and "Reindeer Parking by Permit Only (Dec 24–25)
Violators will be placed on Naughty List." The signpost is on the right of a street. Do not repeat signs. Signs must be realistic.

Characters:
one witch is holding a broom and the other has a rolled-up magic carpet. They are in the foreground, back slightly turned towards the camera and head slightly tilted as they scrutinize the signs.

Composition from background to foreground:
streets + parked cars + buildings -> street sign -> witches. Characters must be closest to the camera taking the shot

^{Best of ~8}

Generación de múltiples turnos

Dado que la generación de imágenes ahora es nativa de GPT‑4o, puedes refinar imágenes mediante una conversación natural. GPT‑4o puede trabajar con imágenes y texto en el contexto del chat, asegurando la consistencia en todo momento. Por ejemplo, si estás diseñando un personaje de videojuego, su apariencia se mantiene consistente en las diferentes iteraciones mientras lo perfeccionas y experimentas.

Give this cat a detective hat and a monocle

^{Best of 1}

turn this into a triple A video games made with a 4k game engine and add some User interface as overlay from a mystery RPG where we can see a health bar and a minimap at the top as well as spells at the bottom with consistent and iconography

^{Best of 1}

update to a landscape image 16:9 ratio, add more spells in the UI, and unzoom the visual so that we see the cat in a third person view walking through a steampunk manhattan creating beautiful contrast and lighting like in the best triple A game, with cool-toned colors

^{Best of 2}

create the interface when the player opens the menu and we see the cat's character profile with his equipment and another page showing active quests (and it should make sense in relationship with the universe worldbuilding we are describing in the image)

^{Best of 8}

credit creator: Manuel Sainsily

Seguimiento de instrucciones

La generación de imágenes de GPT‑4o sigue mensajes detallados con meticulosa atención. Mientras que otros sistemas tienen dificultades con 5 a 8 objetos, GPT‑4o puede manejar un máximo de entre 10 y 20 objetos diferentes. La vinculación más estrecha de los objetos con sus características y relaciones permite un mejor control.

A square image containing a 4 row by 4 column grid containing 16 objects on a white background. Go from left to right, top to bottom. Here’s the list:
1. a blue star
2. red triangle
3. green square
4. pink circle
5. orange hourglass
6. purple infinity sign
7. black and white polka dot bowtie
8. tiedye "42"
9. an orange cat wearing a black baseball cap
10. a map with a treasure chest
11. a pair of googly eyes
12. a thumbs up emoji
13. a pair of scissors
14. a blue and white giraffe
15. the word "OpenAI" written in cursive
16. a rainbow-colored lightning bolt

^{Best of 5}

Aprendizaje en contexto

GPT‑4o puede analizar y aprender de las imágenes cargadas por el usuario, integrando sin problemas sus detalles en su contexto para mejorar la generación de imágenes.

draw a design for a vehicle with triangular wheels, using these images as reference.
label the front wheel, the back wheel, and at the of the diagram say (in small caps)
TRIANGLE WHEELED VEHICLE. English Patent. 2025. OPENAI.

^{Best of ~16}

now put this in a photo taken in new york city.

^{Best of ~16}

Conocimiento mundial

La generación de imágenes nativa permite a 4o vincular sus conocimientos entre texto e imágenes, lo que produce un modelo que parece más inteligente y eficiente.

Code Example (Three.js)

HTML

1<!DOCTYPE html>
2<html lang="en">
3  <head>
4    <meta charset="UTF-8" />
5    <title>OpenAI Banner</title>
6    <style>
7      body { margin: 0; overflow: hidden; }
8      canvas { display: block; }
9    </style>
10  </head>
11  <body>
12    <script type="module">
13      import * as THREE from 'https://cdn.jsdelivr.net/npm/three@0.160.0/build/three.module.js';
14      import { OrbitControls } from 'https://cdn.jsdelivr.net/npm/three@0.160.0/examples/jsm/controls/OrbitControls.js';
15      import { FontLoader } from 'https://cdn.jsdelivr.net/npm/three@0.160.0/examples/jsm/loaders/FontLoader.js';
16      import { TextGeometry } from 'https://cdn.jsdelivr.net/npm/three@0.160.0/examples/jsm/geometries/TextGeometry.js';
17
18      const scene = new THREE.Scene();
19      const camera = new THREE.PerspectiveCamera(45, window.innerWidth / window.innerHeight, 0.1, 1000);
20      const renderer = new THREE.WebGLRenderer({ antialias: true });
21      renderer.setSize(window.innerWidth, window.innerHeight);
22      document.body.appendChild(renderer.domElement);
23
24      // Lighting
25      const light = new THREE.AmbientLight(0xffffff, 1);
26      scene.add(light);
27
28      const dirLight = new THREE.DirectionalLight(0xffffff, 1);
29      dirLight.position.set(0, 5, 10);
30      scene.add(dirLight);
31
32      // Camera position
33      camera.position.z = 20;
34
35      // Controls
36      const controls = new OrbitControls(camera, renderer.domElement);
37
38      // Banner background
39      const bannerGeometry = new THREE.PlaneGeometry(20, 10);
40      const bannerMaterial = new THREE.MeshStandardMaterial({ color: 0x1a1a1a });
41      const banner = new THREE.Mesh(bannerGeometry, bannerMaterial);
42      scene.add(banner);
43
44      // OpenAI Logo texture (placeholder)
45      const loader = new THREE.TextureLoader();
46      loader.load('https://upload.wikimedia.org/wikipedia/commons/4/4d/OpenAI_Logo.svg', texture => {
47        const logoGeometry = new THREE.PlaneGeometry(4, 4);
48        const logoMaterial = new THREE.MeshBasicMaterial({ map: texture, transparent: true });
49        const logo = new THREE.Mesh(logoGeometry, logoMaterial);
50        logo.position.set(-5, 0, 0.1); // Slightly in front of the banner
51        scene.add(logo);
52      });
53
54      // Load font and add text
55      const fontLoader = new FontLoader();
56      fontLoader.load('https://threejs.org/examples/fonts/helvetiker_regular.typeface.json', font => {
57        const textGeometry = new TextGeometry("I am 4-o", {
58          font: font,
59          size: 1,
60          height: 0.2,
61          curveSegments: 12,
62          bevelEnabled: true,
63          bevelThickness: 0.02,
64          bevelSize: 0.02,
65          bevelOffset: 0,
66          bevelSegments: 5
67        });
68
69        textGeometry.center();
70
71        const textMaterial = new THREE.MeshStandardMaterial({ color: 0x00ffcc });
72        const textMesh = new THREE.Mesh(textGeometry, textMaterial);
73        textMesh.position.set(5, -0.5, 0.1); // Opposite side of logo
74        scene.add(textMesh);
75      });
76
77      // Resize handler
78      window.addEventListener('resize', () => {
79        camera.aspect = window.innerWidth / window.innerHeight;
80        camera.updateProjectionMatrix();
81        renderer.setSize(window.innerWidth, window.innerHeight);
82      });
83
84      // Render loop
85      function animate() {
86        requestAnimationFrame(animate);
87        controls.update();
88        renderer.render(scene, camera);
89      }
90
91      animate();
92    </script>
93  </body>
94</html>

make an image of what this means to you

Fotorrealismo y estilo

El entrenamiento con imágenes que reflejan una amplia variedad de estilos permite al modelo crear o transformar imágenes de forma convincente.

A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.
A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.
A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.

A cat looking into a puddle of water on a street, but its reflection is that of a tiger, and both reflections are realistically distorted by ripples in the water — A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.
A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.
A candid paparazzi-style photo of Karl Marx hurriedly walking through the parking lot of the Mall of America, glancing over his shoulder with a startled expression as he tries to avoid being photographed. He’s clutching multiple glossy shopping bags filled with luxury goods. His coat flutters behind him in the wind, and one of the bags is swinging as if he’s mid-stride. Blurred background with cars and a glowing mall entrance to emphasize motion. Flash glare from the camera partially overexposes the image, giving it a chaotic, tabloid feel.

Limitaciones

Nuestro modelo no es perfecto. En este momento estamos al tanto de varias limitaciones, sin embargo, trabajaremos para resolverlas mediante mejoras del modelo después del lanzamiento inicial.

Hemos observado que GPT‑4o, en ocasiones, puede recortar imágenes más largas, como carteles, de forma demasiado ajustada, especialmente en la parte inferior.

Seguridad

De acuerdo con nuestra Especificación del Modelo, buscamos maximizar la libertad creativa al apoyar casos de uso valiosos como el desarrollo de juegos, la exploración histórica y la educación, mientras mantenemos altos estándares de seguridad. Al mismo tiempo, sigue siendo tan importante como siempre bloquear las solicitudes que infringen esos estándares. A continuación, presentamos evaluaciones de áreas de riesgo adicionales en las que trabajamos para habilitar contenido seguro y útil, y apoyar una expresión creativa más amplia para los usuarios.

Procedencia mediante C2PA y búsqueda interna reversible
Todas las imágenes generadas incluyen metadatos C2PA, que identificarán una imagen como originada de GPT‑4o, para asegurar transparencia. También hemos desarrollado una herramienta de búsqueda interna que emplea atributos técnicos de las generaciones para ayudar a verificar si el contenido proviene de nuestro modelo.

Bloquear lo malo
Seguimos bloqueando solicitudes para generar imágenes que puedan infringir nuestras políticas de contenido, como materiales de abuso sexual infantil y contenido sexual generado sin consentimiento. Cuando las imágenes de personas reales están en contexto, tenemos mayores restricciones sobre qué tipo de imágenes se pueden crear, con medidas de protección especialmente rigurosas en lo que respecta a la desnudez y la violencia gráfica. Como en cualquier lanzamiento, la seguridad nunca se completa y es más bien un área de inversión continua. A medida que aprendamos más sobre el uso del modelo en el mundo real, ajustaremos nuestras políticas en consecuencia.

Para obtener más información sobre nuestro enfoque, visita el anexo a la tarjeta del sistema GPT‑4o⁠ sobre generación de imágenes.

Usar el razonamiento para potenciar la seguridad
Similar a nuestro trabajo de alineación deliberativa⁠, hemos entrenado un LLM de razonamiento para trabajar directamente a partir de especificaciones de seguridad interpretables y redactadas por humanos. Usamos este LLM de razonamiento durante el desarrollo para que nos ayude a identificar y resolver las ambigüedades en nuestras políticas. Junto con nuestros avances multimodales y las técnicas de seguridad existentes desarrolladas para ChatGPT y Sora, esto nos permite moderar⁠ tanto el texto de entrada como las imágenes de salida conforme a nuestras políticas.

Acceso y disponibilidad

El lanzamiento de la generación de imágenes 4o empieza hoy para los usuarios Plus, Pro, Team y Free como el generador de imágenes predeterminado en ChatGPT, y próximamente estará disponible para Enterprise y Edu. También está disponible para usarlo en Sora. Para aquellas personas que quieran seguir usando DALL·E, aún se podrá acceder a través de DALL·E GPT.

Los desarrolladores pronto podrán generar imágenes con GPT‑4o a través de la API, y el acceso estará disponible en las próximas semanas.

Crear y personalizar imágenes es tan sencillo como chatear usando GPT‑4o: solo describe lo que necesitas, incluyendo detalles específicos como la relación de aspecto, colores exactos usando códigos hexadecimales o un fondo transparente. Debido a que este modelo crea imágenes más detalladas, las imágenes tardan más en renderizarse. La espera a menudo es de hasta un minuto.

credit creator: [Alex Duffy](https://every.to/@AlxAi)
credit creator: [Alex Duffy](https://every.to/@AlxAi)
credit creator: [Alex Duffy](https://every.to/@AlxAi)

credit creator: [August Kamp](https://www.instagram.com/august.kamp/?igsh=MTRpeG9xd3F2MzEyeg#) — credit creator: [Alex Duffy](https://every.to/@AlxAi)
credit creator: [Alex Duffy](https://every.to/@AlxAi)
credit creator: [Alex Duffy](https://every.to/@AlxAi)

Repetición de livestream

Autor

OpenAI

Liderazgo

Gabriel Goh: Generación de Imágenes

Jackie Shannon: Producto de ChatGPT

Mengchao Zhong, Wayne Chang: Ingeniería de ChatGPT

Rohan Sahai: Producto e Ingeniería de Sora

Brendan Quinn, Tomer Kaftan: Inferencia

Prafulla Dhariwal: Organización Multimodal

Investigación

Investigación fundacional

Allan Jabri, David Medina, Gabriel Goh, Kenji Hata, Lu Liu, Prafulla Dhariwal

Investigación clave

Aditya Ramesh, Alex Nichol, Casey Chu, Cheng Lu, Dian Ang Yap, Heewoo Jun, James Betker, Jianfeng Wang, Long Ouyang, Li Jing y Wesam Manassra

Colaboradores de investigación

Aiden Low, Brandon McKinzie, Charlie Nash, Huiwen Chang, Ishaan Gulrajani, Jamie Kiros, Ji Lin, Kshitij Gupta, Yang Song

Comportamiento del modelo

Laurentia Romaniuk

Organización Multimodal

Andrew Gibiansky, Yang Lu

Datos

Líderes de Datos

Gildas Chabot, James Park Lennon

Datos

Arshi Bhatnagar, Dragos Oprica, Rohan Kshirsagar, Spencer Papay, Szi-chieh Yu, Wesam Manassra y Yilei Qian

Moderadores

Hazel Byrne, Jennifer Luckenbill, Mariano López

Asesores de Datos Humanos

Long Ouyang

Escalamiento

Líderes de inferencia

Brendan Quinn, Tomer Kaftan

Inferencia

Alyssa Huang, Jacob Menick, Nick Stathas, Ruslan Vasilev y Stanley Hsieh

Aplicadas

Líder de Producto de ChatGPT

Jackie Shannon

Líderes de Ingeniería de ChatGPT

Mengchao Zhong, Wayne Chang

Líder de Diseño de Producto

Matt Chan

Ciencia de datos

Xiaolin Hao

ChatGPT

Andrew Sima, Annie Cheng, Benjamin Goh, Boyang Niu, Dian Ang Yap, Duc Tran, Edede Oiwoh, Eric Zhang, Ethan Chang, Jeffrey Dunham, Jay Chen, Kan Wu, Karen Li, Kelly Stirman, Mengyuan Xu, Michelle Qin, Ola Okelola, Pedro Aguilar, Rocky Smith, Rohit Ramchandani, Sara Culver, Sean Fitzgerald, Vlad Fomenko, Wanning Jiang, Wesam Manassra, Xiaolin Hao, Yilei Qian

Sora

Líderes de producto Sora

Rohan Sahai, Wesam Manassra

Producto e Ingeniería de Sora

Boyang Niu, David Schnurr, Gilman Tolle, Joe Taylor, Joey Flynn, Mike Starr, Rajeev Nayak, Rohan Sahai, Wesam Manassra

Seguridad

Líder de Seguridad

Somay Jain

Seguridad

Alex Beutel, Andrea Vallone, Botao Hao, Brendan Quinn, Cameron Raymond, Chong Zhang, David Robinson, Eric Wallace, Filippo Raso, Huiwen Chang, Ian Kivlichan, Irina Kofman, Keren Gu-Lemberg, Kristen Ying, Madelaine Boyd, Meghan Shah, Michael Lampe, Owen Campbell-Moore, Rohan Sahai, Rodrigo Riaza Perez, Sam Toizer, Sandhini Agarwal y Troy Peterson

Estrategia

Adam Cohen, Adam Wells, Ally Bennett, Ashley Pantuliano, Carolina Paz, Claudia Fischer, Declan Grabb, Gaby Sacramone-Lutz, Lauren Jonas, Ryan Beiermeister, Shiao Lee, Tom Stasi, Tyce Walters, Ziad Reslan, Zoe Stoll

Marketing y comunicaciones

Líderes de Comunicaciones y Marketing

Minnia Feng, Natalie Summers, Taya Christianson

Comunicaciones

Alex Baker-Whitcomb, Ashley Tyra, Bailey Richardson, Gaby Raila, Marselus Cayton, Scott Ethersmith y Souki Mansoor

Diseño y creatividad

Líderes

Kendra Rimbach, Veit Moeller

Diseño

Adam Brandon, Adam Koppel, Angela Baek, Cary Hudson, Dana Palmie, Freddie Sulit, Jeffrey Sabin Matsumoto, Leyan Lo, Matt Nichols, Thomas Degry, Vanessa Antonia Schefke, Yara Khakbaz

Agradecimientos Especiales

Aditya Ramesh, Aidan Clark, Alex Beutel, Ben Newhouse, Ben Rossen, Che Chang, Greg Brockman, Hannah Wong, Ishaan Singal, Jason Kwon, Jiacheng Feng, Jiahui Yu, Joanne Jang, Johannes Heidecke, Kevin Weil, Mark Chen, Mia Glaese, Nick Turley, Raul Puri, Reiichiro Nakano, Rui Shu, Sam Altman, Shuchao Bi y Vinnie Monaco