OpenAI o3 和 o4-mini 是最新的 o 系列视觉推理模型。这是我们的模型第一次能够在思维链中结合图像进行思考,而不仅仅是看到图像。
与我们早期的 OpenAI o1 模型类似,o3 和 o4-mini 在回答问题前也会进行长时间的思考,并在回答用户问题前调用较长的内部思维链。通过在思维链中结合图像进行思考,o3 和 o4-mini 进一步扩展了这一功能。图像思考通过使用工具转换用户上传的图像来实现,除了其他简单的图像处理技术外,还允许用户对图像进行裁切、放大和旋转。更重要的是,这些功能都是原生的,无需依赖单独的专门模型。
ChatGPT 的增强型视觉智能可进行比以往更全面、更准确、更可靠的图像分析,进而帮助您解决更棘手的问题。它能将高级推理与网络搜索和自动缩放、裁切、翻转或图像增强等处理工具无缝结合,甚至能从残缺的照片中获得启发。例如,上传经济学问题集的照片即可获得分步骤解释;分享版本错误的截图就能快速获得根本原因分析。
这种方法实现了测试时间计算扩展的新轴心,将视觉推理和文本推理无缝融合在一起。在多模态基准测试中获得的顶尖成绩也反映了这一点,标志着模型向多模态推理迈出了重要一步。
视觉推理实操
图像思考允许您以更轻松的方式与 ChatGPT 交互。您可以通过拍照来提问,而不必担心对象的位置,例如文字是否颠倒,或者一张照片中是否存在多个物理问题。即使乍看之下对象并不明显,视觉推理也能促使模型将照片放大,从而看得更清楚。
所有示例均使用 OpenAI o3 完成。

What is written on the notebook?
已推理笔记本内容 20 秒
Analyzed image

Analyzed image

- The user posted an image at /mnt/data/own_notebook.jpg and asked me to read what's written on it. I'll need to load the image so I can inspect the text. Once I view it, I realize the text is upside down, so I’ll rotate it so it's readable.
From there, I can check what's written and share my findings clearly with the user.
Analyzed image

- The user shared an image found at /mnt/data/own_notebook.jpg and asked me to read the text on it. I’ll load the image and zoom in to see the writing clearly. It turns out the text is upside down, so I’ll rotate it for easy reading. Afterward, I’ll crop the lower part of the image to focus on the relevant area, and rotate the cropped section to ensure the text is right side up.
Analyzed image

Done
It says: “4th February – finish roadmap.”
我们最新的视觉推理模型可与 Python 数据分析、网络搜索、图像生成等其他工具协同工作,从而创造性地有效解决更复杂的问题,同时首次为用户提供多模态代理体验。

solve this maze and plot a path with red-colored line
已推理迷宫解法和路径 1 分 44 秒
Analyzed image

- The user provided an image path, but I’ve yet to load it. First, I’ll need to load the image from the given path for analysis. Once that’s done, I can detect the maze path, potentially marked by black lines, and plot a red line to showcase it. The image is relatively small—404x
- I’m thinking the start could be at the top left missing border piece, or maybe near the left border. To help figure this out, I’ll look at the first row pixels to find any white areas near the border. Since the maze walls are black on white, I’ll try a thresholding approach. To invert it, I’ll use a threshold value of 200, which would set white to 255 and black closer to 0. I’ll apply this threshold using the inverse method.
- Okay, I initially thought I could approximate the maze start near the top-left opening. I applied some thresholding to detect white pixels but encountered an issue—nothing changed. It seems the image might have high noise. Oddly, when inspecting the grayscale, all pixel values are zero, which suggests the image is completely black. This doesn’t match what we saw earlier with Matplotlib, so I’ll delve deeper into how PIL loaded the image and inspect the underlying pixel array.
Analyzed
Python
1print(imgcv.max(), imgcv.min(), imgcv.mean())0 0 0.0
Analyzed
Python
1import numpy as np2img_np = np.array(img)3print(img_np.shape, img_np.dtype, img_np.max(), img_np.min())(404, 404, 4) uint8 255 0
It looks like the lines are black with a transparent background, but when PIL loads the image, it's showing a white background due to how it's handling the alpha channel. I’ll try converting the image to RGB, but the pixel values are still showing all zeros. This is odd, since the black lines should be visible. Hmm, earlier the display had a white background. Let me check the pixel values at coordinate (200, 200) to get a clearer picture.
Analyzed
Python
1img_rgb = img.convert('RGB')2imgcv = cv2.cvtColor(np.array(img_rgb), cv2.COLOR_RGB2GRAY)3print(imgcv.max(), imgcv.min())0 0
Analyzed
Python
1print(img_np[200,200])[0 0 0 0]
- The image has black lines with a transparent background, but it appears white due to how PIL handled the alpha channel. When displayed earlier, I saw white areas, which were actually transparent areas shown on a white grid (the Matplotlib background color). So, the lines are black with the transparent spaces representing the open path. I'll need to treat these transparent areas as the open path when interpreting the maze. It’s essentially a black line maze on a clear canvas.
- I see that the maze walls are black pixels, and the path is transparent with an alpha value of 0. I need to find the path from an open cell on one border to another on the opposite border. To do this, I'll create a binary matrix where the passable cells have an alpha of 0 and are not part of a line border. Now, I'll focus on processing the image channels to identify the path.
Analyzed
Python
1alpha = img_np[:,:,3]2# passable pixel if alpha == 0 (transparent) else wall3passable = alpha == 04# find openings along border5h,w = passable.shape6openings = []7# top row8for x in range(w):9 if passable[0,x]:10 openings.append((0,x))11# bottom row12for x in range(w):13 if passable[h-1,x]:14 openings.append((h-1,x))15# left col16for y in range(h):17 if passable[y,0]:18 openings.append((y,0))19# right col20for y in range(h):21 if passable[y,w-1]:22 openings.append((y,w-1))23len(openings)1616
- The walls are black, and the path is transparent (alpha 0). I need to find a way from the open cell at one border to the open cell at the opposite border. I’ll create a binary matrix where the passable cells are transparent (alpha 0). However, there are a lot of transparent cells, and the paths might not be clear due to line gaps. I plan to treat the lines as about 3 pixels thick. I'll try BFS/DFS to analyze the passable area from the starting point.
- It seems the maze's border cells might connect to "air" outside the maze—this is the transparent area surrounding the maze. But the path is only open through legitimate openings in the border walls. Outside of these openings, the transparent background might merge with the outside area, making it harder to differentiate. I need to treat the inside walkway as enclosed by walls, ensuring it connects to the outside only at those openings, not through a continuous open background.
- I’m thinking of dilating the walls by 3 pixels. This might cause openings smaller than 3 pixels to close, which could block the path. However, this might inadvertently seal parts of the interior walkway, leading to a disconnected path. So, I should try morphological closing—using a small kernel to dilate the walls—and then check for passable areas. This could help me manage the maze more effectively while keeping the path clear.
Analyzed
Python
1# find pattern of vertical walls2# examine alpha channel row 0 lines to see thickness3unique_rows = np.unique(np.sum(~passable, axis=1))4unique_rows[:20]array([ 0, 16, 20, 22, 24, 26, 28, 30, 32, 34, 94, 150, 178, 192, 206, 220, 234, 248, 276, 290])
Analyzed
Python
1import cv22# create wall mask (1 where wall)3wall = ~passable4# dilate walls by 3 px5kernel = np.ones((3,3), np.uint8)6wall_dil = cv2.dilate(wall.astype(np.uint8), kernel, iterations=3)7pass_dil = wall_dil==08# label connected regions in pass_dil9num_labels, labels = cv2.connectedComponents(pass_dil.astype(np.uint8))10print(num_labels)11
2
- It might be easier to treat the entire transparent area outside the walls as open, and then use BFS starting from the top-left outside region. However, since the walkway is also transparent, using BFS would flood the inside walkway from the entrance openings. I think a better approach might be to switch to grid cells, considering the maze lines are about 4 pixels wide. Alternatively, I could use a shortest path algorithm on the pixel level, restricting movement to inside the maze’s bounding box to solve the path.
- I’m considering dilating the walls by around 3 pixels. If there are openings smaller than that, they could seal, disconnecting the interior walkway from the outside. The interior walkway might also get blocked, causing the path to be cut off. To test this, I’ll try using a morphological operation where I dilate the walls with a small kernel and check for passable regions. If needed, I could increase the dilation to 5 pixels to see how it affects the maze structure.
- One idea is to invert the walls and walkway
Analyzed
Python
1wall_dil = cv2.dilate(wall.astype(np.uint8), kernel, iterations=5)2pass_dil = wall_dil==03num_labels, labels = cv2.connectedComponents(pass_dil.astype(np.uint8))4print(num_labels)2
- I’m thinking of analyzing the top border, specifically between 0 and around 10 pixel rows. I could look for positions where the wall (alpha not zero) is absent and detect where the top border has an opening — essentially where all the rows for a specific x position have alpha equal to zero (transparent). That should allow me to identify where the walkway might begin at the top border.
Analyzed
Python
1# find opening columns on top border vertical line: for each x where all first maybe 10 rows walls present?2open_cols_top = []3for x in range(w):4 if np.all(alpha[0:10,x] == 0):5 open_cols_top.append(x)6open_cols_top[:20], len(open_cols_top)([0, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 403], 16)
- I'm considering dilating the walls by 3 pixels, but if the openings are smaller, they might seal the walkway. This could block the path. I'll try using morphological closing to see if dilating the walls with a small kernel helps. After dilation, the walkway might still be connected, so I’ll need to check border openings and spot any gaps on the rectangle's border. I'll also inspect a few initial pixels to determine wall thickness, which seems to be about 4 pixels.
Analyzed
Python
1print(alpha[0,0:30])[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
Analyzed
Python
1print(alpha[1,0:50])[ 0 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255 255]
- The walls are drawn with black lines, and walkway openings are the gaps in these lines. I want to check the top border for missing walls, so I’ll inspect the first 10 rows for each column where the "alpha" value is zero, indicating no wall. It looks like there are 16 such columns, mostly near the edges. Some areas (like positions 195-208) probably represent real openings, while others (like positions 0 and 403) may be due to missing borders outside the maze. Now, I'll check row 1 for walls.
- I’ll check row 1, where alpha equals 255, indicating pixels that are part of the walls. From this, I realize that the exterior walls are
Analyzed
Python
1wall_rows = []2for y in range(h):3 if np.mean(alpha[y,:]>0) > 0.5:4 wall_rows.append(y)5# keep unique groups6rows=[]7prev=-108for y in wall_rows:9 if y-prev>2: rows.append(y)10 prev=y11rows[1, 17, 49, 97, 113, 145, 161, 177, 193, 209, 241, 257, 289, 305, 321, 337, 369, 401]
- I checked alpha at row 1, and 255 indicates the presence of a wall. The border walls are likely located at row 1, row h-2, and certain columns. The outside region and walkway are alpha==0, but not within the border lines. The walkway seems enclosed by walls drawn at intervals, every 4 pixels. I’ll check horizontal wall positions, looking for rows with mostly 255-alpha values. I noticed a pattern, with gaps roughly every 16 pixels, though it’s not consistent due to vertical walls. I plan to compute the differences in these gaps.
Analyzed
Python
1diffs = [j-i for i,j in zip(rows[:-1],rows[1:])]2diffs[16, 32, 48, 16, 32, 16, 16, 16, 16, 32, 16, 32, 16, 16, 16, 32, 32]
- I checked row 1 and observed that alpha 255 indicates a wall. The border walls may be at row 1 and row h-2, as well as certain columns. Outside the border area and the walkway both have alpha 0. It seems the walkway is enclosed by walls drawn at odd intervals. The grid cell width might be around 4 px. The horizontal wall positions appear roughly every 16 px, but the pattern isn't uniform due to vertical walls. I'll compute the differences for a clearer picture.
- Okay, there are still many openings along row 1, but I think clusters around columns 195–208 might actually correspond to the real entrance. This span is about 14 px wide. The openings at 0 and 403 could be corners, so they might not be true entrances. It seems the real entrance is in the 195–208 range. Now, I’ll look at the left border to check for vertical openings, focusing on specific y values.
Analyzed
Python
1openings_top = [x for x in range(w) if alpha[1,x]==0]2len(openings_top), openings_top[:20](16, [0, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 403])
Analyzed
Python
1openings_left = [y for y in range(h) if alpha[y,1]==0]2openings_left[:20], len(openings_left)([0, 403], 2)
- I’m thinking of using morphological thinning to simplify the walkway into a one-pixel width, then running BFS to find a path. I’ll need to separate the walkway region from the outside. A flood fill from the border can help mark the outside, leaving the walkway region, but I need the walkway. A simpler plan is to target specific openings for entrances and exits. I can detect these gaps at row 1 in the border, where openings might appear as alpha[1,x]==0. Let’s check that approach.
- I plan to use morphological thinning to narrow the walkway down to a one-pixel width, then run BFS to find the best path. To separate the walkway region from the outside, I’ll flood-fill around the border where alpha==0 to mark the outside. Then, I can address gaps in the walkway with morphological closing. Instead of filling all gaps, I’ll focus on entrance/exit openings, which I can detect by finding regions where alpha[1,x] == 0. Let’s check row 1 for these openings.
- Looking at row 1, there are many potential openings, but the clusters between 195-208 (about 14 px wide) seem to be the real entrance. The row's openings might just be where there’s no wall. The left border also has vertical gaps, and I see that the top-left corner is open. The right and bottom borders also have openings, so I think the entrance and exit are at the top and bottom of the maze. With these connections, the walkway is accessible through these openings, so flood filling isn't viable.
- I’m trying to define the interior of the maze, focusing on the region between (2, h-3) for rows and (2, w-3) for columns. To set up the starting point, I’ll look for the first passable cell near the top opening, specifically looking between x=195 and x=208 in row 2. I’ll select the middle of that range, x=201, as my start point. For the end point, I’ll look for something similar near the bottom, around (h-3, 201).
Analyzed
Python
1openings_right = [y for y in range(h) if alpha[y,w-2]==0]2openings_right[:20], len(openings_right)([0, 403], 2)
Analyzed
Python
1openings_bottom = [x for x in range(w) if alpha[h-2,x]==0]2openings_bottom[:20], len(openings_bottom)([0, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 403], 16)
Analyzed
Python
1start=(2,201)2end=(401,201)3print(passable[start], passable[end])正确 正确
Analyzed
Python
1from collections import deque2import numpy as np3
4# Define mask of valid passable inside region5valid = passable.copy()6# exclude row0,row h-1, col0,col w-17valid[0,:]=False8valid[h-1,:]=False9valid[:,0]=False10valid[:,w-1]=False11
12start_pt=start13end_pt=end14
15prev = -np.ones((h,w,2), dtype=int)16visited = np.zeros((h,w), dtype=bool)17q=deque()18q.append(start_pt)19visited[start_pt]=True20found=False21dirs=[(1,0),(-1,0),(0,1),(0,-1)]22while q:23 y,x=q.popleft()24 if (y,x)==end_pt:25 found=True26 break27 for dy,dx in dirs:28 ny,nx=y+dy,x+dx29 if 0<=ny<h and 0<=nx<w and valid[ny,nx] and not visited[ny,nx]:30 visited[ny,nx]=True31 prev[ny,nx]=[y,x]32 q.append((ny,nx))33print(found)正确
- I’ll start by selecting the starting and ending points as passable pixels near specific coordinates, using the average of the opening cluster for the start point. Then, I can compute the shortest path using BFS, restricting it to the interior region. After finding the path, I'll overlay it on the original image by drawing a red line. But to ensure the red line stands out, I’ll convert the image to RGBA and then display it for a clear view.
Analyzed
Python
1# reconstruct path2path=[]3p=end_pt4while p!=(-1,-1):5 path.append(p)6 y,x=p7 prev_y, prev_x = prev[y,x]8 if prev_y==-1:9 break10 p=(prev_y, prev_x)11path = path[::-1]12len(path)1144
Analyzed
Python
1from PIL import ImageDraw2img_rgba = img.copy()3draw = ImageDraw.Draw(img_rgba)4for y,x in path:5 draw.point((x,y), fill=(255,0,0,255))6# Save7img_rgba.save('/mnt/data/maze_solved.png')Analyzed image

Done
Matplotlib Chart

Here’s one possible solution path for your maze, traced in red.
Python
1from PIL import Image, ImageDraw2import numpy as np3import cv24import matplotlib.pyplot as plt5from collections import deque6# Load the original maze (walls are black, background is transparent)7maze_path = '/mnt/data/d19c14d3-e182-4dba-99dc-647e4f725536.png'8img = Image.open(maze_path)9alpha = np.array(img)[:, :, 3]10h, w = alpha.shape11passable = alpha == 0 # transparent pixels are walkable12# We determined the entrance & exit are roughly centered at the top & bottom openings13start = (2, 201)14end = (h - 3, 201)15# Breadth‑first search on walkable pixels (inside the border)16valid = passable.copy()17valid[0, :] = valid[-1, :] = valid[:, 0] = valid[:, -1] = False18prev = -np.ones((h, w, 2), dtype=int)19visited = np.zeros((h, w), bool)20q = deque([start])21visited[start] = True22dirs = [(1, 0), (-1, 0), (0, 1), (0, -1)]23while q:24 y, x = q.popleft()25 if (y, x) == end:26 break27 for dy, dx in dirs:28 ny, nx = y + dy, x + dx29 if 0 <= ny < h and 0 <= nx < w and valid[ny, nx] and not visited[ny, nx]:30 visited[ny, nx] = True31 prev[ny, nx] = (y, x)32 q.append((ny, nx))33# Reconstruct the path34path = []35p = end36while (p != (-1, -1)):37 path.append(p)38 py, px = p39 p = tuple(prev[py, px])40path.reverse()41# Draw the solution in red42solved = img.copy()43draw = ImageDraw.Draw(solved)44for y, x in path:45 draw.point((x, y), fill=(255, 0, 0, 255))46# Show the solved maze47plt.figure(figsize=(6, 6))48plt.imshow(solved)49plt.axis('off')50plt.show()51# Save for download if the user wants it later52solved_path = '/mnt/data/maze_solved.png'53solved.save(solved_path)54print(f"Solved maze saved to {solved_path}")55
基准测试表现
为了凸显视觉推理相对于前代多模态模型的改进,我们使用一组不同的人类考试和机器学习基准对 OpenAI o3 和 o4-mini 进行了测试。在我们测试的所有多模态任务中,两款新视觉推理模型的表现明显优于前代模型。
所有模型均在高“推理强度”设置下进行评估,这与 ChatGPT 中的 o4-mini-high 等模型类似。
尤其值得注意的是,在我们评估的所有感知基准中,不依赖浏览而进行的图像思考都表现出了显著的提升。在 STEM 相关问题解答 (MMMU、MathVista)、图表阅读与推理 (CharXiv)、感知原语(VLM 为 Blind)和视觉搜索 (V*) 等方面,我们的模型性能达到了新的高度。在 V* 测试中,我们的视觉推理方法达到了 95.7% 的准确率,基本解决了基准问题。
局限性和下一步计划
图像思考目前具有以下局限性:
- 推理链过长:模型可能会执行多余或不必要的工具调用和图像处理步骤,导致思维链过长。
- 感知错误:模型仍然会犯基本的感知错误。即使工具调用正确地推进了推理过程,视觉误读仍然可能导致错误的最终答案。
- 可靠性:在多次尝试解决一个问题时,模型可能会尝试不同的视觉推理过程,而其中一些过程可能会导致错误的结果。
OpenAI o3 和 o4-mini 显著提升了最先进的视觉推理能力,向着更广泛的多模态推理迈出了重要的一步。这些模型在视觉感知任务中表现出了最佳性能,能够解决之前难以企及的问题。
我们将持续改进模型的图像推理能力,让它更简洁、冗余更少、可靠性更高。我们期待继续开展多模态推理研究,帮助人们探索如何利用这些改进提高日常工作效率。
4 月 16 日更新:根据系统提示的更改(在原评估中并未体现),我们更新了 o3 在 Charxiv-r、Mathvista 和 vlmsareblind 上的测试结果。
作者
Contributors
Aditya Ramesh, Aidan Clark, Aleksandra Spyra, Alex Tachard Passos, Alexander Kirillov, Ali Kalami, Amy McDonald Sandjideh, Andrei Gheorghe, Andrew Gibiansky, Andrew Tulloch, Angela Baek, Anubha Srivastava, Avital Oliver, Behrooz Ghorbani, Ben Leimberger, Borys Minaiev, Bowen Cheng, Brandon McKinzie, Carpus Chang, Cary Hudson, Casey Chu, Charlotte Cole, Chen Shen, Dan Roberts, Dana Palmie, Daniel Kappler, David Medina, Edmund Wong, Eric Mitchell, Eric Ning, Freddie Sulit, Haiming Bao, Haitang Hu, Hongyu Ren, Hyeonwoo Noh, Jakub Pachocki, James Betker, James Qin, Jamie Kiros, Jason Ai, Jerry Tworek, Jessica Liang, Ji Lin, Jiahui Yu, Jianfeng Wang, Joseph Mo, Kenji Hata, Kevin King, Kristian Georgiev, Kshitij Gupta, Lauren Yang, Li Jing, Lin Yang, Linden Li, Mark Chen, Martin Li, Max Schwarzer, Mia Glaese, Michael Malek, Minnia Feng, Nacho Soto, Nat McAleese, Niko Felix, Peter Faiman, Prafulla Dhariwal, Rajkumar Samuel, Rapha Gontijo Lopes, Ravi Teja Mullapudi, Reiichiro Nakano, Rennie Song, Ricky Xu, Sam Altman, Sean Fitzgerald, Shengjia Zhao, Shengli Hu, Shuchao Bi, Spencer Papay, Szi-chieh Yu, Wenda Zhou, Yang Lu, Yara Khakbaz, Yunxing Dai, Zhishuai Zhang