2026年1月29日

深入了解 OpenAI 的内部数据智能体

作者：Bonnie Xu、Aravind Suresh 和 Emma Tang

正在加载…

数据为系统如何学习、产品如何演进以及企业如何做出选择提供动力。但要快速、准确地获取答案，并结合相关上下文，往往比想象中更为困难。随着 OpenAI 业务范围的扩展，为了简化这项工作，我们建立了专属定制内部 AI 数据智能体，通过 OpenAI 的平台进行探索和推理。

我们的智能体是一款定制内部专用工具（非外部产品），围绕 OpenAI 的数据、权限和工作流而构建。我们将展示 OpenAI 如何构建和使用这一工具，以帮助大家了解 AI 如何以切实可行的方式，支持团队完成日常工作。我们用于构建和运行这一智能体的 OpenAI 工具（Codex、我们的 GPT‑5 旗舰模型、Evals API⁠（在新窗口中打开）和 Embeddings API⁠（在新窗口中打开））与我们面向全球各地开发人员提供的工具相同。

我们的数据智能体可支持员工在几分钟（而非几天）内从问题中提取洞察数据。这降低了所有职能部门（而不仅仅是我们的数据团队）提取数据和进行细致分析的门槛。如今，OpenAI 的工程、数据科学、市场进入、财务和研究团队都依赖智能体来解答高价值数据问题。例如，它可以通过直观的自然语言格式，回答如何评估发布流程和理解业务运行状况等问题。该智能体结合了 Codex 驱动的表格级知识，以及产品和组织的背景信息。其持续学习记忆系统意味着它能够不断优化。

屏幕截图显示用户在 2025 年 10 月 6 日询问 ChatGPT 周活跃用户 (WAU) 数据，并与 DevDay 2023 期间的数据进行比较。智能体报告称，2025 年的 WAU 约为 8 亿，2023 年的 WAU 约为 1 亿，备注信息表明变化幅度超过 7 亿，增长约 8 倍，随后还附上说明背景信息。

在这篇文章中，我们将深入分析为什么我们需要定制 AI 数据智能体，代码增强型数据上下文和自学习功能为何如此重要，以及我们在此过程中汲取的经验教训。

为什么我们需要定制工具

OpenAI 的数据平台为工程、产品和研究部门的 3,500 多名内部用户提供服务，涵盖 70,000 个数据集中超过 600 PB 的数据。在此规模下，寻找合适的表格可能是分析过程中最耗时的环节之一。

正如一位内部用户所解释的：

“我们有很多非常相似的表格，我曾耗费大量时间去厘清它们的不同之处，以及具体应选择哪一个表格。有的表格包括已注销用户，有的则不包括这些用户。有的表格存在重叠字段，很难分辨具体内容。”

即使选择了正确的表格，生成正确的结果也并非易事。分析师必须对表格数据和表格关系进行推理，以确保正确应用转换和筛选功能。常见的故障模式 — 多对多连接、筛选条件下推错误和未处理的空值，都可能会在不知不觉中输出无效结果。鉴于 OpenAI 庞大的组织架构，分析师不应将时间浪费在调试 SQL 语义或查询性能上：他们的重点应放在定义指标、验证假设和制定数据驱动的决策。

SQL 代码屏幕截图，其中定义了两个 CTE — order_enriched 和 monthly_segment，用于连接客户地理数据，推导订单月份字段，以及计算订单数量、总收入、含税收入以及从发货到签收的平均天数等月度汇总指标。 — 这条 SQL 语句长达 180 多行。要确认我们是否连接正确的表格并查询相关列，并非易事。

工作原理

让我们逐步了解 OpenAI 的智能体及其如何整理上下文并不断自我完善。

我们的智能体由 GPT‑5.2 驱动，旨在通过 OpenAI 的数据平台进行推理。无论员工在何处办公，都可以使用这一工具：它能够充当 Slack 智能体、通过 Web 界面、嵌入集成开发环境 (IDE)、经由 MCP 连接的 Codex CLI⁠（在新窗口中打开）以及直接通过 MCP 连接器在 OpenAI 的内部 ChatGPT 应用中调用⁠（在新窗口中打开）。

标题为“数据智能体运作原理”的图表。Agent-UI、Local Agent-MCP、Remote Agent-MCP 和 Slack Agent 等入口点均已输入到 Agent-API。API 可连接内部数据知识和企业背景信息，与数据仓库和平台源同步，并通过 Agent-MCP 与 GPT-5.2 模型交换请求。

用户可以提出复杂的开放式问题，这通常需要多轮人工探索。以使用测试数据集的提示为例：“For NYC taxi trips, which pickup-to-dropoff ZIP pairs are the most unreliable, with the largest gap between typical and worst-case travel times, and when does that variability occur?”

从理解问题到探索数据、运行查询和汇总结果，智能体可负责执行端到端分析。

屏幕截图显示，一位用户正在询问纽约市哪些出租车上车→下车点的 ZIP 组合最“不可靠”。智能体基于来自 samples.nyctaxi.trips 的约 21,000 次行程样本作出解释，定义典型情况 (p50) 与最坏情况 (p95)，应用筛选条件，并说明它如何识别每个 ZIP 对的最长行程时间。 — 智能体对问题的答复。

智能体的一项核心优势在于其解决问题的推理能力。智能体并非遵循固定的脚本，而是会自行评估流程进展。如果某个中间结果出现异常（例如，由于错误的连接或筛选操作导致行数为零），智能体就会调查出现非问题的具体环节，调整其方法，并再次尝试。在此过程中，它会保留完整的上下文，并在各步骤之间传递学习成果。这一闭环式自学习过程可将迭代任务从用户转移到智能体自身，从而更快输出结果，并持续提供质量远高于手动工作流的分析数据。

任务工作流的屏幕截图，展示 AI 智能体分析纽约市出租车行程时长的分布计划。它包括目标、内部搜索、模式检查、代码片段，以及关于计算 p50/p95 差值、识别不可靠的 ZIP 对，并规划 SQL 查询的推理过程。 — 智能体可凭借推理能力，识别最不可靠的纽约市出租车上车点-下车点配对数据。

该智能体现已涵盖完整的分析工作流：发现数据、运行 SQL、发布笔记及报告。它能够理解企业内部知识，通过网络搜索获取外部信息，并不断积累记忆以实现持续优化。

上下文决定一切

高质量的答案取决于丰富、准确的上下文。在缺乏上下文的情况下，即使是强大的模型也可能会输出错误结果，例如严重误估用户数量或曲解内部术语。

屏幕截图：显示用户正在询问“What was ChatGPT Image Gen logged-in DAU for the last 30 days?”，下方的状态行显示智能体已“Working for 22m 41s”，表明其仍在处理长期运行的查询。

没有记忆能力的智能体，无法有效进行查询。

屏幕截图：显示用户正在询问“What was ChatGPT Image Gen logged-in DAU for the last 30 days?”消息下方的状态行显示“Worked for 1m 22s”字样，表明查询仍在运行，且需要很长时间才能完成。

智能体的记忆可通过定位正确的表格，加快查询速度。

为了避免这些故障模式，我们围绕多层上下文构建智能体，并以 OpenAI 的数据和机构知识为基础。

标题为“数据智能体上下文层级”的图表，展示了六个堆叠层级：1) 表格使用情况，2) 人工注释，3) Codex 增强，4) 机构知识，5) 记忆和 6) 运行时上下文。每一层都以金字塔形状的条形图显示。

第 1 层：表格使用情况

元数据基础：智能体依赖模式元数据（列名和数据类型）来指导 SQL 编写，并使用表格沿袭（例如，上游和下游表格关系）来理解不同表格之间的上下文。
查询推理：采集历史查询有助于智能体理解如何编写查询，以及哪些表格通常相互关联。

第 2 层：人工注释

精选描述：由领域专家精心整理的表格和列相关描述，用于记录意图、语义、业务含义，以及无法从模式或过往查询中轻易推断的已知注意事项。

仅靠元数据是不够的。要真正区分表格，你需要了解其创建方式及来源。

第 3 层：Codex 增强

通过推导表格的代码级定义，智能体能够更深入地理解数据的实际内容。
- 关于表格中存储的内容以及如何根据分析事件中得出的细微差别提供额外的信息。例如，它可以提供关于值的唯一性、表格数据更新频率、数据范围（例如，如果表格排除某些字段，即标识其具备相应的粒度级别）等上下文。
通过展示如何在 Spark、Python 和其他数据系统中使用 SQL 以外的表格，提供更丰富的使用情况上下文。
这意味着，智能体可以区分内容相似但存在关键差异的表格。例如，它可以判断某个表格是否仅包含第一方 ChatGPT 流量。该上下文还会自动刷新，因此无需手动维护即可保持更新。

标题为“Codex 增强知识管道”的图表。主流表格已应用于多个 Codex 任务，这些任务从 OpenAI 代码库中提取细节，包括表格的用途、粒度和主键、下游使用模式、替代表格选项以及数据新鲜度。

第 4 层：机构知识

智能体可以访问 Slack、Google Docs 和 Notion，这些平台可记录关键的企业背景信息，例如产品发布、可靠性事件、内部代号和工具，以及关键指标的规范定义和计算逻辑。
这些文档会被采集、嵌入，并与元数据和权限一起存储。检索服务可在运行时处理访问控制和缓存，支持智能体高效且安全地获取这些信息。

用户询问连接器的使用量为何在 12 月呈现下滑趋势的屏幕截图。智能体解释称，使用量下降的原因是 2025 年 11 月 13 日出现的日志问题，导致 ChatGPT 5.1 发布后的使用量被低估。传统的遥测技术一直处于空白状态，直到更新的事件成为可信数据源。

第 5 层：记忆

当智能体收到更正信息或发现某些数据问题存在细微差别时，它能够保存这些学习结果以供后续使用，从而在与用户交互的过程中持续改进。
- 因此，未来的答案将以更准确的基线为切入点，而非反复处理相同的问题。
- 记忆的目标是保留并复用那些不易察觉的更正信息、过滤条件和约束限制，这些内容对于维持数据的正确性至关重要，但仅凭其他层级难以有效推断。
- 例如，在某个案例中，智能体不知道如何筛选特定的分析实验（它需要依赖与实验门中定义的特定字符串进行匹配）。在此过程中，记忆至关重要，因为这一能力可确保智能体正确筛选相关信息，而非盲目尝试进行字符串匹配。
当你向智能体提出更正意见，或者当其从对话中查找学习内容时，它会提示你保存记忆，以备后续使用。
- 用户也可以手动创建和编辑记忆。
- 记忆的范围涵盖全局和个人层面，而智能体所配备的工具将简化编辑流程。

通知横幅显示“数据智能体希望将 2 条学习内容保存到其记忆中”，其中一个标注项为“ChatGPT 顶级指标”，右侧的确认消息显示“已保存到全局记忆”字样，且包含绿色复选标记。

第 6 层：运行时上下文

如果表格缺乏先前的上下文或现有信息过时，智能体可以向数据仓库发起实时查询，以直接检查和查询该表格。这使其能够验证模式、实时理解数据，并做出相应的回复。
该智能体还能根据需要与其他数据平台系统（元数据服务、Airflow、Spark）进行对话，以获取数据仓库之外更广泛的数据上下文。

We run a daily offline pipeline that aggregates table usage, human annotations, and Codex-derived enrichment into a single, normalized representation. This enriched context is then converted into embeddings using the OpenAI embeddings API⁠（在新窗口中打开） and stored for retrieval. At query time, the agent pulls only the most relevant embedded context via retrieval-augmented generation⁠（在新窗口中打开） (RAG) instead of scanning raw metadata or logs. This makes table understanding fast and scalable, even across tens of thousands of tables, while keeping runtime latency predictable and low. Runtime queries are issued to our data warehouse live as needed.

标题为“数据智能体中的上下文检索”的图表。离线预处理层包括表格使用情况、人工注释、Codex 增强、机构知识和记忆，这些信息均已输入到 RAG 嵌入中。实时检索表明智能体可通过语义搜索或精确文本检索功能来查询数据库，以生成运行时上下文。

Together, these layers ensure the agent’s reasoning is grounded in OpenAI’s data, code, and institutional knowledge, dramatically reducing errors and improving answer quality.

Built to think and work like a teammate

One-shot answers work when the problem is clear, but most questions aren’t. More often, arriving at the correct result requires back-and-forth refinement and some course correction.

The agent is built to behave like a teammate you can reason with. It’s a conversational, always-on and handles both quick answers and iterative exploration.

It carries over complete context across turns, so users can ask follow-up questions, adjust their intent, or change direction without restating everything. If the agent starts heading down the wrong path, users can interrupt mid-analysis and redirect it, just like working with a human collaborator who listens instead of plowing ahead.

When instructions are unclear or incomplete, the agent proactively asks clarifying questions. If no response is provided, it applies sensible defaults to make progress. For example, if a user asks about business growth with no date range specified, it may assume the last seven or 30 days. These priors allow it to stay responsive and non-blocking while still converging on the right outcome.

The result is an agent that works well both when you know exactly what you want (e.g., “Tell me about this table”) and just as strong when you’re exploring (e.g., “I’m seeing a dip here, can we break this down by customer type and timeframe?”).

After rollout, we observed that users frequently ran the same analyses for routine repetitive work. To expedite this, the agent's workflows package recurring analyses into reusable instruction sets. Examples include workflows for weekly business reports and table validations. By encoding context and best practices once, workflows streamline repeat analyses and ensure consistent results across users.

包含“提出数据问题”字样占位符的用户界面 (UI) 输入栏。输入栏下方是标注“使用工作流”字样的按钮，右侧为麦克风和发送图标。该条形图采用圆角设计，且置于深色背景之上。

Moving fast without breaking trust

Building an always-on, evolving agent means quality can drift just as easily as it can improve. Without a tight feedback loop, regressions are inevitable and invisible. The only way to scale capability without breaking trust is through systematic evaluation.

In this section, we’ll discuss how we leverage OpenAI’s Evals API⁠（在新窗口中打开） to measure and protect the agent’s response quality.

Its Evals are built on curated sets of question-answer pairs. Each question targets an important metric or analytical pattern we care deeply about getting right, paired with a manually authored “golden” SQL query that produces the expected result. For each eval, we send the natural language question to its query-generation endpoint, execute the generated SQL, and compare the output against the result of the expected SQL.

标题为“数据智能体评估管道”的图表。问答评估与预期的 SQL 同步输入到生成步骤，用于生成 SQL 和结果。OpenAI Evals 使用 dataframe 和 SQL 比较功能，比对生成的结果与预期结果，并输出评分和推理。

Evaluation doesn’t rely on naive string matching. Generated SQL can differ syntactically while still being correct, and result sets may include extra columns that don’t materially affect the answer. To account for this, we compare both the SQL and the resulting data, and feed these signals into OpenAI’s Evals grader. The grader produces a final score along with an explanation, capturing both correctness and acceptable variation.

These evals are like unit tests that run continuously during development to identify regressions as canaries in production; this allows us to catch issues early and confidently iterate as the agent's capabilities expand.

Agent security

Our agent plugs directly into OpenAI’s existing security and access-control model. It operates purely as an interface layer, inheriting and enforcing the same permissions and guardrails that govern OpenAI’s data.

All of the agent’s access is strictly pass-through, meaning users can only query tables they already have permission to access. When access is missing, it flags this or falls back to alternative datasets the user is authorized to use.

Finally, it's built for transparency. Like any system, it can make mistakes. It exposes its reasoning process by summarizing assumptions and execution steps alongside each answer. When queries are executed, it links directly to the underlying results, allowing users to inspect raw data and verify every step of the analysis.

Lessons learned

Building our agent from scratch surfaced practical lessons about how agents behave, where they struggle, and what actually makes them reliable at scale.

Lesson #1: Less is More

Early on, we exposed our full tool set to the agent, and quickly ran into problems with overlapping functionality. While this redundancy can be helpful for specific custom cases and is more obvious to a human when manually invoking, it’s confusing to agents. To reduce ambiguity and improve reliability, we restricted and consolidated certain tool calls.

Lesson #2: Guide the Goal, Not the Path

We also discovered that highly prescriptive prompting degraded results. While many questions share a general analytical shape, the details vary enough that rigid instructions often pushed the agent down incorrect paths. By shifting to higher-level guidance and relying on GPT‑5’s reasoning to choose the appropriate execution path, the agent became more robust and produced better results.

Lesson #3: Meaning Lives in Code

Schemas and query history describe a table’s shape and usage, but its true meaning lives in the code that produces it. Pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata. By crawling the codebase with Codex, our agent understands how datasets are actually constructed and is able to better reason about what each table actually contains. It can answer “what’s in here” and “when can I use it” far more accurately than from warehouse signals alone.

Same vision, new tools

We’re constantly working to improve our agent by increasing its ability to handle ambiguous questions, improving its reliability and accuracy with stronger validations, and integrating it more deeply into workflows. We believe it should blend naturally into how people already work, instead of functioning like a separate tool.

While our tooling will keep benefiting from underlying improvements in agent reasoning, validation, and self-correction, our team’s mission remains the same: seamlessly deliver fast, trustworthy data analysis across OpenAI’s data ecosystem.