數據驅動著系統學習、產品演進以及企業決策的方式。然而,要快速準確地獲取答案,並掌握合適的背景資料,往往比預期更困難。為了讓 OpenAI 在擴展過程中更輕鬆地處理數據,我們建立了自家訂製的內部 AI 數據智能代理,用於探索並分析我們自身平台的數據。
我們的智能代理是一款自訂式內部專用工具(並非對外提供的產品),專門圍繞 OpenAI 的數據、權限及工作流程而設計。我們在此展示如何構建並使用這款工具,並以實例說明 AI 如何在日常工作中為各團隊提供實際且具影響力的支援。我們用來構建及運行這款智能代理的 OpenAI 工具(Codex、我們的 GPT‑5 旗艦模型、Evals API(在新視窗中開啟) 及嵌入 API(在新視窗中開啟)),與我們提供給全球開發者的工具完全相同。
我們的數據智能代理讓員工在數分鐘內,便能從提問取得深入洞見,而非動輒耗費數天。這降低了各智能部門取得數據與展開細緻分析的門檻,而不僅限於我們的數據團隊。如今,OpenAI 的工程、數據科學、市場發佈、財務及研究等團隊,都倚賴這款智能代理來解答具有高度影響力的數據問題。例如,智能代理可協助評估產品發佈成效、理解業務健康狀況等問題,全部只需透過直觀的自然語言對話即可完成。智能代理結合了由 Codex 驅動的表格層級知識,以及產品與組織的相關背景。其持續學習的記憶系統,意味著智能代理的表現亦會隨著每次互動而提升。

在這篇文章中,我們將詳細拆解為何需要一款專用的 AI 數據智能代理、其程式碼強化的數據背景與自我學習功能為何如此實用,以及我們在此過程中獲得的經驗。
OpenAI 的數據平台服務超過 3,500 名內部用戶,涵蓋工程、產品及研究等部門,數據規模跨越 7 萬個數據集,總量超過 600 PB。在如此龐大的規模下,光是找到正確的表格,便可能成為分析中最耗時的環節。
誠如一位內部用戶所言:
「我們擁有許多非常相似的表格,而我需要花費大量時間試圖釐清它們之間的差異,並決定該使用哪一個。有些表格包括未登入的用戶,有些則不然。有些欄位重疊,要分辨其區別相當困難。」
即使選擇了正確的表格,要產出準確的結果仍可能頗具難度。分析師必須對表格數據及其關聯性進行推理,以確保轉換與篩選條件獲得正確應用。常見的錯誤情況,例如多對多連接、篩選條件下推錯誤,以及未處理的空值等,均有可能會在無形間使結果失效。以 OpenAI 的規模而言,分析師不應將時間耗費在偵錯 SQL 語義或查詢效能上;而是應將焦點放在定義指標、驗證假設,以及作出數據驅動的決策方面。

此 SQL 語句長達 180 多行。要判斷我們是否連接了正確的表格並查詢了正確的欄位,實非易事。
讓我們逐步說明我們的智能代理為何物、智能代理如何梳理背景資料,以及如何持續自我提升。
我們的智能代理由 GPT‑5.2 驅動,專門設計用於對 OpenAI 的數據平台進行推理分析。員工可於日常工作的任何地方使用這款智能代理:作為 Slack 助理、透過網頁介面、在整合開發環境中、透過 MCP 使用 Codex CLI(在新視窗中開啟),以及直接在 OpenAI 內部的 ChatGPT 應用程式中透過 MCP 連接器(在新視窗中開啟)使用。
用戶可以提出複雜的開放式問題,這類問題通常需要多輪手動探索方可獲得解答。以這個使用測試數據集的提示詞為例:「For NYC taxi trips, which pickup-to-dropoff ZIP pairs are the most unreliable, with the largest gap between typical and worst-case travel times, and when does that variability occur?」
從理解問題、探索數據、執行查詢,到綜合分析結果,智能代理可端到端地處理整個分析流程。

智能代理對問題的回應。
智能代理的其中一項強大能力,在於其解決問題時的推理方式。其並不遵循固定腳本,而是會評估自身的進展。如果某個中間結果看起來不正確(例如因連接或篩選條件錯誤而導致結果為零列數據),智能代理便會探究問題所在、調整對策,並重新嘗試。在整個過程中,智能代理會保持完整的上下文理解,並將學習成果應用於後續步驟。這種閉環的自我學習過程,將迭代工作從用戶身上轉移至智能代理本身,從而實現較手動工作流程更快的結果產出,以及始終如一的高質素分析。

智能代理辨別最不可靠紐約的士上落車地點組合的推理過程。
智能代理涵蓋完整的分析工作流程:從數據探索、執行 SQL 查詢,到發佈筆記本與報告。代理能理解公司的內部知識、展開網絡搜尋以取得外部資訊,並透過持續的使用學習和記憶功能不斷提升表現。
高質素的答案取決於豐富且準確的背景資料。在缺乏背景資料的情況下,哪怕是再強大的模型亦可能會產出錯誤結果,例如嚴重誤判用戶人數或誤解內部術語。

缺乏記憶功能的智能代理,無法有效地進行查詢。

智能代理的記憶功能可透過定位正確的表格,來實現更快的查詢。
為免出現這些失誤,智能代理的設計核心圍繞著多層級的背景資料,使其能緊密結合 OpenAI 的數據與機構知識。
- 中繼資料基礎:智能代理會依靠結構定義中繼資料(欄名稱與數據類型)來協助撰寫 SQL,並運用表格譜系(例如上下游表格關聯)來提供不同表格之間關聯的背景資料。
- 查詢推論:智能代理透過分析過去的查詢記錄,理解如何自行撰寫查詢,以及哪些表格通常用於連接。
- 由領域專家提供表格與欄的精選描述,涵蓋了難以從結構定義或歷史查詢中推斷的設計意圖、語義、商業意義及已知注意事項。
單靠中繼資料並不足夠。要真正區分不同表格,就需要理解表格的建立方式,以及其來源為何。
- 透過推導出表格的程式碼層級定義,智能代理可更深入地理解數據實際包含的內容。
- 關於表格中儲存了哪些數據,以及這些數據如何從分析事件中衍生而來的細微差異,提供了額外資訊。舉例來說,可提供關於數據值唯一性、表格數據更新頻率、數據範圍(例如:若表格排除了某些欄位,則其具備此顆粒度級別)等背景資料。
- 透過展示表格在 Spark、Python 及其他數據系統中超越 SQL 的使用方式,此方式提供了更強化的使用背景資料。
- 換而言之,智能代理可區分外表相似、但關鍵地方存在差異的表格。例如,代理可辨別某個表格是否僅包含第一方的 ChatGPT 流量。這類背景資料會自動更新,因此不必手動維護即可保持最新狀態。
- 智能代理可存取 Slack、Google 文件及 Notion,這些平台記錄了關鍵的公司背景資料,例如產品發佈、可靠性事件、內部代號與工具,以及關鍵指標的標準定義與運算邏輯。
- 該等文件經過處理嵌入,會與中繼資料及權限一同儲存。檢索服務在運行時處理存取控制與快取,使智能代理能有效且安全地取得這些資訊。

- 當智能代理獲得修正或發現某些數據問題的細微差異時,便能將該等學習內容儲存下來以供下次使用,使其能與用戶一起不斷進步。
- 因此,未來的答案能以更準確的基礎為起點,而不會重複發生相同的問題。
- 記憶功能的目標,是保留並重複使用那些對於數據正確性至關重要,但難以僅從其他層級推斷的非顯見修正、篩選條件與限制。
- 例如,在某個案例中,智能代理最初並不知道如何篩選特定的分析實驗(該篩選條件依賴於匹配實驗閘道中定義的特定字串)。在此情況下,記憶功能至關重要,有助確保智能代理能正確篩選,而非模糊地嘗試字串匹配。
- 當你對智能代理作出糾正,或智能代理從對話中發現值得學習的地方,代理便會提示你將該等記憶儲存下來,以供下次使用。
- 用戶亦可手動建立及編輯記憶。
- 記憶範圍分為全域和個人層級,智能代理的工具則讓編輯這些記憶變得更輕鬆。

- 當表格欠缺既有背景資料,或現有資料已過時,智能代理可對數據倉庫發出即時查詢,直接檢查並查詢該表格。這使其能夠驗證結構定義、即時理解數據,並據之作出回應。
- 智能代理亦可按需與其他數據平台系統(例如中繼資料服務、Airflow、Spark)進行對話,以取得倉庫之外更廣泛的數據背景資料。
We run a daily offline pipeline that aggregates table usage, human annotations, and Codex-derived enrichment into a single, normalized representation. This enriched context is then converted into embeddings using the OpenAI embeddings API(在新視窗中開啟) and stored for retrieval. At query time, the agent pulls only the most relevant embedded context via retrieval-augmented generation(在新視窗中開啟) (RAG) instead of scanning raw metadata or logs. This makes table understanding fast and scalable, even across tens of thousands of tables, while keeping runtime latency predictable and low. Runtime queries are issued to our data warehouse live as needed.
Together, these layers ensure the agent’s reasoning is grounded in OpenAI’s data, code, and institutional knowledge, dramatically reducing errors and improving answer quality.
One-shot answers work when the problem is clear, but most questions aren’t. More often, arriving at the correct result requires back-and-forth refinement and some course correction.
The agent is built to behave like a teammate you can reason with. It’s a conversational, always-on and handles both quick answers and iterative exploration.
It carries over complete context across turns, so users can ask follow-up questions, adjust their intent, or change direction without restating everything. If the agent starts heading down the wrong path, users can interrupt mid-analysis and redirect it, just like working with a human collaborator who listens instead of plowing ahead.
When instructions are unclear or incomplete, the agent proactively asks clarifying questions. If no response is provided, it applies sensible defaults to make progress. For example, if a user asks about business growth with no date range specified, it may assume the last seven or 30 days. These priors allow it to stay responsive and non-blocking while still converging on the right outcome.
The result is an agent that works well both when you know exactly what you want (e.g., “Tell me about this table”) and just as strong when you’re exploring (e.g., “I’m seeing a dip here, can we break this down by customer type and timeframe?”).
After rollout, we observed that users frequently ran the same analyses for routine repetitive work. To expedite this, the agent's workflows package recurring analyses into reusable instruction sets. Examples include workflows for weekly business reports and table validations. By encoding context and best practices once, workflows streamline repeat analyses and ensure consistent results across users.

Building an always-on, evolving agent means quality can drift just as easily as it can improve. Without a tight feedback loop, regressions are inevitable and invisible. The only way to scale capability without breaking trust is through systematic evaluation.
In this section, we’ll discuss how we leverage OpenAI’s Evals API(在新視窗中開啟) to measure and protect the agent’s response quality.
Its Evals are built on curated sets of question-answer pairs. Each question targets an important metric or analytical pattern we care deeply about getting right, paired with a manually authored “golden” SQL query that produces the expected result. For each eval, we send the natural language question to its query-generation endpoint, execute the generated SQL, and compare the output against the result of the expected SQL.
Evaluation doesn’t rely on naive string matching. Generated SQL can differ syntactically while still being correct, and result sets may include extra columns that don’t materially affect the answer. To account for this, we compare both the SQL and the resulting data, and feed these signals into OpenAI’s Evals grader. The grader produces a final score along with an explanation, capturing both correctness and acceptable variation.
These evals are like unit tests that run continuously during development to identify regressions as canaries in production; this allows us to catch issues early and confidently iterate as the agent's capabilities expand.
Our agent plugs directly into OpenAI’s existing security and access-control model. It operates purely as an interface layer, inheriting and enforcing the same permissions and guardrails that govern OpenAI’s data.
All of the agent’s access is strictly pass-through, meaning users can only query tables they already have permission to access. When access is missing, it flags this or falls back to alternative datasets the user is authorized to use.
Finally, it's built for transparency. Like any system, it can make mistakes. It exposes its reasoning process by summarizing assumptions and execution steps alongside each answer. When queries are executed, it links directly to the underlying results, allowing users to inspect raw data and verify every step of the analysis.
Building our agent from scratch surfaced practical lessons about how agents behave, where they struggle, and what actually makes them reliable at scale.
Early on, we exposed our full tool set to the agent, and quickly ran into problems with overlapping functionality. While this redundancy can be helpful for specific custom cases and is more obvious to a human when manually invoking, it’s confusing to agents. To reduce ambiguity and improve reliability, we restricted and consolidated certain tool calls.
We also discovered that highly prescriptive prompting degraded results. While many questions share a general analytical shape, the details vary enough that rigid instructions often pushed the agent down incorrect paths. By shifting to higher-level guidance and relying on GPT‑5’s reasoning to choose the appropriate execution path, the agent became more robust and produced better results.
Schemas and query history describe a table’s shape and usage, but its true meaning lives in the code that produces it. Pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata. By crawling the codebase with Codex, our agent understands how datasets are actually constructed and is able to better reason about what each table actually contains. It can answer “what’s in here” and “when can I use it” far more accurately than from warehouse signals alone.
We’re constantly working to improve our agent by increasing its ability to handle ambiguous questions, improving its reliability and accuracy with stronger validations, and integrating it more deeply into workflows. We believe it should blend naturally into how people already work, instead of functioning like a separate tool.
While our tooling will keep benefiting from underlying improvements in agent reasoning, validation, and self-correction, our team’s mission remains the same: seamlessly deliver fast, trustworthy data analysis across OpenAI’s data ecosystem.


