資料驅動系統學習、產品演進及企業決策的方式。但要快速、正確地取得回答,且要具備正確上下文的情況下,往往比應有的還要困難。為了隨著 OpenAI 擴展變得更輕鬆,我們打造出我們專屬的內部 AI 資料智慧體,用來在我們的平台上進行探索和推理。
我們的智慧體是僅供內部使用的自訂工具 (不對外提供),專為 OpenAI 的資料、權限和工作流程構建。我們會展示我們如何構建並使用它,以幫助呈現 AI 如何以真實且具影響力的方式支持我們團隊的日常工作。我們用來建置並執行它的 OpenAI 工具 (Codex、我們的 GPT‑5 旗艦模型、Evals API(在新視窗中開啟),以及 Embeddings API(在新視窗中開啟)) 與我們提供給全球各地開發者使用的工具相同。
我們的資料智慧體讓員工在數分鐘內從問題中獲得洞察,而不是數天這降低了跨所有職能進行資料提取與細膩分析的門檻,而不僅僅是由我們的資料團隊來完成。今日,OpenAI 的工程、資料科學、市場進入、財務和研究團隊都依賴智慧體來回答高影響力的資料問題。例如,它可以協助回答如何評估產品的推出和了解企業體質等問題,這一切都可以透過直觀的自然語言格式實現。智慧體結合 Codex 驅動的資料表層級知識,並融入產品和組織的脈絡背景。其持續學習的記憶系統意味著它會隨著每次的使用不斷改進。

在這篇文章中,我們將分析為什麼我們需要一個量身打造的 AI 智慧體,它的程式碼增強資料上下文和自我學習為何如此有用,以及我們在這過程中學到的經驗教訓。
OpenAI 的資料平台服務遍布工程、產品與研究等領域超過 3,500 名內部使用者,涵蓋分布於 70,000 個資料集超過 600 PB的資料。在那樣的規模下,僅僅是找到合適的資料表,就可能是進行分析時最耗時的部分。
正如一位內部用戶所言:
"「我們有很多非常類似的資料格,我花了很多時間試著弄清楚它們有什麼不同,以及該用哪一個。」有些包含已登出的使用者,有些則未包含。有些資料表的欄位重疊,很難分辨其各自的用途。
即使選擇了正確的資料表,產生正確的結果仍然可能充滿挑戰。分析師必須針對資料表資料與資料表關係進行推理,以確保正確應用轉換和篩選條件。常見的失敗模式—多對多聯結、篩選下推錯誤,以及未處理的空值—可能會在不知不覺中產生無效的結果。在 OpenAI 的規模下,分析師不應該把時間耗在 SQL 語意偵錯或查詢效能上:他們應該專注於定義指標、驗證假設,並根據資料做出決策。

這個 SQL 陳述式長達 180 多行。要知道我們是否聯結正確的資料表並查詢正確的欄位,並不容易。
讓我們逐步了解我們的智慧體是什麼、它如何整理上下文,以及它如何持續自我改進。
我們的智慧體由 GPT‑5.2 支援,專為在 OpenAI 的資料平台上進行推理而設計。它可在員工已經工作的任何地方使用:作為 Slack 智慧體、透過網頁介面、在 IDE 內、在 透過 MCP 的 Codex CLI(在新視窗中開啟) 中,以及直接在透過 MCP 連接器的 OpenAI 內部 ChatGPT 應用程式(在新視窗中開啟)中。
使用者可以提出複雜的開放式問題,這通常需要多回合的手動探索才能解答。以此範例提示詞為例:「針對紐約市計程車行程,哪些上車到下車的郵遞區號配對最不可靠 (也就是常見與最壞情況行程時間之間的差距最大) 及這種變異會在何時發生?」
智慧體端到端處理分析,從理解問題到探索資料、執行查詢,並整合結果。

智慧體對問題的回應。
智慧體的超能力之一是它推理解決問題的能力。智慧體不遵循固定腳本,而是評估自己的進度。如果中間結果看起來不對 (例如,因為聯結或篩選錯誤導致列數為零),則智慧體會調查問題所在,調整方法,然後再試一次。在整個過程中,它保留完整的上下文,並在各個步驟之間延續學習經驗。這個自我學習的封閉流程會將迭代從使用者轉移到智慧體本身,使其比手動工作流程更快產生結果,並持續提供更高品質的分析。

智慧體的推理,用於識別最不可靠的紐約市計程車上車-下車配對。
智慧體涵蓋完整的分析工作流程:探索資料、執行 SQL,並發佈筆記本和報告。它能理解公司內部知識,透過搜尋網路取得外部資訊,並隨著時間透過學習到的使用方式和記憶改進。
高品質的答案取決於豐富、準確的上下文。如果沒有上下文,即使是強大的模型也可能產生錯誤結果,例如嚴重誤估使用者數量,或誤解內部術語。

沒有記憶的智慧體,無法有效地進行查詢。

智慧體的記憶功能透過定位正確的資料表來加快查詢速度。
為了避免這些失敗模式,智慧體的設計圍繞著 多層級的上下文,使其奠定在 OpenAI 的資料和機構知識的基礎上。
- 中繼資料基礎: 智慧體依賴結構描述中繼資料 (資料欄名稱和資料類型) 來指導 SQL 撰寫,並使用資料表沿襲 (例如,上游和下游資料表關係) 來提供不同資料表之間關係的上下文。
- 查詢推論:匯入歷史查詢有助於智慧體了解如何撰寫自己的查詢,以及哪些資料表通常會一起聯結。
- 由領域專家提供的資料表與資料欄精選描述,其中擷取意圖、語義、業務含義,以及無法從結構描述或過去查詢中輕易推斷出的已知注意事項。
僅靠中繼資料是不夠的。若要真正區分資料表,你需要了解它們的建立方式及來源。
- 透過推導出資料表的程式碼層級定義,智慧體能更深入理解資料實際包含的內容。
- 資料表中所儲存內容的細微差異,以及這些內容如何從分析事件中推導出來,提供了額外的資訊。例如,它可以提供有關值的唯一性、資料表資料更新頻率、資料範圍 (例如,如果資料表排除某些欄位,則具有此層級的粒度) 等方面的上下文。
- 這會透過顯示該資料表在 Spark、Python 和其他資料系統中 SQL 以外的使用方式,提供更豐富的使用上下文。
- 這表示智慧體可以區分外觀相似但在關鍵方面有所不同的資料表。例如,它可以判斷某個資料表是否僅包含第一方 ChatGPT 流量。此上下文也會自動重新整理,因此無需手動維護即可保持最新狀態。
- 智慧體可以存取 Slack、Google Docs 和 Notion,這些工具會擷取關鍵的公司上下文,例如產品推出、可靠性事件、內部代號與工具,以及關鍵指標的規範定義與計算邏輯。
- 這些文件會被匯入、嵌入,並與中繼資料和權限一起儲存。檢索服務在執行階段會處理存取控制和快取,使智慧體能夠有效率且安全地提取這些資訊。

- 當智慧體收到更正或發現某些資料問題中的細微差異時,它能將這些學習經驗保存下來以供下次使用,使其能與使用者一起持續改進。
- 因此,未來的回答會從更準確的基準開始,而不是一再遇到相同的問題。
- 記憶的目標是保留並重複使用那些對資料正確性至關重要、但僅靠其他層級難以推論的不明顯更正、篩選條件與限制。
- 例如,在某個案例中,智慧體不知道如何篩選特定的分析實驗 (它依賴於在實驗閘中定義的特定字串進行比對)。記憶在這裡至關重要,以確保能正確篩選,而不是模糊地嘗試進行字串匹配。
- 當你給智慧體更正或它從你的對話中學到新內容時,它會提示你將該記憶儲存起來,以便下次使用。
- 使用者也可以手動建立和編輯記憶。
- 記憶劃分為全域和個人層級,智慧體的工具讓你能夠輕鬆編輯記憶。

- 當資料表沒有任何先前的上下文,或現有資訊已過時,智慧體可以對資料倉儲發出即時查詢,直接檢查並查詢該資料表。這使它能夠驗證結構描述、即時理解資料,並據此做出回應。
- 智慧體也能視需要與其他資料平台系統 (中繼資料服務、Airflow、Spark) 交談,以獲取存在於資料倉儲之外更廣泛的資料上下文。
We run a daily offline pipeline that aggregates table usage, human annotations, and Codex-derived enrichment into a single, normalized representation. This enriched context is then converted into embeddings using the OpenAI embeddings API(在新視窗中開啟) and stored for retrieval. At query time, the agent pulls only the most relevant embedded context via retrieval-augmented generation(在新視窗中開啟) (RAG) instead of scanning raw metadata or logs. This makes table understanding fast and scalable, even across tens of thousands of tables, while keeping runtime latency predictable and low. Runtime queries are issued to our data warehouse live as needed.
Together, these layers ensure the agent’s reasoning is grounded in OpenAI’s data, code, and institutional knowledge, dramatically reducing errors and improving answer quality.
One-shot answers work when the problem is clear, but most questions aren’t. More often, arriving at the correct result requires back-and-forth refinement and some course correction.
The agent is built to behave like a teammate you can reason with. It’s a conversational, always-on and handles both quick answers and iterative exploration.
It carries over complete context across turns, so users can ask follow-up questions, adjust their intent, or change direction without restating everything. If the agent starts heading down the wrong path, users can interrupt mid-analysis and redirect it, just like working with a human collaborator who listens instead of plowing ahead.
When instructions are unclear or incomplete, the agent proactively asks clarifying questions. If no response is provided, it applies sensible defaults to make progress. For example, if a user asks about business growth with no date range specified, it may assume the last seven or 30 days. These priors allow it to stay responsive and non-blocking while still converging on the right outcome.
The result is an agent that works well both when you know exactly what you want (e.g., “Tell me about this table”) and just as strong when you’re exploring (e.g., “I’m seeing a dip here, can we break this down by customer type and timeframe?”).
After rollout, we observed that users frequently ran the same analyses for routine repetitive work. To expedite this, the agent's workflows package recurring analyses into reusable instruction sets. Examples include workflows for weekly business reports and table validations. By encoding context and best practices once, workflows streamline repeat analyses and ensure consistent results across users.

Building an always-on, evolving agent means quality can drift just as easily as it can improve. Without a tight feedback loop, regressions are inevitable and invisible. The only way to scale capability without breaking trust is through systematic evaluation.
In this section, we’ll discuss how we leverage OpenAI’s Evals API(在新視窗中開啟) to measure and protect the agent’s response quality.
Its Evals are built on curated sets of question-answer pairs. Each question targets an important metric or analytical pattern we care deeply about getting right, paired with a manually authored “golden” SQL query that produces the expected result. For each eval, we send the natural language question to its query-generation endpoint, execute the generated SQL, and compare the output against the result of the expected SQL.
Evaluation doesn’t rely on naive string matching. Generated SQL can differ syntactically while still being correct, and result sets may include extra columns that don’t materially affect the answer. To account for this, we compare both the SQL and the resulting data, and feed these signals into OpenAI’s Evals grader. The grader produces a final score along with an explanation, capturing both correctness and acceptable variation.
These evals are like unit tests that run continuously during development to identify regressions as canaries in production; this allows us to catch issues early and confidently iterate as the agent's capabilities expand.
Our agent plugs directly into OpenAI’s existing security and access-control model. It operates purely as an interface layer, inheriting and enforcing the same permissions and guardrails that govern OpenAI’s data.
All of the agent’s access is strictly pass-through, meaning users can only query tables they already have permission to access. When access is missing, it flags this or falls back to alternative datasets the user is authorized to use.
Finally, it's built for transparency. Like any system, it can make mistakes. It exposes its reasoning process by summarizing assumptions and execution steps alongside each answer. When queries are executed, it links directly to the underlying results, allowing users to inspect raw data and verify every step of the analysis.
Building our agent from scratch surfaced practical lessons about how agents behave, where they struggle, and what actually makes them reliable at scale.
Early on, we exposed our full tool set to the agent, and quickly ran into problems with overlapping functionality. While this redundancy can be helpful for specific custom cases and is more obvious to a human when manually invoking, it’s confusing to agents. To reduce ambiguity and improve reliability, we restricted and consolidated certain tool calls.
We also discovered that highly prescriptive prompting degraded results. While many questions share a general analytical shape, the details vary enough that rigid instructions often pushed the agent down incorrect paths. By shifting to higher-level guidance and relying on GPT‑5’s reasoning to choose the appropriate execution path, the agent became more robust and produced better results.
Schemas and query history describe a table’s shape and usage, but its true meaning lives in the code that produces it. Pipeline logic captures assumptions, freshness guarantees, and business intent that never surface in SQL or metadata. By crawling the codebase with Codex, our agent understands how datasets are actually constructed and is able to better reason about what each table actually contains. It can answer “what’s in here” and “when can I use it” far more accurately than from warehouse signals alone.
We’re constantly working to improve our agent by increasing its ability to handle ambiguous questions, improving its reliability and accuracy with stronger validations, and integrating it more deeply into workflows. We believe it should blend naturally into how people already work, instead of functioning like a separate tool.
While our tooling will keep benefiting from underlying improvements in agent reasoning, validation, and self-correction, our team’s mission remains the same: seamlessly deliver fast, trustworthy data analysis across OpenAI’s data ecosystem.


