Codex CLI(jinfetaħ f’tieqa ġdida) is our cross-platform local software agent, designed to produce high-quality, reliable software changes while operating safely and efficiently on your machine. We’ve learned a tremendous amount about how to build a world-class software agent since we first launched the CLI in April. To unpack those insights, this is the first post in an ongoing series where we’ll explore various aspects of how Codex works, as well as hard-earned lessons. (For an even more granular view on how the Codex CLI is built, check out our open source repository at https://github.com/openai/codex(jinfetaħ f’tieqa ġdida). Many of the finer details of our design decisions are memorialized in GitHub issues and pull requests if you’d like to learn more.)
To kick off, we’ll focus on the agent loop, which is the core logic in Codex CLI that is responsible for orchestrating the interaction between the user, the model, and the tools the model invokes to perform meaningful software work. We hope this post gives you a good view into the role our agent (or “harness”) plays in making use of an LLM.
Before we dive in, a quick note on terminology: at OpenAI, “Codex” encompasses a suite of software agent offerings, including Codex CLI, Codex Cloud, and the Codex VS Code extension. This post focuses on the Codex harness, which provides the core agent loop and execution logic that underlies all Codex experiences and is surfaced through the Codex CLI. For ease here, we’ll use the terms “Codex” and “Codex CLI” interchangeably.
At the heart of every AI agent is something called “the agent loop.” A simplified illustration of the agent loop looks like this:
To start, the agent takes input from the user to include in the set of textual instructions it prepares for the model known as a prompt.
The next step is to query the model by sending it our instructions and asking it to generate a response, a process known as inference. During inference, the textual prompt is first translated into a sequence of input tokens(jinfetaħ f’tieqa ġdida)—integers that index into the model’s vocabulary. These tokens are then used to sample the model, producing a new sequence of output tokens.
The output tokens are translated back into text, which becomes the model’s response. Because tokens are produced incrementally, this translation can happen as the model runs, which is why many LLM-based applications display streaming output. In practice, inference is usually encapsulated behind an API that operates on text, abstracting away the details of tokenization.
As the result of the inference step, the model either (1) produces a final response to the user’s original input, or (2) requests a tool call that the agent is expected to perform (e.g., “run ls and report the output”). In the case of (2), the agent executes the tool call and appends its output to the original prompt. This output is used to generate a new input that’s used to re-query the model; the agent can then take this new information into account and try again.
This process repeats until the model stops emitting tool calls and instead produces a message for the user (referred to as an assistant message in OpenAI models). In many cases, this message directly answers the user’s original request, but it may also be a follow-up question for the user.
Because the agent can execute tool calls that modify the local environment, its “output” is not limited to the assistant message. In many cases, the primary output of a software agent is the code it writes or edits on your machine. Nevertheless, each turn always ends with an assistant message—such as “I added the architecture.md you asked for”—which signals a termination state in the agent loop. From the agent’s perspective, its work is complete and control returns to the user.
The journey from user input to agent response shown in the diagram is referred to as one turn of a conversation (a thread in Codex). Though this conversation turn can include many iterations between the model inference and tool calls. Every time you send a new message to an existing conversation, the conversation history is included as part of the prompt for the new turn, which includes the messages and tool calls from previous turns:
This means that as the conversation grows, so does the length of the prompt used to sample the model. This length matters because every model has a context window, which is the maximum number of tokens it can use for one inference call. Note this window includes both input and output tokens. As you might imagine, an agent could decide to make hundreds of tool calls in a single turn, potentially exhausting the context window. For this reason, context window management is one of the agent’s many responsibilities. Now, let’s dive in to see how Codex runs the agent loop.
The Codex CLI sends HTTP requests to the Responses API(jinfetaħ f’tieqa ġdida) to run model inference. We’ll examine how information flows through Codex, which uses the Responses API to drive the agent loop.
The Responses API endpoint that the Codex CLI uses is configurable(jinfetaħ f’tieqa ġdida), so it can be used with any endpoint that implements the Responses API(jinfetaħ f’tieqa ġdida):
- When using ChatGPT login(jinfetaħ f’tieqa ġdida) with the Codex CLI, it uses
https://chatgpt.com/backend-api/codex/responsesas the endpoint - When using API-key authentication(jinfetaħ f’tieqa ġdida) with OpenAI hosted models, it uses
https://api.openai.com/v1/responsesas the endpoint - When running Codex CLI with
--ossto use gpt-oss with ollama 0.13.4+(jinfetaħ f’tieqa ġdida) or LM Studio 0.3.39+(jinfetaħ f’tieqa ġdida), it defaults tohttp://localhost:11434/v1/responsesrunning locally on your computer - Codex CLI can be used with the Responses API hosted by a cloud provider such as Azure
Let’s explore how Codex creates the prompt for the first inference call in a conversation.
As an end user, you don’t specify the prompt used to sample the model verbatim when you query the Responses API. Instead, you specify various input types as part of your query, and the Responses API server decides how to structure this information into a prompt that the model is designed to consume. You can think of the prompt as a “list of items”; this section will explain how your query gets transformed into that list.
In the initial prompt, every item in the list is associated with a role. The role indicates how much weight the associated content should have and is one of the following values (in decreasing order of priority): system, developer, user, assistant.
The Responses API(jinfetaħ f’tieqa ġdida) takes a JSON payload with many parameters. We’ll focus on these three:
instructions(jinfetaħ f’tieqa ġdida): system (or developer) message inserted into the model’s contexttools(jinfetaħ f’tieqa ġdida): a list of tools the model may call while generating a responseinput(jinfetaħ f’tieqa ġdida): a list of text, image, or file inputs to the model
In Codex, the instructions field is read from the model_instructions_file(jinfetaħ f’tieqa ġdida) in ~/.codex/config.toml, if specified; otherwise, the base_instructions associated with a model(jinfetaħ f’tieqa ġdida) are used. Model-specific instructions live in the Codex repo and are bundled into the CLI (e.g., gpt-5.2-codex_prompt.md(jinfetaħ f’tieqa ġdida)).
The tools field is a list of tool definitions that conform to a schema defined by the Responses API. For Codex, this includes tools that are provided by the Codex CLI, tools that are provided by the Responses API that should be made available to Codex, as well as tools provided by the user, usually via MCP servers:
Finally, the input field of the JSON payload is a list of items. Codex inserts the following items(jinfetaħ f’tieqa ġdida) into the input before adding the user message:
1. A message with role=developer that describes the sandbox that applies only to the Codex-provided shell tool defined in the tools section. That is, other tools, such as those provided from MCP servers, are not sandboxed by Codex and are responsible for enforcing their own guardrails.
The message is built from a template where the key pieces of content come from snippets of Markdown bundled into the Codex CLI, such as workspace_write.md(jinfetaħ f’tieqa ġdida) and on_request.md(jinfetaħ f’tieqa ġdida):
2. (Optional) A message with role=developer whose contents are the developer_instructions value read from the user’s config.toml file.
3. (Optional) A message with role=user whose contents are the “user instructions,” which are not sourced from a single file but are aggregated across multiple sources(jinfetaħ f’tieqa ġdida). In general, more specific instructions appear later:
- Contents of
AGENTS.override.mdandAGENTS.mdin$CODEX_HOME - Subject to a limit (32 KiB, by default), look in each folder from the Git/project root of the
cwd(if it it exists) up to thecwditself: add the contents of any ofAGENTS.override.md,AGENTS.md, or any filename specified byproject_doc_fallback_filenames in config.toml - If any skills(jinfetaħ f’tieqa ġdida) have been configured:
- a short preamble about skills
- the skill metadata(jinfetaħ f’tieqa ġdida) for each skill
- a section on how to use skills(jinfetaħ f’tieqa ġdida)
4. A message with role=user that describes the local environment in which the agent is currently operating. This specifies the current working directory and the user’s shell(jinfetaħ f’tieqa ġdida):
Ladarba Codex ikun għamel il-komputazzjoni kollha ta’ hawn fuq biex jinizzjalizza l-input, iżid il-messaġġ tal-utent biex tibda l-konverżazzjoni.
L-eżempji preċedenti ffukaw fuq il-kontenut ta’ kull messaġġ, iżda innota li kull element tal-input huwa oġġett JSON bi type, role(jinfetaħ f’tieqa ġdida), u content kif ġej:
Ladarba Codex jibni l-payload JSON sħiħ biex jibgħatu lill-Responses API, imbagħad jagħmel it-talba HTTP POST b’header Authorization skont kif il-punt ta' tmiem tal-Responses API ikun ikkonfigurat f’~/.codex/config.toml (jiżdiedu headers HTTP addizzjonali u query parameters jekk ikunu speċifikati).
Meta server tal-OpenAI Responses API jirċievi t-talba, juża l-JSON biex joħloq il-prompt għall-mudell kif ġej (żgur, implimentazzjoni personalizzata tal-Responses API tista’ tagħmel għażla differenti):
Kif tista’ tara, l-ordni tal-ewwel tliet oġġetti fil-prompt huwa determinat mis-server, mhux mill-klijent. Madankollu, minn dawk it-tliet oġġetti, il-kontenut biss tal-messaġġ system huwa wkoll ikkontrollat mis-server, peress li l-tools u l-instructions huma determinati mill-klijent. Dawn jiġu segwiti mill-input mill-payload JSON biex jitlesta l-prompt.
Issa li għandna l-prompt tagħna, lesti biex nieħdu kampjun mill-mudell.
Din it-talba HTTP lill-Responses API tibda l-ewwel “turn” ta’ konverżazzjoni f’Codex. Is-server iwieġeb bi stream ta’ Server-Sent Events (SSE(jinfetaħ f’tieqa ġdida)). Id-data ta’ kull event hija payload JSON b’"type" li jibda b’"response", li tista’ tkun xi ħaġa bħal din (lista sħiħa tal-events tinsab fid-dokumentazzjoni tal-API(jinfetaħ f’tieqa ġdida) tagħna):
Codex jikkonsma l-stream tal-events(jinfetaħ f’tieqa ġdida) u jerġa’ jippubblikahom bħala oġġetti interni ta’ event li klijent jista’ juża. Events bħal response.output_text.delta jintużaw biex jappoġġjaw l-istreaming fil-UI, filwaqt li events oħra bħal response.output_item.added jinbidlu f’oġġetti li jiżdiedu mal-input għal sejħiet sussegwenti tal-Responses API.
Ejja nassumu li l-ewwel talba lill-Responses API tinkludi żewġ events response.output_item.done: waħda b’type=reasoning u waħda b’type=function_call. Dawn l-events iridu jkunu rrappreżentati fil-qasam input tal-JSON meta nerġgħu nsaqsu lill-mudell bit-tweġiba għas-sejħa tal-għodda:
Il-prompt li jirriżulta użat biex jittieħed kampjun mill-mudell bħala parti mill-query sussegwenti jkun jidher hekk:
B’mod partikolari, innota kif il-prompt il-qadim huwa prefiss eżatt tal-prompt il-ġdid. Dan huwa intenzjonat, peress li jagħmel it-talbiet sussegwenti ferm aktar effiċjenti għax jippermettilna nieħdu vantaġġ mill-prompt caching (li se niddiskutu fit-taqsima li jmiss dwar il-prestazzjoni).
Meta nħarsu lura lejn l-ewwel dijagramma tagħna tal-loop tal-aġent, naraw li jista’ jkun hemm ħafna iterazzjonijiet bejn l-inferenza u s-sejħa tal-għodod. Il-prompt jista’ jkompli jikber sakemm finalment nirċievu assistant message, li jindika t-tmiem tat-turn:
F’Codex CLI, aħna nippreżentaw l-assistant message lill-utent u niffokaw il-composer biex nindikaw lill-utent li issa huwa t-“turn” tiegħu biex ikompli l-konverżazzjoni. Jekk l-utent iwieġeb, kemm l-assistant message mit-turn preċedenti, kif ukoll il-messaġġ il-ġdid tal-utent, iridu jiżdiedu mal-input fit-talba tal-Responses API biex jibda t-turn il-ġdid:
Għal darb’oħra, minħabba li qed inkomplu konverżazzjoni, it-tul tal-input li nibagħtu lill-Responses API jibqa’ jiżdied:
Ejja neżaminaw x’ifisser dan il-prompt li dejjem qed jikber għall-prestazzjoni.
Forsi qed tistaqsi lilek innifsek, “Stenna, mhux il-loop tal-aġent kwadratiku f’termini tal-ammont ta’ JSON mibgħut lill-Responses API tul il-konverżazzjoni?” U jkollok raġun. Filwaqt li l-Responses API jappoġġja parametru mhux obbligatorju previous_response_id(jinfetaħ f’tieqa ġdida) biex itaffi din il-problema, Codex ma jużahx illum, primarjament biex iżomm it-talbiet kompletament stateless u biex jappoġġja konfigurazzjonijiet ta’ l-ebda żamma tad-dejta (ZDR).
Li tevita previous_response_id tissimplifika l-affarijiet għall-fornitur tal-Responses API għax tiżgura li kull talba tkun stateless. Dan jagħmilha wkoll sempliċi biex jiġu appoġġjati klijenti li għażlu l-ebda żamma tad-dejta (ZDR)(jinfetaħ f’tieqa ġdida), peress li l-ħażna tad-dejta meħtieġa biex tappoġġja previous_response_id tmur kontra ZDR. Innota li l-klijenti ZDR ma jissagrifikawx il-kapaċità li jibbenefikaw minn messaġġi ta’ raġunament proprjetarji minn turns preċedenti, peress li l-encrypted_content assoċjat jista’ jiġi dekriptat fuq is-server. (OpenAI jippersisti ċ-ċavetta tad-dekriptaġġ ta’ klijent ZDR, iżda mhux id-dejta tiegħu.) Ara l-PRs #642(jinfetaħ f’tieqa ġdida) u #1641(jinfetaħ f’tieqa ġdida) għall-bidliet relatati f’Codex biex jappoġġja ZDR.
Ġeneralment, l-ispiża tal-kampjunar tal-mudell tiddomina l-ispiża tat-traffiku tan-network, u b’hekk il-kampjunar isir il-mira ewlenija tal-isforzi tagħna għall-effiċjenza. Huwa għalhekk li l-prompt caching huwa tant importanti, għax jippermettilna nerġgħu nużaw il-komputazzjoni minn sejħa ta’ inferenza preċedenti. Meta niksbu cache hits, il-kampjunar tal-mudell ikun lineari u mhux kwadratiku. Id-dokumentazzjoni tagħna dwar prompt caching (jinfetaħ f’tieqa ġdida) tispjega dan f’aktar dettall:
Cache hits huma possibbli biss għal matches ta’ prefiss eżatti fi prompt. Biex tirrealizza l-benefiċċji tal-caching, poġġi kontenut statiku bħal istruzzjonijiet u eżempji fil-bidu tal-prompt tiegħek, u poġġi kontenut varjabbli, bħal informazzjoni speċifika għall-utent, fit-tmiem. Dan japplika wkoll għal immaġnijiet u għodod, li jridu jkunu identiċi bejn it-talbiet.
B’dan f’moħħna, ejja nikkunsidraw x’tipi ta’ operazzjonijiet jistgħu jikkawżaw “cache miss” f’Codex:
- Nibdlu l-
toolsdisponibbli għall-mudell fin-nofs tal-konverżazzjoni. - Nibdlu l-
modelli huwa l-mira tat-talba tal-Responses API (fil-prattika, dan ibiddel it-tielet oġġett fil-prompt oriġinali, peress li fih istruzzjonijiet speċifiċi għall-mudell). - Nibdlu l-konfigurazzjoni tas-sandbox, il-modalità ta’ approvazzjoni, jew id-direttorju tax-xogħol kurrenti.
It-tim ta’ Codex irid ikun diliġenti meta jintroduċi karatteristiċi ġodda f’Codex CLI li jistgħu jikkompromettu l-prompt caching. Bħala eżempju, l-appoġġ inizjali tagħna għal għodod MCP introduċa bug fejn ma rnexxilniex nelenkaw l-għodod f’ordni konsistenti(jinfetaħ f’tieqa ġdida), u dan ikkawża cache misses. Innota li l-għodod MCP jistgħu jkunu partikolarment diffiċli għax is-servers MCP jistgħu jibdlu l-lista ta’ għodod li jipprovdu fuq il-fly permezz ta’ notifika notifications/tools/list_changed(jinfetaħ f’tieqa ġdida). Li tonora din in-notifika fin-nofs ta’ konverżazzjoni twila jista’ jikkawża cache miss għaljin.
Meta jkun possibbli, nimmaniġġjaw bidliet fil-konfigurazzjoni li jseħħu fin-nofs tal-konverżazzjoni billi nżidu messaġġ ġdid mal-input biex jirrifletti l-bidla minflok ma nimmodifikaw messaġġ preċedenti:
- Jekk il-konfigurazzjoni tas-sandbox jew il-modalità ta’ approvazzjoni tinbidel, ndaħħlu(jinfetaħ f’tieqa ġdida) messaġġ ġdid b’
role=developerbl-istess format tal-oġġett oriġinali<permissions instructions>. - Jekk jinbidel id-direttorju tax-xogħol kurrenti, ndaħħlu(jinfetaħ f’tieqa ġdida) messaġġ ġdid b’
role=userbl-istess format tal-<environment_context>oriġinali.
Nagħmlu ħafna sforz biex niżguraw cache hits għall-prestazzjoni. Hemm riżorsa ewlenija oħra li rridu nimmaniġġjaw: it-tieqa tal-kuntest.
L-istrateġija ġenerali tagħna biex nevitaw li jispiċċa l-kuntest hija li nikkompattaw il-konverżazzjoni ladarba n-numru ta’ tokens jaqbeż ċertu limitu. B’mod speċifiku, nissostitwixxu l-input b’lista ġdida u iżgħar ta’ oġġetti li tirrappreżenta l-konverżazzjoni, u tippermetti lill-aġent ikompli b’fehim ta’ x’ġara s’issa. Implimentazzjoni bikrija ta’ compaction(jinfetaħ f’tieqa ġdida) kienet teħtieġ li l-utent isejjaħ manwalment il-kmand /compact, li kien jagħmel query lill-Responses API bl-użu tal-konverżazzjoni eżistenti flimkien ma’ istruzzjonijiet personalizzati għal sommarizzazzjoni(jinfetaħ f’tieqa ġdida). Codex kien juża l-assistant message li kien jirriżulta u li kien fih is-sommarju bħala l-input(jinfetaħ f’tieqa ġdida) il-ġdid għal turns ta’ konverżazzjoni sussegwenti.
Minn dak iż-żmien ’l hawn, il-Responses API evolva biex jappoġġja punt ta' tmiem speċjali /responses/compact endpoint(jinfetaħ f’tieqa ġdida) li jwettaq il-compaction b’mod aktar effiċjenti. Dan jirritorna lista ta’ oġġetti(jinfetaħ f’tieqa ġdida) li tista’ tintuża minflok l-input preċedenti biex tkompli l-konverżazzjoni filwaqt li tillibera t-tieqa tal-kuntest. Din il-lista tinkludi oġġett speċjali type=compaction b’oġġett opak encrypted_content li jżomm il-fehim latenti tal-mudell tal-konverżazzjoni oriġinali. Issa, Codex juża dan il-punt ta' tmiem awtomatikament biex jikkompatta l-konverżazzjoni meta jinqabeż il-auto_compact_limit(jinfetaħ f’tieqa ġdida).
Introduċejna l-loop tal-aġent Codex u mxejna miegħek kif Codex jibni u jimmaniġġja l-kuntest tiegħu meta jagħmel query lil mudell. Tul it-triq, enfasizzajna kunsiderazzjonijiet prattiċi u l-aħjar prattiki li japplikaw għal kull min qed jibni loop ta’ aġent fuq il-Responses API.
Għalkemm il-loop tal-aġent jipprovdi l-pedament għal Codex, dan huwa biss il-bidu. Fil-posts li ġejjin, se nidħlu fl-arkitettura tal-CLI, nesploraw kif jiġi implimentat l-użu tal-għodod, u nagħtu ħarsa aktar mill-qrib lejn il-mudell ta’ sandboxing ta’ Codex.


