How the loop works inside the Codex CLI, what production teams at Datadog and OpenAI actually use it for, and how the same architecture transfers to media-generation agents on hiapi.

Codex shipped Sora's Android app in 28 days with four engineers. At Datadog, more than a thousand engineers now rely on it to catch the kind of system-level bugs that human reviewers used to miss. Both stories trace back to the same piece of architecture: a deceptively simple control loop that wraps a language model, a set of tools, and a sandboxed environment.
If you are building an AI coding assistant, a media-generation pipeline, or any agent that has to act on the world, the Codex agent loop is the cleanest reference you can study right now. This is a walk-through of how it works, what production teams actually do with it, and how the same architectural pattern transfers to media-generation agents you can build on top of hiapi.
This article is based on OpenAI's public engineering writeups: "Unrolling the Codex Agent Loop" (January 2026), the Datadog case study, and the Sora Android shipping retrospective. Hiapi platform prices quoted below are as of 2026-06.
The agent loop is the control flow that sits between you, a language model, and the tools the model is allowed to invoke. From the user's perspective, one "turn" looks like a chat message in and a chat message out. Underneath, a single turn can include many round trips between the model and external tools.
The shape of one turn:
AGENTS.md file), the conversation history, and the user's new message.The crucial property is that "output" is not just the assistant message you see in the terminal. Most of the real output is the side effects of the tool calls: files written, tests run, packages installed. The assistant message is the termination signal.
This is why agent loops are not just chat with extra steps. Chat has one round trip per turn. An agent loop can do dozens of inference → tool → inference cycles inside a single turn, and the user only sees the wrap-up message.
A naive implementation of the loop above works, but it gets slow and expensive within a few turns. Codex's design fixes this with three ideas worth stealing.
1. Prefix caching on the Responses API. Every time the agent talks to the model, the prompt starts with the same long prefix: system prompt, then AGENTS.md, then the early conversation. The first request computes and caches the model's internal key-value state for that prefix. Subsequent requests in the same session skip the recomputation for any shared prefix.
A practical consequence: do not edit AGENTS.md mid-session. The system prompt is usually the largest single cached block, and changing it invalidates the cache for the rest of the session. Codex's own engineers call this out as the most common self-inflicted slowdown.
2. Tool-call results live inside the loop, not on top of it. Each tool output gets appended to the same conversation. This sounds obvious, but it means the model has full memory of every command it ran, every error it saw, and every file it inspected — without you having to summarize anything by hand. The trade-off is a context window that grows fast; Codex handles this with automatic compaction once the conversation gets too long.
3. Termination is detected from the model's output shape, not by counting turns. The agent does not have a max-iteration safety net. It loops as long as the model keeps emitting tool calls, and stops the instant the model emits a normal assistant message. That decision lives with the model, which keeps the orchestration code small.

The most useful thing about the Datadog rollout is that it tells you what the agent loop is good at in production, and what it is not.
Datadog's AI Development Experience team wanted a code-review agent that could spot the kind of bugs human reviewers miss — not lint, not style, but cross-service regressions, performance cliffs, and changes that would re-create real incidents. Most prior tools, they note, behaved like "advanced linters." They flagged shallow patterns inside a single diff. They had no model of the system around it.
To validate the agent before rolling it out, the team built an incident replay harness: they reconstructed the pull requests that had contributed to past incidents, fed each one to Codex as if it were arriving fresh in review, and then asked the on-call engineers whether Codex's comments would have changed the outcome.
The published result: Codex surfaced useful new feedback on roughly 22% of historical incidents — more than any other tool the team evaluated. Because every one of those pull requests had already passed human review, the test was essentially measuring "things reviewers missed at the time."
Today more than 1,000 Datadog engineers use it on every PR. Engineers react to comments with 👍/👎 and either amend the code or skip the suggestion with a rationale. The interesting design choice is that Datadog measures impact through the back channel — engineers posting in Slack about useful catches — rather than a formal in-tool metric. The team treats the agent as augmentation, not gatekeeping; it surfaces signal, humans still make the call.
The pattern to take from this: the agent loop is at its best when it can reason across the whole system (deps, tests, related services), not just the diff in front of it. That requires giving it the right tools and the right read access. The agent loop is the chassis; the tool surface and the project context are what determine whether it surfaces real risks.

The Sora Android story is the more aggressive proof point. Between October 8 and November 5 of 2025, a team of four engineers used Codex to take Sora from "iOS-only" to a public Android launch — a 28-day cycle, with an internal employee build at day 18 and a public release ten days later. The app debuted at #1 on the Google Play Store; users created more than a million videos in the first 24 hours. Crash-free rate at launch: 99.9%.
The team has been candid about what worked, and — more usefully — what did not.
The thing that did not work was the obvious prompt: "Build the Sora Android app based on the iOS code. Go." They tried it, and they aborted it quickly. What came out was technically functional but the product experience was sub-par. Codex, in their words, "isn't yet great at inferring what it hasn't been told" and struggles with "deep architectural judgment" when left unguided.
The thing that did work was treating Codex as a semantic translator between platforms, with strong scaffolding around it:
AGENTS.md files to encode team conventions so Codex's output stayed consistent.The reported share of code generated by Codex was close to 85%, on a roughly 5-billion-token budget over the four weeks.
There are two lessons here that are easy to underestimate. First, the agent's quality is bounded by how well you scaffold it: exemplars, AGENTS.md, clear scopes. Second, "small team plus capable agent" beats "scale up engineering headcount" for greenfield work — the Fred Brooks observation finally has a counter-example in the right hands.

Strip away the specifics of code review and Android shipping and you are left with four patterns that any team building agents should copy.
Establish the patterns before you scale. Whether it is a navigation pattern (Sora) or what counts as a "risky change" (Datadog), the human work happens up front. Exemplar features and a stable AGENTS.md are how you transmit those patterns into every subsequent turn — and they happen to be exactly what the prefix cache rewards.
Give the agent the right tool surface, not the biggest one. Datadog's reviewer needs to read across services. Sora's translator needs to read the iOS source and the backend. Both teams curated the read surface carefully. A bigger tool list is not a better tool list.
Keep the human on the architecture, the agent on the implementation. Both case studies converge on this split. Humans set the structural decisions; the agent fills in inside well-bounded scopes. Inverting that split is where teams burn time.
Measure on outcomes, not on tool metrics. Datadog deliberately watches Slack reactions and incident outcomes rather than the agent's own emit rate. That keeps the optimization target honest. A code review tool that comments a lot but does not change incident behavior is not a useful tool.
The agent-loop architecture is not specific to code. Anywhere you have a model that needs to act, observe, and iterate, the same control flow works.
The hiapi platform exposes media-generation models through the same /v1/chat/completions endpoint shape that the Codex CLI uses for inference, which makes the substitution relatively mechanical. Concretely, you can build a video-production agent whose loop looks like this:
seedance-2-0 through /v1/chat/completions (priced at $0.15 per generation at base resolution, with 720P at a 2.2× multiplier on the hiapi platform as of 2026-06).The same shape works for an image-generation agent: brief → text-to-image call (for example, gpt-image-2 at $0.03 per image at 1K, or nano-banana-2 at $0.085 at 1K) → vision check → revise → retry. The Codex pattern of "prefix-cache a stable system prompt and let tool-call outputs accumulate in the conversation" applies directly: keep the role definition and brand guidelines in the system prompt, and let the loop accumulate each generation attempt and critique as turns in the same thread.
A few practical translations of the Codex lessons to a media agent:
AGENTS.md becomes a brand-guideline file. Stable, loaded once, never edited mid-session. That is what the prefix cache rewards.If you want to try the underlying models behind such a loop without writing the orchestration yet, the hiapi pricing page lists current per-image and per-second video rates, and individual model detail pages have a Playground you can run a single prompt through before you decide what to wire into your agent.
What is the Codex agent loop?
It is the control flow inside the Codex CLI that orchestrates the back-and-forth between you, the model, and the tools the model is allowed to call. Each "turn" can include many model-inference → tool-call cycles internally, and it ends when the model emits a normal assistant message instead of another tool call.
Why does prefix caching matter so much for an agent?
Because every turn re-sends a long prefix — system prompt, project instructions, conversation history. Without caching, the model would re-encode all of that on every request, which is slow and expensive. With caching, the shared prefix is computed once and reused across turns in the same session. The practical implication is that you do not edit your project instructions (e.g., AGENTS.md) in the middle of a session, because that invalidates the cache for the rest of it.
How many engineers does it take to ship something like Sora for Android?
OpenAI's published number is four engineers over 28 days, with Codex reportedly generating close to 85% of the codebase and consuming roughly 5 billion tokens across the project. The team explicitly chose to stay small — referencing Brooks's "adding people to a late project makes it later" — and to lean on the agent for implementation.
What did Codex actually catch in the Datadog evaluation?
In the team's "incident replay" methodology, they re-ran historical incident-causing PRs through Codex and asked the on-call engineers whether the feedback would have changed the outcome. Codex provided actionable feedback on roughly 22% of the incidents — issues like cross-service regressions and performance changes that human reviewers had missed when the PRs originally landed.
Can I use the same agent-loop pattern for image or video generation?
Yes. The architectural pattern — prepare prompt, call model, execute tool, append result, loop until termination — is model-agnostic. You can wire it around hiapi's media-generation endpoints (image models like gpt-image-2 or nano-banana-2, video models like seedance-2-0) and follow the same lessons: stable system prompt for prefix caching, exemplars for quality, sandboxed spend caps for safety, critique step for termination.
What should I put in AGENTS.md?
For Codex specifically: project conventions, build/test/lint commands, code style rules, and any "do not touch" boundaries. The published Sora and Datadog teams both treated it as the agent's onboarding document — the things you would tell a senior engineer on day one. Keep it stable during a session so it stays in the prefix cache.
AGENTS.md, not by prompting "go build it."/v1/chat/completions shape, and the Codex playbook for prefix caching, sandboxing, and exemplars maps one-to-one onto image and video agents.