Inside Codex's Agent Loop: Lessons from Datadog and Sora's 28-Day Android Launch

How the loop works inside the Codex CLI, what production teams at Datadog and OpenAI actually use it for, and how the same architecture transfers to media-generation agents on hiapi.

hiapi12

Inside Codex's Agent Loop: Lessons from Datadog and Sora's 28-Day Android Launch

28 daysSora Android shipped in

4Engineers on the Sora Android team

~22%Datadog incidents Codex caught in replay

Codex shipped Sora's Android app in 28 days with four engineers. At Datadog, more than a thousand engineers now rely on it to catch the kind of system-level bugs that human reviewers used to miss. Both stories trace back to the same piece of architecture: a deceptively simple control loop that wraps a language model, a set of tools, and a sandboxed environment.

If you are building an AI coding assistant, a media-generation pipeline, or any agent that has to act on the world, the Codex agent loop is the cleanest reference you can study right now. This is a walk-through of how it works, what production teams actually do with it, and how the same architectural pattern transfers to media-generation agents you can build on top of hiapi.

This article is based on OpenAI's public engineering writeups: "Unrolling the Codex Agent Loop" (January 2026), the Datadog case study, and the Sora Android shipping retrospective. Hiapi platform prices quoted below are as of 2026-06.

What the agent loop actually is

The agent loop is the control flow that sits between you, a language model, and the tools the model is allowed to invoke. From the user's perspective, one "turn" looks like a chat message in and a chat message out. Underneath, a single turn can include many round trips between the model and external tools.

The shape of one turn:

Prepare the prompt. The agent assembles a system prompt, project-level instructions (in Codex's case, an AGENTS.md file), the conversation history, and the user's new message.
Inference. The model returns either a final assistant message or a tool call.
If it's a tool call, the agent executes the call in a sandbox, captures the result, appends it to the conversation, and goes back to step 2.
If it's a final assistant message — something like "I added the architecture.md you asked for" — the turn terminates and control returns to the user.

The crucial property is that "output" is not just the assistant message you see in the terminal. Most of the real output is the side effects of the tool calls: files written, tests run, packages installed. The assistant message is the termination signal.

This is why agent loops are not just chat with extra steps. Chat has one round trip per turn. An agent loop can do dozens of inference → tool → inference cycles inside a single turn, and the user only sees the wrap-up message.

Three properties that make the loop fast

A naive implementation of the loop above works, but it gets slow and expensive within a few turns. Codex's design fixes this with three ideas worth stealing.

1. Prefix caching on the Responses API. Every time the agent talks to the model, the prompt starts with the same long prefix: system prompt, then AGENTS.md, then the early conversation. The first request computes and caches the model's internal key-value state for that prefix. Subsequent requests in the same session skip the recomputation for any shared prefix.

A practical consequence: do not edit AGENTS.md mid-session. The system prompt is usually the largest single cached block, and changing it invalidates the cache for the rest of the session. Codex's own engineers call this out as the most common self-inflicted slowdown.

2. Tool-call results live inside the loop, not on top of it. Each tool output gets appended to the same conversation. This sounds obvious, but it means the model has full memory of every command it ran, every error it saw, and every file it inspected — without you having to summarize anything by hand. The trade-off is a context window that grows fast; Codex handles this with automatic compaction once the conversation gets too long.

3. Termination is detected from the model's output shape, not by counting turns. The agent does not have a max-iteration safety net. It loops as long as the model keeps emitting tool calls, and stops the instant the model emits a normal assistant message. That decision lives with the model, which keeps the orchestration code small.

Case study one: Datadog uses the loop for system-level code review

The most useful thing about the Datadog rollout is that it tells you what the agent loop is good at in production, and what it is not.

Datadog's AI Development Experience team wanted a code-review agent that could spot the kind of bugs human reviewers miss — not lint, not style, but cross-service regressions, performance cliffs, and changes that would re-create real incidents. Most prior tools, they note, behaved like "advanced linters." They flagged shallow patterns inside a single diff. They had no model of the system around it.

To validate the agent before rolling it out, the team built an incident replay harness: they reconstructed the pull requests that had contributed to past incidents, fed each one to Codex as if it were arriving fresh in review, and then asked the on-call engineers whether Codex's comments would have changed the outcome.

The published result: Codex surfaced useful new feedback on roughly 22% of historical incidents — more than any other tool the team evaluated. Because every one of those pull requests had already passed human review, the test was essentially measuring "things reviewers missed at the time."

Today more than 1,000 Datadog engineers use it on every PR. Engineers react to comments with 👍/👎 and either amend the code or skip the suggestion with a rationale. The interesting design choice is that Datadog measures impact through the back channel — engineers posting in Slack about useful catches — rather than a formal in-tool metric. The team treats the agent as augmentation, not gatekeeping; it surfaces signal, humans still make the call.

The pattern to take from this: the agent loop is at its best when it can reason across the whole system (deps, tests, related services), not just the diff in front of it. That requires giving it the right tools and the right read access. The agent loop is the chassis; the tool surface and the project context are what determine whether it surfaces real risks.

Case study two: shipping Sora for Android in 28 days

The Sora Android story is the more aggressive proof point. Between October 8 and November 5 of 2025, a team of four engineers used Codex to take Sora from "iOS-only" to a public Android launch — a 28-day cycle, with an internal employee build at day 18 and a public release ten days later. The app debuted at #1 on the Google Play Store; users created more than a million videos in the first 24 hours. Crash-free rate at launch: 99.9%.

The team has been candid about what worked, and — more usefully — what did not.

The thing that did not work was the obvious prompt: "Build the Sora Android app based on the iOS code. Go." They tried it, and they aborted it quickly. What came out was technically functional but the product experience was sub-par. Codex, in their words, "isn't yet great at inferring what it hasn't been told" and struggles with "deep architectural judgment" when left unguided.

The thing that did work was treating Codex as a semantic translator between platforms, with strong scaffolding around it:

They established the project's architectural patterns first — navigation, dependency injection, networking — and wrote a small number of exemplar features by hand. These set the standard for the rest of the codebase.
They used AGENTS.md files to encode team conventions so Codex's output stayed consistent.
They frequently pointed Codex at the iOS Swift code and at the existing backend, and asked it to produce semantically equivalent Kotlin. Their summary line: "the future of cross-platform is just Codex."
They treated Codex like "a newly hired senior engineer" — humans owned architecture, system design, and user experience; Codex did the heavy lifting inside well-bounded scopes.

The reported share of code generated by Codex was close to 85%, on a roughly 5-billion-token budget over the four weeks.

There are two lessons here that are easy to underestimate. First, the agent's quality is bounded by how well you scaffold it: exemplars, AGENTS.md, clear scopes. Second, "small team plus capable agent" beats "scale up engineering headcount" for greenfield work — the Fred Brooks observation finally has a counter-example in the right hands.

The patterns that transfer

Strip away the specifics of code review and Android shipping and you are left with four patterns that any team building agents should copy.

Establish the patterns before you scale. Whether it is a navigation pattern (Sora) or what counts as a "risky change" (Datadog), the human work happens up front. Exemplar features and a stable AGENTS.md are how you transmit those patterns into every subsequent turn — and they happen to be exactly what the prefix cache rewards.

Give the agent the right tool surface, not the biggest one. Datadog's reviewer needs to read across services. Sora's translator needs to read the iOS source and the backend. Both teams curated the read surface carefully. A bigger tool list is not a better tool list.

Keep the human on the architecture, the agent on the implementation. Both case studies converge on this split. Humans set the structural decisions; the agent fills in inside well-bounded scopes. Inverting that split is where teams burn time.

Measure on outcomes, not on tool metrics. Datadog deliberately watches Slack reactions and incident outcomes rather than the agent's own emit rate. That keeps the optimization target honest. A code review tool that comments a lot but does not change incident behavior is not a useful tool.

Applying the loop to media-generation agents

The agent-loop architecture is not specific to code. Anywhere you have a model that needs to act, observe, and iterate, the same control flow works.

The hiapi platform exposes every media-generation model through one unified async endpoint — POST /v1/tasks creates a generation task, GET /v1/tasks/:id collects the result — so the "act" step of the loop is a two-call function you write once and reuse across all models. Concretely, you can build a video-production agent whose loop looks like this:

The agent receives a brief: "create a 6-second product shot of a watch on marble."
It assembles a prompt and creates a seedance-2-0 task via POST /v1/tasks, then polls the task ID until it lands (priced at $0.15 per second of video at 480P, with 720P at a 2.2× multiplier on the hiapi platform as of 2026-06 — the 6-second clip above runs $0.90 at 480P).
It downloads the resulting video, runs a quick critique step — "is the watch face actually visible? is the lighting consistent with the brief?" — and either accepts the output or revises the prompt.
If the critique fails N times in a row, the agent escalates to the user with the best-so-far and a description of what is going wrong.

The same shape works for an image-generation agent: brief → text-to-image call (for example, gpt-image-2 at $0.03 per image at 1K, or Nano-Banana-2 at $0.085 at 1K) → vision check → revise → retry. The Codex pattern of "prefix-cache a stable system prompt and let tool-call outputs accumulate in the conversation" applies directly: keep the role definition and brand guidelines in the system prompt, and let the loop accumulate each generation attempt and critique as turns in the same thread.

A few practical translations of the Codex lessons to a media agent:

AGENTS.md becomes a brand-guideline file. Stable, loaded once, never edited mid-session. That is what the prefix cache rewards.
Exemplar features become reference outputs. Hand-curated "what good looks like" images that the critique step compares against.
Sandboxing becomes rate and spend caps. A media agent that has not learned when to stop can rack up a lot of generations. Set per-turn spend ceilings the same way Codex restricts file-system writes by default.
Termination signal is identical. The loop ends when the critique step returns "accept" instead of another generation request.

If you want to try the underlying models behind such a loop without writing the orchestration yet, the hiapi pricing page lists current per-image and per-second video rates, and individual model detail pages have a Playground you can run a single prompt through before you decide what to wire into your agent.

FAQ

What is the Codex agent loop?

It is the control flow inside the Codex CLI that orchestrates the back-and-forth between you, the model, and the tools the model is allowed to call. Each "turn" can include many model-inference → tool-call cycles internally, and it ends when the model emits a normal assistant message instead of another tool call.

Why does prefix caching matter so much for an agent?

Because every turn re-sends a long prefix — system prompt, project instructions, conversation history. Without caching, the model would re-encode all of that on every request, which is slow and expensive. With caching, the shared prefix is computed once and reused across turns in the same session. The practical implication is that you do not edit your project instructions (e.g., AGENTS.md) in the middle of a session, because that invalidates the cache for the rest of it.

How many engineers does it take to ship something like Sora for Android?

OpenAI's published number is four engineers over 28 days, with Codex reportedly generating close to 85% of the codebase and consuming roughly 5 billion tokens across the project. The team explicitly chose to stay small — referencing Brooks's "adding people to a late project makes it later" — and to lean on the agent for implementation.

What did Codex actually catch in the Datadog evaluation?

In the team's "incident replay" methodology, they re-ran historical incident-causing PRs through Codex and asked the on-call engineers whether the feedback would have changed the outcome. Codex provided actionable feedback on roughly 22% of the incidents — issues like cross-service regressions and performance changes that human reviewers had missed when the PRs originally landed.

Can I use the same agent-loop pattern for image or video generation?

Yes. The architectural pattern — prepare prompt, call model, execute tool, append result, loop until termination — is model-agnostic. You can wire it around hiapi's media-generation endpoints (image models like gpt-image-2 or nano-banana-2, video models like seedance-2-0) and follow the same lessons: stable system prompt for prefix caching, exemplars for quality, sandboxed spend caps for safety, critique step for termination.

What should I put in AGENTS.md?

For Codex specifically: project conventions, build/test/lint commands, code style rules, and any "do not touch" boundaries. The published Sora and Datadog teams both treated it as the agent's onboarding document — the things you would tell a senior engineer on day one. Keep it stable during a session so it stays in the prefix cache.

Key takeaways

The Codex agent loop is a simple control flow: prepare prompt → model inference → optional tool call → append result → repeat until the model emits a normal assistant message.
Three implementation details make it production-grade: prefix caching, tool-call results that live inside the conversation, and termination detection driven by the model's output shape.
Datadog uses this loop for system-level code review across 1,000+ engineers; its incident-replay harness showed Codex catching ~22% of historical incident-related issues that human reviewers had missed.
OpenAI used Codex to ship the Sora Android app in 28 days with four engineers, ~85% Codex-generated code, and a 99.9% crash-free rate at launch — by scaffolding heavily with exemplars and AGENTS.md, not by prompting "go build it."
The same agent-loop architecture transfers cleanly to media-generation pipelines you can build on hiapi: every text-to-image and text-to-video model speaks the same /v1/tasks async shape, and the Codex playbook for prefix caching, sandboxing, and exemplars maps one-to-one onto image and video agents.