The Agentic Stack: How a State-of-the-Art AI Agent Actually Works
An agent is a loop. We read roughly 1,800 recent papers and turned them into one builder's map: the loop every agent runs, and for each part of it the state-of-the-art approach, how to actually build it, and the trap to avoid. Opinionated, and fully sourced.
An agent is a loop, not a model with a clever prompt. Almost all of its real-world capability comes from the engineering around that loop, not from the weights.
We read roughly 1,800 recent papers on how to build agents and turned them into one builder's map. Not a reading list: a working model of how an agent actually runs, and for each moving part, the approach that is state of the art today, how to build it, and the mistake to avoid. Every recommendation links to the work behind it. The harness is the product (AI Harness Engineering, WildClawBench).
An agent is a loop, not a model call. The model repeatedly assembles context, decides one next action, executes it against a real substrate, observes the result, and updates its state until it judges the task done. Everything else in agent engineering is making that loop reliable, cheap, inspectable, and safe. Read the rest of this guide as a walk around that loop: first what happens inside it, then what it takes to run many loops and to keep them improving, and finally the four planes that make a loop you can trust.
The engine: deciding the next step
Control Loop & Planning
The decision policy that sequences observe-think-act over a horizon, beyond linear ReAct.
Run a ReAct loop with a cheap learned world model that predicts each candidate action's natural-language state diff before committing, and use it to gate irreversible actions. It buys most of the safety of MCTS tree search at one-step lookahead cost. Reach for full MCTS+DPO (Agent Q) only when you can afford training and live rollouts. (See the paper.)
- Represent environment state to the planner as a natural-language diff (what changed) rather than the full page/observation.
- Before any irreversible action (payment, delete, send), run a one-step lookahead predicting the resulting state diff and require it to match intent.
- For tasks where sub-results must merge rather than chain, model reasoning as a graph (Graph-of-Thoughts) with an explicit aggregation step instead of forcing a linear chain.
- Audit your action set for what cannot be undone and route exactly those through the lookahead/approval gate.
Reaching for MCTS or asynchronous interruptible planning first, even SOTA models collapse (47%→11%) on overlapping/interruptible tasks; get a gated synchronous loop solid before adding concurrency.
How the agent touches the world
Action Substrate
The execution surface, shell, filesystem, code interpreter, or custom tool API, through which the model actually changes things.
Give the model a stateful Python interpreter as its unified action space (CodeAct): one action can call tools, branch, loop, and recover, and the interpreter's stdout/exception is the observation, strictly more expressive than JSON tool-call envelopes and far cheaper than orchestrating composition externally. (See the paper.)
- Stand up a sandboxed container with a persistent Python interpreter; expose it as the agent's primary action and feed stdout/traceback straight back as the next observation.
- Design the observation surface as a product, not a shell passthrough: paginate file views in fixed chunks, hard-cap output length, and return terse structured feedback instead of raw stdout.
- Make edits guarded, validate syntax before commit and auto-rollback on failure, so a bad edit never corrupts state.
- Prune any tool the agent reliably misuses; a smaller, well-described command set beats a large one.
Reusing human GUIs or dumping whole files into context. The substrate sets the ceiling on everything above it, and a default shell passthrough caps you low.
Tool Interface & Discovery
How an agent selects, parameterizes, and correctly calls the right tool when catalogs reach thousands.
At scale, use hierarchical category→tool→endpoint retrieval with a self-reflective 're-retrieve if unsolvable' loop (AnyTool) to keep only a handful of candidate tools in context per step. Flooding the window with thousands of tool specs degrades function-calling accuracy via distractors. (See the paper.)
- Build a two-stage retriever: resolve a category, then a tool, then an endpoint, so the LLM only ever sees a small candidate set.
- Author every tool description against a six-field rubric, purpose, input schema, output schema, side effects, preconditions, examples, and fix missing-purpose first (the most common defect).
- Treat tool docs as testable artifacts: fuzz them (ToolFuzz-style) to surface underspecification before deployment, and expose only the ~20% of any API surface the agent actually needs.
- Add a self-reflection branch: on a failed/unsolvable call, re-retrieve rather than one-shot committing to the first tool.
Augmenting every tool description to fix 'smells', blanket augmentation adds ~67% more execution steps; augment only the smells that correlate with actual task failures.
What the loop carries forward
Memory & Recall
Persistent state beyond the context window, what's stored, how it's indexed, and how relevant history is reconstructed at query time.
Expose read/write/search tools over an external memory tier that the model pages in and out itself (MemGPT virtual-context paging), with a self-edited working-memory block mutated via tool calls. Memory is a state-management problem the agent should actively manage, not a passive top-k vector lookup. (See the paper.)
- Split storage into tiers, in-context working memory, searchable recall (chat history), and archival key-value/vector, and give the model explicit tools to read/write/search each.
- Store raw events verbatim and defer extraction/reasoning to retrieval time; never discard information at write time, because you don't yet know the query.
- Build a multi-stage retrieval funnel, cheap keyword recall → semantic re-rank → relational reasoning → reader, instead of a single similarity search; SQLite alone is competitive, no vector DB required.
- Add a scheduled reflection pass that synthesizes episodic memories into higher-level abstractions, and score retrieval by recency + importance + relevance, not similarity alone.
Trusting any memory layer on dependency/invalidation reasoning, current systems score ~1-3% on cascade/absence updates; if a fact can be superseded, model edge invalidation explicitly or a long-context baseline will beat you.
Context Lifecycle & Compression
Active in-flight management of the context window, what to keep, compress, evict, or page during a long run.
Optimize natural-language compression guidelines for observations and history via a failure-analysis feedback loop (ACON), 26-54% token reduction with improved success and zero fine-tuning, then distill the compressor into a small model for latency. It beats fixed summarization and needs no RL. (See the paper.)
- Compress observations and trajectory history separately, each tuned against its own task-failure signal, not with one generic summarizer.
- Place the most critical retrieved chunks at the top or bottom of the prompt and treat the middle as lossy storage; reorder RAG hits to front-load the top result.
- Shrink context aggressively even after retrieval succeeds, input length alone degrades accuracy 14-85% with perfect retrieval, so pass the shortest evidence-preserving span.
- If you can train, make context delete/insert (MemAct) or trajectory folding (AgentFold) first-class agent actions to cut context ~51% while preserving task signal.
Assuming more context is free or that retrieval quality is the only knob. Length itself is a quality/latency dial, and naively concatenating k docs in score order buries evidence in the dead middle.
Running many loops at once
Multi-Agent Coordination & Interop
Orchestrating several agents and exposing them to each other, roles, handoffs, discovery, and shared-state correctness.
Encode a known human workflow as an explicit phase DAG with schema-constrained handoffs and downstream-validates-upstream checks (MetaGPT SOP-as-prompt), not free-form chat. It breaks the hallucination cascade. Pair it with per-action compensating rollbacks (SagaLLM) so a failed phase recovers instead of silently corrupting. (See the paper.)
- Replace open-ended agent chatter with an explicit phase DAG; give each role a domain-expert system prompt and require structured (schema-validated) outputs between phases.
- Register a compensating undo action alongside every forward action at spawn time, and checkpoint at phase boundaries so rollback scope is bounded.
- Put an independent validation agent at each phase rather than letting actors self-validate.
- If agents share mutable state, use dedicated shards and reconstruct read-sets from observed traffic (S-Bus), self-reported read sets over-claim by 32-49%, so don't trust them.
Scaling agent count to beat context limits, agents exchange distributed state competently but fail to integrate it into correct answers (the Communication-Reasoning Gap), so naive fan-out adds cost without capability.
Runtime & Resource Management
The OS-layer for agent fleets, scheduling, lifecycle, KV-cache reuse, and spend governance.
Place all dynamic content (tool results, session state, date) at the END of the system prompt, after every static instruction, and exclude tool results from cached blocks (Don't Break the Cache): a 41-80% cost and 13-31% TTFT reduction across OpenAI/Anthropic/Google with zero model changes. It's the cheapest reliable lever before you build a scheduler. (See the paper.)
- Restructure prompts so the static prefix is byte-stable and all volatile values sit at the tail; never embed dynamic values inside cached tool definitions.
- Treat the context window as a managed resource with a three-tier lifecycle, active / compacted / hibernated, and add a zombie reaper for stuck tool calls and orphaned subprocesses (AgentRM).
- Use rate-limit-aware scheduling that backs off before the provider does, and MLFQ-style admission control instead of a fixed concurrency cap.
- Measure per-provider cache hit rate separately, block-placement strategies diverge across vendors.
Leaning on test-time scaling for accuracy, returns diminish while latency variance and energy cost explode; govern cost at the runtime layer instead of paying for more samples.
Getting better after deployment
Self-Improvement & Skill Acquisition
Inducing reusable routines and skills from experience, externalizing capability into a skill/workflow library rather than weights.
Induce reusable task-level routines from past trajectories and selectively inject them at task start (Agent Workflow Memory). It accumulates a workflow library online from live interactions and degrades gracefully when history is sparse, with no fine-tuning. It's the highest-leverage, lowest-cost self-improvement primitive available today. (See the paper.)
- After runs, distill recurring multi-step action sequences into named workflows; retrieve and inject the relevant ones into context at the start of new tasks.
- Drive skill discovery from failure traces and gate acceptance on held-out validation (EvoSkill Pareto selection) so the library only grows with skills that actually help.
- Make skills executable program-functions that trigger on detected failure-prone states and inject corrective actions (HASP), rather than passive textual advice the model may ignore.
- Where a cheap correctness oracle exists, seed an evolutionary loop with a working implementation and let an evaluator gate LLM mutations (AlphaEvolve).
Assuming more skills always help, on real SWE tasks 39/49 induced skills gave +0%, and unconstrained self-evolution can degrade alignment; gate every skill on measured benefit and watch for misevolution.
Knowing what the loop did, and whether it was good
Verification & Observability
Turning raw trajectories into structured traces, machine-checkable verdicts, and root-cause attributions.
Instrument every step as a Message-Action Trace with declarative step- and trace-level contracts that yield machine-checkable verdicts plus deterministic replay (Trace-Based Assurance). It pinpoints the first failing step and doubles as a runtime action mediator. This is the buildable spine; LLM-as-judge is a complement, not a substitute. (See the paper.)
- Log every agent step as a structured event, step ID, action type, payload, outcome, so the whole run is replayable deterministically.
- Write declarative step- and trace-level contracts that produce PASS/FAIL verdicts and name the first violating step.
- Add fault-injection stubs at service/retrieval/memory boundaries to verify the agent contains degraded conditions.
- For subjective quality, run LLM-as-judge with A/B-swap to cancel position bias and calibrate against human labels (target >80% agreement); wrap code-correctness judging in MCTS for reliability.
Trusting an LLM to name which agent failed and when, even SOTA reasoning models can't attribute reliably, and judges exhibit self-bias; ground attribution in the trace/dependency graph, not the model's opinion.
Evaluation & Reliability
Measuring quality on final environment state and trial-to-trial reliability, plus auditing the benchmarks themselves.
Score on final database/environment state with a simulated user and policy rules, and report pass^k across repeated trials (τ-bench). Capability and reliability are different axes, and pass^k separates real skill from a single lucky run. Trace-matching and pass@1 hide the reliability gap that dominates production. (See the paper.)
- Evaluate by diffing final environment/database state, not by matching the trajectory; allow multiple correct paths via milestone scoring (ToolSandbox).
- Report pass^k, not just pass@1, re-run identical tasks and measure how often the agent succeeds every time.
- Drive a simulated user with the policy document as input to test rule adherence under realistic back-and-forth.
- Before trusting any benchmark, audit it: diff model patches against the issue text for solution leakage and cross-check passing patches against ground-truth tests (SWE-bench+ found 32% leakage, 31% weak-test passes).
Believing single-run task-completion numbers and public leaderboards, agents that succeed once routinely fail on identical re-runs, and crowdsourced rankings are distorted by selective disclosure; keep immutable append-only result logs.
Assuming the loop is under attack
Adversarial Surface & Red-Teaming
The catalogued, agent-specific attack surface, injection via ingested content, persistent memory, multi-server trust, and inter-agent messages.
There is no robust model; nearly every deployed agent violates policy within 10-100 queries and robustness doesn't correlate with model size. The least-bad move is to adopt ART as a drop-in red-team suite and measure your agent's policy-robustness as a distinct axis from capability, then defend deterministically below the model. (See the paper.)
- Run a deployment-policy test harness covering data-access, financial-action, and regulatory-compliance violation classes under a 10-100 query attack budget; measure attack transferability across models.
- Treat the inter-agent message bus as an adversarial boundary, sign message envelopes and schema-validate every inbound message before acting (AiTM rewrites messages without compromising any agent).
- Audit MCP servers before adoption: tool-poisoning sits at ~5.5% prevalence, so validate and sanitize tool metadata and isolate context per tool invocation.
- Model cross-temporal attacks: enumerate every persistence substrate (memory, scheduled jobs, filesystem) × firing-separation, since sleeper-channel payloads fire later through a different surface.
Relying on point-in-time, within-task evaluation. The live attacks are delayed-trigger and cross-surface, which single-session red-teaming structurally cannot detect.
Runtime Defense & Pre-Action Authorization
A deterministic enforcement layer that intercepts consequential actions and decides allow/deny on policy and provenance, not model alignment.
Put a synchronous, deterministic pre-action authorization shim between the agent loop and the executor that canonicalizes each tool call and evaluates it against a declarative policy before any side effect (OAP / Faramesh). Alignment is probabilistic and auditing is too late; only an external, non-bypassable gate is reliable. The same shim doubles as your spend/quality gate. (See the paper.)
- Insert a call-site interceptor that canonicalizes every tool call into a hashable form and returns PERMIT/DEFER/DENY (or allow/warn/block/review) before execution; require the executor to validate a signed decision artifact.
- Make DEFER a first-class outcome that routes to async human approval without blocking the stack, and write a cryptographically signed per-call audit record keyed by action hash.
- Gate injection specifically with masked re-execution (MELON): re-run the step with the user prompt masked and block when the action is unchanged. The agent is obeying injected content, not the user.
- For memory/RAG injection, enforce tool-gating at the memory layer (Memory Sandbox). It's the only defense that reaches 0% ASR; input- and retrieval-level filters are indistinguishable from no defense.
Granting static permissions decoupled from live trust, or filtering before the injection site. Defenses that cannot observe where untrusted content enters are blind to it.
Four hard truths
Four findings showed up across layer after layer. They are the most useful things we can hand another builder.
If you are building an agent today
If you are building an agent today: make the action substrate a sandboxed stateful Python interpreter with CodeAct (one action space for tools, branching, and recovery), and design terse paginated observations on top of it. Run a synchronous ReAct loop with a one-step world-model lookahead gating every irreversible action. Skip MCTS until you can train. For memory, give the model MemGPT-style read/write/search tools over a tiered store (SQLite is enough), store events verbatim, and compress observations and history separately with ACON-tuned guidelines. Keep tool catalogs small with hierarchical retrieval and six-field descriptions. For multi-agent work, use a MetaGPT-style phase DAG with schema-constrained handoffs and SagaLLM compensating rollbacks, not free chat. Govern runtime by placing all dynamic content at the tail of the prompt (41-80% cost cut, free) plus a zombie reaper and rate-aware scheduling. Wrap the whole loop in Message-Action Traces with declarative contracts for observability, evaluate on final-state pass^k via τ-bench, and put a deterministic OAP-style pre-action authorization shim in front of the executor with Memory-Sandbox tool-gating on the memory layer. Improve over time by inducing AWM workflows from past trajectories. The order of payoff: substrate and authorization shim first, then traces and pass^k eval, then memory and compression, then self-improvement.
This map is a snapshot of a field moving fast, and it is meant to be argued with. It was compiled at Immersive Commons from a large corpus of recent agent-engineering papers, every recommendation grounded in the work behind it. If you are building agents, here or anywhere, and you think a layer is wrong or missing, come tell us. That is what the Commons is for.