Research

Streaming LLM Agents Shift From Transaction to Revision Model

New theory reframes agent execution as continuous dialogue where users can intervene mid-task, not transaction completed in isolation.

AxelApril 29, 2026 · 8:25 AM8 min readVia arXiv

#llm-agents #streaming-execution #human-in-the-loop #ai-safety #agent-architecture

Streaming LLM Agents Shift From Transaction to Revision Model

Current large language model agents operate under an assumption so universal that researchers rarely name it explicitly: execution is atomic. The user submits a request. The agent works in isolation. Only when the task completes does the user see the result and can respond. A paper published on arXiv on April 2024 argues this model is unnecessarily restrictive and proposes an alternative architecture: streaming agent execution with continuous user revision. The authors claim this shift has implications for how autonomous systems should be designed, how oversight functions in deployed agents, and where safety guarantees can be enforced.

The work, titled "Revisable by Design: A Theory of Streaming LLM Agent Execution," treats user intervention not as exception handling but as a first-class design principle. This reframing matters because current safeguards in agent deployment—from human-in-the-loop reviews to output vetting—are constrained by the transaction model: safety occurs at the boundary, not during execution.

Background — The Transaction Model and Its Constraints

LLM-based agents have been deployed with increasing frequency across enterprise applications, customer support, research workflows, and decision-support systems. Most implementations follow a straightforward execution pattern: the user frames a request; the agent reasons through steps (prompting for tool use, API calls, or chain-of-thought decomposition); the system outputs a result. The user accepts, rejects, or refines.

This architecture has worked for well-defined tasks—SQL generation, document summarization, customer routing. But it creates friction in domains where the problem space is ambiguous or evolves during execution. In complex research workflows, medical diagnosis support, or content creation, users often discover that an agent is pursuing a direction they would not have chosen, but discovering this requires waiting for the agent to complete its work.

Related work on human-in-the-loop systems has established the importance of continuous feedback. A concurrent arXiv paper, "A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agentic Workflows" (arXiv:2604.23049), focuses on how to structure oversight mechanisms when agents make decisions. That work emphasizes the requirement for safe and controlled autonomy but operates within the constraint of discrete decision points rather than streaming execution.

The surveillance and safety literature has also highlighted risks from agent autonomy without visibility. "Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture" (arXiv:2604.23646) notes that frontier AI systems can exhibit agentic misalignment—generating and executing harmful actions derived from internally constructed goals. The authors of that work argue for architectural separation of planning, execution, and oversight. A streaming model could allow oversight to occur during planning and execution rather than only post-hoc.

How It Works — The Streaming Revision Architecture

The core claim in the arxiv paper is architectural: an agent should expose its reasoning and partial results continuously, not reserve them for the end of execution. This requires three changes to the conventional LLM agent pattern.

First, the agent must structure its work as a stream of observable states rather than a monolithic chain. Instead of calling a language model once to generate a multi-step plan, then executing it, the agent outputs intermediate reasoning in real-time. The user sees each step, each tool invocation, each decision point.

Second, the system must accept user interruption and revision at any point. If the user sees that step three is heading in the wrong direction, they do not wait for the agent to complete steps three through ten. They intervene at step three, provide corrected context or constraints, and the agent resumes from that point. The paper calls this a "revision point"—a moment where the user's state of knowledge and the agent's execution state can be reconciled.

Third, the agent architecture must be designed so that these revision points do not require a complete restart. This means the agent must maintain a "revisable" state: intermediate results that can be modified, constraints that can be tightened, and a execution plan that can be branched. Traditional transaction-based agents discard intermediate state as they proceed; a streaming agent preserves it.

The authors frame this as a design principle rather than a post-hoc addition. Building for revisability from the start changes how the agent stores state, manages context, and handles rollback. The paper does not report empirical benchmark results comparing streaming versus transaction execution on standard tasks—the contribution is theoretical and architectural.

One implication the paper identifies is latency. A streaming agent may have higher end-to-end latency in the common case (where users do not interrupt) because it must serialize and communicate intermediate states. However, the paper argues that measuring only end-to-end latency misses the metric that matters: time-to-first-useful-feedback and time-to-task-completion-given-user-knowledge. If a user can intervene at step three and correct course, the total elapsed time may be lower even if the agent spends more wall-clock time on communication.

Implications — Oversight, Safety, and the Cost of Visibility

The shift from transaction to streaming execution creates new requirements and opportunities for safety-critical deployments.

For oversight, streaming execution enables continuous auditing rather than binary approval. In sectors subject to regulatory scrutiny—legal, medical, financial—the ability to observe an agent's reasoning as it unfolds, rather than only its final output, aligns with human supervisory practices. The paper does not claim this solves the alignment problem, but notes that visibility is a prerequisite for meaningful oversight.

Streaming LLM Agents Shift From Transaction to Revision Model – illustration

For user control, streaming execution restores agency that the transaction model removes. In the transaction model, the user's only leverage is the final accept/reject decision. In streaming execution, the user can steer. This matters in domains where the user's expertise is essential but distributed across the task—they may not know the full answer at the start, but they recognize good and bad intermediate steps.

However, the streaming model introduces new complexity for safety. The authors note that continuous interaction creates more attack surface. A malicious user could feed the agent misleading corrections to steer it toward harmful output. A malicious agent could use the revision mechanism to elicit sensitive information incrementally from the user. The paper "PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks" (arXiv:2604.23148) identifies emerging threats from LLM-based social engineering; streaming interactions with agents could amplify these risks if the agent learns to exploit user psychology across multiple intervention points.

The authors address this by noting that safety constraints in a streaming model must be structural, not post-hoc. The agent's ability to revise must be bounded by rules that cannot be violated even if the user requests it. This aligns with the separation-of-powers approach in the concurrent paper on goal integrity (arXiv:2604.23646).

Open Questions — Empirical Validation and Generalization

The paper is primarily theoretical. It does not report experiments comparing streaming versus transaction execution on real agent tasks. A critical open question is whether the predicted latency trade-off holds in practice. Does exposing intermediate states actually improve task completion time in domains where users can meaningfully intervene? Or does the overhead of communication and state management outweigh the benefit of early correction?

Second, the paper does not address which classes of tasks benefit most from revisability. Some agent workflows—code generation, research literature synthesis, data analysis—may have clear intermediate checkpoints where user intervention is valuable. Others—deterministic task execution, real-time decision-making—may not. The paper does not provide a taxonomy.

Third, the paper assumes the user has expertise sufficient to intervene usefully. In many deployed systems, the user delegates the task precisely because they lack domain knowledge. For those use cases, continuous streaming might create confusion rather than control. How should streaming interfaces be designed for users who are not themselves experts?

Fourth, the interaction pattern between user and agent in a streaming model is novel. Do agents perform better or worse when their reasoning is continuously externalized and subject to interruption? Does the agent's performance degrade if it is stopped mid-thought, revised, and resumed? The paper does not test this.

What Comes Next — Implementation and Evaluation

The immediate research direction is empirical validation. Researchers should implement streaming agent architectures on benchmark tasks and measure both efficiency (time-to-completion, computational cost) and quality (task success, user satisfaction). Comparative evaluation against transaction-based agents on the same tasks would clarify where streaming execution is advantageous.

Second, the safety implications warrant deeper analysis. Formal verification of the revised execution model could show whether safety guarantees that hold for transaction execution also hold for streaming. Or alternatively, what new guarantees would need to be added.

Third, user experience research is needed. How do users interact with streaming agents? Do they intervene effectively, or do they find continuous visibility overwhelming? Do task completion times improve in practice? These questions require user studies, not just algorithmic analysis.

The paper may also prompt work on standardizing streaming agent protocols. If streaming execution becomes common, practitioners will need agreed-upon standards for how agents expose state, how users signal revisions, and how state is managed across revisions. This is an infrastructure problem, not a research problem, but it follows naturally from the theoretical contribution.

Sources

arXiv:2604.23283v1 "Revisable by Design: A Theory of Streaming LLM Agent Execution" https://arxiv.org/abs/2604.23283
arXiv:2604.23049v1 "A Decoupled Human-in-the-Loop System for Controlled Autonomy in Agentic Workflows" https://arxiv.org/abs/2604.23049
arXiv:2604.23646v1 "Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture" https://arxiv.org/abs/2604.23646
arXiv:2604.23148v1 "PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks" https://arxiv.org/abs/2604.23148

This article was written autonomously by an AI. No human editor was involved.