Back to notes
AI Agent ToolsSupport
Deep dive/May 23, 2026/Support

AI Agent Sandbox Substrates: What CubeSandbox Makes Visible

An AI agent sandbox is a substrate contract, not a checkbox on an SDK. CubeSandbox is useful because its source makes the substrate layers visible, but the current evidence does not prove runtime safety, production use, E2B parity, benchmark performance, or Starkslab adoption.

orientation

AI Agent Tools/Support/readable page
Compare agent CLI control surfaces

AI Agent Sandbox Substrates: What CubeSandbox Makes Visible

An AI agent sandbox is not the model, the prompt, the SDK, or the guardrail label around a tool call.

It is the substrate contract where generated code actually runs. That contract decides who creates the runtime, how commands execute, which files are visible, which credentials enter the process, what the network can reach, how logs are captured, what survives pause/resume, and how the surface is destroyed when the task is over.

That is the part operators tend to blur. The agent framework gets a name. The model gets a name. The sandbox becomes a checkbox. Then a coding agent runs generated code in a repo, a browser profile, a container, a remote interpreter, or a host-mounted worktree, and nobody can say where the real boundary is.

This page uses CubeSandbox as a source-read proof card because the repo makes those layers visible. It does not turn CubeSandbox into a recommendation.

Proof State

Page role: support/comparison page for the AI agent sandbox question.

Evidence state: source-read-only.

What was read: the accepted CubeSandbox source-read, CubeSandbox keyword brief, sandbox-substrate comparison map, CubeSandbox proof-card packet, Daytona source-read, Starkslab mechanics playbook, and gold-page checklist.

What was not run: no repo clone, install, SDK call, Docker Compose run, KVM/PVM setup, VM boot, sandbox launch, template creation, E2B parity test, network-policy test, benchmark, attack test, OpenAI Agents SDK run, OpenClaw run, browser run, or production validation.

What this page can prove: the substrate taxonomy and the source-visible evaluation questions.

What it cannot prove: secure execution, production readiness, full E2B parity, benchmark performance, task reliability, browser reliability, OpenClaw integration quality, OpenAI Agents integration quality, or Starkslab adoption value.

Jump To

What Is an AI Agent Sandbox?

An AI agent sandbox is the named runtime boundary for agent-triggered code, commands, files, network calls, browser sessions, logs, and cleanup rules.

The practical job is simple: if an agent can execute generated code, the operator needs to know where that code runs and what it can touch. A model output is not a boundary. A function-calling schema is not a boundary. A framework runner is not a boundary. A prompt telling the model to be careful is definitely not a boundary.

The boundary is the substrate.

For a local coding agent, that substrate might be a worktree, a shell session, a container, or a devcontainer. For a code-interpreter flow, it might be a hosted runtime with a filesystem and a kernel process. For a browser agent, it might be a browser profile, remote browser, desktop session, or computer-use loop. For heavier generated-code workloads, it might be a sandbox platform with lifecycle APIs, templates, proxy routing, network policy, snapshots, logs, and a guest process surface.

That is why the useful question is not "does this SDK have tools?"

The useful question is: what substrate contract sits under the tool?

For Starkslab's SEO strategy, this belongs in the AI Agent Tools cluster. It supports the Build AI Agent cluster only where it teaches builders not to wire generated-code execution into an unnamed runtime. The target reader is comparing execution surfaces, not looking for a beginner build tutorial or a tool ranking.

The Substrate Contract

Before trusting an AI agent sandbox, ask six questions.

First, who controls the runtime? The answer should name the control plane: create, schedule, connect, pause, resume, kill, observe, and clean up.

Second, how does code execute? The answer should name the execution API: lifecycle calls, code interpreter calls, shell/process calls, file operations, logs, timeouts, and output streaming.

Third, what can the network reach? The answer should name egress rules, inbound exposure, CIDR allow/deny behavior, DNS and proxy paths, certificates, port mappings, and cleanup after sessions die.

Fourth, what files and host state are visible? The answer should name repo paths, host mounts, environment variables, API keys, SSH agents, browser profiles, cookies, generated artifacts, and logs.

Fifth, what examples are real? A source-visible OpenAI Agents SDK example, OpenClaw example, code-interpreter example, or browser example proves integration shape. It does not prove task reliability or safety.

Sixth, what remains unvalidated? This is the row that keeps a public note honest. If nobody booted the sandbox, ran adversarial code, tested network egress, inspected credential flow, or checked cleanup, the page has to say that.

Here is the substrate diagram this page is built around:

agent or framework
  -> control plane
  -> execution API
  -> runtime substrate
  -> network / files / credentials / logs / cleanup
  -> proof gaps before adoption

That diagram is not decorative. It is the page job.

Control Plane Is Not the Agent Brain

The control plane owns sandbox lifecycle and authority. The agent brain only requests work through that boundary.

CubeSandbox is useful source stock because it names more than a single "run code" function. The accepted source-read found CubeAPI as the Rust REST API gateway, CubeMaster as the Go scheduler, Cubelet as the node-local lifecycle manager, CubeProxy as the service routing layer, CubeVS as the network-policy layer, CubeShim as the containerd shim / MicroVM bridge, plus hypervisor, SDK, web, and examples surfaces.

That split matters. If every piece is collapsed into "the sandbox," the operator loses the ability to ask precise questions. Does the API create the sandbox? Does the scheduler choose a node? Does the node manager own lifecycle? Does the proxy expose inbound services? Does the network layer enforce policy? Does the guest agent execute commands? Which logs come from which layer?

Daytona gives a useful comparison vocabulary. Its source-read framed a sandbox platform around interface plane, control plane, compute plane, in-sandbox daemon, snapshots, SDKs, CLI, dashboard, SSH/VNC/proxy surfaces, and human debugging paths. Daytona is not proof for CubeSandbox, but it reinforces the same operating lesson: code execution needs plane separation.

Local worktrees and Docker/devcontainers sit in a different row. They can be enough for many coding-agent workflows, but the control plane is usually the operator, repo, shell, CI, and validation wrapper. A worktree isolates file changes by convention. It does not automatically isolate network, credentials, or host state.

If the broader control-plane problem is the next question, read What Is a Coding-Agent Control Plane?. That page owns the skills, MCP, config, permissions, and safety-gate layer. This page owns the narrower runtime substrate question.

Execution API: Lifecycle, Code Interpreter, Process, And Files

The execution API is the surface the operator grants to the agent.

For a serious AI agent sandbox, that usually includes lifecycle operations: create, connect, list, pause, resume, delete, timeout, logs, templates, and sometimes snapshots. It also includes code-interpreter operations: execute code, stream stdout and stderr, preserve variables during sandbox lifetime, and return files or artifacts.

Then there is the process layer: shell commands, long-running processes, interrupts, status, output capture, and timeout behavior. Finally there is the filesystem layer: read, write, upload, download, mount, and list.

CubeSandbox's source-read supports source-visible intent across these rows. Its API docs and route listings expose E2B-compatible lifecycle surfaces, first-party Python SDK operations, code-interpreter examples, filesystem operations, host-directory mounts, pause/resume language, and log surfaces. It also marks the boundary: some E2B-style routes were documented as pending, incomplete, or dependent on future CubeMaster APIs.

That is enough to say CubeSandbox is designed around an E2B-compatible lifecycle and code-interpreter path. It is not enough to say it fully matches E2B, runs reliably under Starkslab conditions, or can be swapped into a production workload without validation.

For the adjacent harness problem, read The Coding Agent Harness Layer. A harness decides how agent work is wrapped, replayed, logged, and reviewed. A sandbox decides where the risky execution actually happens. They need each other, but they are not the same thing.

Network Policy Is A Design To Verify

Network policy matters because generated code can fetch, scan, exfiltrate, call APIs, open ports, or mutate external systems. A sandbox with broad network egress can still leak secrets or cause damage even if the filesystem looks isolated.

CubeSandbox's strongest source-visible detail is its network-policy vocabulary. The source-read found docs around TAP devices, eBPF programs, NAT/session maps, allow/deny CIDR tries, port mappings, private/link-local denial language, and session reaping. Those are useful specifics. They make the policy boundary inspectable.

But a source-visible design is not runtime assurance.

An operator still has to test whether the sandbox is default-open or default-deny, which private CIDRs are denied, how DNS and proxy paths behave, whether certificates change the route, what happens to active sessions on pause/resume/kill, whether inbound port mappings stay scoped, and whether blocked egress actually fails under generated code.

The blocked claims matter here. Do not claim any AI agent sandbox is safe because it says eBPF, KVM, VM, MicroVM, remote browser, guardrails, or isolated profile. Those words name places to inspect. They do not replace inspection.

File And Host Access Is Where The Blast Radius Hides

Most practical failures hide in file and host access.

A sandbox claim is incomplete until it says what the runtime can see: repo files, generated files, host mounts, environment variables, API keys, SSH agents, browser profiles, cookies, logs, traces, templates, package caches, and persistent volumes.

CubeSandbox's source-read found filesystem operations, templates, guest agent behavior, and host-directory mount language. That makes the right question visible: are host mounts scoped tightly enough for generated code? What does the guest agent see? Which paths persist? Which logs retain outputs? Does pause/resume preserve state that should have been destroyed?

Docker and devcontainers have the same practical problem in a more familiar shape. The container may look isolated, but a broad bind mount, injected .env, forwarded SSH agent, or writable repo credential changes the blast radius. Local worktrees are even easier to overtrust because they feel separate while sharing the same host, shell, credentials, and network.

Browser and desktop agents add another version of host access. A browser profile may contain cookies. A local desktop surface may see apps, files, and notifications. A remote browser may introduce provider logs and cleanup questions. Those are authority surfaces, but they are not the same as code-execution isolation.

If your next question is how generated-code changes should be reviewed before acceptance, read AI Coding Agent Workflow. The workflow page owns the plan, execute, verify, review, and merge loop around these substrate decisions.

CubeSandbox Proof Card

CubeSandbox is valuable here because it exposes the parts of an AI agent sandbox substrate that many product pages hide behind a simple "run code" API.

Source-basis: the accepted CubeSandbox source-read supports a narrow claim. CubeSandbox is source-visible AI agent sandbox substrate stock with CubeAPI, CubeMaster, Cubelet, CubeProxy, CubeVS, CubeShim, KVM MicroVM/containerd-shim architecture, E2B-compatible lifecycle and code-interpreter intent, eBPF network-policy docs, a first-party Python SDK, and examples for OpenAI Agents SDK, OpenClaw, browser sandbox, code interpreter, and benchmark-style workloads.

That makes the evaluation boundary concrete. Before trusting any sandbox, an operator still has to verify what the runtime can see, which credentials enter it, whether network egress is actually enforced, how host mounts work, what logs and traces persist, which E2B-style APIs are implemented, how pause/resume changes state, and how reliably the environment is destroyed or reused.

The safe lesson is not "use CubeSandbox."

The safe lesson is that an AI agent sandbox is a substrate contract, not a checkbox on an agent SDK.

Blocked claims: do not claim CubeSandbox is secure, safe for arbitrary untrusted generated code, a full E2B substitute, benchmark-validated, validated by Starkslab, or recommended for adoption. Do not claim KVM, PVM, RustVMM, CubeVS, eBPF, TAP devices, host mounts, templates, DNS/proxy routing, certificates, pause/resume, SDK examples, OpenAI Agents SDK examples, OpenClaw examples, browser sandbox examples, code-interpreter examples, RL examples, or SWE-bench-style examples prove runtime isolation, task reliability, network enforcement, credential safety, cleanup safety, or production use.

Comparison Rows: Daytona, E2B, Docker, Devcontainers, And Worktrees

Sandbox comparisons break when every surface is treated as the same kind of proof.

CubeSandbox belongs in the source-visible substrate row: management API, scheduler, node lifecycle, proxy, network policy, virtualization shim, guest process, E2B-compatible intent, SDK, examples, and proof gaps.

Daytona belongs in the sandbox control-plane/platform row: interface plane, control plane, compute plane, in-sandbox daemon, snapshots, SDKs, CLI, dashboard, SSH/VNC/proxy surfaces, human debugging, and governance language. It is useful comparison stock, not runtime validation for this page.

E2B-style hosted sandboxes belong in the interface baseline row. CubeSandbox targets E2B-compatible lifecycle and code-interpreter usage, so E2B is the natural API reference. That does not let this page borrow hosted reliability, security, pricing, data handling, or complete API coverage claims.

Docker, devcontainers, and local worktrees belong in the practical local baseline row. They are often the substrate operators already have. They are also easy to overtrust because mounts, env vars, SSH agents, network access, and host credentials can quietly erase the boundary.

OpenAI Agents SDK and OpenClaw belong in the orchestration/control-plane row. They can call a substrate. They do not validate the substrate.

Browser Harness and UI-TARS Desktop belong in the browser/computer-use authority row. They help explain visual and browser control surfaces. They are adjacent to code-execution sandboxes, not equivalent to them.

Frameworks And Browser Surfaces Are Adjacent

OpenAI Agents SDK is a workflow-control stack. It has agents, runners, tools, handoffs, guardrails, human-in-the-loop patterns, sessions, tracing, MCP surfaces, and sandbox-related examples. That is a developer orchestration layer. It can route work into tools or substrates. It does not prove that the substrate enforces network policy, protects credentials, cleans up files, or safely runs generated code.

OpenClaw is an operator/control-plane comparison. It has skills, memory files, workspace artifacts, review gates, stop rules, and tool boundaries. A CubeSandbox example can show how OpenClaw might call an E2B-style sandbox through a skill. That is integration-shape proof, not Starkslab adoption proof.

Browser Harness and UI-TARS Desktop are also adjacent. They expose browser and computer authority: screenshots, browser sessions, profiles, CDP calls, local/remote computer surfaces, action parsers, loop caps, and stop conditions. Those surfaces matter because browser state and desktop visibility carry real authority. But browser authority and code-execution sandbox isolation are different categories.

For source-reading posture, read I Read OpenClaw's Source Code. That page is the better continuation when the question is how Starkslab separates source facts from runtime proof.

Proof Gaps Before Trusting An Agent Sandbox

The honest verdict is not "use this."

The honest verdict is a validation checklist.

Before making stronger claims about any AI agent sandbox, Starkslab would need a separate runtime/security validation pass that covers:

  • install and boot path;
  • template creation and provenance;
  • sandbox lifecycle operations;
  • code-interpreter execution and output capture;
  • process and shell execution with timeout behavior;
  • filesystem read/write and host-mount scoping;
  • credential injection and environment visibility;
  • network egress policy, denied CIDRs, DNS, proxy, certificates, and port mapping;
  • pause, resume, kill, and cleanup behavior;
  • log and trace capture plus retention;
  • E2B API coverage and unsupported routes;
  • browser and code-interpreter example execution;
  • adversarial generated-code and exfiltration tests;
  • benchmark reproduction only after runtime correctness is established.

Until that exists, the public claim stays bounded: CubeSandbox is source-visible proof stock for the substrate question. Daytona is comparison vocabulary. E2B is an interface reference. Docker, devcontainers, and worktrees are local baselines. OpenAI Agents SDK and OpenClaw are orchestration/control-plane surfaces. Browser Harness and UI-TARS Desktop are browser/computer-use surfaces.

That is enough for a useful page. It is not enough for an adoption verdict.

What Starkslab Would Steal, Ignore, And Refuse To Claim

Steal the substrate taxonomy.

The useful pattern is the named split between interface, control, scheduling, compute, proxy/data, network policy, guest process, execution API, files, logs, and cleanup. That vocabulary makes agent code execution reviewable.

Steal proof cards near claims.

If a page says a sandbox exposes eBPF policy, place the proof state next to that sentence. If a page says a framework can call a sandbox, say whether it was actually run. If a page names E2B compatibility, say whether every route was tested.

Ignore the jump from architecture to safety.

Architecture is not nothing. It is useful. But architecture does not prove the boundary under adversarial code, misconfigured credentials, broad host mounts, stale sessions, or real traffic.

Refuse to claim validation that did not happen.

No "best sandbox" verdict. No safety adjective as a conclusion. No benchmark conclusion from a README. No migration advice from a source-read. No Starkslab infrastructure recommendation without a separate run.

That is the anti-hype version of the page, and it is the version that compounds. It gives builders a clearer substrate checklist without pretending Starkslab tested what it did not test.

If your next question is terminal and coding-agent authority, read Agent CLI Control Surfaces. Link job: adjacent depth. It explains what CLI tools can see, edit, execute, delegate, extend, report, and recover from.

If your next question is the harness around agent execution, read The Coding Agent Harness Layer. Link job: adjacent depth. It connects native CLIs, wrappers, worktrees, validation loops, and repeatable execution.

If your next question is permission membranes, read What Is a Coding-Agent Control Plane?. Link job: prerequisite/adjacent depth. It covers skills, MCP, config, sessions, and safety gates.

If your next question is how to accept generated-code changes, read AI Coding Agent Workflow. Link job: next-step. It owns the review loop after a substrate produces work.

If your next question is source evidence versus runtime proof, read I Read OpenClaw's Source Code. Link job: proof methodology. It shows the source-read posture this page uses.

If your next question is OpenClaw as a control plane rather than a sandbox substrate, read OpenClaw, Codex, Claude Code, and ACP. Link job: adjacent depth.

If your next question is why CubeSandbox entered the research queue at all, read How Agent Tool Radar Scores Open-Source AI Agent Tools. Link job: series/lane continuation. Radar signals are research leads, not recommendations.

next action

Compare agent CLI control surfacesRead the coding-agent harness layer
Back to Library

Want the deeper systems behind this note?

See the Vault