This note is for engineers who wire frameworks, gateways, and runners—not for debating model weights. It pairs with our LangGraph checkpoint and quota matrix, the OpenClaw IDE bridge sandbox playbook, and JSON Schema tool retries and timeout fuses on remote nodes. Skim the homepage for product context, compare regions on pricing, and browse the full Tech Blog; viewing plans does not require signing in.
| Dimension | E2B-class cloud sandbox | Local / remote Mac strategy |
|---|---|---|
| Isolation story | Fresh VM or microVM per task; kernel boundary; easy tear-down. | Separate macOS user or role account, dedicated APFS volume or dataset, no shared login session with humans. |
| Permission surface | IAM + hypervisor; attach secrets per job; metadata service scoped tokens. | ACLs, SIP-aware paths, TCC prompts avoided via headless service design; tool runner never holds developer Apple ID. |
| Path stability | Ephemeral /workspace; predictable image layers. |
Explicit AGENT_ROOT, TMPDIR, and cache redirects under ~/Library avoided for automation; session-scoped subdirs only. |
| Timeout model | VM wall clock + cgroup-style limits where available. | Per-tool asyncio/subprocess caps, ulimit -t where appropriate, HTTP deadlines, graph-level fuse—see linked OpenClaw article. |
| Observability | Serial console, cloud logging agent, snapshot before destroy. | Structured JSON lines, git diff --stat trails, launchd throttled health checks to loopback gateways. |
| Best default | Untrusted codegen from the public internet. | Apple-only stacks, Metal, MLX, Xcode-derived tooling, on-prem compliance. |
Threat model (one-line table)
Keep this table beside your runbook: each row should map to at least one automated test or alert.
| Actor / vector | Asset at risk | Failure mode | Primary control |
|---|---|---|---|
| Model-proposed shell | User home, SSH keys | Recursive delete or archive exfil | Allowlisted argv[0], cwd lock, read-only repo mount |
| Benign tool + malicious input | Adjacent projects | Path traversal into sibling repos | Chroot-like prefix enforcement, realpath checks |
| Dependency install hooks | Network egress | Supply-chain phone-home | Egress proxy deny-by-default, offline mirror for CI |
| Stuck subprocess | Worker pool | Queue stall, GPU lock | SIGTERM grace then SIGKILL; lease TTL on session id |
| Prompt injection via tool output | Downstream LLM calls | Recursive tool fan-out | Output byte caps, schema validation, max tool depth |
Local sandbox configuration snippets
Apple’s historical sandbox-exec / Seatbelt profile workflow inspired many teams to think in terms of deny by default. On current macOS versions, treat sandbox-exec as documentation for constraints you reimplement in your runner: wrap tools in a dedicated user, fixed working directory, sanitized environment, and explicit open-file descriptors—rather than assuming a single CLI still ships on every fleet machine.
# Session boot (zsh/bash)—mirror "disposable root" semantics locally
export SESSION_ID="${SESSION_ID:-$(uuidgen | tr '[:upper:]' '[:lower:]')}"
export AGENT_ROOT="$HOME/llm-agent-sand/$SESSION_ID"
mkdir -p "$AGENT_ROOT"/{workspace,tmp,cache,logs}
chmod 700 "$AGENT_ROOT"
cd "$AGENT_ROOT/workspace" || exit 1
export TMPDIR="$AGENT_ROOT/tmp"
export XDG_CACHE_HOME="$AGENT_ROOT/cache"
# Hard cap CPU seconds for child builds (tune per tool)
ulimit -t 180 2>/dev/null || truePair the above with a tool manifest checked into git: absolute paths to binaries, maximum argument length, forbidden environment keys (AWS_, SSH_, corporate tokens), and a rule that $HOME for the agent user is only the sandbox prefix. For gateway-mediated flows, reuse the read-only repo plus writable scratch split from the IDE bridge sandbox article.
CI gate integration
Production sandboxes fail when only humans remember the knobs. Promote the same contracts into CI:
- Registry diff—fail the build if a new tool name appears without JSON Schema, owner, and default timeout.
- Dry-run harness—execute each tool against a synthetic workspace on a Mac runner; assert writes stay under
AGENT_ROOT. - Budget table—export max wall time per tool class; CI fails if sums exceed nightly job SLA (align with the timeout layering in LangGraph timeouts).
- Secret scanners—block commits that embed loopback gateway tokens or
.p12blobs into agent templates.
Merge queues become your “mini cloud”: every PR proves the agent still respects path and timeout invariants before code reaches the remote Mac fleet.
Timeout & path acceptance checklist
Ship only when every item is objectively true—paste into tickets or SOC reviews.
- SESSION_ID prefixes every write path; deleting a session removes workspace, tmp, and cache.
- TMPDIR and XDG_CACHE_HOME point inside
AGENT_ROOT; no tool falls back to shared/tmp. - Each tool has per-invocation wall timeout plus process group kill on expiry.
- Graph or workflow has a global fuse smaller than infra idle shutdown.
- Read-only bind or ACL on source repos; agents cannot
git pushwithout a second human-approved step. - Egress default deny; allowlist only required hosts and ports per tool profile.
- Logs redact tokens; failures return structured JSON to the model, not raw stderr dumps.
- Soak test on a dedicated remote Mac: 2× expected concurrency for one hour with no zombie PIDs.
Failure mode FAQ
Why do allowlisted tools still leak data? Allowlists gate names, not composition. Combine argv validation, cwd locks, sanitized TMPDIR, and output size caps.
When is cloud strictly safer than Mac? When code is fully untrusted and you do not need Apple-only APIs—spin a disposable VM and keep Mac runners for signed pipelines only.
What breaks first under load? Usually not the model—orphan shells, filled scratch disks, and checkpoint directories competing with tool caches. Watch inode and byte usage together.
How do I phrase timeouts for the LLM? Return TOOL_TIMEOUT with retry guidance only when the operation is idempotent; otherwise return a terminal policy code so the agent does not loop forever.
Executable policy (copy into your README)
- No tool runs as the developer’s interactive user; automation uses a dedicated account with empty login items.
- Every remote invocation carries
SESSION_ID,AGENT_ROOT, and a maximum tool depth counter decremented per nested call. - Timeouts are configured in three layers: HTTP client, subprocess, and graph—documented in the same file as the tool registry.
- CI must prove path confinement and timeout behavior on macOS before promoting agent templates to production branches.
See also: Tech Blog index · Purchase · Help Center