thread_id, checkpoints, interrupt, and quotas, and the same graph runs for hours on a notebook or a remote Mac without mystery failures.
This article sits alongside our local LLM inference matrix, RAG chunking and vector quota guide, OpenClaw tool-call validation and timeouts, and remote IDE bridge sandbox playbook—together they cover the full LLM workflow stack. Use the homepage for platform context and open the purchase page to compare regions; browsing pricing and plans does not require a login.
| Dimension | Design focus | Remote Mac long-run hint |
|---|---|---|
| thread_id | One stable id per session; map to audit identity; disallow concurrent writers for the same key. | Issue ids at the gateway so arbitrary client strings cannot collide; log handoffs across reconnects. |
| Checkpoint | Match deploy unit and namespace; plan migrations when the graph or serializer changes. | Watch write amplification: DB growth, SQLite WAL, and scheduled VACUUM or Postgres autovacuum windows. |
| SQLite / Postgres | SQLite + WAL on local SSD for single-process, low fan-out agents. | Postgres + pool for multi-worker or central audit; never place SQLite files on network mounts. |
| interrupt | Pause at human-approval nodes; persist waiting state and copy for the UI. | After long idle, fall back to a safe branch or release the worker so nightly jobs do not wedge the queue. |
| Tool timeout | Per-tool hard caps plus subgraph budgets; separate LLM time-to-first-token from total wall time. | Bake in RTT and p95 tail latency; align log fields with the OpenClaw fuse pattern in our tool-call article. |
| Directory quotas | Isolate workspace, checkpoints, artifacts, and downloads with soft and hard limits. | Watch inode storms and cleanup cadence—see the IDE bridge sandbox note for git diff and health probes. |
Why state beats “more tools”
LangGraph makes the agent a state machine: every node transition can be checkpointed. That is powerful until two clients reuse the same thread_id, or a deploy changes the graph without migrating stored tuples. Treat checkpoints as part of your schema: version the graph, document serializer upgrades, and test resume paths the way you test database migrations. On Apple Silicon, CPU and unified memory are rarely the first bottleneck—ambiguous threading and unbounded scratch directories are.
Checkpoint backends: SQLite vs Postgres
SQLite shines when one process owns the checkpointer, concurrency is modest, and the file lives on a fast local volume with WAL enabled. Keep it outside iCloud or Dropbox sync roots, and monitor file size during soak tests: long autonomous runs can append many rows per thread. Postgres is the default when multiple workers write checkpoints, you need row-level audit, or centralized storage sits closer to your API tier than the Mac itself. Use a pool, explicit migrations for checkpointer tables, and integration tests that cover clock skew and reconnect storms. If the database is remote while the agent runs on a rented Mac, validate latency to the checkpointer—slow commits turn into visible tail latency on every node.
Interrupts and human-in-the-loop
interrupt is your off-ramp for approvals, CAPTCHAs, or policy gates. Persist enough context that a human can approve out-of-band: checkpoint id, proposed tool args, and a short rationale string. On resume, verify the graph binary matches what wrote the waiting checkpoint; mismatches are how you get “it resumed but skipped a node” bug reports. For remote sessions, add idle TTL: if nobody answers within N minutes, route to a safe failure node so GPUs and workers are not reserved indefinitely.
Layering tool timeouts
Expose three clocks: per-tool execution, subgraph or branch budget, and LLM call budget (first token vs total). Map each timeout to either a structured retry or a terminal error the model is allowed to see. Log correlation ids across tool subprocesses so you can tell a hung HTTP client from a saturated CPU. This lines up with the JSON Schema and circuit-breaker story in OpenClaw tool calls on a remote Mac—reuse the same vocabulary in dashboards.
Sandbox directory quotas
Agents write: cloned repos, downloaded models, diff artifacts, and temporary conversions. Split roots so a runaway download cannot fill the checkpoint volume. Enforce soft warnings at 70–80% and hard stops before the filesystem errors, and alert on inode usage when tools emit many small files. Pair quotas with periodic cleanup jobs tied to thread_id lifecycle events so abandoned sessions do not leak disk.
Acceptance checklist before you ship
- thread_id links to scratch prefixes so deleting a session cascades to its workspace.
- Checkpoint storage is separated from user upload directories and from model caches.
- Per-tool file, directory, and inode ceilings are configured with alerts.
- interrupt waiting states have queue TTL or worker reassignment rules.
- After tool timeout, child processes receive SIGTERM then SIGKILL—no zombies.
- Crash drills: kill -9 the runner, restart, resume from latest checkpoint, assert invariants.
Remote Mac long-run operations
Under launchd or a container shim, colocate logs, checkpoints, and scratch on the same local APFS volume so disk alerts are meaningful. Re-run multi-hour jobs on dedicated hardware: observe checkpoint growth curves, WAL behavior, tail latency for tools, and sandbox byte usage over time. This is the same operational mindset as long RAG indexing in our vector quota matrix, except the dominant risk is state drift rather than embedding batch size.
HowTo recap
Pin thread_id → choose SQLite or Postgres and migrate → define interrupt/resume contracts → layer timeouts → enforce quotas and cleanup in CI → soak on a dedicated remote Mac with realistic traffic. The JSON-LD HowTo in the page head expands each step for scrapers.
# Naming example—adjust prefixes for your org
CHECKPOINT_DIR="$HOME/agent-state/checkpoints"
SANDBOX_WORKSPACE="$HOME/agent-state/scratch/$THREAD_ID"See also: OpenClaw tool pipelines · Home · Purchase