On this page: Dependencies and directory layout · Gateway tokens and egress · QueryEngine integration · Logs and doctor
This playbook targets RAG and agent engineers who already index documents with LlamaIndex and now must meet production gates on a rented Apple Silicon host. It complements the LlamaIndex Workflows cost matrix and mirrors gateway discipline from the Haystack OpenClaw how-to, but stays inside QueryEngine composition trees and Python tool surfaces.
Pain points. 1. Models propose tool names your tenant never approved, which defeats audit if HTTP slips past policy. 2. Vector queries spike tail latency; without a timeout fuse, the whole answer path stalls and operators blame the LLM. 3. Silent failures teach the model to confabulate sources, so you need a compact failure envelope the ResponseSynthesizer can read.
Dependencies and directory layout
Start from a clean repo on the remote Mac. Use python -m venv .venv, pin llama-index-core plus your vector client, and keep embeddings colocated with the index volume so cold starts do not cross WAN during demos. Reserve directories schemas/tools/ for JSON Schema, src/retrieval/ for wrapped retrievers, tokens/ with mode 0600 ignored by git, and logs/jsonl/ for worker output. Document loopback ports for OpenClaw and any local reranker so CI and humans share one table.
Gateway tokens and egress restrictions
Align with OpenClaw 2026.4.x practice: install Node 22 LTS, upgrade the CLI, then run openclaw gateway listen bound to 127.0.0.1 with --token-file sourced from the dashboard. Reach the port from your laptop through SSH reverse tunnels only. Treat the Bearer string as a tenant scoped capability: it should authorize just the tool routes your QueryEngine will ever call, not blanket internet egress from the Python process.
Enforce deny by default outbound from the Mac except the gateway loopback and the vector database socket. Side-effecting HTTP from LlamaIndex tool functions must go through the gateway client class so every path inherits the same headers, correlation ids, and allowlist checks.
| Concern | OpenClaw gateway layer | LlamaIndex QueryEngine layer |
|---|---|---|
| Policy | Tool name and route allowlist, Bearer rotation, request size caps. | Retriever choice, chunking, synthesis prompts, metadata filters. |
| Timeouts | HTTP connect and read deadlines for tool POST bodies. | asyncio.wait_for around vector queries; fuse counter and cooldown. |
| Retries | Bounded backoff for transient 429 or connect errors on gateway calls. | Zero or single retry while fuse closed; never loop expensive search. |
| Failure shape | HTTP codes and gateway validation errors. | Structured retrieval_skipped flags fed into synthesis context. |
QueryEngine integration: code-level steps
Work in seven ordered moves so diffs stay reviewable.
1. Wrap the retriever. Subclass or decorate your VectorIndexRetriever so every aretrieve path runs under asyncio.wait_for with a deadline such as 450 ms in development and 900 ms when indexes are warm. Catch TimeoutError and return an empty node list plus metadata retrieval_skipped=true.
2. Add a fuse counter. Track consecutive breaches in module level state keyed by index name. After two breaches inside a five minute sliding window, open the fuse: short circuit further retrieval until cooldown expires and attach fuse_opened_at to metadata.
3. Centralize gateway HTTP. Implement GatewayToolClient.post(name, payload) that reads base URL and token from environment, injects X-Correlation-Id, and rejects any name absent from your checked-in allowlist before bytes hit the wire.
4. Validate payloads locally. Run jsonschema.validate or Pydantic models that mirror schemas/tools/*.json so malformed tool arguments fail fast without spending model turns.
5. Register tools with LlamaIndex. Expose only the wrapped functions as FunctionTool instances passed into QueryEngineTool or agent runners. Never pass raw httpx sessions into prompts.
6. Build the QueryEngine. Instantiate your engine with the fused retriever, attach tools, and extend the response_synthesizer template or callbacks so metadata containing failure envelopes becomes visible to the final prompt as a short bullet, not raw JSON dumps.
7. Map errors to summaries. Translate gateway HTTP failures and fuse events into one object with keys stage, code, correlation_id, and hint. Pass that object through source_nodes metadata or parallel dicts your synthesis step explicitly reads.
# Conceptual guard: never skip local validation
# ALLOWED_TOOLS = {"search_ticket", "post_audit_log"}
# assert tool_name in ALLOWED_TOOLS
# jsonschema.validate(instance=payload, schema=load_schema(tool_name))
# async with asyncio.timeout(0.45): nodes = await retriever.aretrieve(q)Logs and doctor inspection
Emit one JSON line per QueryEngine invocation with pipeline id, correlation id, elapsed retrieval milliseconds, fuse state, and redacted tool names. Rotate files daily under logs/jsonl/ so support can join Mac syslog and gateway access logs without grepping prompts.
After every dependency bump, run openclaw doctor --json and archive the output beside your release notes. Curl /health on the gateway with the production token during deploy hooks; fail the deploy if latency or auth differs from staging. For broader agent patterns, compare notes with the PydanticAI gateway guide, which shares the same JSON Schema and breaker vocabulary.
Citable parameters for runbooks
- Retriever deadline: 450 ms dev, 900 ms warm index; fuse after two consecutive timeouts.
- Gateway HTTP: connect 2 s, read 8 s; backoff base 250 ms with full jitter; no retry on 401.
- Cooldown: keep the fuse open for at least 120 s before probing retrieval again during incidents.
When you are ready to move from a laptop tunnel to a fleet node, colocate the gateway, indexes, and MLX-friendly embeddings on the same Mac mini M4 class machine so correlation ids stay meaningful under load.
Public pages (no login): Compare plans on pricing, open purchase when you are ready to reserve capacity, read the Help Center, and browse the Tech Blog index for adjacent gateway articles.