On this page: Pain points · Decision matrix · Remote acceptance checklist · Modelfile and CLI steps · Citable signals · FAQ
This matrix helps you ship multi-skill local inference on a Mac without silently blending personas. Pair it with the baseline llama.cpp versus Ollama inference framing, the gateway economics in multi-model routing and remote cost acceptance, and—when agents keep state—LangGraph checkpoint and sandbox quotas. Export token spans with OpenTelemetry GenAI semantics when you connect dashboards.
Pain points when adapters multiply
1. Hidden state beats new weights. Clients often keep prior messages while you swap tags, so KV cache and system prompts interleave and the model sounds like two specialists arguing.
2. Unified memory is the real budget. Apple Silicon does not give you a clean VRAM wall. Context length, concurrent chats, and adapter residency compete for the same pool, so spreadsheets that assume discrete GPU memory mis-size batches.
3. Remote nodes add economics noise. Identical digests can still diverge once you add SSH hops, log shipping, and hourly rental burn. Without a checklist you cannot defend rollback to pure local inference.
Decision matrix: swap pattern, memory, throughput versus latency
| Dimension | Single base, sequential adapters | Single base, overlapping sessions |
|---|---|---|
| Adapter switching model | One active tag at a time; lowest blast radius; easiest audits | Multiple tags referenced without queue discipline; highest risk of blended tone |
| VRAM / unified memory signal | Track one context plus one adapter footprint; predictable peaks | Summed KV and adapter pressure can spike without obvious UI cues |
| Throughput versus latency | Prefers steady tokens per second for batch-like jobs | Interactive bursts optimize time to first token; batch wins may hide tail latency |
Remote node cost acceptance checklist
- Digest parity:
ollama listmatches the run card you used on the laptop baseline. - Latency envelope: document p50 and p95 round trip separately from model compute so network noise is visible.
- Token economics: tokens per billed hour must clear finance thresholds you already used for gateway routes.
- Log and checkpoint shipping: bytes per hour for telemetry stay inside storage budgets from observability planning.
- Rollback: a signed path returns traffic to local inference when any line item breaches for two consecutive windows.
Modelfile discipline, context defrag, and smoke tests
1. Encode intent. Keep one Modelfile per tag with explicit FROM, ADAPTER, and TEMPLATE blocks so reviewers see which LoRA pairs with which system prompt.
2. Create immutable tags. Run ollama create with semantic names, then paste digest hashes next to quantization notes in your operations card.
3. Size context honestly. Set PARAMETER num_ctx and expected parallel chats against unified memory, not an imaginary VRAM ceiling.
4. Defrag before swap. Start a fresh thread or clear history whenever you change adapters so fragments from the previous skill do not ride along.
5. Measure with one prompt. For each tag, capture time to first token and streaming tokens per second using the same user prompt so comparisons stay honest.
6. Promote remote hosts last. Replay the same CLI acceptance on the rented Mac, attach transfer costs, and only then wire agents that fan out across tools. Align tool timeouts with checkpoint storage so LoRA swaps do not fight long-running threads.
# Example: build two adapter tags from Modelfiles in ./models
cd ~/models/skill-a
ollama create skill-a-v3 -f Modelfile
ollama list | grep skill-a
cd ../skill-b
ollama create skill-b-v2 -f Modelfile
# Smoke: fixed prompt, swap tags, record TTFT and tokens per second
ollama run skill-a-v3 "Summarize this policy in five bullets: $(cat ./fixture.txt)"
ollama run skill-b-v2 "Summarize this policy in five bullets: $(cat ./fixture.txt)"
# Ops: confirm only expected models load before remote promotion
ollama psCitable runbook snippets
- Every adapter tag lists base GGUF hash, LoRA file hash, PARAMETER num_ctx, and max concurrent sessions on the same card you attach to incidents.
- Hot swap playbooks require a client-side reset step so support teams know to clear history before blaming weights.
- Remote acceptance adds network milliseconds explicitly so product owners do not confuse geography with quantization regressions.
FAQ: adapters, memory, remote billing
Voice still mixes after a swap. You likely reused a long conversation. Clear thread state, shorten system instructions, and retry the smoke prompt.
Throughput collapsed but latency looks fine. Check whether you increased parallel sessions without raising context. Trim concurrency before touching quantization.
Remote costs jumped without more users. Inspect log shipping, repeated model pulls, and idle GPU hours. Align billing windows with the checklist above before scaling agents.
Public pages: Start from the home overview, compare pricing and purchase without signing in, and read operational detail in the Help Center. The Tech Blog index lists companion agent and routing guides.