This note is a checklist you can run on a Mac mini M4 class machine. It connects to our llama.cpp vs Ollama inference matrix for GGUF-centric stacks, the multi-model routing and cost matrix for gateways, the OpenTelemetry GenAI observability matrix for span fields, and the local RAG chunk and embedding quota matrix so retrieval jobs do not steal RAM from your generator.
Scenario selection · Batch size vs memory · Acceptance and metrics · FAQ
Scenario selection
Start from workload geometry, not framework marketing. Interactive assistants care about tail latency and time-to-first-token. Offline eval and labeling care about sustained tokens per second. Research and fine-tuning care about Hugging Face Trainer hooks, PEFT adapters, and reproducible checkpoints. The table below is a decision aid—always confirm with your model size, quantization, and tokenizer on real prompts.
| Scenario | MLX-LM (MLX) | Transformers (MPS) |
|---|---|---|
| Ship inference optimized for Apple Silicon | Default first look; MLX graphs and memory layout align with unified memory; batch and quant paths are usually explicit | Works; watch MPS batching, dtype, and attention padding—measure end-to-end, not a micro-benchmark |
| Stay inside the Hugging Face training stack | Strong for conversion and serving converted weights; training story depends on your toolchain | Trainer, Accelerate, PEFT, and eval harnesses—keep this as the spine if fine-tuning is weekly |
| Long context plus concurrent sessions | Stress KV residency against one memory envelope; prefer explicit cache reuse policies | Same KV physics; isolate interactive and batch jobs across processes with separate caps |
| Coexist with Ollama or llama.cpp | Parallel track for native MLX weights; do not mix CLI semantics blindly | Python-side eval track; route production traffic using the multi-model article above |
Remember that KV cache bytes grow with sequence length and layer count, while batch multiplies active activations and often increases peak memory during prefill. A “safe” batch on paper becomes unsafe when prompts vary wildly in length or when embeddings and the LLM share one machine without scheduling—hence the RAG link above.
Batch size and unified memory reference table
The numbers below are starting points for acceptance testing, not guarantees. Treat Activity Monitor memory pressure, swap, and a ten-minute steady run as the gate—not a sixty-second leaderboard screenshot. Leave headroom for macOS, browsers, and telemetry; on a 24 GB class M4, many teams keep at least 8–12 GB free before calling a setting production-ready.
| Workload | Starting batch / context | What to watch in unified memory |
|---|---|---|
| Interactive chat (low tail latency) | Batch 1; set context to the shortest window that preserves quality |
Peak RSS, memory pressure color, swap I/O; regressions in time-to-first-token |
| Offline eval or export | Increase batch and max_tokens in steps; log throughput at each step |
KV plus large prefill spikes; schedule away from indexing jobs |
| RAG answer synthesis | Cap retrieved chunk tokens before the LLM call; separate embedding batch from decode | Combined peak of retriever plus generator; see the RAG matrix for quotas |
Executable MLX-LM placeholders. Replace bracketed values with your model id, prompt file, and temperature. Add --max-kv-size or similar flags when your build exposes them to bound cache growth.
# One-shot generation (CLI)
python -m mlx_lm.generate --model MODEL_ID_OR_PATH \
--prompt PROMPT.txt --max-tokens 512 --temp 0.7
# Serve locally (when available in your mlx-lm version)
python -m mlx_lm.server --model MODEL_ID_OR_PATH --port 8080Executable Transformers + MPS placeholders. Enable fallback for ops not on MPS; pin dtype; when batching, set padding side and attention mask explicitly.
export PYTORCH_ENABLE_MPS_FALLBACK=1
python - <<'PY'
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
mid = "ORG/MODEL_ID"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
mid, torch_dtype=torch.float16, device_map="mps"
)
inputs = tok(["Hello world"], return_tensors="pt").to("mps")
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
PYAcceptance steps and monitoring metrics
Steps. (1) Freeze model revision, weight format, and tokenizer checksum. (2) Split scripts for interactive vs offline loads; record p50 and p95 latency separately. (3) Log prefill and decode token counts, batch, device, cache reuse, and errors per request. (4) Capture peak memory, swap, and thermal throttling if fans ramp during steady decode. (5) Map fields to the GenAI observability matrix so dashboards stay portable across environments. (6) Replay the same harness for two to four hours on a dedicated host to catch sleep, Spotlight, and background sync issues.
Metrics. Track at least time_to_first_token, tokens_per_sec_decode, an estimate of kv_cache_bytes or layer-wise proxy, mem_pressure_peak, and oom_or_fallback_count. Threshold on regression versus your baseline curve instead of a single universal number—model families shift the knee point.
- Pass: steady tokens per second within agreed band, no sustained swap, tail latency inside SLO, error rate near zero.
- Investigate: decode stalls with idle GPU gaps—often memory bandwidth or batch mis-sizing.
- Fail: compression or swap for more than a short burst; repeated MPS fallback warnings; unbounded queue depth in front of the model.
FAQ
Can I run MLX-LM and Transformers on the same M4? Yes, but use separate processes and explicit memory or concurrency caps so unified memory bandwidth is not silently contended.
Why does my table batch not match real throughput? Prefill-heavy prompts and decode-heavy sessions hit different bottlenecks; KV reuse changes the curve. Split timings by phase.
What should I do after acceptance passes locally? Check in the scripts and metric names with your routing runbook, then run the same package on a remote Mac mini M4 node to validate clock, disk, and long-run stability—especially before you attach customer traffic.
When you are ready to move benchmarks off a laptop, you can browse plans without logging in on our pricing page and pick a node on the purchase page. Product context lives on the homepage; deeper setup notes are in the help center.
Summary: scenario picks the primary stack; batch and KV decide whether unified memory stays operable; reproducible scripts plus observability fields turn a one-off demo into something you can ship.