On Apple M4 unified memory, batch size, context length, and KV cache ride the same budget curve. Pick the workload shape first, then tune parameters—otherwise you get a demo that runs once and an operator who cannot reproduce it next week.

This note is a checklist you can run on a Mac mini M4 class machine. It connects to our llama.cpp vs Ollama inference matrix for GGUF-centric stacks, the multi-model routing and cost matrix for gateways, the OpenTelemetry GenAI observability matrix for span fields, and the local RAG chunk and embedding quota matrix so retrieval jobs do not steal RAM from your generator.

Scenario selection · Batch size vs memory · Acceptance and metrics · FAQ

Scenario selection

Start from workload geometry, not framework marketing. Interactive assistants care about tail latency and time-to-first-token. Offline eval and labeling care about sustained tokens per second. Research and fine-tuning care about Hugging Face Trainer hooks, PEFT adapters, and reproducible checkpoints. The table below is a decision aid—always confirm with your model size, quantization, and tokenizer on real prompts.

Scenario MLX-LM (MLX) Transformers (MPS)
Ship inference optimized for Apple Silicon Default first look; MLX graphs and memory layout align with unified memory; batch and quant paths are usually explicit Works; watch MPS batching, dtype, and attention padding—measure end-to-end, not a micro-benchmark
Stay inside the Hugging Face training stack Strong for conversion and serving converted weights; training story depends on your toolchain Trainer, Accelerate, PEFT, and eval harnesses—keep this as the spine if fine-tuning is weekly
Long context plus concurrent sessions Stress KV residency against one memory envelope; prefer explicit cache reuse policies Same KV physics; isolate interactive and batch jobs across processes with separate caps
Coexist with Ollama or llama.cpp Parallel track for native MLX weights; do not mix CLI semantics blindly Python-side eval track; route production traffic using the multi-model article above

Remember that KV cache bytes grow with sequence length and layer count, while batch multiplies active activations and often increases peak memory during prefill. A “safe” batch on paper becomes unsafe when prompts vary wildly in length or when embeddings and the LLM share one machine without scheduling—hence the RAG link above.

Batch size and unified memory reference table

The numbers below are starting points for acceptance testing, not guarantees. Treat Activity Monitor memory pressure, swap, and a ten-minute steady run as the gate—not a sixty-second leaderboard screenshot. Leave headroom for macOS, browsers, and telemetry; on a 24 GB class M4, many teams keep at least 8–12 GB free before calling a setting production-ready.

Workload Starting batch / context What to watch in unified memory
Interactive chat (low tail latency) Batch 1; set context to the shortest window that preserves quality Peak RSS, memory pressure color, swap I/O; regressions in time-to-first-token
Offline eval or export Increase batch and max_tokens in steps; log throughput at each step KV plus large prefill spikes; schedule away from indexing jobs
RAG answer synthesis Cap retrieved chunk tokens before the LLM call; separate embedding batch from decode Combined peak of retriever plus generator; see the RAG matrix for quotas

Executable MLX-LM placeholders. Replace bracketed values with your model id, prompt file, and temperature. Add --max-kv-size or similar flags when your build exposes them to bound cache growth.

# One-shot generation (CLI) python -m mlx_lm.generate --model MODEL_ID_OR_PATH \ --prompt PROMPT.txt --max-tokens 512 --temp 0.7 # Serve locally (when available in your mlx-lm version) python -m mlx_lm.server --model MODEL_ID_OR_PATH --port 8080

Executable Transformers + MPS placeholders. Enable fallback for ops not on MPS; pin dtype; when batching, set padding side and attention mask explicitly.

export PYTORCH_ENABLE_MPS_FALLBACK=1 python - <<'PY' import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer mid = "ORG/MODEL_ID" tok = AutoTokenizer.from_pretrained(mid) model = AutoModelForCausalLM.from_pretrained( mid, torch_dtype=torch.float16, device_map="mps" ) inputs = tok(["Hello world"], return_tensors="pt").to("mps") with torch.inference_mode(): out = model.generate(**inputs, max_new_tokens=128, do_sample=False) print(tok.decode(out[0], skip_special_tokens=True)) PY

Acceptance steps and monitoring metrics

Steps. (1) Freeze model revision, weight format, and tokenizer checksum. (2) Split scripts for interactive vs offline loads; record p50 and p95 latency separately. (3) Log prefill and decode token counts, batch, device, cache reuse, and errors per request. (4) Capture peak memory, swap, and thermal throttling if fans ramp during steady decode. (5) Map fields to the GenAI observability matrix so dashboards stay portable across environments. (6) Replay the same harness for two to four hours on a dedicated host to catch sleep, Spotlight, and background sync issues.

Metrics. Track at least time_to_first_token, tokens_per_sec_decode, an estimate of kv_cache_bytes or layer-wise proxy, mem_pressure_peak, and oom_or_fallback_count. Threshold on regression versus your baseline curve instead of a single universal number—model families shift the knee point.

  • Pass: steady tokens per second within agreed band, no sustained swap, tail latency inside SLO, error rate near zero.
  • Investigate: decode stalls with idle GPU gaps—often memory bandwidth or batch mis-sizing.
  • Fail: compression or swap for more than a short burst; repeated MPS fallback warnings; unbounded queue depth in front of the model.

FAQ

Can I run MLX-LM and Transformers on the same M4? Yes, but use separate processes and explicit memory or concurrency caps so unified memory bandwidth is not silently contended.

Why does my table batch not match real throughput? Prefill-heavy prompts and decode-heavy sessions hit different bottlenecks; KV reuse changes the curve. Split timings by phase.

What should I do after acceptance passes locally? Check in the scripts and metric names with your routing runbook, then run the same package on a remote Mac mini M4 node to validate clock, disk, and long-run stability—especially before you attach customer traffic.

When you are ready to move benchmarks off a laptop, you can browse plans without logging in on our pricing page and pick a node on the purchase page. Product context lives on the homepage; deeper setup notes are in the help center.

Summary: scenario picks the primary stack; batch and KV decide whether unified memory stays operable; reproducible scripts plus observability fields turn a one-off demo into something you can ship.