Why try MLX-LM first on M4 for inference shipping?

MLX targets Apple Silicon execution paths and often reaches stable throughput inside a predictable unified memory envelope. Hugging Face Transformers remains the default when your pipeline depends on Trainer, PEFT, or custom training loops.

When KV cache grows, which lever should I pull first?

Shorten effective context or reduce concurrent sequences before swapping models; raising batch increases both weight residency and active KV footprint, so validate with measured curves instead of guessing.

Why is Transformers batch greater than one unstable on MPS?

Prefill and decode phases differ in memory and synchronization; measure with the same prompt distribution, set padding and attention policies explicitly, and avoid unbounded queues that mix interactive and offline jobs in one process.

2026 Mac Local LLM Matrix: MLX-LM vs Transformers on M4 (Batch, KV, Unified Memory)

On Apple M4 unified memory, batch size, context length, and KV cache ride the same budget curve. Pick the workload shape first, then tune parameters—otherwise you get a demo that runs once and an operator who cannot reproduce it next week.

This note is a checklist you can run on a Mac mini M4 class machine. It connects to our llama.cpp vs Ollama inference matrix for GGUF-centric stacks, the multi-model routing and cost matrix for gateways, the OpenTelemetry GenAI observability matrix for span fields, and the local RAG chunk and embedding quota matrix so retrieval jobs do not steal RAM from your generator.

Scenario selection · Batch size vs memory · Acceptance and metrics · FAQ

Scenario selection

Start from workload geometry, not framework marketing. Interactive assistants care about tail latency and time-to-first-token. Offline eval and labeling care about sustained tokens per second. Research and fine-tuning care about Hugging Face Trainer hooks, PEFT adapters, and reproducible checkpoints. The table below is a decision aid—always confirm with your model size, quantization, and tokenizer on real prompts.

Scenario	MLX-LM (MLX)	Transformers (MPS)
Ship inference optimized for Apple Silicon	Default first look; MLX graphs and memory layout align with unified memory; batch and quant paths are usually explicit	Works; watch MPS batching, dtype, and attention padding—measure end-to-end, not a micro-benchmark
Stay inside the Hugging Face training stack	Strong for conversion and serving converted weights; training story depends on your toolchain	Trainer, Accelerate, PEFT, and eval harnesses—keep this as the spine if fine-tuning is weekly
Long context plus concurrent sessions	Stress KV residency against one memory envelope; prefer explicit cache reuse policies	Same KV physics; isolate interactive and batch jobs across processes with separate caps
Coexist with Ollama or llama.cpp	Parallel track for native MLX weights; do not mix CLI semantics blindly	Python-side eval track; route production traffic using the multi-model article above

Remember that KV cache bytes grow with sequence length and layer count, while batch multiplies active activations and often increases peak memory during prefill. A “safe” batch on paper becomes unsafe when prompts vary wildly in length or when embeddings and the LLM share one machine without scheduling—hence the RAG link above.

Batch size and unified memory reference table

The numbers below are starting points for acceptance testing, not guarantees. Treat Activity Monitor memory pressure, swap, and a ten-minute steady run as the gate—not a sixty-second leaderboard screenshot. Leave headroom for macOS, browsers, and telemetry; on a 24 GB class M4, many teams keep at least 8–12 GB free before calling a setting production-ready.

Workload	Starting batch / context	What to watch in unified memory
Interactive chat (low tail latency)	Batch `1`; set context to the shortest window that preserves quality	Peak RSS, memory pressure color, swap I/O; regressions in time-to-first-token
Offline eval or export	Increase batch and `max_tokens` in steps; log throughput at each step	KV plus large prefill spikes; schedule away from indexing jobs
RAG answer synthesis	Cap retrieved chunk tokens before the LLM call; separate embedding batch from decode	Combined peak of retriever plus generator; see the RAG matrix for quotas

Executable MLX-LM placeholders. Replace bracketed values with your model id, prompt file, and temperature. Add --max-kv-size or similar flags when your build exposes them to bound cache growth.

# One-shot generation (CLI)
python -m mlx_lm.generate --model MODEL_ID_OR_PATH \
  --prompt PROMPT.txt --max-tokens 512 --temp 0.7

# Serve locally (when available in your mlx-lm version)
python -m mlx_lm.server --model MODEL_ID_OR_PATH --port 8080

Executable Transformers + MPS placeholders. Enable fallback for ops not on MPS; pin dtype; when batching, set padding side and attention mask explicitly.

export PYTORCH_ENABLE_MPS_FALLBACK=1
python - <<'PY'
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
mid = "ORG/MODEL_ID"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
    mid, torch_dtype=torch.float16, device_map="mps"
)
inputs = tok(["Hello world"], return_tensors="pt").to("mps")
with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
PY

Acceptance steps and monitoring metrics

Steps. (1) Freeze model revision, weight format, and tokenizer checksum. (2) Split scripts for interactive vs offline loads; record p50 and p95 latency separately. (3) Log prefill and decode token counts, batch, device, cache reuse, and errors per request. (4) Capture peak memory, swap, and thermal throttling if fans ramp during steady decode. (5) Map fields to the GenAI observability matrix so dashboards stay portable across environments. (6) Replay the same harness for two to four hours on a dedicated host to catch sleep, Spotlight, and background sync issues.

Metrics. Track at least time_to_first_token, tokens_per_sec_decode, an estimate of kv_cache_bytes or layer-wise proxy, mem_pressure_peak, and oom_or_fallback_count. Threshold on regression versus your baseline curve instead of a single universal number—model families shift the knee point.

Pass: steady tokens per second within agreed band, no sustained swap, tail latency inside SLO, error rate near zero.
Investigate: decode stalls with idle GPU gaps—often memory bandwidth or batch mis-sizing.
Fail: compression or swap for more than a short burst; repeated MPS fallback warnings; unbounded queue depth in front of the model.

FAQ

Can I run MLX-LM and Transformers on the same M4? Yes, but use separate processes and explicit memory or concurrency caps so unified memory bandwidth is not silently contended.

Why does my table batch not match real throughput? Prefill-heavy prompts and decode-heavy sessions hit different bottlenecks; KV reuse changes the curve. Split timings by phase.

What should I do after acceptance passes locally? Check in the scripts and metric names with your routing runbook, then run the same package on a remote Mac mini M4 node to validate clock, disk, and long-run stability—especially before you attach customer traffic.

When you are ready to move benchmarks off a laptop, you can browse plans without logging in on our pricing page and pick a node on the purchase page. Product context lives on the homepage; deeper setup notes are in the help center.

Summary: scenario picks the primary stack; batch and KV decide whether unified memory stays operable; reproducible scripts plus observability fields turn a one-off demo into something you can ship.

2026 Mac Local LLM Decision Matrix: MLX-LM vs Hugging Face Transformers on M4—Batch Size, KV Cache, and Unified Memory Acceptance

Scenario selection

Batch size and unified memory reference table

Acceptance steps and monitoring metrics

FAQ

Pin MLX-LM and Transformers sweeps on a remote Mac