On this page: Latency and cost · Decision matrix · Batch scan method · Cache keys · Failure fallback · FAQ
This playbook targets RAG builders on Mac who must decide between a managed embedding API and a local multilingual model. You will get explicit cost tripwires, a reproducible batch sweep, cache contracts, and a remote Mac soak checklist so benchmarks survive laptops that sleep. Pair it with chunk economics in local RAG chunk and vector quota matrix, index mechanics in FAISS versus sqlite-vec on Mac, and multimodal parallels in CLIP versus SigLIP Core ML EP matrix.
Pain points RAG teams hit on M4
1. Silent dimension drift. Mixing 512 and 1536 vectors in one index destroys recall without obvious crashes.
2. Batch tables copied from Linux. Core ML EP fusion changes memory peaks; naive batch sixteen settings from a CUDA box OOM on unified memory.
3. API spend spikes during reindex. Overnight backfills multiply token counts; finance only notices after the invoice lands.
Latency and cost: where each path wins
OpenAI text-embedding-3-small keeps operational load low: you send UTF-8 text, choose a dimensions parameter, and pay per input token at published list rates. Latency is dominated by TLS, routing, and retries—not matmul on your Mac.
bge-m3 locally moves work onto the Neural Engine and GPU clusters inside M4 when the graph lowers through Core ML EP. Capital cost shifts to engineering hours, regression tests after each macOS release, and unified memory headroom for concurrent retrieval jobs.
Use simple cost thresholds as guardrails. Example policy: if projected monthly embedding tokens exceed roughly two engineer-days of fully loaded salary at list API price, fund a Core ML track. If p95 embedding latency from your office network exceeds interactive search budgets while local p95 stays inside fifty milliseconds for the same corpus slice, bias toward on-box inference.
Decision matrix with acceptance hints
| Path | Primary latency driver | Cost shape | Choose when |
|---|---|---|---|
| OpenAI text-embedding-3-small | Network RTT plus provider queue | Linear with billed tokens and chosen dimensions | Small corpora, strict SLAs on recall without ML staffing, or air-gapped exceptions |
| bge-m3 ONNX + Core ML EP | Batch size, EP partition quality, ANE eligibility | Fixed hardware minutes plus engineer time | High volume reindex, multilingual parity, privacy constraints, predictable offline windows |
| Hybrid | Whichever queue saturates first | Blended line items plus cache hits | Interactive queries use API while nightly ingest uses local EP with shared normalization |
Executable batch size scan on Core ML EP
Follow a geometric ladder so you capture knee points without guessing.
- Step 1. Pin tokenizer revision, max sequence length, pooling, and any instruction prefix identical to training assumptions.
- Step 2. Warm the session with one throwaway batch to stabilize caches.
- Step 3. For batch in powers of two from one to sixty-four, record p50 and p95 wall clock, tokens per second, and peak RSS including decode buffers.
- Step 4. Stop when latency grows faster than throughput or when memory crosses eighty percent of your Mac budget.
- Step 5. Compare the winning batch against the same text streamed through OpenAI with identical concurrency to produce an apples-to-apples chart.
- Step 6. Document intra-op thread caps; start near performance-core count then reduce if tail latency widens.
# Sketch: exponential batch probe (pseudo-metrics hooks)
for B in 1 2 4 8 16 32 64; do
run_embed_benchmark --model bge-m3-onnx --batch "$B" --ep coreml \
--log p50 p95 rss_telemetry
doneDimension and quantization prompts
For OpenAI, shrinking dimensions cuts storage and distance math but needs a retrieval regression pass. For bge-m3, prefer INT8 dynamic quantization on linear layers when accuracy holds on your evaluation queries; keep a float32 shadow slice for disputes.
Always log which quantization artifact served each vector so reprocessing can replay deterministically.
Cache keys that survive reindex chaos
Never key caches on raw user strings alone. Combine normalized text hash, model identifier, dimension flag, quantization tag, Core ML EP build id, and tokenizer vocabulary checksum.
- Include chunking policy version from your chunk matrix so retrievers do not mix incompatible windows.
- Store provider request ids for API rows to reconcile billing disputes quickly.
Failure fallback ladder
When Core ML EP rejects an operator, fall back to ONNX CPU EP for that subgraph while alerting telemetry. If CPU saturates, temporarily route overflow to OpenAI with the same normalization path. If API budgets trip, pause interactive expansion and queue offline work until counters cool.
Remote rented Mac acceptance for benchmarks
Laptops sleep, Spotlight indexes spike, and VPN jitter pollutes latency. Rent an isolated Mac mini M4 node, install the same binaries, disable opportunistic indexing for the benchmark window, and replay identical batch sweeps for at least two hours.
- Attach tenant tags to metrics so finance can map wall clock to billed minutes.
- Capture thermals: stable fans mean your batch table is trustworthy for capacity planning.
Citable guardrails
- Treat API list price per million input tokens as the baseline denominator when arguing hybrid budgets.
- Record ANE versus GPU placement percentages from ONNX Runtime logs whenever macOS changes.
- Keep retrieval nDCG or MRR on a frozen evaluation set whenever dimensions or quantization shift.
FAQ
Can I mix OpenAI embeddings with older bge-m3 vectors? Only after re-embedding or projecting with a documented mapper; otherwise distances lie.
Does Core ML EP always beat CPU for tiny batches? Often no—batch one pays fixed dispatch costs. Measure instead of assuming.
Where do I read LLM plus embedding bundles? Use public purchase and pricing pages, then continue with RAG articles linked above—no login wall for browsing.
Next reads: chunk and quota matrix, vector index shootout, Tech Blog index, and operator docs in the Help Center.