Should I default to 1536 dimensions for text-embedding-3-small?

Smaller dimensions reduce vector storage and distance compute at the price of recall. Benchmark retrieval on a labeled slice before you commit; many teams ship 512 or 1024 when latency budgets dominate.

Why does my bge-m3 batch suddenly regress after a macOS update?

Core ML EP kernels and ANE eligibility shift with OS builds. Re-run the batch sweep, capture operator placement logs, and pin OS images on servers you control.

When is renting a remote Mac justified for embeddings only?

When laptops sleep or thermals skew benchmarks, a dedicated cloud Mac reproduces production-like idle noise and stable power states so batch tables stay trustworthy for capacity planning.

2026 Mac Embedding Matrix: OpenAI text-embedding-3-small vs bge-m3 ONNX Core ML EP

Treat embedding choice as a finance and reliability contract: OpenAI text-embedding-3-small bills per token and rides the network, while bge-m3 on ONNX Runtime with Core ML EP trades engineering time for predictable tail latency on Apple Silicon—if you sweep batch size honestly and keep cache keys versioned.

On this page: Latency and cost · Decision matrix · Batch scan method · Cache keys · Failure fallback · FAQ

This playbook targets RAG builders on Mac who must decide between a managed embedding API and a local multilingual model. You will get explicit cost tripwires, a reproducible batch sweep, cache contracts, and a remote Mac soak checklist so benchmarks survive laptops that sleep. Pair it with chunk economics in local RAG chunk and vector quota matrix, index mechanics in FAISS versus sqlite-vec on Mac, and multimodal parallels in CLIP versus SigLIP Core ML EP matrix.

Pain points RAG teams hit on M4

1. Silent dimension drift. Mixing 512 and 1536 vectors in one index destroys recall without obvious crashes.

2. Batch tables copied from Linux. Core ML EP fusion changes memory peaks; naive batch sixteen settings from a CUDA box OOM on unified memory.

3. API spend spikes during reindex. Overnight backfills multiply token counts; finance only notices after the invoice lands.

Latency and cost: where each path wins

OpenAI text-embedding-3-small keeps operational load low: you send UTF-8 text, choose a dimensions parameter, and pay per input token at published list rates. Latency is dominated by TLS, routing, and retries—not matmul on your Mac.

bge-m3 locally moves work onto the Neural Engine and GPU clusters inside M4 when the graph lowers through Core ML EP. Capital cost shifts to engineering hours, regression tests after each macOS release, and unified memory headroom for concurrent retrieval jobs.

Use simple cost thresholds as guardrails. Example policy: if projected monthly embedding tokens exceed roughly two engineer-days of fully loaded salary at list API price, fund a Core ML track. If p95 embedding latency from your office network exceeds interactive search budgets while local p95 stays inside fifty milliseconds for the same corpus slice, bias toward on-box inference.

Decision matrix with acceptance hints

Path	Primary latency driver	Cost shape	Choose when
OpenAI text-embedding-3-small	Network RTT plus provider queue	Linear with billed tokens and chosen dimensions	Small corpora, strict SLAs on recall without ML staffing, or air-gapped exceptions
bge-m3 ONNX + Core ML EP	Batch size, EP partition quality, ANE eligibility	Fixed hardware minutes plus engineer time	High volume reindex, multilingual parity, privacy constraints, predictable offline windows
Hybrid	Whichever queue saturates first	Blended line items plus cache hits	Interactive queries use API while nightly ingest uses local EP with shared normalization

Executable batch size scan on Core ML EP

Follow a geometric ladder so you capture knee points without guessing.

Step 1. Pin tokenizer revision, max sequence length, pooling, and any instruction prefix identical to training assumptions.
Step 2. Warm the session with one throwaway batch to stabilize caches.
Step 3. For batch in powers of two from one to sixty-four, record p50 and p95 wall clock, tokens per second, and peak RSS including decode buffers.
Step 4. Stop when latency grows faster than throughput or when memory crosses eighty percent of your Mac budget.
Step 5. Compare the winning batch against the same text streamed through OpenAI with identical concurrency to produce an apples-to-apples chart.
Step 6. Document intra-op thread caps; start near performance-core count then reduce if tail latency widens.

# Sketch: exponential batch probe (pseudo-metrics hooks)
for B in 1 2 4 8 16 32 64; do
  run_embed_benchmark --model bge-m3-onnx --batch "$B" --ep coreml \
    --log p50 p95 rss_telemetry
done

Dimension and quantization prompts

For OpenAI, shrinking dimensions cuts storage and distance math but needs a retrieval regression pass. For bge-m3, prefer INT8 dynamic quantization on linear layers when accuracy holds on your evaluation queries; keep a float32 shadow slice for disputes.

Always log which quantization artifact served each vector so reprocessing can replay deterministically.

Cache keys that survive reindex chaos

Never key caches on raw user strings alone. Combine normalized text hash, model identifier, dimension flag, quantization tag, Core ML EP build id, and tokenizer vocabulary checksum.

Include chunking policy version from your chunk matrix so retrievers do not mix incompatible windows.
Store provider request ids for API rows to reconcile billing disputes quickly.

Failure fallback ladder

When Core ML EP rejects an operator, fall back to ONNX CPU EP for that subgraph while alerting telemetry. If CPU saturates, temporarily route overflow to OpenAI with the same normalization path. If API budgets trip, pause interactive expansion and queue offline work until counters cool.

Remote rented Mac acceptance for benchmarks

Laptops sleep, Spotlight indexes spike, and VPN jitter pollutes latency. Rent an isolated Mac mini M4 node, install the same binaries, disable opportunistic indexing for the benchmark window, and replay identical batch sweeps for at least two hours.

Attach tenant tags to metrics so finance can map wall clock to billed minutes.
Capture thermals: stable fans mean your batch table is trustworthy for capacity planning.

Citable guardrails

Treat API list price per million input tokens as the baseline denominator when arguing hybrid budgets.
Record ANE versus GPU placement percentages from ONNX Runtime logs whenever macOS changes.
Keep retrieval nDCG or MRR on a frozen evaluation set whenever dimensions or quantization shift.

FAQ

Can I mix OpenAI embeddings with older bge-m3 vectors? Only after re-embedding or projecting with a documented mapper; otherwise distances lie.

Does Core ML EP always beat CPU for tiny batches? Often no—batch one pays fixed dispatch costs. Measure instead of assuming.

Where do I read LLM plus embedding bundles? Use public purchase and pricing pages, then continue with RAG articles linked above—no login wall for browsing.

Next reads: chunk and quota matrix, vector index shootout, Tech Blog index, and operator docs in the Help Center.

2026 Mac Local Embedding Decision Matrix: OpenAI text-embedding-3-small vs bge-m3 ONNX Core ML EP—Batch Size, Dimension Cost, and Remote Node Acceptance