Pair this page with the Mac vector index USearch versus FAISS matrix for downstream quotas, the local RAG chunk and embedding quota matrix for ingest economics, and the OpenTelemetry GenAI observability matrix for span field names when you export latency histograms.
On this page: Decision matrix · Model format conversion · Thread count · Memory peak · Batch inference queue · Remote acceptance
Decision matrix (CLIP vs SigLIP vs execution path)
Pick a row that matches your workload geometry, then treat the cells as hypotheses you validate with frozen preprocessing and identical tensors across runs.
| Stack focus | When CLIP-style ViT plus text tower fits | When SigLIP-style training signal fits | Vector dimension contract | Batch starting point on M4 class silicon |
|---|---|---|---|---|
| Interactive search and rerank | Mature tooling, wide zoo of ONNX exports, predictable cosine workflows. | Use when captions are noisy and fine-grained text alignment matters more than legacy benchmarks. | Freeze output dim such as 512 or 768 and document L2 normalization before dot products. | Begin near batch one for latency tails, then nudge upward only after thread sweep. |
| Offline catalog ingest | Excellent when ops already standardize on CLIP checkpoints for legacy indexes. | Strong when label noise is high and sigmoid training reduces false neighbors in pilot evals. | Match int8 or float quant tables to the ANN backend in the vector article above. | Sweep eight sixteen thirty two patches per batch while watching peak RSS not just averages. |
| ONNX Runtime Core ML EP shipping path | Default path when graphs convert cleanly and you want Neural Engine coverage without rewriting Swift. | Same path but watch attention and layernorm subgraphs for unsupported ops and fallbacks. | Record whether pooled image and text heads stay concatenated or dual-write for audit. | Recompute batch after EP promotion because fusion changes buffer lifetimes. |
Model format conversion
Treat PyTorch to ONNX as the first contract, then ONNX to Core ML or direct Core ML export as a second contract. Log exporter versions, opset, dynamic axes for variable length text, and any inserted adapters. Keep a tiny golden tensor pack so cosine distance between ONNX CPU logits and Core ML EP logits stays below your agreed epsilon before you touch production traffic. When an operator refuses to lower, capture the subgraph name and decide whether rewrite, partial CPU execution, or a different checkpoint is cheaper than endless engineering time.
Thread count
Start intra-op threads near the performance-core count on Apple Silicon, then walk downward if p95 latency grows while throughput barely moves. Inter-op threads should stay conservative when your service already multiplexes HTTP workers because oversubscription shows up as tail inflation, not average GPU charts. Always log thread caps next to batch so remote replays reproduce the same schedule.
Memory peak
Measure peak resident size during decode and resize phases, not only steady state after caches warm. Large side-by-side image tensors plus text token buffers spike unified memory when batch grows. Leave explicit headroom for macOS file cache and telemetry agents. If swap appears during a soak, treat that as a failed run even when averages look healthy. Align peaks with the RAG quota matrix so generators and retrievers never share one silent budget line.
Batch inference queue
Split online and offline queues so bursty ingest cannot starve user facing rerank calls. Publish a maximum depth, a wait time SLO, and a policy for shedding load such as shedding lowest priority shards first. Use back-pressure at the HTTP layer instead of unbounded in-memory buffers because embedding requests are often larger than chat tokens. Persist poison messages to a dead letter prefix with model revision tags so you can replay after conversion fixes without corrupting the index.
Remote node cost acceptance checklist
Finance should see the same artifacts operators use. Work through the list below on a rented host before you scale hours.
- Archive the exact ONNX and Core ML bundles plus checksums in object storage tied to the tenant id.
- Run a six hour soak with production batch curves while recording images per hour and error counts.
- Compare wall clock from your scheduler with billed minutes including cold start windows.
- Attach OpenTelemetry fields for batch thread count device and queue wait to each job.
- Sign off only when swap never sustains and p95 latency stays inside the table you captured locally.
- Citable gate: keep text token cap and short edge within one percent of local benchmarks when validating remote parity.
- Citable gate: store peak RSS samples at least once per batch step in your nightly chart.
- Citable gate: require zero unexplained Core ML fallback warnings during soak or reopen conversion.
FAQ
Does lower dimension always mean cheaper? Not if quantization or ANN parameters shift recall; treat dimension as a contract with the index team.
Should I run SigLIP on CPU for parity tests? Yes briefly, but spend most cycles on the EP you ship so acceptance matches reality.
Open the homepage for product context, skim Help for SSH and console access, then compare pricing and continue to purchase without logging in.
Summary: Freeze preprocessing and dimension contracts, promote ONNX through Core ML EP with fresh batch and memory sweeps, cap threads and queues explicitly, export telemetry fields, and buy dedicated Apple Silicon hours only after remote soak minutes line up with invoices.