llama.cpp vs MLX/oMLX: Architecture Benchmark on Apple Silicon
llama.cpp vs oMLX / MLX Architecture
TL;DR: llama.cpp is 21-59% faster than MLX for LLM inference on Apple Silicon M5 Max. Both use the same Metal API to access the GPU, but llama.cpp has 4 layers vs MLX’s 6 — eliminating Python interpreter overhead, GIL contention, and language boundary crossings. For pure inference workloads, fewer layers = faster.
1. 最重要嘅一件事
兩個 stack 最終都叫同一個 API: Metal。 Metal 係 Apple GPU 嘅 low-level API (類似 Vulkan/DirectX)。“MLX 為 Apple Silicon 優化” 唔等於 “MLX 係唯一條路去用 Apple GPU”。任何人都可以寫 Metal shader 直接同 GPU 傾偈。llama.cpp 就係咁做。
想像 Metal 係一條公路。MLX 係 Apple 起嘅一架旅遊巴 — 舒適、多功能、載到好多唔同嘅 ML workload。llama.cpp 嘅 ggml-metal 係一架改裝賽車 — 只做一件事 (transformer inference),但做得極快。兩架車行同一條公路 (Metal API) 去同一個目的地 (GPU compute)。
2. Layer-by-Layer Architecture
oMLX Stack (6 layers)
┌─────────────────────────────────────────────┐
│ ~~Python 3.10+~~ [要用戶裝] │
│ Interpreter + GIL (Global Interpreter Lock) │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ ~~oMLX Server~~ [Python] │
│ FastAPI HTTP wrapper, OpenAI-compatible API │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ ~~mlx-lm~~ [Python] │
│ Model loading, tokenization, sampling, │
│ KV cache │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ Apple MLX Framework [C++ / Apple] │
│ 通用 ML compute graph, lazy evaluation, │
│ unified memory │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ Metal API [共用] │
│ Apple GPU 嘅 low-level API │
│ (shader dispatch, buffer management) │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ Apple Silicon GPU │
│ M5 Max — unified memory, Neural Engine, │
│ GPU cores │
└─────────────────────────────────────────────┘
llama.cpp Stack (4 layers)
┌─────────────────────────────────────────────┐
│ llama-server [C/C++] │
│ HTTP server + model loading + tokenization │
│ + KV cache + sampling — 全部喺一個 │
│ native binary │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ ggml [C] │
│ Tensor library — 只為 transformer 設計, │
│ 冇通用 ML overhead │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ Metal API [共用] │
│ 同一個 API,同一條公路 │
├──────────────────────┬──────────────────────┤
│ ↓ │
│ Apple Silicon GPU │
│ 同一塊 chip,同一啲 GPU cores │
└─────────────────────────────────────────────┘
少 2 層 = 少 overhead
冇 Python interpreter
冇 GIL
冇通用 ML framework abstraction
3. Request 嘅完整旅程
一個 classify request 由 Rust server 出發,經過每一層到 GPU,再返嚟。
oMLX Path (6 hops)
Rust server → HTTP POST → [Python oMLX server] → [mlx-lm (Python)] → MLX (C++) → Metal → GPU
每次 request 都要 cross Python/C++ boundary 多次。Python GIL 限制 concurrent request throughput。
llama.cpp Path (4 hops)
Rust server → HTTP POST → llama-server (C++) → ggml → Metal → GPU
全程 native code。冇 language boundary crossing。Full concurrency (冇 GIL)。
4. 點解 “Apple 優化” 唔等於最快
| 因素 | MLX (Apple) | ggml-metal (llama.cpp) |
|---|---|---|
| 設計目標 | 通用 ML framework (training + inference + research) | 只做 transformer inference — 冇其他 overhead |
| Compute graph | Lazy evaluation, dynamic graph (靈活但有 overhead) | Static graph, pre-compiled (快但冇咁靈活) |
| Metal shader | 通用 kernel (支援任何 ML operation) | Hand-tuned kernel for attention, GEMM, RoPE |
| Language overhead | Python → C++ boundary 每個 op 都要 cross | 全程 C/C++,zero overhead |
| Concurrency | Python GIL — 一次只有一個 thread 執行 Python code | Full multi-threading,continuous batching |
| Memory management | Unified memory (Apple 強項) + Python GC overhead | Unified memory (同一個 hardware) + manual memory (zero GC) |
| Quantization | MLX 4-bit (Apple 自己嘅 format) | GGUF Q4_K_M (更成熟、community 優化多年) |
| Model support speed | Apple 同 HuggingFace 合作,新 model 通常最快有 | Community-driven,但 GGUF 係 de facto standard,通常同日有 |
| Training | 支援 fine-tuning、LoRA、full training | 只有 inference,唔做 training |
MLX 嘅真正強項係 research flexibility 同 training,唔係 inference 速度。 對於純 inference (我哋嘅 use case),llama.cpp 嘅 “do one thing well” 哲學贏。如果你要 fine-tune model,MLX 仲係正確選擇。但我哋只需要 run a frozen model。
5. Experiment: Head-to-Head Benchmark
Test Conditions
| Setting | Value |
|---|---|
| Hardware | Apple M5 Max, 128GB unified memory |
| Model (oMLX) | Qwen3.5-4B-MLX-4bit (MLX safetensors, ~2.5GB) |
| Model (llama.cpp) | Qwen3.5-4B-Q4_K_M (GGUF, 2.7GB) |
| Task | Phase classification — system prompt + conversation context → 80 token JSON |
| Rounds | 12 per test |
| Fair play | Dev server stopped. Same model generation. Same prompts. Same hardware. |
| API | Both serve OpenAI-compatible /v1/chat/completions |
5.1 — Raw Terminal Output
========================================================================
HEAD-TO-HEAD: oMLX (MLX) vs llama-server (GGUF) — Phase Classification
========================================================================
Rounds per test: 12
Hardware: M5 Max, 128GB unified memory
oMLX (MLX-4bit): READY (281ms warmup)
llama.cpp (GGUF-Q4KM): READY (443ms warmup)
────────────────────────────────────────────────────────────────────────
TEST 1: Latency by payload size
────────────────────────────────────────────────────────────────────────
Payload: small (176 chars)
oMLX (MLX-4bit) p50= 237ms p90= 247ms avg= 238ms min= 231ms max= 247ms valid=12/12
llama.cpp (GGUF-Q4KM) p50= 187ms p90= 188ms avg= 196ms min= 177ms max= 298ms valid=12/12
Payload: medium (828 chars)
oMLX (MLX-4bit) p50= 256ms p90= 266ms avg= 260ms min= 253ms max= 298ms valid=12/12
llama.cpp (GGUF-Q4KM) p50= 188ms p90= 189ms avg= 203ms min= 178ms max= 394ms valid=12/12
Payload: large (1747 chars)
oMLX (MLX-4bit) p50= 412ms p90= 418ms avg= 395ms min= 325ms max= 451ms valid=12/12
llama.cpp (GGUF-Q4KM) p50= 171ms p90= 172ms avg= 193ms min= 161ms max= 480ms valid=12/12
────────────────────────────────────────────────────────────────────────
TEST 2: Sequential throughput (medium payload, 10 calls)
────────────────────────────────────────────────────────────────────────
oMLX (MLX-4bit) 3.84 calls/sec p50= 257ms p90= 297ms avg= 260ms
llama.cpp (GGUF-Q4KM) 5.38 calls/sec p50= 188ms p90= 195ms avg= 186ms
────────────────────────────────────────────────────────────────────────
TEST 3a: Concurrent throughput (2 slots, 12 calls)
────────────────────────────────────────────────────────────────────────
oMLX (MLX-4bit) 5.52 calls/sec p50= 360ms p90= 367ms avg= 361ms
llama.cpp (GGUF-Q4KM) 6.21 calls/sec p50= 267ms p90= 576ms avg= 321ms
────────────────────────────────────────────────────────────────────────
TEST 3b: Concurrent throughput (4 slots, 24 calls)
────────────────────────────────────────────────────────────────────────
oMLX (MLX-4bit) 6.21 calls/sec p50= 644ms p90= 655ms avg= 642ms
llama.cpp (GGUF-Q4KM) 7.92 calls/sec p50= 421ms p90= 886ms avg= 502ms
────────────────────────────────────────────────────────────────────────
TEST 4: Output quality samples (medium payload)
────────────────────────────────────────────────────────────────────────
oMLX (MLX-4bit):
[OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
[OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
[OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
llama.cpp (GGUF-Q4KM):
[OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
[OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
[OK] {"phase": "testing", "scope": "Refactor auth middleware to use JWT"}
========================================================================
VERDICT
========================================================================
small: oMLX p50=237ms llama.cpp p50=187ms → llama.cpp is 21% faster
medium: oMLX p50=256ms llama.cpp p50=188ms → llama.cpp is 27% faster
large: oMLX p50=412ms llama.cpp p50=171ms → llama.cpp is 59% faster
5.2 — Verdict Cards
| Payload | oMLX p50 | llama.cpp p50 | Difference | Winner |
|---|---|---|---|---|
| Small (176 chars) | 237ms | 187ms | 21% | llama.cpp faster |
| Medium (828 chars) | 256ms | 188ms | 27% | llama.cpp faster |
| Large (1747 chars) | 412ms | 171ms | 59% | llama.cpp faster |
5.3 — Visualized
p50 Latency (ms) — lower is better
| Payload | oMLX | llama.cpp | Winner |
|---|---|---|---|
| Small (176 chars) | 237ms | 187ms | llama.cpp |
| Medium (828 chars) | 256ms | 188ms | llama.cpp |
| Large (1747 chars) | 412ms | 171ms | llama.cpp |
Throughput (calls/sec) — higher is better
| Mode | oMLX | llama.cpp | Winner |
|---|---|---|---|
| Sequential | 3.84/s | 5.38/s | llama.cpp |
| Concurrent (2 slots) | 5.52/s | 6.21/s | llama.cpp |
| Concurrent (4 slots) | 6.21/s | 7.92/s | llama.cpp |
5.4 — Where oMLX Wins
Tail latency under high concurrency. At 4 concurrent slots, oMLX p90 = 655ms vs llama.cpp p90 = 886ms. oMLX 更穩定 — likely because MLX’s lazy evaluation smooths out contention. llama.cpp throughput 更高但偶爾 spike。對於 background classification (唔係 user-facing),higher throughput with occasional spikes 係更好嘅 tradeoff。
5.5 — The First Run Was Wrong
Failed Benchmark: Qwen3.5 "Thinking Mode" Trap
First run showed llama.cpp producing 0/8 valid outputs:
llama.cpp (GGUF-Q4KM):
[BAD]
[BAD]
[BAD]
Raw response revealed the problem:
"content": "",
"reasoning_content": "Okay, let's see. The user wants me to classify..."
All 80 tokens consumed by internal reasoning. Content field empty.
Root cause: Qwen3.5 defaults to Chain-of-Thought "thinking mode".
oMLX disabled it via: chat_template_kwargs: {"enable_thinking": false}
llama-server needs: --reasoning off at startup
Fix applied → re-benchmarked → 12/12 valid (100%)
6. Full Summary Table
| Metric | oMLX (MLX-4bit) | llama.cpp (GGUF-Q4KM) | Winner |
|---|---|---|---|
| p50 latency (small) | 237ms | 187ms | llama.cpp (21% faster) |
| p50 latency (medium) | 256ms | 188ms | llama.cpp (27% faster) |
| p50 latency (large) | 412ms | 171ms | llama.cpp (59% faster) |
| Sequential throughput | 3.84 calls/sec | 5.38 calls/sec | llama.cpp (1.4x) |
| Concurrent (2 slots) | 5.52 calls/sec | 6.21 calls/sec | llama.cpp (1.1x) |
| Concurrent (4 slots) | 6.21 calls/sec | 7.92 calls/sec | llama.cpp (1.3x) |
| p90 tail (4 slots) | 655ms | 886ms | oMLX (more consistent) |
| Quality (valid JSON) | 12/12 (100%) | 12/12 (100%) | Tie |
| Phase agreement | ”testing" | "testing” | Same answer |
| Python required | Yes | No | llama.cpp |
| User setup steps | 5 steps | 0 steps | llama.cpp |
7. Dependency 比較
oMLX: 用戶需要裝嘅嘢
Python 3.10+pip / venvmlx-lmMLX frameworkoMLX CLI手動 omlx serve
6 個 dependencies,5 個手動步驟
llama.cpp: 用戶需要裝嘅嘢
- (冇)
0 個 dependencies,0 個手動步驟。Binary bundled,model lazy-download on click。
8. 總結
“為 Apple Silicon 優化” ≠ 最快。
MLX 係 Apple 嘅通用 ML framework — 好似 PyTorch for Apple Silicon。佢優化咗 unified memory access、training、同 research flexibility。
llama.cpp / ggml 係一個只做 LLM inference 嘅 C library — hand-tuned Metal shaders、static compute graph、zero Python overhead。
兩個都用同一個 Metal API 去叫同一塊 GPU。但 llama.cpp 嘅路徑短 2 層,冇 Python GIL,冇通用 framework abstraction。所以喺純 inference 場景,佢更快。
我哋嘅 use case (80 token classification, fire-and-forget) 完美 match llama.cpp 嘅設計目標。
Benchmarked 2026-03-29 — M5 Max 128GB — Qwen3.5-4B — llama.cpp b8500 — oMLX 0.2.23. Benchmark script: bench_h2h.py — 12 rounds per test — dev server stopped for fair GPU allocation.