llama.cpp vs oMLX / MLX Architecture

TL;DR: llama.cpp is 21-59% faster than MLX for LLM inference on Apple Silicon M5 Max. Both use the same Metal API to access the GPU, but llama.cpp has 4 layers vs MLX’s 6 — eliminating Python interpreter overhead, GIL contention, and language boundary crossings. For pure inference workloads, fewer layers = faster.

1. 最重要嘅一件事

兩個 stack 最終都叫同一個 API: Metal。 Metal 係 Apple GPU 嘅 low-level API (類似 Vulkan/DirectX)。“MLX 為 Apple Silicon 優化” 唔等於 “MLX 係唯一條路去用 Apple GPU”。任何人都可以寫 Metal shader 直接同 GPU 傾偈。llama.cpp 就係咁做。

想像 Metal 係一條公路。MLX 係 Apple 起嘅一架旅遊巴 — 舒適、多功能、載到好多唔同嘅 ML workload。llama.cpp 嘅 ggml-metal 係一架改裝賽車 — 只做一件事 (transformer inference)，但做得極快。兩架車行同一條公路 (Metal API) 去同一個目的地 (GPU compute)。

2. Layer-by-Layer Architecture

oMLX Stack (6 layers)

┌─────────────────────────────────────────────┐
│  ~~Python 3.10+~~              [要用戶裝]    │
│  Interpreter + GIL (Global Interpreter Lock) │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  ~~oMLX Server~~                  [Python]   │
│  FastAPI HTTP wrapper, OpenAI-compatible API │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  ~~mlx-lm~~                      [Python]   │
│  Model loading, tokenization, sampling,      │
│  KV cache                                    │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Apple MLX Framework           [C++ / Apple] │
│  通用 ML compute graph, lazy evaluation,     │
│  unified memory                              │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Metal API                        [共用]     │
│  Apple GPU 嘅 low-level API                  │
│  (shader dispatch, buffer management)        │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Apple Silicon GPU                           │
│  M5 Max — unified memory, Neural Engine,     │
│  GPU cores                                   │
└─────────────────────────────────────────────┘

llama.cpp Stack (4 layers)

┌─────────────────────────────────────────────┐
│  llama-server                     [C/C++]   │
│  HTTP server + model loading + tokenization  │
│  + KV cache + sampling — 全部喺一個          │
│  native binary                               │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  ggml                               [C]     │
│  Tensor library — 只為 transformer 設計，    │
│  冇通用 ML overhead                          │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Metal API                        [共用]     │
│  同一個 API，同一條公路                       │
├──────────────────────┬──────────────────────┤
│                      ↓                       │
│  Apple Silicon GPU                           │
│  同一塊 chip，同一啲 GPU cores               │
└─────────────────────────────────────────────┘

少 2 層 = 少 overhead
  冇 Python interpreter
  冇 GIL
  冇通用 ML framework abstraction

3. Request 嘅完整旅程

一個 classify request 由 Rust server 出發，經過每一層到 GPU，再返嚟。

oMLX Path (6 hops)

Rust server → HTTP POST → [Python oMLX server] → [mlx-lm (Python)] → MLX (C++) → Metal → GPU

每次 request 都要 cross Python/C++ boundary 多次。Python GIL 限制 concurrent request throughput。

llama.cpp Path (4 hops)

Rust server → HTTP POST → llama-server (C++) → ggml → Metal → GPU

全程 native code。冇 language boundary crossing。Full concurrency (冇 GIL)。

4. 點解 “Apple 優化” 唔等於最快

因素	MLX (Apple)	ggml-metal (llama.cpp)
設計目標	通用 ML framework (training + inference + research)	只做 transformer inference — 冇其他 overhead
Compute graph	Lazy evaluation, dynamic graph (靈活但有 overhead)	Static graph, pre-compiled (快但冇咁靈活)
Metal shader	通用 kernel (支援任何 ML operation)	Hand-tuned kernel for attention, GEMM, RoPE
Language overhead	Python → C++ boundary 每個 op 都要 cross	全程 C/C++，zero overhead
Concurrency	Python GIL — 一次只有一個 thread 執行 Python code	Full multi-threading，continuous batching
Memory management	Unified memory (Apple 強項) + Python GC overhead	Unified memory (同一個 hardware) + manual memory (zero GC)
Quantization	MLX 4-bit (Apple 自己嘅 format)	GGUF Q4_K_M (更成熟、community 優化多年)
Model support speed	Apple 同 HuggingFace 合作，新 model 通常最快有	Community-driven，但 GGUF 係 de facto standard，通常同日有
Training	支援 fine-tuning、LoRA、full training	只有 inference，唔做 training

MLX 嘅真正強項係 research flexibility 同 training，唔係 inference 速度。 對於純 inference (我哋嘅 use case)，llama.cpp 嘅 “do one thing well” 哲學贏。如果你要 fine-tune model，MLX 仲係正確選擇。但我哋只需要 run a frozen model。

5. Experiment: Head-to-Head Benchmark

Test Conditions

Setting	Value
Hardware	Apple M5 Max, 128GB unified memory
Model (oMLX)	Qwen3.5-4B-MLX-4bit (MLX safetensors, ~2.5GB)
Model (llama.cpp)	Qwen3.5-4B-Q4_K_M (GGUF, 2.7GB)
Task	Phase classification — system prompt + conversation context → 80 token JSON
Rounds	12 per test
Fair play	Dev server stopped. Same model generation. Same prompts. Same hardware.
API	Both serve OpenAI-compatible `/v1/chat/completions`

5.1 — Raw Terminal Output

========================================================================
HEAD-TO-HEAD: oMLX (MLX) vs llama-server (GGUF) — Phase Classification
========================================================================
Rounds per test: 12
Hardware: M5 Max, 128GB unified memory

  oMLX (MLX-4bit): READY (281ms warmup)
  llama.cpp (GGUF-Q4KM): READY (443ms warmup)

────────────────────────────────────────────────────────────────────────
TEST 1: Latency by payload size
────────────────────────────────────────────────────────────────────────

  Payload: small (176 chars)
    oMLX (MLX-4bit)                 p50=   237ms  p90=   247ms  avg=   238ms  min=   231ms  max=   247ms  valid=12/12
    llama.cpp (GGUF-Q4KM)           p50=   187ms  p90=   188ms  avg=   196ms  min=   177ms  max=   298ms  valid=12/12

  Payload: medium (828 chars)
    oMLX (MLX-4bit)                 p50=   256ms  p90=   266ms  avg=   260ms  min=   253ms  max=   298ms  valid=12/12
    llama.cpp (GGUF-Q4KM)           p50=   188ms  p90=   189ms  avg=   203ms  min=   178ms  max=   394ms  valid=12/12

  Payload: large (1747 chars)
    oMLX (MLX-4bit)                 p50=   412ms  p90=   418ms  avg=   395ms  min=   325ms  max=   451ms  valid=12/12
    llama.cpp (GGUF-Q4KM)           p50=   171ms  p90=   172ms  avg=   193ms  min=   161ms  max=   480ms  valid=12/12

────────────────────────────────────────────────────────────────────────
TEST 2: Sequential throughput (medium payload, 10 calls)
────────────────────────────────────────────────────────────────────────
  oMLX (MLX-4bit)                 3.84 calls/sec  p50=   257ms  p90=   297ms  avg=   260ms
  llama.cpp (GGUF-Q4KM)           5.38 calls/sec  p50=   188ms  p90=   195ms  avg=   186ms

────────────────────────────────────────────────────────────────────────
TEST 3a: Concurrent throughput (2 slots, 12 calls)
────────────────────────────────────────────────────────────────────────
  oMLX (MLX-4bit)                 5.52 calls/sec  p50=   360ms  p90=   367ms  avg=   361ms
  llama.cpp (GGUF-Q4KM)           6.21 calls/sec  p50=   267ms  p90=   576ms  avg=   321ms

────────────────────────────────────────────────────────────────────────
TEST 3b: Concurrent throughput (4 slots, 24 calls)
────────────────────────────────────────────────────────────────────────
  oMLX (MLX-4bit)                 6.21 calls/sec  p50=   644ms  p90=   655ms  avg=   642ms
  llama.cpp (GGUF-Q4KM)           7.92 calls/sec  p50=   421ms  p90=   886ms  avg=   502ms

────────────────────────────────────────────────────────────────────────
TEST 4: Output quality samples (medium payload)
────────────────────────────────────────────────────────────────────────

  oMLX (MLX-4bit):
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}

  llama.cpp (GGUF-Q4KM):
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to JWT"}
    [OK] {"phase": "testing", "scope": "Refactor auth middleware to use JWT"}

========================================================================
VERDICT
========================================================================
     small: oMLX p50=237ms  llama.cpp p50=187ms  → llama.cpp is 21% faster
    medium: oMLX p50=256ms  llama.cpp p50=188ms  → llama.cpp is 27% faster
     large: oMLX p50=412ms  llama.cpp p50=171ms  → llama.cpp is 59% faster

5.2 — Verdict Cards

Payload	oMLX p50	llama.cpp p50	Difference	Winner
Small (176 chars)	237ms	187ms	21%	llama.cpp faster
Medium (828 chars)	256ms	188ms	27%	llama.cpp faster
Large (1747 chars)	412ms	171ms	59%	llama.cpp faster

5.3 — Visualized

p50 Latency (ms) — lower is better

Payload	oMLX	llama.cpp	Winner
Small (176 chars)	237ms	187ms	llama.cpp
Medium (828 chars)	256ms	188ms	llama.cpp
Large (1747 chars)	412ms	171ms	llama.cpp

Throughput (calls/sec) — higher is better

Mode	oMLX	llama.cpp	Winner
Sequential	3.84/s	5.38/s	llama.cpp
Concurrent (2 slots)	5.52/s	6.21/s	llama.cpp
Concurrent (4 slots)	6.21/s	7.92/s	llama.cpp

5.4 — Where oMLX Wins

Tail latency under high concurrency. At 4 concurrent slots, oMLX p90 = 655ms vs llama.cpp p90 = 886ms. oMLX 更穩定 — likely because MLX’s lazy evaluation smooths out contention. llama.cpp throughput 更高但偶爾 spike。對於 background classification (唔係 user-facing)，higher throughput with occasional spikes 係更好嘅 tradeoff。

5.5 — The First Run Was Wrong

Failed Benchmark: Qwen3.5 "Thinking Mode" Trap

First run showed llama.cpp producing 0/8 valid outputs:

  llama.cpp (GGUF-Q4KM):
    [BAD]
    [BAD]
    [BAD]

Raw response revealed the problem:
  "content": "",
  "reasoning_content": "Okay, let's see. The user wants me to classify..."

All 80 tokens consumed by internal reasoning. Content field empty.

Root cause: Qwen3.5 defaults to Chain-of-Thought "thinking mode".
oMLX disabled it via: chat_template_kwargs: {"enable_thinking": false}
llama-server needs: --reasoning off at startup

Fix applied → re-benchmarked → 12/12 valid (100%)

6. Full Summary Table

Metric	oMLX (MLX-4bit)	llama.cpp (GGUF-Q4KM)	Winner
p50 latency (small)	237ms	187ms	llama.cpp (21% faster)
p50 latency (medium)	256ms	188ms	llama.cpp (27% faster)
p50 latency (large)	412ms	171ms	llama.cpp (59% faster)
Sequential throughput	3.84 calls/sec	5.38 calls/sec	llama.cpp (1.4x)
Concurrent (2 slots)	5.52 calls/sec	6.21 calls/sec	llama.cpp (1.1x)
Concurrent (4 slots)	6.21 calls/sec	7.92 calls/sec	llama.cpp (1.3x)
p90 tail (4 slots)	655ms	886ms	oMLX (more consistent)
Quality (valid JSON)	12/12 (100%)	12/12 (100%)	Tie
Phase agreement	”testing"	"testing”	Same answer
Python required	Yes	No	llama.cpp
User setup steps	5 steps	0 steps	llama.cpp

7. Dependency 比較

oMLX: 用戶需要裝嘅嘢

~~Python 3.10+~~
~~pip / venv~~
~~mlx-lm~~
~~MLX framework~~
~~oMLX CLI~~
~~手動 omlx serve~~

6 個 dependencies，5 個手動步驟

llama.cpp: 用戶需要裝嘅嘢

(冇)

0 個 dependencies，0 個手動步驟。Binary bundled，model lazy-download on click。

8. 總結

“為 Apple Silicon 優化” ≠ 最快。

MLX 係 Apple 嘅通用 ML framework — 好似 PyTorch for Apple Silicon。佢優化咗 unified memory access、training、同 research flexibility。

llama.cpp / ggml 係一個只做 LLM inference 嘅 C library — hand-tuned Metal shaders、static compute graph、zero Python overhead。

兩個都用同一個 Metal API 去叫同一塊 GPU。但 llama.cpp 嘅路徑短 2 層，冇 Python GIL，冇通用 framework abstraction。所以喺純 inference 場景，佢更快。

我哋嘅 use case (80 token classification, fire-and-forget) 完美 match llama.cpp 嘅設計目標。

Benchmarked 2026-03-29 — M5 Max 128GB — Qwen3.5-4B — llama.cpp b8500 — oMLX 0.2.23. Benchmark script: bench_h2h.py — 12 rounds per test — dev server stopped for fair GPU allocation.