Writings

Picking the smallest model that does the job

MLOps 14 min 2026-01-20

Draft writing — full body publishes via the editorial workflow.

Most teams pay for one model size up. This post walks through the eval workflow we use to pick the smallest serving stack that still meets the task’s quality bar, under a fixed VRAM budget — vLLM, SGLang, and llama.cpp head-to-head on the same prompts and the same hardware.

← All writings