The Numbers: Benchmarking My LLM Gateway on a H100

A couple of weeks ago I wrote about rewriting my LLM gateway to bring it from MVP to production. The architectural claims were; multi-tenancy, hybrid inference , sub-5ms overhead. So I benchmarked it on an H100 against a naive HuggingFace serving baseline. This post covers how my architectural decisions affected the numbers and what the numbers actually mean.
Highlights
At concurrency 4, 100% success on both sides: Gateway delivers responses 4.91× faster end-to-end and 5.66× higher throughput — pure architectural win from continuous batching
At concurrency 1: Gateway shows 60–67× faster TTFT — but this number bundles two things (streaming + inference), and the post explains both honestly
At concurrency 16 sustained: Gateway returns HTTP 429 to 92% of requests by design — that's the rate limiter doing its job, not a system failure
Tested on H100 80GB serving Qwen 2.5 7B with vLLM AWQ vs HF Transformers fp16, both serving the same model
The headline here is not the 60x finding. It's the 4.91× total speedup at c=4 with 100% success on both sides — the cleanest architectural result in the dataset.
How to Read These Numbers
TTFT measurement asymmetry between targets.
The gateway streams responses via Server-Sent Events; TTFT measures true time-to-first-token. The baseline returns non-streaming JSON; for non-streaming targets, TTFT equals total response time because nothing reaches the client until generation completes. This means the dramatic TTFT speedups (60–67×) reflect two combined architectural choices, not just inference engine differences:
Streaming response delivery — the gateway shows partial output immediately; the baseline doesnt respond until generation completes
Inference optimization — vLLM AWQ generates ~1.3× more tokens per second than HF Transformers fp16 on the same H100 The Total speedup is the apples-to-apples inference comparison after streaming differences wash out.
The c=4 numbers tell a different story.
The 67× TTFT speedup at concurrency 4 is not streaming-related — it's about parallelism. The naive baseline serializes inference via asyncio.Lock 4 concurrent client requests become 4 sequential server inferences. The gateway processes all 4 in parallel via vLLM continuous batching. This is the cleanest architectural finding in the dataset, and it's what the rest of the post is really about.
The Headline Finding: Continuous Batching at c=4
| Metric | Gateway | Baseline | Speedup |
|---|---|---|---|
| TTFT p50 | 233ms | 15.66s | 67× |
| Total p50 | 3.19s | 15.66s | 4.91× |
| Throughput (req/s) | 1.04 | 0.18 | 5.66× |
| Success rate | 100% | 100% | match |
Results were 100% successful under c=4 for both implementations under the same conditions; same model, same hardware, same inputs. The gateway delivers ~5x faster end-to-end because it benefits from vLLM's scheduler to parallelise the processes. On the other hand, baseline's lock mechanism forced it to process sequentially.
Single-Request: Streaming vs Inference
| Metric | MVP | Gateway | Speedup |
|---|---|---|---|
| TTFT p50 | 3.92s | 59ms | 66.69× |
| Total p50 | 3.92s | 3.00s | 1.31× |
Actually ~66x is what users fell buth the 1.31x is the true inference-only number. A response that takes 3s in total makes users feel instant because it starts in 60ms.
Concurrency Sweep
| Pattern | c | req/s | tokens/s | Success |
|---|---|---|---|---|
| steady | 4 | 1.04 | 267 | 100% |
| steady | 8 | 1.6 | 405 | 99.1% |
| steady | 16 | 0.9 | 231 | 8.0% |
| burst | 8 | 0.6 | 149 | 50.0% |
| burst | 32 | 1.0 | 246 | 21.9% |
The MVP isn't here above c=4 because lock queueing exceeds the client timeout, therefore it's functionally non-operational under moderate load.
c=16 — Graceful Degradation by Design
At c=16, success drops to 8%; the other 92% get HTTP 429. That's not a failure instead its the per-tenant limit is rpm=60 (~1 req/s) and offered load is ~16 req/s. The gateway rejects overflow with Retry-After, keeps bounded latency on served requests, and refuses to saturate the backend.
Two Important Bugs I Caught during Benchmark
During benchmark and deployment I caught a couple of bugs however these two were the benchmark breaking.
Baseline ran on CPU. Missing
device_mapcaused HuggingFace defaulted to CPU at ~.75 tok/s, which would've made the gateway look artificially superior.Redis EVALSHA SHA mismatch. Ledger hashed its Lua scritp with SHA-256 however Redis's cache uses SHA-1 that caused a
NoScripErrorand a full script resend every call.
Methodology
H100 80Gb on RunPod, local networking, Qwen 2.5 7B (AWQ gateway / fp16 baseline), wall clock client timing, 429/5XX/timeouts all counted as failures. Baseline tested only at c≤4 (above that, lock queueing times out). Full hardware, software, command matrix and reproduction are in Benchmark Protocol
Conclusion
4.91× at c=4, 100% on both sides — the result that matters.
1.31× inference / ~66× TTFT — both real, measuring different things.
MVP can't survive c>4 — production serving needs a production stack.
429 at c=16 is correct behavior, not failure.
💬 Next post: LLM Gateway on K8s with Karpenter / Knative / KServe


