LLM Gateway Benchmark on H100: vLLM vs HuggingFace

A couple of weeks ago I wrote about rewriting my LLM gateway to bring it from MVP to production. The architectural claims were; multi-tenancy, hybrid inference , sub-5ms overhead. So I benchmarked it on an H100 against a naive HuggingFace serving baseline. This post covers how my architectural decisions affected the numbers and what the numbers actually mean.

Highlights

At concurrency 4, 100% success on both sides: Gateway delivers responses 4.91× faster end-to-end and 5.66× higher throughput — pure architectural win from continuous batching
At concurrency 1: Gateway shows 60–67× faster TTFT — but this number bundles two things (streaming + inference), and the post explains both honestly
At concurrency 16 sustained: Gateway returns HTTP 429 to 92% of requests by design — that's the rate limiter doing its job, not a system failure
Tested on H100 80GB serving Qwen 2.5 7B with vLLM AWQ vs HF Transformers fp16, both serving the same model

The headline here is not the 60x finding. It's the 4.91× total speedup at c=4 with 100% success on both sides — the cleanest architectural result in the dataset.

How to Read These Numbers

TTFT measurement asymmetry between targets.

The gateway streams responses via Server-Sent Events; TTFT measures true time-to-first-token. The baseline returns non-streaming JSON; for non-streaming targets, TTFT equals total response time because nothing reaches the client until generation completes. This means the dramatic TTFT speedups (60–67×) reflect two combined architectural choices, not just inference engine differences:

Streaming response delivery — the gateway shows partial output immediately; the baseline doesnt respond until generation completes
Inference optimization — vLLM AWQ generates ~1.3× more tokens per second than HF Transformers fp16 on the same H100 The Total speedup is the apples-to-apples inference comparison after streaming differences wash out.

The c=4 numbers tell a different story.

The 67× TTFT speedup at concurrency 4 is not streaming-related — it's about parallelism. The naive baseline serializes inference via asyncio.Lock 4 concurrent client requests become 4 sequential server inferences. The gateway processes all 4 in parallel via vLLM continuous batching. This is the cleanest architectural finding in the dataset, and it's what the rest of the post is really about.

The Headline Finding: Continuous Batching at c=4

Metric	Gateway	Baseline	Speedup
TTFT p50	233ms	15.66s	67×
Total p50	3.19s	15.66s	4.91×
Throughput (req/s)	1.04	0.18	5.66×
Success rate	100%	100%	match

Results were 100% successful under c=4 for both implementations under the same conditions; same model, same hardware, same inputs. The gateway delivers ~5x faster end-to-end because it benefits from vLLM's scheduler to parallelise the processes. On the other hand, baseline's lock mechanism forced it to process sequentially.

Single-Request: Streaming vs Inference

Metric	MVP	Gateway	Speedup
TTFT p50	3.92s	59ms	66.69×
Total p50	3.92s	3.00s	1.31×

Actually ~66x is what users fell buth the 1.31x is the true inference-only number. A response that takes 3s in total makes users feel instant because it starts in 60ms.

Concurrency Sweep

Pattern	c	req/s	tokens/s	Success
steady	4	1.04	267	100%
steady	8	1.6	405	99.1%
steady	16	0.9	231	8.0%
burst	8	0.6	149	50.0%
burst	32	1.0	246	21.9%

The MVP isn't here above c=4 because lock queueing exceeds the client timeout, therefore it's functionally non-operational under moderate load.

c=16 — Graceful Degradation by Design

At c=16, success drops to 8%; the other 92% get HTTP 429. That's not a failure instead its the per-tenant limit is rpm=60 (~1 req/s) and offered load is ~16 req/s. The gateway rejects overflow with Retry-After, keeps bounded latency on served requests, and refuses to saturate the backend.

Two Important Bugs I Caught during Benchmark

During benchmark and deployment I caught a couple of bugs however these two were the benchmark breaking.

Baseline ran on CPU. Missing device_map caused HuggingFace defaulted to CPU at ~.75 tok/s, which would've made the gateway look artificially superior.
Redis EVALSHA SHA mismatch. Ledger hashed its Lua scritp with SHA-256 however Redis's cache uses SHA-1 that caused a NoScripError and a full script resend every call.

Methodology

H100 80Gb on RunPod, local networking, Qwen 2.5 7B (AWQ gateway / fp16 baseline), wall clock client timing, 429/5XX/timeouts all counted as failures. Baseline tested only at c≤4 (above that, lock queueing times out). Full hardware, software, command matrix and reproduction are in Benchmark Protocol

Conclusion

4.91× at c=4, 100% on both sides — the result that matters.
1.31× inference / ~66× TTFT — both real, measuring different things.
MVP can't survive c>4 — production serving needs a production stack.
429 at c=16 is correct behavior, not failure.

💬 Next post: LLM Gateway on K8s with Karpenter / Knative / KServe

The Numbers: Benchmarking My LLM Gateway on a H100

Highlights

How to Read These Numbers

TTFT measurement asymmetry between targets.

The c=4 numbers tell a different story.

The Headline Finding: Continuous Batching at c=4

Single-Request: Streaming vs Inference

Concurrency Sweep

c=16 — Graceful Degradation by Design

Two Important Bugs I Caught during Benchmark

Methodology

Conclusion

Comments

More from this blog

MVP to Mission-Critical: The Idea Behind My LLM Gateway Rewrite

Command Palette

Highlights

How to Read These Numbers

TTFT measurement asymmetry between targets.

The c=4 numbers tell a different story.

The Headline Finding: Continuous Batching at c=4

Single-Request: Streaming vs Inference

Concurrency Sweep

c=16 — Graceful Degradation by Design

Two Important Bugs I Caught during Benchmark

Methodology

Conclusion

Comments

More from this blog