Skip to main content

Command Palette

Search for a command to run...

The Numbers: Benchmarking My LLM Gateway on a H100

Updated
5 min read
The Numbers: Benchmarking My LLM Gateway on a H100

A couple of weeks ago I wrote about rewriting my LLM gateway to bring it from MVP to production. The architectural claims were; multi-tenancy, hybrid inference , sub-5ms overhead. So I benchmarked it on an H100 against a naive HuggingFace serving baseline. This post covers how my architectural decisions affected the numbers and what the numbers actually mean.

Highlights

  • At concurrency 4, 100% success on both sides: Gateway delivers responses 4.91× faster end-to-end and 5.66× higher throughput — pure architectural win from continuous batching

  • At concurrency 1: Gateway shows 60–67× faster TTFT — but this number bundles two things (streaming + inference), and the post explains both honestly

  • At concurrency 16 sustained: Gateway returns HTTP 429 to 92% of requests by design — that's the rate limiter doing its job, not a system failure

  • Tested on H100 80GB serving Qwen 2.5 7B with vLLM AWQ vs HF Transformers fp16, both serving the same model

The headline here is not the 60x finding. It's the 4.91× total speedup at c=4 with 100% success on both sides — the cleanest architectural result in the dataset.

How to Read These Numbers

TTFT measurement asymmetry between targets.

The gateway streams responses via Server-Sent Events; TTFT measures true time-to-first-token. The baseline returns non-streaming JSON; for non-streaming targets, TTFT equals total response time because nothing reaches the client until generation completes. This means the dramatic TTFT speedups (60–67×) reflect two combined architectural choices, not just inference engine differences:

  • Streaming response delivery — the gateway shows partial output immediately; the baseline doesnt respond until generation completes

  • Inference optimization — vLLM AWQ generates ~1.3× more tokens per second than HF Transformers fp16 on the same H100 The Total speedup is the apples-to-apples inference comparison after streaming differences wash out.

The c=4 numbers tell a different story.

The 67× TTFT speedup at concurrency 4 is not streaming-related — it's about parallelism. The naive baseline serializes inference via asyncio.Lock 4 concurrent client requests become 4 sequential server inferences. The gateway processes all 4 in parallel via vLLM continuous batching. This is the cleanest architectural finding in the dataset, and it's what the rest of the post is really about.

The Headline Finding: Continuous Batching at c=4

Metric Gateway Baseline Speedup
TTFT p50 233ms 15.66s 67×
Total p50 3.19s 15.66s 4.91×
Throughput (req/s) 1.04 0.18 5.66×
Success rate 100% 100% match

Results were 100% successful under c=4 for both implementations under the same conditions; same model, same hardware, same inputs. The gateway delivers ~5x faster end-to-end because it benefits from vLLM's scheduler to parallelise the processes. On the other hand, baseline's lock mechanism forced it to process sequentially.

Single-Request: Streaming vs Inference

Metric MVP Gateway Speedup
TTFT p50 3.92s 59ms 66.69×
Total p50 3.92s 3.00s 1.31×

Actually ~66x is what users fell buth the 1.31x is the true inference-only number. A response that takes 3s in total makes users feel instant because it starts in 60ms.

Concurrency Sweep

Pattern c req/s tokens/s Success
steady 4 1.04 267 100%
steady 8 1.6 405 99.1%
steady 16 0.9 231 8.0%
burst 8 0.6 149 50.0%
burst 32 1.0 246 21.9%

The MVP isn't here above c=4 because lock queueing exceeds the client timeout, therefore it's functionally non-operational under moderate load.

c=16 — Graceful Degradation by Design

At c=16, success drops to 8%; the other 92% get HTTP 429. That's not a failure instead its the per-tenant limit is rpm=60 (~1 req/s) and offered load is ~16 req/s. The gateway rejects overflow with Retry-After, keeps bounded latency on served requests, and refuses to saturate the backend.

Two Important Bugs I Caught during Benchmark

During benchmark and deployment I caught a couple of bugs however these two were the benchmark breaking.

  1. Baseline ran on CPU. Missing device_map caused HuggingFace defaulted to CPU at ~.75 tok/s, which would've made the gateway look artificially superior.

  2. Redis EVALSHA SHA mismatch. Ledger hashed its Lua scritp with SHA-256 however Redis's cache uses SHA-1 that caused a NoScripError and a full script resend every call.

Methodology

H100 80Gb on RunPod, local networking, Qwen 2.5 7B (AWQ gateway / fp16 baseline), wall clock client timing, 429/5XX/timeouts all counted as failures. Baseline tested only at c≤4 (above that, lock queueing times out). Full hardware, software, command matrix and reproduction are in Benchmark Protocol

Conclusion

  • 4.91× at c=4, 100% on both sides — the result that matters.

  • 1.31× inference / ~66× TTFT — both real, measuring different things.

  • MVP can't survive c>4 — production serving needs a production stack.

  • 429 at c=16 is correct behavior, not failure.

💬 Next post: LLM Gateway on K8s with Karpenter / Knative / KServe

More from this blog

O

Orhun Kupeli

2 posts

Hi, I'm Orhun. I'm a fullstack engineer who spends most of my time on backend systems and infrastructure. I write about architecture, performance, and the lessons that only show up after you ship. So, no tutorials, no hot takes.