July 2025•Article

Tzafon LLM inference architecture

Introducing Tzafon's LLM inference architecture. This architecture provides a robust, scalable, and observable solution for serving LLM inferences.

Introduction

This post describes the architecture we've built for serving large language model (LLM) inferences using a combination of:

Nomad as the orchestrator
Triton Inference Server with vLLM backend
Envoy as the ingress/load balancer
Consul for service discovery
Prometheus + Nomad Autoscaler for observability and horizontal scaling

The goal was to design a serving stack that is modular, transparent, and responsive to traffic patterns — while remaining operationally simple.

High-level diagram of the architecture

LLM Inference Architecture

Architecture overview

Triton + vLLM backend via Nomad

Each model is deployed as a Nomad job running a containerized Triton Inference Server with vLLM backend. These jobs are registered in HashiCorp Consul, which we use for basic service discovery (we don't use Consul Connect or mesh features).

Envoy proxy + xDS Control Plane

All incoming inference requests go through Envoy, which serves as our load balancer.

Envoy is configured to:

Load balance across Triton replicas using the Maglev algorithm.
Forward requests via gRPC to Triton Inference Servers using Triton's native gRPC streaming interface.

Rather than hardcoding backend endpoints, Envoy retrieves them dynamically using the xDS API. We run a separate Envoy control plane (as a Nomad job), which:

Pulls service data from Consul, where each Triton replica is registered.
Generates a new xDS configuration snapshot.
Serves that config to Envoy over xDS (Endpoint/Cluster/Listener Discovery Service).

This setup allows Envoy to stay up-to-date with available replicas and react to model deployments, restarts, or scale events — without requiring any restarts or manual updates.

Observability and Autoscaling

Prometheus scrapes the metrics from both:

Envoy: request counts, error rates, latencies etc.
Triton: inference performance.

To manage autoscaling, we've deployed a Nomad Autoscaler agent as a separate job. This agent continuously observes selected metrics—such as queued requests and system load—and uses Nomad's API to issue scaling instructions. Based on this data, it dynamically adjusts the number of replicas for each model job. This setup enables our services to scale horizontally in response to real-time demand, ensuring performance and resource efficiency.

Benchmarking

To evaluate our architecture, we ran end-to-end load tests comparing two setups for serving the 7b Multimodal model:

Nomad + Envoy + Triton (vLLM backend)
KubeAI + vLLM on Kubernetes

We tested various levels of concurrency and request volume to observe differences in response time, latency, and system stability.

Concurrency	Total requests	Mean response time (KubeAI)	Mean response time (Nomad)	P95 Latency (Nomad)	Notes
1	100	4.10s	4.23s	5.68s	Similar under light load
5	100	4.12s	3.50s	4.37s	15% faster on Nomad
10	100	5.12s	3.58s	4.41s	30% faster on Nomad
25	500	4.77s	3.63s	4.52s	Nomad faster and stable
50	500	5.05s	4.56s	6.95s	KubeAI latency starts to climb
100	1000	unstable	5.30s	6.94s	Nomad remained responsive

Summary

This architecture provides a robust, scalable, and observable solution for serving LLM inferences. By combining Nomad for orchestration, Triton with vLLM for serving, Envoy for load balancing, Consul for service discovery, and Prometheus with Nomad Autoscaler for observability and scaling, we achieve a flexible and efficient system that adapts to changing traffic patterns and operational needs.