Tzafon LLM inference architecture

Introduction

This post describes the architecture we've built for serving large language model (LLM) inferences using a combination of:

  • Nomad as the orchestrator
  • Triton Inference Server with vLLM backend
  • Envoy as the ingress/load balancer
  • Consul for service discovery
  • Prometheus + Nomad Autoscaler for observability and horizontal scaling

The goal was to design a serving stack that is modular, transparent, and responsive to traffic patterns — while remaining operationally simple.

High-level diagram of the architecture

LLM Inference Architecture

Architecture overview

Triton + vLLM backend via Nomad

Each model is deployed as a Nomad job running a containerized Triton Inference Server with vLLM backend. These jobs are registered in HashiCorp Consul, which we use for basic service discovery (we don't use Consul Connect or mesh features).

Envoy proxy + xDS Control Plane

All incoming inference requests go through Envoy, which serves as our load balancer.

Envoy is configured to:

  • Load balance across Triton replicas using the Maglev algorithm.
  • Forward requests via gRPC to Triton Inference Servers using Triton's native gRPC streaming interface.

Rather than hardcoding backend endpoints, Envoy retrieves them dynamically using the xDS API. We run a separate Envoy control plane (as a Nomad job), which:

  • Pulls service data from Consul, where each Triton replica is registered.
  • Generates a new xDS configuration snapshot.
  • Serves that config to Envoy over xDS (Endpoint/Cluster/Listener Discovery Service).

This setup allows Envoy to stay up-to-date with available replicas and react to model deployments, restarts, or scale events — without requiring any restarts or manual updates.

Observability and Autoscaling

Prometheus scrapes the metrics from both:

  • Envoy: request counts, error rates, latencies etc.
  • Triton: inference performance.

To manage autoscaling, we've deployed a Nomad Autoscaler agent as a separate job. This agent continuously observes selected metrics—such as queued requests and system load—and uses Nomad's API to issue scaling instructions. Based on this data, it dynamically adjusts the number of replicas for each model job. This setup enables our services to scale horizontally in response to real-time demand, ensuring performance and resource efficiency.

Benchmarking

To evaluate our architecture, we ran end-to-end load tests comparing two setups for serving the 7b Multimodal model:

  • Nomad + Envoy + Triton (vLLM backend)
  • KubeAI + vLLM on Kubernetes

We tested various levels of concurrency and request volume to observe differences in response time, latency, and system stability.

Concurrency Total requests Mean response time (KubeAI) Mean response time (Nomad) P95 Latency (Nomad) Notes
1 100 4.10s 4.23s 5.68s Similar under light load
5 100 4.12s 3.50s 4.37s 15% faster on Nomad
10 100 5.12s 3.58s 4.41s 30% faster on Nomad
25 500 4.77s 3.63s 4.52s Nomad faster and stable
50 500 5.05s 4.56s 6.95s KubeAI latency starts to climb
100 1000 unstable 5.30s 6.94s Nomad remained responsive

Summary

This architecture provides a robust, scalable, and observable solution for serving LLM inferences. By combining Nomad for orchestration, Triton with vLLM for serving, Envoy for load balancing, Consul for service discovery, and Prometheus with Nomad Autoscaler for observability and scaling, we achieve a flexible and efficient system that adapts to changing traffic patterns and operational needs.