LLMOps 2026: Production-Grade Self-Hosted AI Deployment Patterns
Running LLMs in production is fundamentally different from running traditional microservices. Memory constraints, GPU scheduling, batch processing, and autoscaling challenges require new patterns. This guide covers the complete LLMOps landscape—from vLLM and TGI to multi-model routing, quantization strategies, and building inference infrastructure that scales.
The LLMOps Landscape in 2026
LLMOps emerged as a distinct discipline because Large Language Models break traditional MLOps assumptions. You can't just wrap a model in a Flask container and call it production-ready. The memory requirements are massive (70B parameter models need 140GB+ VRAM), inference is stateful (KV-cache management), and throughput depends heavily on batching and scheduling decisions.
In 2026, the LLMOps stack has stabilized around several key components:
- Inference Engines: vLLM, TGI (Text Generation Inference), TensorRT-LLM, llama.cpp
- Model Storage: Hugging Face Hub, S3-compatible object stores, model registries
- Serving Infrastructure: Kubernetes with GPU operators, specialized schedulers
- API Gateways: LiteLLM, OpenRouter, custom routing layers
- Observability: Token-level metrics, cost tracking, latency analysis
- Safety: Guardrails, prompt injection detection, output filtering
The 2026 AI Infrastructure Report shows that 62% of enterprises running LLMs in production use self-hosted or hybrid setups (up from 38% in 2024). The primary drivers: data privacy (87%), cost predictability at scale (64%), and model customization (71%). Managed APIs remain popular for prototyping, but production workloads increasingly move on-premise.
Why Self-Hosted LLMs?
Before diving into implementation, understand when self-hosting makes sense:
| Factor | Self-Hosted | Managed API |
|---|---|---|
| Data Privacy | Data never leaves infrastructure | Third-party data processing |
| Cost at Scale | Fixed hardware costs | Linear per-token pricing |
| Latency | Predictable, controllable | Variable, network-dependent |
| Model Control | Fine-tune, merge, quantize freely | Limited to provider's models |
| Initial Setup | Complex, requires expertise | Minutes with API key |
| Operational Overhead | Significant (GPU management, scaling) | Minimal (provider-managed) |
The breakeven point typically occurs around 10-50 million tokens per day, depending on hardware costs and model size. At enterprise scale (billions of tokens monthly), self-hosting can reduce costs by 60-80%.
Inference Engines: The Heart of LLMOps
The inference engine is where model weights meet compute. Your choice fundamentally impacts throughput, latency, and hardware requirements.
Inference Engine Comparison
| Engine | Best For | Throughput | Latency | Memory Efficiency | Quantization |
|---|---|---|---|---|---|
| vLLM | High-throughput serving | Excellent (PagedAttention) | Good | Good (continuous batching) | FP16, INT8, AWQ, GPTQ |
| TGI | Hugging Face ecosystem | Very Good | Good | Very Good (FlashAttention) | FP16, INT8, AWQ, GPTQ, EETQ |
| TensorRT-LLM | NVIDIA GPUs, max throughput | Excellent | Excellent | Excellent (inflight batching) | FP16, INT8, INT4, FP8 |
| llama.cpp | CPU inference, edge devices | Moderate | High (on CPU) | Excellent (GGUF) | GGUF (Q4_K_M, Q5_K_M, Q8_0) |
| ExLlamaV2 | Local LLMs on consumer GPUs | High | Very Low | Excellent | ExL2, GPTQ |
| MLC LLM | Multi-platform deployment | Good | Low | Good | INT4, INT8 |
Understanding PagedAttention (vLLM's Innovation)
vLLM revolutionized LLM serving with PagedAttention, inspired by virtual memory in operating systems. Traditional inference allocates contiguous memory for the KV cache, leading to massive internal fragmentation. PagedAttention stores KV cache in fixed-size blocks (like OS pages), enabling:
- Near-zero memory waste—only used blocks allocated
- Continuous batching—new requests join ongoing batch
- 2-4x throughput improvement over naive batching
# Traditional Batching (inefficient)
Request 1: [Generate....................................................Done]
Request 2: [Generate....................................................Done]
Request 3: [Generate....................................................Done]
Time -> [===========================================================]
# Continuous Batching with PagedAttention (efficient)
Request 1: [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Request 2: [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Request 3: [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Time -> [===============================================================]
# GPU constantly utilized, new requests join as others complete
Production vLLM Deployment
vLLM has become the default choice for production GPU serving. Here's how to deploy it properly.
Docker Deployment
# docker-compose.yml for vLLM
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:v0.4.0
runtime: nvidia
environment:
- CUDA_VISIBLE_DEVICES=0,1
- VLLM_WORKER_MULTIPROC_METHOD=spawn
volumes:
- /data/models:/models
ports:
- "8000:8000"
command: >
--model /models/Meta-Llama-3-70B-Instruct-AWQ
--tensor-parallel-size 2
--quantization awq
--max-model-len 8192
--max-num-batched-tokens 8192
--max-num-seqs 256
--gpu-memory-utilization 0.95
--swap-space 4
--enforce-eager False
--enable-prefix-caching
--api-key ${VLLM_API_KEY}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "100m"
max-file: "3"
# Optional: Redis for caching
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
restart: unless-stopped
volumes:
redis-data:
Key Configuration Parameters
| Parameter | Description | Recommended |
|---|---|---|
| tensor-parallel-size | GPUs per model instance | 2-4 for 70B+ models |
| pipeline-parallel-size | Pipeline stages across nodes | 1 (unless multi-node) |
| max-model-len | Max context length | 4096-8192 (balance with memory) |
| max-num-seqs | Max concurrent sequences | 128-512 (higher = more throughput) |
| gpu-memory-utilization | Fraction of GPU memory to use | 0.90-0.95 (leave room for overhead) |
| swap-space | CPU swap space in GB | 4-8 (for KV cache offloading) |
| enable-prefix-caching | Cache common prompt prefixes | True (major speedup for RAG) |
| quantization | Weight quantization method | awq, gptq, fp8, or None |
API Usage Examples
# Python client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-api-key"
)
# Simple completion
response = client.chat.completions.create(
model="/models/Meta-Llama-3-70B-Instruct-AWQ",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
temperature=0.7,
max_tokens=500,
stream=False
)
print(response.choices[0].message.content)
# Streaming (for real-time UI)
stream = client.chat.completions.create(
model="/models/Meta-Llama-3-70B-Instruct-AWQ",
messages=[{"role": "user", "content": "Write a haiku about AI."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
# With structured output (JSON mode)
response = client.chat.completions.create(
model="/models/Meta-Llama-3-70B-Instruct-AWQ",
messages=[{
"role": "user",
"content": "Extract entities from: 'Apple launched iPhone 15 in September 2024'"
}],
response_format={"type": "json_object"},
extra_body={
"guided_json": {
"type": "object",
"properties": {
"company": {"type": "string"},
"product": {"type": "string"},
"date": {"type": "string"}
},
"required": ["company", "product", "date"]
}
}
)
import json
entities = json.loads(response.choices[0].message.content)
print(entities) # {'company': 'Apple', 'product': 'iPhone 15', 'date': 'September 2024'}
Quantization: Running Large Models on Smaller Hardware
Quantization reduces model precision to fit larger models on available hardware. In 2026, it's essentially mandatory for production deployment.
Quantization Methods Comparison
| Method | Bits | Memory Reduction | Quality Loss | Speed | Best For |
|---|---|---|---|---|---|
| FP16 | 16 | 50% | Minimal | Baseline | Maximum quality, H100/A100 only |
| AWQ | 4 | 75% | Low | Fast (GEMM kernels) | Production serving, consumer GPUs |
| GPTQ | 4, 3 | 75-81% | Low-Moderate | Fast | Maximum compression, edge cases |
| GGUF (llama.cpp) | Q4_K_M, Q5_K_M, Q8_0 | 60-75% | Low (Q5_K_M) | Moderate | CPU inference, consumer GPUs |
| FP8 | 8 | 50% | Very Low | Very Fast (Hopper) | H100/H200, maximum throughput |
| BitsAndBytes (NF4) | 4 | 75% | Low | Moderate | Training + inference (LoRA) |
Quantizing a Model with AutoAWQ
# Install AutoAWQ
pip install autoawq
# Quantize a model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Meta-Llama-3-70B-Instruct"
quant_path = "models/Meta-Llama-3-70B-Instruct-AWQ"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"Model quantized and saved to {quant_path}")
# Memory requirements after quantization:
# 70B model FP16: ~140GB VRAM
# 70B model AWQ: ~40GB VRAM
For 70B parameter models: Use AWQ 4-bit for production serving on A100 40GB or RTX 4090 pairs. For 7B-13B models: Q5_K_M GGUF runs on consumer GPUs with excellent quality. For maximum throughput on H100: Use FP8 with TensorRT-LLM. Always evaluate on your specific tasks—quantization quality varies by model and use case.
LLMs on Kubernetes: Production Patterns
Running LLMs on Kubernetes requires understanding GPU scheduling, memory constraints, and the unique lifecycle of inference workloads.
Prerequisites
# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true
# Verify GPU nodes
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia")))'
# Should show: nvidia.com/gpu: "4" (or however many GPUs)
Production Deployment with K8s
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-70b-instruct
namespace: llm-serving
labels:
app: llm-70b-instruct
model: llama-3-70b
spec:
replicas: 1 # Usually 1 per model (stateful, GPU-bound)
selector:
matchLabels:
app: llm-70b-instruct
template:
metadata:
labels:
app: llm-70b-instruct
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
nodeSelector:
node-type: gpu-a100 # Ensure GPU nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: vllm/vllm-openai:v0.4.0
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --model
- /models/Meta-Llama-3-70B-Instruct-AWQ
- --tensor-parallel-size
- "2"
- --quantization
- awq
- --max-model-len
- "8192"
- --max-num-seqs
- "256"
- --gpu-memory-utilization
- "0.95"
- --swap-space
- "4"
- --enable-prefix-caching
- --api-key
- $(API_KEY)
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 2
memory: "80Gi"
cpu: "8"
requests:
nvidia.com/gpu: 2
memory: "80Gi"
cpu: "4"
env:
- name: API_KEY
valueFrom:
secretKeyRef:
name: vllm-api-keys
key: primary
- name: CUDA_VISIBLE_DEVICES
value: "0,1"
volumeMounts:
- name: models
mountPath: /models
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
startupProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 30
periodSeconds: 10
volumes:
- name: models
persistentVolumeClaim:
claimName: models-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 10Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- llm-70b-instruct
topologyKey: kubernetes.io/hostname
---
apiVersion: v1
kind: Service
metadata:
name: llm-70b-instruct
namespace: llm-serving
spec:
selector:
app: llm-70b-instruct
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-70b-instruct-hpa
namespace: llm-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-70b-instruct
minReplicas: 1
maxReplicas: 3
metrics:
- type: Pods
pods:
metric:
name: vllm:gpu_utilization
target:
type: AverageValue
averageValue: "80"
- type: External
external:
metric:
name: vllm:queue_length
target:
type: AverageValue
averageValue: "50"
behavior:
scaleUp:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 600
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Pods
value: 1
periodSeconds: 600
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: llm-70b-instruct-pdb
namespace: llm-serving
spec:
minAvailable: 1
selector:
matchLabels:
app: llm-70b-instruct
Model Routing and Load Balancing
When running multiple models, you need intelligent routing. LiteLLM has become the standard proxy for this.
LiteLLM Configuration
# config.yaml for LiteLLM
model_list:
# Primary models
- model_name: gpt-4
litellm_params:
model: openai/gpt-4
api_key: os.environ/OPENAI_API_KEY
- model_name: llama-3-70b
litellm_params:
model: openai/Meta-Llama-3-70B-Instruct
api_base: http://llm-70b-instruct.llm-serving:8000/v1
api_key: os.environ/VLLM_API_KEY
- model_name: llama-3-8b
litellm_params:
model: openai/Meta-Llama-3-8B-Instruct
api_base: http://llm-8b-instruct.llm-serving:8000/v1
api_key: os.environ/VLLM_API_KEY
# Fallback models
- model_name: llama-3-70b-fallback
litellm_params:
model: openai/Meta-Llama-3-70B-Instruct
api_base: http://llm-70b-instruct-backup.llm-serving:8000/v1
api_key: os.environ/VLLM_API_KEY
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
proxy_batch_write_at: 60
database_url: os.environ/DATABASE_URL
router_settings:
routing_strategy: simple-shuffle # or least-busy, weighted
timeout: 30s
retries: 2
# Rate limiting
litellm_settings:
rate_limit:
- model: llama-3-70b
tpm: 100000
rpm: 1000
- model: llama-3-8b
tpm: 500000
rpm: 5000
# Guardrails
guardrails:
- guardrail_name: "PII-detection"
litellm_params:
guardrail: presidio
output:
redact: true
- guardrail_name: "content-moderation"
litellm_params:
guardrail: llamaguard
mode: "during_call"
# Spend tracking
team_settings:
- team_id: "engineering"
models: ["llama-3-70b", "llama-3-8b"]
max_budget: 1000
budget_duration: "30d"
- team_id: "research"
models: ["gpt-4", "llama-3-70b", "llama-3-8b"]
max_budget: 5000
budget_duration: "30d"
Autoscaling Patterns for LLMs
Traditional CPU-based autoscaling doesn't work for LLMs. GPU autoscaling requires different signals:
- GPU Utilization: When > 80%, scale up
- Queue Depth: Pending requests waiting for GPU time
- Time-to-First-Token (TTFT): When latency exceeds SLO
- Tokens-per-Second (TPS): When throughput drops
# Custom metrics adapter for GPU scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-gpu-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-70b
minReplicas: 1
maxReplicas: 5
metrics:
# Scale on GPU utilization
- type: External
external:
metric:
name: nvidia_gpu_utilization
selector:
matchLabels:
pod: vllm-70b
target:
type: AverageValue
averageValue: "80"
# Scale on request queue depth
- type: External
external:
metric:
name: vllm_queue_length
target:
type: AverageValue
averageValue: "20"
behavior:
# Slow scale-up (GPU pods take time to start)
scaleUp:
stabilizationWindowSeconds: 180
policies:
- type: Pods
value: 1
periodSeconds: 300
# Very slow scale-down (avoid thrashing)
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Pods
value: 1
periodSeconds: 600
# Important: GPU autoscaling is expensive and slow.
# Consider these patterns instead:
# 1. Pre-warmed pools for expected load
# 2. Multi-model routing to balance across existing capacity
# 3. Spot/preemptible instances for batch workloads
LLM Observability: Metrics That Matter
Standard application metrics aren't sufficient for LLMs. You need token-level observability:
| Metric | Description | Alert Threshold |
|---|---|---|
| TTFT (Time to First Token) | Latency from request to first response token | P99 < 500ms |
| TPOT (Time Per Output Token) | Inter-token latency during generation | P99 < 100ms |
| Throughput (tokens/sec) | Total tokens generated per second | Monitor trends |
| Queue Depth | Pending requests | > 50 requests |
| KV Cache Utilization | GPU memory used for context | > 90% |
| Cost per 1K tokens | Infrastructure cost efficiency | Compare to OpenAI pricing |
Security and Safety in LLMOps
Running LLMs introduces unique security challenges:
- Prompt Injection: Malicious inputs to manipulate model behavior
- Data Leakage: Model memorizing and exposing training data
- Jailbreaking: Bypassing safety constraints
- Resource Exhaustion: Denial of service via expensive generation requests
- Model Theft: Extracting model weights or architecture
# Security layers for production
1. Input Validation
- Max length limits
- Rate limiting per user/IP
- Content filtering (PII, toxic content)
2. Prompt Injection Detection
- LlamaGuard integration
- Custom rules for known attack patterns
- Sandboxing for untrusted inputs
3. Output Filtering
- PII redaction (Presidio)
- Content moderation
- Refusal detection
4. Resource Limits
- Max tokens per request
- Timeout thresholds
- Queue limits per tenant
5. Network Security
- TLS everywhere
- API key authentication
- VPC/isolated network for model servers
Cost Optimization Strategies
LLM infrastructure is expensive. Here's how to optimize:
- Quantization: 4-bit reduces VRAM by 75%, enabling smaller GPUs
- Prefix Caching: Cache common prompts (RAG contexts, system prompts)
- Multi-Model Routing: Use smaller models when sufficient
- Spot Instances: For batch inference workloads
- Request Batching: Increase throughput without linear cost
- Time-based Scaling: Scale down during off-hours
# Cost breakdown for 70B model inference (AWS p4d.24xlarge - 8xA100)
# On-demand: $32.77/hour
# Spot: $9.83/hour (70% savings)
# At 1000 requests/hour, 500 tokens average:
# Total tokens: 500,000/hour = 12M/day
# Cost: $32.77/hour = $787/day (on-demand)
# Cost: $9.83/hour = $236/day (spot)
# Comparable OpenAI API cost: ~$180/day (gpt-4-turbo)
# Comparable Anthropic API cost: ~$270/day (claude-3-opus)
# Self-hosting becomes cheaper at:
# - ~8M+ tokens/day with spot instances
# - ~15M+ tokens/day with on-demand instances
# Additional savings from:
# - No per-request latency
# - No rate limits
# - Model customization capability
Conclusion
Self-hosted LLMs have matured from experimental projects to production infrastructure. The combination of vLLM's PagedAttention, AWQ quantization, and Kubernetes GPU scheduling makes it feasible to run 70B+ parameter models on affordable hardware.
The key patterns for 2026: Use vLLM or TGI for high-throughput serving, quantize aggressively with AWQ 4-bit for production, route intelligently between models, and implement proper observability with token-level metrics. Security through LlamaGuard and cost optimization through spot instances and prefix caching complete the picture.
Start with a single model, measure your actual token throughput, and scale based on data. LLMOps is still evolving—the best practices today will be outdated in a year, but the fundamentals of efficient inference, observability, and security will remain constant.
The democratization of AI infrastructure is here. You no longer need OpenAI's budget to run production-grade LLMs—you just need the patterns in this guide.