AI Tutorial

LLMOps 2026: Production-Grade Self-Hosted AI Deployment Patterns

March 12, 2026 25 min read

Running LLMs in production is fundamentally different from running traditional microservices. Memory constraints, GPU scheduling, batch processing, and autoscaling challenges require new patterns. This guide covers the complete LLMOps landscape—from vLLM and TGI to multi-model routing, quantization strategies, and building inference infrastructure that scales.

The LLMOps Landscape in 2026

LLMOps emerged as a distinct discipline because Large Language Models break traditional MLOps assumptions. You can't just wrap a model in a Flask container and call it production-ready. The memory requirements are massive (70B parameter models need 140GB+ VRAM), inference is stateful (KV-cache management), and throughput depends heavily on batching and scheduling decisions.

In 2026, the LLMOps stack has stabilized around several key components:

Inference Engines: vLLM, TGI (Text Generation Inference), TensorRT-LLM, llama.cpp
Model Storage: Hugging Face Hub, S3-compatible object stores, model registries
Serving Infrastructure: Kubernetes with GPU operators, specialized schedulers
API Gateways: LiteLLM, OpenRouter, custom routing layers
Observability: Token-level metrics, cost tracking, latency analysis
Safety: Guardrails, prompt injection detection, output filtering

📊 LLM Deployment Maturity 2026

The 2026 AI Infrastructure Report shows that 62% of enterprises running LLMs in production use self-hosted or hybrid setups (up from 38% in 2024). The primary drivers: data privacy (87%), cost predictability at scale (64%), and model customization (71%). Managed APIs remain popular for prototyping, but production workloads increasingly move on-premise.

Why Self-Hosted LLMs?

Before diving into implementation, understand when self-hosting makes sense:

Factor	Self-Hosted	Managed API
Data Privacy	Data never leaves infrastructure	Third-party data processing
Cost at Scale	Fixed hardware costs	Linear per-token pricing
Latency	Predictable, controllable	Variable, network-dependent
Model Control	Fine-tune, merge, quantize freely	Limited to provider's models
Initial Setup	Complex, requires expertise	Minutes with API key
Operational Overhead	Significant (GPU management, scaling)	Minimal (provider-managed)

The breakeven point typically occurs around 10-50 million tokens per day, depending on hardware costs and model size. At enterprise scale (billions of tokens monthly), self-hosting can reduce costs by 60-80%.

Inference Engines: The Heart of LLMOps

The inference engine is where model weights meet compute. Your choice fundamentally impacts throughput, latency, and hardware requirements.

Inference Engine Comparison

Engine	Best For	Throughput	Latency	Memory Efficiency	Quantization
vLLM	High-throughput serving	Excellent (PagedAttention)	Good	Good (continuous batching)	FP16, INT8, AWQ, GPTQ
TGI	Hugging Face ecosystem	Very Good	Good	Very Good (FlashAttention)	FP16, INT8, AWQ, GPTQ, EETQ
TensorRT-LLM	NVIDIA GPUs, max throughput	Excellent	Excellent	Excellent (inflight batching)	FP16, INT8, INT4, FP8
llama.cpp	CPU inference, edge devices	Moderate	High (on CPU)	Excellent (GGUF)	GGUF (Q4_K_M, Q5_K_M, Q8_0)
ExLlamaV2	Local LLMs on consumer GPUs	High	Very Low	Excellent	ExL2, GPTQ
MLC LLM	Multi-platform deployment	Good	Low	Good	INT4, INT8

Understanding PagedAttention (vLLM's Innovation)

vLLM revolutionized LLM serving with PagedAttention, inspired by virtual memory in operating systems. Traditional inference allocates contiguous memory for the KV cache, leading to massive internal fragmentation. PagedAttention stores KV cache in fixed-size blocks (like OS pages), enabling:

Near-zero memory waste—only used blocks allocated
Continuous batching—new requests join ongoing batch
2-4x throughput improvement over naive batching

# Traditional Batching (inefficient)
Request 1: [Generate....................................................Done]
Request 2: [Generate....................................................Done]
Request 3: [Generate....................................................Done]
Time ->    [===========================================================]

# Continuous Batching with PagedAttention (efficient)
Request 1: [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Request 2:      [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Request 3:           [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Time ->    [===============================================================]
# GPU constantly utilized, new requests join as others complete

Production vLLM Deployment

vLLM has become the default choice for production GPU serving. Here's how to deploy it properly.

Docker Deployment

# docker-compose.yml for vLLM
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:v0.4.0
    runtime: nvidia
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    volumes:
      - /data/models:/models
    ports:
      - "8000:8000"
    command: >
      --model /models/Meta-Llama-3-70B-Instruct-AWQ
      --tensor-parallel-size 2
      --quantization awq
      --max-model-len 8192
      --max-num-batched-tokens 8192
      --max-num-seqs 256
      --gpu-memory-utilization 0.95
      --swap-space 4
      --enforce-eager False
      --enable-prefix-caching
      --api-key ${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
    restart: unless-stopped
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "3"

  # Optional: Redis for caching
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    restart: unless-stopped

volumes:
  redis-data:

Key Configuration Parameters

Parameter	Description	Recommended
tensor-parallel-size	GPUs per model instance	2-4 for 70B+ models
pipeline-parallel-size	Pipeline stages across nodes	1 (unless multi-node)
max-model-len	Max context length	4096-8192 (balance with memory)
max-num-seqs	Max concurrent sequences	128-512 (higher = more throughput)
gpu-memory-utilization	Fraction of GPU memory to use	0.90-0.95 (leave room for overhead)
swap-space	CPU swap space in GB	4-8 (for KV cache offloading)
enable-prefix-caching	Cache common prompt prefixes	True (major speedup for RAG)
quantization	Weight quantization method	awq, gptq, fp8, or None

API Usage Examples

# Python client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

# Simple completion
response = client.chat.completions.create(
    model="/models/Meta-Llama-3-70B-Instruct-AWQ",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500,
    stream=False
)

print(response.choices[0].message.content)

# Streaming (for real-time UI)
stream = client.chat.completions.create(
    model="/models/Meta-Llama-3-70B-Instruct-AWQ",
    messages=[{"role": "user", "content": "Write a haiku about AI."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

# With structured output (JSON mode)
response = client.chat.completions.create(
    model="/models/Meta-Llama-3-70B-Instruct-AWQ",
    messages=[{
        "role": "user",
        "content": "Extract entities from: 'Apple launched iPhone 15 in September 2024'"
    }],
    response_format={"type": "json_object"},
    extra_body={
        "guided_json": {
            "type": "object",
            "properties": {
                "company": {"type": "string"},
                "product": {"type": "string"},
                "date": {"type": "string"}
            },
            "required": ["company", "product", "date"]
        }
    }
)

import json
entities = json.loads(response.choices[0].message.content)
print(entities)  # {'company': 'Apple', 'product': 'iPhone 15', 'date': 'September 2024'}

Quantization: Running Large Models on Smaller Hardware

Quantization reduces model precision to fit larger models on available hardware. In 2026, it's essentially mandatory for production deployment.

Quantization Methods Comparison

Method	Bits	Memory Reduction	Quality Loss	Speed	Best For
FP16	16	50%	Minimal	Baseline	Maximum quality, H100/A100 only
AWQ	4	75%	Low	Fast (GEMM kernels)	Production serving, consumer GPUs
GPTQ	4, 3	75-81%	Low-Moderate	Fast	Maximum compression, edge cases
GGUF (llama.cpp)	Q4_K_M, Q5_K_M, Q8_0	60-75%	Low (Q5_K_M)	Moderate	CPU inference, consumer GPUs
FP8	8	50%	Very Low	Very Fast (Hopper)	H100/H200, maximum throughput
BitsAndBytes (NF4)	4	75%	Low	Moderate	Training + inference (LoRA)

Quantizing a Model with AutoAWQ

# Install AutoAWQ
pip install autoawq

# Quantize a model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Meta-Llama-3-70B-Instruct"
quant_path = "models/Meta-Llama-3-70B-Instruct-AWQ"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, 
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f"Model quantized and saved to {quant_path}")

# Memory requirements after quantization:
# 70B model FP16: ~140GB VRAM
# 70B model AWQ: ~40GB VRAM

💡 Quantization Strategy

For 70B parameter models: Use AWQ 4-bit for production serving on A100 40GB or RTX 4090 pairs. For 7B-13B models: Q5_K_M GGUF runs on consumer GPUs with excellent quality. For maximum throughput on H100: Use FP8 with TensorRT-LLM. Always evaluate on your specific tasks—quantization quality varies by model and use case.

LLMs on Kubernetes: Production Patterns

Running LLMs on Kubernetes requires understanding GPU scheduling, memory constraints, and the unique lifecycle of inference workloads.

Prerequisites

# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

# Verify GPU nodes
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia")))'

# Should show: nvidia.com/gpu: "4" (or however many GPUs)

Production Deployment with K8s

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-70b-instruct
  namespace: llm-serving
  labels:
    app: llm-70b-instruct
    model: llama-3-70b
spec:
  replicas: 1  # Usually 1 per model (stateful, GPU-bound)
  selector:
    matchLabels:
      app: llm-70b-instruct
  template:
    metadata:
      labels:
        app: llm-70b-instruct
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        node-type: gpu-a100  # Ensure GPU nodes
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.4.0
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /models/Meta-Llama-3-70B-Instruct-AWQ
            - --tensor-parallel-size
            - "2"
            - --quantization
            - awq
            - --max-model-len
            - "8192"
            - --max-num-seqs
            - "256"
            - --gpu-memory-utilization
            - "0.95"
            - --swap-space
            - "4"
            - --enable-prefix-caching
            - --api-key
            - $(API_KEY)
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 2
              memory: "80Gi"
              cpu: "8"
            requests:
              nvidia.com/gpu: 2
              memory: "80Gi"
              cpu: "4"
          env:
            - name: API_KEY
              valueFrom:
                secretKeyRef:
                  name: vllm-api-keys
                  key: primary
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1"
          volumeMounts:
            - name: models
              mountPath: /models
            - name: shm
              mountPath: /dev/shm
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            failureThreshold: 30
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: models-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 10Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - llm-70b-instruct
                topologyKey: kubernetes.io/hostname

---
apiVersion: v1
kind: Service
metadata:
  name: llm-70b-instruct
  namespace: llm-serving
spec:
  selector:
    app: llm-70b-instruct
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-70b-instruct-hpa
  namespace: llm-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-70b-instruct
  minReplicas: 1
  maxReplicas: 3
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm:gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"
    - type: External
      external:
        metric:
          name: vllm:queue_length
        target:
          type: AverageValue
          averageValue: "50"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 600
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 600

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-70b-instruct-pdb
  namespace: llm-serving
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: llm-70b-instruct

Model Routing and Load Balancing

When running multiple models, you need intelligent routing. LiteLLM has become the standard proxy for this.

LiteLLM Configuration

# config.yaml for LiteLLM
model_list:
  # Primary models
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4
      api_key: os.environ/OPENAI_API_KEY
  
  - model_name: llama-3-70b
    litellm_params:
      model: openai/Meta-Llama-3-70B-Instruct
      api_base: http://llm-70b-instruct.llm-serving:8000/v1
      api_key: os.environ/VLLM_API_KEY
  
  - model_name: llama-3-8b
    litellm_params:
      model: openai/Meta-Llama-3-8B-Instruct
      api_base: http://llm-8b-instruct.llm-serving:8000/v1
      api_key: os.environ/VLLM_API_KEY
  
  # Fallback models
  - model_name: llama-3-70b-fallback
    litellm_params:
      model: openai/Meta-Llama-3-70B-Instruct
      api_base: http://llm-70b-instruct-backup.llm-serving:8000/v1
      api_key: os.environ/VLLM_API_KEY

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  proxy_batch_write_at: 60
  database_url: os.environ/DATABASE_URL

router_settings:
  routing_strategy: simple-shuffle  # or least-busy, weighted
  timeout: 30s
  retries: 2
  
# Rate limiting
litellm_settings:
  rate_limit:
    - model: llama-3-70b
      tpm: 100000
      rpm: 1000
    - model: llama-3-8b
      tpm: 500000
      rpm: 5000

# Guardrails
guardrails:
  - guardrail_name: "PII-detection"
    litellm_params:
      guardrail: presidio
      output:
        redact: true
  - guardrail_name: "content-moderation"
    litellm_params:
      guardrail: llamaguard
      mode: "during_call"

# Spend tracking
team_settings:
  - team_id: "engineering"
    models: ["llama-3-70b", "llama-3-8b"]
    max_budget: 1000
    budget_duration: "30d"
  - team_id: "research"
    models: ["gpt-4", "llama-3-70b", "llama-3-8b"]
    max_budget: 5000
    budget_duration: "30d"

Autoscaling Patterns for LLMs

Traditional CPU-based autoscaling doesn't work for LLMs. GPU autoscaling requires different signals:

GPU Utilization: When > 80%, scale up
Queue Depth: Pending requests waiting for GPU time
Time-to-First-Token (TTFT): When latency exceeds SLO
Tokens-per-Second (TPS): When throughput drops

# Custom metrics adapter for GPU scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-70b
  minReplicas: 1
  maxReplicas: 5
  metrics:
    # Scale on GPU utilization
    - type: External
      external:
        metric:
          name: nvidia_gpu_utilization
          selector:
            matchLabels:
              pod: vllm-70b
        target:
          type: AverageValue
          averageValue: "80"
    
    # Scale on request queue depth
    - type: External
      external:
        metric:
          name: vllm_queue_length
        target:
          type: AverageValue
          averageValue: "20"
  behavior:
    # Slow scale-up (GPU pods take time to start)
    scaleUp:
      stabilizationWindowSeconds: 180
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300
    # Very slow scale-down (avoid thrashing)
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 600

# Important: GPU autoscaling is expensive and slow.
# Consider these patterns instead:
# 1. Pre-warmed pools for expected load
# 2. Multi-model routing to balance across existing capacity
# 3. Spot/preemptible instances for batch workloads

LLM Observability: Metrics That Matter

Standard application metrics aren't sufficient for LLMs. You need token-level observability:

Metric	Description	Alert Threshold
TTFT (Time to First Token)	Latency from request to first response token	P99 < 500ms
TPOT (Time Per Output Token)	Inter-token latency during generation	P99 < 100ms
Throughput (tokens/sec)	Total tokens generated per second	Monitor trends
Queue Depth	Pending requests	> 50 requests
KV Cache Utilization	GPU memory used for context	> 90%
Cost per 1K tokens	Infrastructure cost efficiency	Compare to OpenAI pricing

Security and Safety in LLMOps

Running LLMs introduces unique security challenges:

Prompt Injection: Malicious inputs to manipulate model behavior
Data Leakage: Model memorizing and exposing training data
Jailbreaking: Bypassing safety constraints
Resource Exhaustion: Denial of service via expensive generation requests
Model Theft: Extracting model weights or architecture

# Security layers for production

1. Input Validation
   - Max length limits
   - Rate limiting per user/IP
   - Content filtering (PII, toxic content)

2. Prompt Injection Detection
   - LlamaGuard integration
   - Custom rules for known attack patterns
   - Sandboxing for untrusted inputs

3. Output Filtering
   - PII redaction (Presidio)
   - Content moderation
   - Refusal detection

4. Resource Limits
   - Max tokens per request
   - Timeout thresholds
   - Queue limits per tenant

5. Network Security
   - TLS everywhere
   - API key authentication
   - VPC/isolated network for model servers

Cost Optimization Strategies

LLM infrastructure is expensive. Here's how to optimize:

Quantization: 4-bit reduces VRAM by 75%, enabling smaller GPUs
Prefix Caching: Cache common prompts (RAG contexts, system prompts)
Multi-Model Routing: Use smaller models when sufficient
Spot Instances: For batch inference workloads
Request Batching: Increase throughput without linear cost
Time-based Scaling: Scale down during off-hours

# Cost breakdown for 70B model inference (AWS p4d.24xlarge - 8xA100)
# On-demand: $32.77/hour
# Spot: $9.83/hour (70% savings)

# At 1000 requests/hour, 500 tokens average:
# Total tokens: 500,000/hour = 12M/day
# Cost: $32.77/hour = $787/day (on-demand)
# Cost: $9.83/hour = $236/day (spot)
# Comparable OpenAI API cost: ~$180/day (gpt-4-turbo)
# Comparable Anthropic API cost: ~$270/day (claude-3-opus)

# Self-hosting becomes cheaper at:
# - ~8M+ tokens/day with spot instances
# - ~15M+ tokens/day with on-demand instances

# Additional savings from:
# - No per-request latency
# - No rate limits
# - Model customization capability

Conclusion

Self-hosted LLMs have matured from experimental projects to production infrastructure. The combination of vLLM's PagedAttention, AWQ quantization, and Kubernetes GPU scheduling makes it feasible to run 70B+ parameter models on affordable hardware.

The key patterns for 2026: Use vLLM or TGI for high-throughput serving, quantize aggressively with AWQ 4-bit for production, route intelligently between models, and implement proper observability with token-level metrics. Security through LlamaGuard and cost optimization through spot instances and prefix caching complete the picture.

Start with a single model, measure your actual token throughput, and scale based on data. LLMOps is still evolving—the best practices today will be outdated in a year, but the fundamentals of efficient inference, observability, and security will remain constant.

The democratization of AI infrastructure is here. You no longer need OpenAI's budget to run production-grade LLMs—you just need the patterns in this guide.