AI Trending Tutorial

LLMOps 2026: Production-Grade Self-Hosted AI Deployment Patterns

25 min read

Running LLMs in production is fundamentally different from running traditional microservices. Memory constraints, GPU scheduling, batch processing, and autoscaling challenges require new patterns. This guide covers the complete LLMOps landscape—from vLLM and TGI to multi-model routing, quantization strategies, and building inference infrastructure that scales.

The LLMOps Landscape in 2026

LLMOps emerged as a distinct discipline because Large Language Models break traditional MLOps assumptions. You can't just wrap a model in a Flask container and call it production-ready. The memory requirements are massive (70B parameter models need 140GB+ VRAM), inference is stateful (KV-cache management), and throughput depends heavily on batching and scheduling decisions.

In 2026, the LLMOps stack has stabilized around several key components:

📊 LLM Deployment Maturity 2026

The 2026 AI Infrastructure Report shows that 62% of enterprises running LLMs in production use self-hosted or hybrid setups (up from 38% in 2024). The primary drivers: data privacy (87%), cost predictability at scale (64%), and model customization (71%). Managed APIs remain popular for prototyping, but production workloads increasingly move on-premise.

Why Self-Hosted LLMs?

Before diving into implementation, understand when self-hosting makes sense:

Factor Self-Hosted Managed API
Data Privacy Data never leaves infrastructure Third-party data processing
Cost at Scale Fixed hardware costs Linear per-token pricing
Latency Predictable, controllable Variable, network-dependent
Model Control Fine-tune, merge, quantize freely Limited to provider's models
Initial Setup Complex, requires expertise Minutes with API key
Operational Overhead Significant (GPU management, scaling) Minimal (provider-managed)

The breakeven point typically occurs around 10-50 million tokens per day, depending on hardware costs and model size. At enterprise scale (billions of tokens monthly), self-hosting can reduce costs by 60-80%.

Inference Engines: The Heart of LLMOps

The inference engine is where model weights meet compute. Your choice fundamentally impacts throughput, latency, and hardware requirements.

Inference Engine Comparison

Engine Best For Throughput Latency Memory Efficiency Quantization
vLLM High-throughput serving Excellent (PagedAttention) Good Good (continuous batching) FP16, INT8, AWQ, GPTQ
TGI Hugging Face ecosystem Very Good Good Very Good (FlashAttention) FP16, INT8, AWQ, GPTQ, EETQ
TensorRT-LLM NVIDIA GPUs, max throughput Excellent Excellent Excellent (inflight batching) FP16, INT8, INT4, FP8
llama.cpp CPU inference, edge devices Moderate High (on CPU) Excellent (GGUF) GGUF (Q4_K_M, Q5_K_M, Q8_0)
ExLlamaV2 Local LLMs on consumer GPUs High Very Low Excellent ExL2, GPTQ
MLC LLM Multi-platform deployment Good Low Good INT4, INT8

Understanding PagedAttention (vLLM's Innovation)

vLLM revolutionized LLM serving with PagedAttention, inspired by virtual memory in operating systems. Traditional inference allocates contiguous memory for the KV cache, leading to massive internal fragmentation. PagedAttention stores KV cache in fixed-size blocks (like OS pages), enabling:

# Traditional Batching (inefficient)
Request 1: [Generate....................................................Done]
Request 2: [Generate....................................................Done]
Request 3: [Generate....................................................Done]
Time ->    [===========================================================]

# Continuous Batching with PagedAttention (efficient)
Request 1: [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Request 2:      [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Request 3:           [Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Gen][Done]
Time ->    [===============================================================]
# GPU constantly utilized, new requests join as others complete

Production vLLM Deployment

vLLM has become the default choice for production GPU serving. Here's how to deploy it properly.

Docker Deployment

# docker-compose.yml for vLLM
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:v0.4.0
    runtime: nvidia
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
      - VLLM_WORKER_MULTIPROC_METHOD=spawn
    volumes:
      - /data/models:/models
    ports:
      - "8000:8000"
    command: >
      --model /models/Meta-Llama-3-70B-Instruct-AWQ
      --tensor-parallel-size 2
      --quantization awq
      --max-model-len 8192
      --max-num-batched-tokens 8192
      --max-num-seqs 256
      --gpu-memory-utilization 0.95
      --swap-space 4
      --enforce-eager False
      --enable-prefix-caching
      --api-key ${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s
    restart: unless-stopped
    logging:
      driver: json-file
      options:
        max-size: "100m"
        max-file: "3"

  # Optional: Redis for caching
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data
    restart: unless-stopped

volumes:
  redis-data:

Key Configuration Parameters

Parameter Description Recommended
tensor-parallel-size GPUs per model instance 2-4 for 70B+ models
pipeline-parallel-size Pipeline stages across nodes 1 (unless multi-node)
max-model-len Max context length 4096-8192 (balance with memory)
max-num-seqs Max concurrent sequences 128-512 (higher = more throughput)
gpu-memory-utilization Fraction of GPU memory to use 0.90-0.95 (leave room for overhead)
swap-space CPU swap space in GB 4-8 (for KV cache offloading)
enable-prefix-caching Cache common prompt prefixes True (major speedup for RAG)
quantization Weight quantization method awq, gptq, fp8, or None

API Usage Examples

# Python client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

# Simple completion
response = client.chat.completions.create(
    model="/models/Meta-Llama-3-70B-Instruct-AWQ",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."}
    ],
    temperature=0.7,
    max_tokens=500,
    stream=False
)

print(response.choices[0].message.content)

# Streaming (for real-time UI)
stream = client.chat.completions.create(
    model="/models/Meta-Llama-3-70B-Instruct-AWQ",
    messages=[{"role": "user", "content": "Write a haiku about AI."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

# With structured output (JSON mode)
response = client.chat.completions.create(
    model="/models/Meta-Llama-3-70B-Instruct-AWQ",
    messages=[{
        "role": "user",
        "content": "Extract entities from: 'Apple launched iPhone 15 in September 2024'"
    }],
    response_format={"type": "json_object"},
    extra_body={
        "guided_json": {
            "type": "object",
            "properties": {
                "company": {"type": "string"},
                "product": {"type": "string"},
                "date": {"type": "string"}
            },
            "required": ["company", "product", "date"]
        }
    }
)

import json
entities = json.loads(response.choices[0].message.content)
print(entities)  # {'company': 'Apple', 'product': 'iPhone 15', 'date': 'September 2024'}

Quantization: Running Large Models on Smaller Hardware

Quantization reduces model precision to fit larger models on available hardware. In 2026, it's essentially mandatory for production deployment.

Quantization Methods Comparison

Method Bits Memory Reduction Quality Loss Speed Best For
FP16 16 50% Minimal Baseline Maximum quality, H100/A100 only
AWQ 4 75% Low Fast (GEMM kernels) Production serving, consumer GPUs
GPTQ 4, 3 75-81% Low-Moderate Fast Maximum compression, edge cases
GGUF (llama.cpp) Q4_K_M, Q5_K_M, Q8_0 60-75% Low (Q5_K_M) Moderate CPU inference, consumer GPUs
FP8 8 50% Very Low Very Fast (Hopper) H100/H200, maximum throughput
BitsAndBytes (NF4) 4 75% Low Moderate Training + inference (LoRA)

Quantizing a Model with AutoAWQ

# Install AutoAWQ
pip install autoawq

# Quantize a model
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Meta-Llama-3-70B-Instruct"
quant_path = "models/Meta-Llama-3-70B-Instruct-AWQ"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, 
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f"Model quantized and saved to {quant_path}")

# Memory requirements after quantization:
# 70B model FP16: ~140GB VRAM
# 70B model AWQ: ~40GB VRAM
đź’ˇ Quantization Strategy

For 70B parameter models: Use AWQ 4-bit for production serving on A100 40GB or RTX 4090 pairs. For 7B-13B models: Q5_K_M GGUF runs on consumer GPUs with excellent quality. For maximum throughput on H100: Use FP8 with TensorRT-LLM. Always evaluate on your specific tasks—quantization quality varies by model and use case.

LLMs on Kubernetes: Production Patterns

Running LLMs on Kubernetes requires understanding GPU scheduling, memory constraints, and the unique lifecycle of inference workloads.

Prerequisites

# Install NVIDIA GPU Operator
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true

# Verify GPU nodes
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia")))'

# Should show: nvidia.com/gpu: "4" (or however many GPUs)

Production Deployment with K8s

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-70b-instruct
  namespace: llm-serving
  labels:
    app: llm-70b-instruct
    model: llama-3-70b
spec:
  replicas: 1  # Usually 1 per model (stateful, GPU-bound)
  selector:
    matchLabels:
      app: llm-70b-instruct
  template:
    metadata:
      labels:
        app: llm-70b-instruct
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        node-type: gpu-a100  # Ensure GPU nodes
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.4.0
          command:
            - python
            - -m
            - vllm.entrypoints.openai.api_server
          args:
            - --model
            - /models/Meta-Llama-3-70B-Instruct-AWQ
            - --tensor-parallel-size
            - "2"
            - --quantization
            - awq
            - --max-model-len
            - "8192"
            - --max-num-seqs
            - "256"
            - --gpu-memory-utilization
            - "0.95"
            - --swap-space
            - "4"
            - --enable-prefix-caching
            - --api-key
            - $(API_KEY)
          ports:
            - containerPort: 8000
              name: http
          resources:
            limits:
              nvidia.com/gpu: 2
              memory: "80Gi"
              cpu: "8"
            requests:
              nvidia.com/gpu: 2
              memory: "80Gi"
              cpu: "4"
          env:
            - name: API_KEY
              valueFrom:
                secretKeyRef:
                  name: vllm-api-keys
                  key: primary
            - name: CUDA_VISIBLE_DEVICES
              value: "0,1"
          volumeMounts:
            - name: models
              mountPath: /models
            - name: shm
              mountPath: /dev/shm
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 60
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            failureThreshold: 30
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: models-pvc
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 10Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values:
                        - llm-70b-instruct
                topologyKey: kubernetes.io/hostname

---
apiVersion: v1
kind: Service
metadata:
  name: llm-70b-instruct
  namespace: llm-serving
spec:
  selector:
    app: llm-70b-instruct
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-70b-instruct-hpa
  namespace: llm-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-70b-instruct
  minReplicas: 1
  maxReplicas: 3
  metrics:
    - type: Pods
      pods:
        metric:
          name: vllm:gpu_utilization
        target:
          type: AverageValue
          averageValue: "80"
    - type: External
      external:
        metric:
          name: vllm:queue_length
        target:
          type: AverageValue
          averageValue: "50"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 600
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 600

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-70b-instruct-pdb
  namespace: llm-serving
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: llm-70b-instruct

Model Routing and Load Balancing

When running multiple models, you need intelligent routing. LiteLLM has become the standard proxy for this.

LiteLLM Configuration

# config.yaml for LiteLLM
model_list:
  # Primary models
  - model_name: gpt-4
    litellm_params:
      model: openai/gpt-4
      api_key: os.environ/OPENAI_API_KEY
  
  - model_name: llama-3-70b
    litellm_params:
      model: openai/Meta-Llama-3-70B-Instruct
      api_base: http://llm-70b-instruct.llm-serving:8000/v1
      api_key: os.environ/VLLM_API_KEY
  
  - model_name: llama-3-8b
    litellm_params:
      model: openai/Meta-Llama-3-8B-Instruct
      api_base: http://llm-8b-instruct.llm-serving:8000/v1
      api_key: os.environ/VLLM_API_KEY
  
  # Fallback models
  - model_name: llama-3-70b-fallback
    litellm_params:
      model: openai/Meta-Llama-3-70B-Instruct
      api_base: http://llm-70b-instruct-backup.llm-serving:8000/v1
      api_key: os.environ/VLLM_API_KEY

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  proxy_batch_write_at: 60
  database_url: os.environ/DATABASE_URL

router_settings:
  routing_strategy: simple-shuffle  # or least-busy, weighted
  timeout: 30s
  retries: 2
  
# Rate limiting
litellm_settings:
  rate_limit:
    - model: llama-3-70b
      tpm: 100000
      rpm: 1000
    - model: llama-3-8b
      tpm: 500000
      rpm: 5000

# Guardrails
guardrails:
  - guardrail_name: "PII-detection"
    litellm_params:
      guardrail: presidio
      output:
        redact: true
  - guardrail_name: "content-moderation"
    litellm_params:
      guardrail: llamaguard
      mode: "during_call"

# Spend tracking
team_settings:
  - team_id: "engineering"
    models: ["llama-3-70b", "llama-3-8b"]
    max_budget: 1000
    budget_duration: "30d"
  - team_id: "research"
    models: ["gpt-4", "llama-3-70b", "llama-3-8b"]
    max_budget: 5000
    budget_duration: "30d"

Autoscaling Patterns for LLMs

Traditional CPU-based autoscaling doesn't work for LLMs. GPU autoscaling requires different signals:

# Custom metrics adapter for GPU scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-gpu-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-70b
  minReplicas: 1
  maxReplicas: 5
  metrics:
    # Scale on GPU utilization
    - type: External
      external:
        metric:
          name: nvidia_gpu_utilization
          selector:
            matchLabels:
              pod: vllm-70b
        target:
          type: AverageValue
          averageValue: "80"
    
    # Scale on request queue depth
    - type: External
      external:
        metric:
          name: vllm_queue_length
        target:
          type: AverageValue
          averageValue: "20"
  behavior:
    # Slow scale-up (GPU pods take time to start)
    scaleUp:
      stabilizationWindowSeconds: 180
      policies:
        - type: Pods
          value: 1
          periodSeconds: 300
    # Very slow scale-down (avoid thrashing)
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Pods
          value: 1
          periodSeconds: 600

# Important: GPU autoscaling is expensive and slow.
# Consider these patterns instead:
# 1. Pre-warmed pools for expected load
# 2. Multi-model routing to balance across existing capacity
# 3. Spot/preemptible instances for batch workloads

LLM Observability: Metrics That Matter

Standard application metrics aren't sufficient for LLMs. You need token-level observability:

Metric Description Alert Threshold
TTFT (Time to First Token) Latency from request to first response token P99 < 500ms
TPOT (Time Per Output Token) Inter-token latency during generation P99 < 100ms
Throughput (tokens/sec) Total tokens generated per second Monitor trends
Queue Depth Pending requests > 50 requests
KV Cache Utilization GPU memory used for context > 90%
Cost per 1K tokens Infrastructure cost efficiency Compare to OpenAI pricing

Security and Safety in LLMOps

Running LLMs introduces unique security challenges:

# Security layers for production

1. Input Validation
   - Max length limits
   - Rate limiting per user/IP
   - Content filtering (PII, toxic content)

2. Prompt Injection Detection
   - LlamaGuard integration
   - Custom rules for known attack patterns
   - Sandboxing for untrusted inputs

3. Output Filtering
   - PII redaction (Presidio)
   - Content moderation
   - Refusal detection

4. Resource Limits
   - Max tokens per request
   - Timeout thresholds
   - Queue limits per tenant

5. Network Security
   - TLS everywhere
   - API key authentication
   - VPC/isolated network for model servers

Cost Optimization Strategies

LLM infrastructure is expensive. Here's how to optimize:

# Cost breakdown for 70B model inference (AWS p4d.24xlarge - 8xA100)
# On-demand: $32.77/hour
# Spot: $9.83/hour (70% savings)

# At 1000 requests/hour, 500 tokens average:
# Total tokens: 500,000/hour = 12M/day
# Cost: $32.77/hour = $787/day (on-demand)
# Cost: $9.83/hour = $236/day (spot)
# Comparable OpenAI API cost: ~$180/day (gpt-4-turbo)
# Comparable Anthropic API cost: ~$270/day (claude-3-opus)

# Self-hosting becomes cheaper at:
# - ~8M+ tokens/day with spot instances
# - ~15M+ tokens/day with on-demand instances

# Additional savings from:
# - No per-request latency
# - No rate limits
# - Model customization capability

Conclusion

Self-hosted LLMs have matured from experimental projects to production infrastructure. The combination of vLLM's PagedAttention, AWQ quantization, and Kubernetes GPU scheduling makes it feasible to run 70B+ parameter models on affordable hardware.

The key patterns for 2026: Use vLLM or TGI for high-throughput serving, quantize aggressively with AWQ 4-bit for production, route intelligently between models, and implement proper observability with token-level metrics. Security through LlamaGuard and cost optimization through spot instances and prefix caching complete the picture.

Start with a single model, measure your actual token throughput, and scale based on data. LLMOps is still evolving—the best practices today will be outdated in a year, but the fundamentals of efficient inference, observability, and security will remain constant.

The democratization of AI infrastructure is here. You no longer need OpenAI's budget to run production-grade LLMs—you just need the patterns in this guide.