DevOps Tutorial

OpenTelemetry Complete Guide: Distributed Tracing, Metrics & Logs in 2026

March 12, 2026 22 min read

OpenTelemetry has become the single standard for observability. In 2026, vendor-agnostic telemetry collection is no longer optional—it's foundational. This guide covers everything from auto-instrumentation to production collectors, correlating traces with metrics and logs, and building observable systems that actually help you debug problems.

Why OpenTelemetry Won the Observability War

Five years ago, observability was fragmented. You used Jaeger for traces, Prometheus for metrics, and ELK for logs. Each had different agents, different configuration formats, different query languages. Vendor lock-in was real—you committed to Datadog, New Relic, or Dynatrace and switching meant re-instrumenting everything.

OpenTelemetry changed the game. As a Cloud Native Computing Foundation (CNCF) incubating project (now graduated), it provides a single, vendor-neutral standard for telemetry data. In 2026, it's the default choice for new implementations and the migration target for legacy systems.

The value proposition is simple:

Instrument once, export anywhere: Same telemetry can go to Prometheus, Jaeger, Datadog, or any OTLP-compatible backend
Auto-instrumentation: Zero-code telemetry for common frameworks
Single agent: One collector instead of three agents per host
Context propagation: Traces, metrics, and logs share the same context
Community-driven: No vendor control, open governance

📈 OpenTelemetry Adoption 2026

According to the 2026 CNCF Survey, 78% of Kubernetes users have adopted OpenTelemetry, up from 54% in 2024. The collector has become the second most deployed CNCF project after Kubernetes itself. Major cloud providers (AWS, GCP, Azure) now offer native OTLP endpoints.

The Three Pillars Unified: Telemetry as a Continuum

Traditional observability treated traces, metrics, and logs as separate systems. OpenTelemetry unifies them under a common data model:

Signal Type	What It Captures	Cardinality	Use Case
Traces	Request path through services	High (unique per request)	Latency analysis, dependency mapping
Metrics	Aggregated measurements over time	Low (fixed dimensions)	Alerting, capacity planning
Logs	Discrete events with context	Medium (event-based)	Debugging, audit trails

The key insight: these are not separate concerns—they're different projections of the same telemetry stream. A trace captures the request journey; metrics aggregate trace-derived data; logs provide detailed event context. OpenTelemetry's context propagation links them together.

The OpenTelemetry Data Model

Understanding the data model is essential for effective implementation:

Resource: Static attributes describing the entity producing telemetry (service.name, k8s.pod.name, host.name)
Scope: Instrumentation library information (library name, version)
Attributes: Key-value pairs providing context (http.method, db.system, user.id)
Events: Timestamped occurrences within a span (logs attached to traces)
Links: Connections between spans across trace boundaries
Status: Span success/error indication

Architecture: From Application to Backend

A production OpenTelemetry deployment typically follows this architecture:

# OpenTelemetry Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         APPLICATION LAYER                            │
│                                                                      │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│   │   Service A  │  │   Service B  │  │   Service C  │              │
│   │              │  │              │  │              │              │
│   │  Auto-instr  │  │  Manual SDK  │  │  Auto-instr  │              │
│   │    OTLP      │  │    OTLP      │  │    OTLP      │              │
│   └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
└──────────┼──────────────────┼──────────────────┼─────────────────────┘
           │                  │                  │
           └──────────────────┴──────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      COLLECTOR LAYER                                 │
│                                                                      │
│   ┌───────────────────────────────────────────────────────────┐     │
│   │            OpenTelemetry Collector (Agent)                 │     │
│   │                                                            │     │
│   │  Receivers → Processors → Exporters                        │     │
│   │                                                            │     │
│   │  OTLP → Batch → Memory Limiter → Resource → Prometheus     │     │
│   │  OTLP → Attributes → Filter → OTLP → Jaeger               │     │
│   │                                                            │     │
│   └───────────────────────────────────────────────────────────┘     │
└─────────────────────────────────────────────────────────────────────┘
           │                  │                  │
           ▼                  ▼                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        BACKEND LAYER                                 │
│                                                                      │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│   │ Prometheus   │  │   Jaeger     │  │    Tempo     │              │
│   │  (Metrics)   │  │  (Tracing)   │  │  (Tracing)   │              │
│   └──────────────┘  └──────────────┘  └──────────────┘              │
│                                                                      │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│   │   Loki       │  │   Grafana    │  │   Alertmanager│              │
│   │   (Logs)     │  │  (Dashboards)│  │   (Alerts)    │              │
│   └──────────────┘  └──────────────┘  └──────────────┘              │
└─────────────────────────────────────────────────────────────────────┘

This layered approach provides flexibility: applications emit OTLP (OpenTelemetry Protocol); collectors process, filter, and route; backends store and visualize. You can swap backends without touching applications.

Instrumentation Strategies: From Zero to Observable

OpenTelemetry offers multiple instrumentation approaches, from fully automatic to fully manual.

Auto-Instrumentation: Zero-Code Telemetry

For most applications, start with auto-instrumentation. It requires no code changes and captures common frameworks automatically.

Java Auto-Instrumentation

# Download the agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.0.0/opentelemetry-javaagent.jar

# Run with agent
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=payment-service \
  -Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
  -Dotel.traces.exporter=otlp \
  -Dotel.metrics.exporter=otlp \
  -Dotel.logs.exporter=otlp \
  -jar application.jar

Environment variables provide cleaner configuration:

# Kubernetes deployment snippet
env:
  - name: OTEL_SERVICE_NAME
    value: "payment-service"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment=production,service.version=2.1.0"
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.monitoring:4317"
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: "grpc"
  - name: OTEL_TRACES_EXPORTER
    value: "otlp"
  - name: OTEL_METRICS_EXPORTER
    value: "otlp"
  - name: OTEL_LOGS_EXPORTER
    value: "otlp"
  - name: OTEL_INSTRUMENTATION_COMMON_DEFAULT_ENABLED
    value: "true"
  # Enable specific instrumentations
  - name: OTEL_INSTRUMENTATION_JDBC_ENABLED
    value: "true"
  - name: OTEL_INSTRUMENTATION_KAFKA_ENABLED
    value: "true"
  # Sampling configuration
  - name: OTEL_TRACES_SAMPLER
    value: "parentbased_traceidratio"
  - name: OTEL_TRACES_SAMPLER_ARG
    value: "0.1"  # 10% sampling

Python Auto-Instrumentation

# Install instrumentation packages
pip install opentelemetry-distro opentelemetry-exporter-otlp

# Auto-instrument your application
opentelemetry-instrument \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --logs_exporter otlp \
  --service_name order-service \
  --exporter_otlp_endpoint http://otel-collector:4318 \
  python app.py

# Or use environment variables
export OTEL_SERVICE_NAME=order-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production

opentelemetry-instrument python app.py

Node.js Auto-Instrumentation

// Install packages
npm install --save @opentelemetry/auto-instrumentations-node

// Create tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
    }),
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Enable specific instrumentations
      '@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-mongodb': { enabled: true },
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.log('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

// Then run: node -r ./tracing.js app.js

💡 Auto-Instrumentation Best Practices

Start with auto-instrumentation, then add manual instrumentation for business-critical paths. Disable noisy instrumentations (like filesystem operations) that create too many spans. Always test in staging—the overhead is typically 3-5% but can spike with poorly configured exporters.

Manual SDK Instrumentation

For custom spans and business metrics, use the SDK directly:

// Java: Custom span with attributes
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.context.Scope;

public PaymentResult processPayment(PaymentRequest request) {
    // Get tracer (typically injected or singleton)
    Tracer tracer = GlobalOpenTelemetry.getTracer("payment-service");
    
    Span span = tracer.spanBuilder("process-payment")
        .setAttribute("payment.id", request.getId())
        .setAttribute("payment.amount", request.getAmount())
        .setAttribute("payment.currency", request.getCurrency())
        .startSpan();
    
    try (Scope scope = span.makeCurrent()) {
        // Add events for important steps
        span.addEvent("Validating payment method");
        validatePaymentMethod(request);
        
        span.addEvent("Charging customer");
        ChargeResult result = paymentGateway.charge(request);
        
        span.setAttribute("charge.id", result.getChargeId());
        span.setStatus(StatusCode.OK);
        
        return PaymentResult.success(result);
        
    } catch (ValidationException e) {
        span.setStatus(StatusCode.ERROR, "Payment validation failed");
        span.recordException(e);
        return PaymentResult.failure(e.getMessage());
        
    } catch (PaymentGatewayException e) {
        span.setStatus(StatusCode.ERROR, "Gateway error");
        span.recordException(e);
        span.setAttribute("error.type", e.getErrorCode());
        return PaymentResult.failure("Gateway unavailable");
        
    } finally {
        span.end();
    }
}

Custom Metrics with SDK

// Java: Custom metrics
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.LongHistogram;
import io.opentelemetry.api.metrics.Meter;

public class PaymentMetrics {
    private final LongCounter paymentsProcessed;
    private final LongHistogram paymentLatency;
    private final LongCounter paymentFailures;
    
    public PaymentMetrics(Meter meter) {
        this.paymentsProcessed = meter.counterBuilder("payments.processed")
            .setDescription("Total payments processed")
            .setUnit("1")
            .build();
            
        this.paymentLatency = meter.histogramBuilder("payments.latency")
            .setDescription("Payment processing time")
            .setUnit("ms")
            .ofLongs()
            .build();
            
        this.paymentFailures = meter.counterBuilder("payments.failures")
            .setDescription("Failed payment attempts")
            .setUnit("1")
            .build();
    }
    
    public void recordPayment(String currency, double amount, long latencyMs, boolean success) {
        Attributes attrs = Attributes.of(
            AttributeKey.stringKey("currency"), currency,
            AttributeKey.booleanKey("success"), success
        );
        
        paymentsProcessed.add(1, attrs);
        paymentLatency.record(latencyMs, attrs);
        
        if (!success) {
            paymentFailures.add(1, attrs);
        }
    }
}

The OpenTelemetry Collector: The Swiss Army Knife

The collector is the most powerful component of OpenTelemetry. It's a vendor-agnostic proxy that receives, processes, and exports telemetry data.

Collector Modes

Agent: Runs alongside application (DaemonSet, sidecar, or process)
Gateway: Centralized collector cluster, handles fan-out
Load Balancing: Stateless collectors with trace-aware load balancing

Production Collector Configuration

# collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 64
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["*"]
          allowed_headers: ["*"]
  
  # Prometheus scrape endpoint
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 10s
          static_configs:
            - targets: ['0.0.0.0:8888']

processors:
  # Batch spans/metrics/logs for efficiency
  batch:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048
  
  # Memory protection
  memory_limiter:
    limit_mib: 1500
    spike_limit_mib: 512
    check_interval: 5s
  
  # Add resource attributes
  resource:
    attributes:
      - key: collector.hostname
        value: ${HOSTNAME}
        action: upsert
      - key: environment
        from_attribute: deployment.environment
        action: insert
  
  # Filter out health checks
  filter/spans:
    spans:
      exclude:
        match_type: strict
        services:
          - health-check-service
  
  # Tail-based sampling (requires load balancing collector)
  tail_sampling:
    decision_wait: 10s
    num_traces: 100_000
    expected_new_traces_per_sec: 100
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

exporters:
  # Traces to Jaeger
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true
  
  # Metrics to Prometheus
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  
  # Logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      resource:
        service.name: "service_name"
        service.namespace: "service_namespace"
  
  # All signals to Tempo (Grafana Cloud)
  otlp/tempo:
    endpoint: tempo:4317
    headers:
      authorization: Basic ${TEMPO_AUTH_TOKEN}
  
  # Debug output (development only)
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter/spans, batch]
      exporters: [otlp/jaeger, otlp/tempo]
    
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, resource, batch]
      exporters: [prometheusremotewrite]
    
    logs:
      receivers: [otlp]
      processors: [memory_limiter, resource, batch]
      exporters: [loki]
  
  extensions:
    - health_check
    - pprof
    - zpages

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

Kubernetes Deployment

# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: monitoring
  labels:
    app: otel-collector
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      hostNetwork: true  # Required for some receivers
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          command:
            - "/otelcol-contrib"
            - "--config=/conf/collector-config.yaml"
          resources:
            limits:
              cpu: 1000m
              memory: 2Gi
            requests:
              cpu: 200m
              memory: 512Mi
          ports:
            - containerPort: 4317  # OTLP gRPC
              hostPort: 4317
              protocol: TCP
            - containerPort: 4318  # OTLP HTTP
              hostPort: 4318
              protocol: TCP
            - containerPort: 13133  # Health
              protocol: TCP
          volumeMounts:
            - name: collector-config
              mountPath: /conf
            - name: hostfs
              mountPath: /hostfs
              readOnly: true
          env:
            - name: HOSTNAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
      volumes:
        - name: collector-config
          configMap:
            name: otel-collector-config
        - name: hostfs
          hostPath:
            path: /

Distributed Tracing Deep Dive

Tracing captures the complete request lifecycle across services. This is where OpenTelemetry's value becomes most apparent.

Understanding Trace Structure

Trace: A distributed request journey (root span to completion)
├── Trace ID: Unique identifier (16 bytes, hex encoded)
├── Span: A unit of work with start/end time
│   ├── Span ID: 8 bytes, unique within trace
│   ├── Parent Span ID: Links to parent span
│   ├── Operation Name: Human-readable description
│   ├── Start/End Time: Nanosecond precision
│   ├── Attributes: Key-value context
│   ├── Events: Timestamped log entries
│   └── Status: Ok, Error, or Unset
└── Context Propagation: Headers carry trace context

Context Propagation

Context propagation is the magic that makes distributed tracing work. OpenTelemetry supports multiple propagation formats:

Format	Header	When to Use
W3C TraceContext	traceparent, tracestate	Default, standards-compliant
Jaeger	uber-trace-id	Legacy Jaeger environments
B3	X-B3-TraceId, X-B3-SpanId	Zipkin, older Istio
Baggage	baggage-key=value	User-defined context propagation

// Propagating context via HTTP headers (Java)
import io.opentelemetry.api.trace.propagation.W3CTraceContextPropagator;
import io.opentelemetry.context.propagation.ContextPropagators;

public class HttpClient {
    private final TextMapPropagator propagator = 
        W3CTraceContextPropagator.getInstance();
    
    public Response callService(Request request, Context context) {
        HttpURLConnection connection = createConnection(request);
        
        // Inject trace context into headers
        propagator.inject(context, connection, 
            (conn, key, value) -> conn.setRequestProperty(key, value));
        
        return execute(connection);
    }
}

// Extracting context from incoming request
public void handleRequest(HttpServletRequest request) {
    // Extract context from headers
    Context extractedContext = propagator.extract(
        Context.current(),
        request,
        (req, key) -> req.getHeader(key)
    );
    
    // Create span as child of extracted context
    Span span = tracer.spanBuilder("handle-request")
        .setParent(extractedContext)
        .startSpan();
}

Span Links vs Parent References

Not all relationships are parent-child. Links connect spans across trace boundaries:

Parent: Standard request flow (calling service → called service)
Link: Async relationships, batch processing, fan-out operations

// Linking a consumer span to the producer span
SpanContext producerSpanContext = extractFromMessage(message);

Span span = tracer.spanBuilder("process-message")
    .addLink(producerSpanContext)  // Not parent, but linked
    .setAttribute("messaging.system", "kafka")
    .setAttribute("messaging.destination", "orders")
    .startSpan();

// This creates a connection without making the consumer
// a child of the producer, which is correct for async processing

Metrics: From Counters to Histograms

Metrics in OpenTelemetry follow the same OTLP protocol but have different semantics than traces.

Metric Types

Type	Use Case	Example
Counter	Always increasing values	requests_total, errors_total
UpDownCounter	Values that go up and down	active_connections, queue_size
Histogram	Distributed values (latency, size)	request_duration_seconds
ObservableCounter	Asynchronous, cumulative values	bytes_transmitted
ObservableGauge	Asynchronous, current values	memory_usage, temperature

Metrics SDK Example

// Java: Complete metrics setup
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.Meter;
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.DoubleHistogram;
import io.opentelemetry.api.metrics.ObservableGauge;

public class ApplicationMetrics {
    private final Meter meter;
    private final LongCounter requestsCounter;
    private final DoubleHistogram requestDuration;
    
    public ApplicationMetrics(OpenTelemetry openTelemetry) {
        this.meter = openTelemetry.getMeter("payment-service");
        
        // Counter for total requests
        this.requestsCounter = meter.counterBuilder("http.server.requests")
            .setDescription("Total HTTP requests")
            .setUnit("1")
            .build();
        
        // Histogram for request duration
        this.requestDuration = meter.histogramBuilder("http.server.duration")
            .setDescription("HTTP request duration")
            .setUnit("ms")
            .setExplicitBucketBoundariesAdvice(
                List.of(5.0, 10.0, 25.0, 50.0, 100.0, 250.0, 500.0, 1000.0, 2500.0, 5000.0)
            )
            .build();
        
        // Observable gauge for heap memory
        meter.gaugeBuilder("jvm.memory.heap.used")
            .setDescription("JVM heap memory used")
            .setUnit("By")
            .buildWithCallback(result -> {
                MemoryMXBean memoryMXBean = ManagementFactory.getMemoryMXBean();
                MemoryUsage heapUsage = memoryMXBean.getHeapMemoryUsage();
                result.record(heapUsage.getUsed());
            });
    }
    
    public void recordRequest(String method, String path, int status, double durationMs) {
        Attributes attrs = Attributes.builder()
            .put("http.method", method)
            .put("http.route", path)
            .put("http.status_code", status)
            .build();
        
        requestsCounter.add(1, attrs);
        requestDuration.record(durationMs, attrs);
    }
}

Log Correlation: The Final Piece

Logs gain superpowers when correlated with traces. OpenTelemetry makes this automatic.

Structured Logging with Trace Context

// Java: Log with trace context automatically
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

public class PaymentService {
    private static final Logger log = LoggerFactory.getLogger(PaymentService.class);
    
    public void processPayment(PaymentRequest request) {
        Span span = Span.current();
        SpanContext ctx = span.getSpanContext();
        
        // Add trace context to logs
        MDC.put("trace_id", ctx.getTraceId());
        MDC.put("span_id", ctx.getSpanId());
        MDC.put("trace_flags", ctx.getTraceFlags().toString());
        MDC.put("service.name", "payment-service");
        
        try {
            log.info("Processing payment: amount={}, currency={}", 
                request.getAmount(), request.getCurrency());
            
            // ... processing ...
            
            log.info("Payment processed successfully: id={}", 
                request.getId());
                
        } finally {
            MDC.clear();
        }
    }
}

// Resulting log (JSON format)
{
    "timestamp": "2026-03-12T10:23:45.123Z",
    "level": "INFO",
    "message": "Payment processed successfully: id=pay_12345",
    "service.name": "payment-service",
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "span_id": "00f067aa0ba902b7",
    "trace_flags": "01"
}

Log Appender Configuration

Most logging frameworks have OpenTelemetry appenders that automatically inject trace context:

Sampling Strategies: Controlling Volume and Cost

Full trace capture in production is expensive. Sampling reduces volume while preserving diagnostic value.

Head-Based Sampling

Decision made at trace start—simple but can't react to downstream errors.

# Application configuration
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling

Tail-Based Sampling

Decision made after trace complete—can sample based on latency, errors, or attributes. Requires collector configuration:

processors:
  tail_sampling:
    decision_wait: 10s  # Wait for trace to complete
    num_traces: 100_000
    expected_new_traces_per_sec: 100
    policies:
      # Always sample errors
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      
      # Sample slow requests
      - name: slow
        type: latency
        latency: {threshold_ms: 2000}
      
      # Sample specific operations
      - name: important_ops
        type: string_attribute
        string_attribute:
          key: http.route
          values: ["/api/payments", "/api/checkout"]
          enabled_regex_matching: true
      
      # Probabilistic for rest
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

Production Deployment Checklist

☐ Instrumentation
  ☐ Auto-instrumentation enabled for frameworks
  ☐ Manual instrumentation for business operations
  ☐ Error tracking with exception recording
  ☐ Custom metrics for key business KPIs
  ☐ Log correlation configured

☐ Collector Configuration
  ☐ Memory limiter configured
  ☐ Batch processor tuned (timeout, size)
  ☐ Resource attributes enriched
  ☐ Tail sampling for high-value traces
  ☐ Queue draining on shutdown
  ☐ Health checks exposed

☐ Security
  ☐ TLS for OTLP endpoints
  ☐ Authentication for collectors
  ☐ Secrets in environment variables
  ☐ No sensitive data in attributes
  ☐ PII filtering configured

☐ Performance
  ☐ Overhead measured (< 5% CPU, < 10% memory)
  ☐ Buffer sizes appropriate for load
  ☐ Retries and timeouts configured
  ☐ Circuit breakers for backends
  ☐ Backpressure handling verified

☐ Reliability
  �Collector runs as DaemonSet or Deployment
  ☐ HPA configured for gateway collectors
  ☐ Persistent queues for exporter failures
  ☐ Monitoring of the collector itself
  ☐ Alerting on dropped spans/metrics/logs

☐ Cost Optimization
  ☐ Sampling reduces data volume
  ☐ Attributes don't explode cardinality
  ☐ Compression enabled
  ☐ Backend-specific batching configured
  ☐ Retention policies set

Backend Options Compared

Stack	Traces	Metrics	Logs	Cost Model
Jaeger + Prometheus + Grafana	Jaeger	Prometheus	ELK/Loki	Self-hosted
Grafana Stack	Tempo	Prometheus	Loki	Self-hosted/SaaS
Honeycomb	✓	Derived	✓	Event-based SaaS
Datadog	✓	✓	✓	Per-host SaaS
New Relic	✓	✓	✓	Data volume
AWS X-Ray + CloudWatch	X-Ray	CloudWatch	CloudWatch	AWS usage
Google Cloud Ops	Cloud Trace	Cloud Monitoring	Cloud Logging	GCP usage

Conclusion

OpenTelemetry has unified observability. What previously required three agents, three protocols, and three query languages now works through a single, vendor-neutral standard. The benefits are substantial: reduced vendor lock-in, simpler instrumentation, correlated telemetry, and a thriving ecosystem.

If you're starting fresh in 2026, there's no reason not to use OpenTelemetry. If you're migrating from proprietary agents, the path is well-trodden—incremental adoption is possible, and the collector's translation capabilities bridge the gap.

Start with auto-instrumentation, add custom spans for your critical paths, configure tail sampling to capture the traces that matter, and build dashboards that combine traces, metrics, and logs. The result is observability that actually helps you understand and improve your systems.

The future of observability is open, unified, and vendor-neutral. OpenTelemetry is that future.