OpenTelemetry Complete Guide: Distributed Tracing, Metrics & Logs in 2026
OpenTelemetry has become the single standard for observability. In 2026, vendor-agnostic telemetry collection is no longer optionalβit's foundational. This guide covers everything from auto-instrumentation to production collectors, correlating traces with metrics and logs, and building observable systems that actually help you debug problems.
Why OpenTelemetry Won the Observability War
Five years ago, observability was fragmented. You used Jaeger for traces, Prometheus for metrics, and ELK for logs. Each had different agents, different configuration formats, different query languages. Vendor lock-in was realβyou committed to Datadog, New Relic, or Dynatrace and switching meant re-instrumenting everything.
OpenTelemetry changed the game. As a Cloud Native Computing Foundation (CNCF) incubating project (now graduated), it provides a single, vendor-neutral standard for telemetry data. In 2026, it's the default choice for new implementations and the migration target for legacy systems.
The value proposition is simple:
- Instrument once, export anywhere: Same telemetry can go to Prometheus, Jaeger, Datadog, or any OTLP-compatible backend
- Auto-instrumentation: Zero-code telemetry for common frameworks
- Single agent: One collector instead of three agents per host
- Context propagation: Traces, metrics, and logs share the same context
- Community-driven: No vendor control, open governance
According to the 2026 CNCF Survey, 78% of Kubernetes users have adopted OpenTelemetry, up from 54% in 2024. The collector has become the second most deployed CNCF project after Kubernetes itself. Major cloud providers (AWS, GCP, Azure) now offer native OTLP endpoints.
The Three Pillars Unified: Telemetry as a Continuum
Traditional observability treated traces, metrics, and logs as separate systems. OpenTelemetry unifies them under a common data model:
| Signal Type | What It Captures | Cardinality | Use Case |
|---|---|---|---|
| Traces | Request path through services | High (unique per request) | Latency analysis, dependency mapping |
| Metrics | Aggregated measurements over time | Low (fixed dimensions) | Alerting, capacity planning |
| Logs | Discrete events with context | Medium (event-based) | Debugging, audit trails |
The key insight: these are not separate concernsβthey're different projections of the same telemetry stream. A trace captures the request journey; metrics aggregate trace-derived data; logs provide detailed event context. OpenTelemetry's context propagation links them together.
The OpenTelemetry Data Model
Understanding the data model is essential for effective implementation:
- Resource: Static attributes describing the entity producing telemetry (service.name, k8s.pod.name, host.name)
- Scope: Instrumentation library information (library name, version)
- Attributes: Key-value pairs providing context (http.method, db.system, user.id)
- Events: Timestamped occurrences within a span (logs attached to traces)
- Links: Connections between spans across trace boundaries
- Status: Span success/error indication
Architecture: From Application to Backend
A production OpenTelemetry deployment typically follows this architecture:
# OpenTelemetry Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LAYER β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Service A β β Service B β β Service C β β
β β β β β β β β
β β Auto-instr β β Manual SDK β β Auto-instr β β
β β OTLP β β OTLP β β OTLP β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
ββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββ
β β β
ββββββββββββββββββββ΄βββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COLLECTOR LAYER β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β OpenTelemetry Collector (Agent) β β
β β β β
β β Receivers β Processors β Exporters β β
β β β β
β β OTLP β Batch β Memory Limiter β Resource β Prometheus β β
β β OTLP β Attributes β Filter β OTLP β Jaeger β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BACKEND LAYER β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Prometheus β β Jaeger β β Tempo β β
β β (Metrics) β β (Tracing) β β (Tracing) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Loki β β Grafana β β Alertmanagerβ β
β β (Logs) β β (Dashboards)β β (Alerts) β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This layered approach provides flexibility: applications emit OTLP (OpenTelemetry Protocol); collectors process, filter, and route; backends store and visualize. You can swap backends without touching applications.
Instrumentation Strategies: From Zero to Observable
OpenTelemetry offers multiple instrumentation approaches, from fully automatic to fully manual.
Auto-Instrumentation: Zero-Code Telemetry
For most applications, start with auto-instrumentation. It requires no code changes and captures common frameworks automatically.
Java Auto-Instrumentation
# Download the agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.0.0/opentelemetry-javaagent.jar
# Run with agent
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=payment-service \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-Dotel.traces.exporter=otlp \
-Dotel.metrics.exporter=otlp \
-Dotel.logs.exporter=otlp \
-jar application.jar
Environment variables provide cleaner configuration:
# Kubernetes deployment snippet
env:
- name: OTEL_SERVICE_NAME
value: "payment-service"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "deployment.environment=production,service.version=2.1.0"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector.monitoring:4317"
- name: OTEL_EXPORTER_OTLP_PROTOCOL
value: "grpc"
- name: OTEL_TRACES_EXPORTER
value: "otlp"
- name: OTEL_METRICS_EXPORTER
value: "otlp"
- name: OTEL_LOGS_EXPORTER
value: "otlp"
- name: OTEL_INSTRUMENTATION_COMMON_DEFAULT_ENABLED
value: "true"
# Enable specific instrumentations
- name: OTEL_INSTRUMENTATION_JDBC_ENABLED
value: "true"
- name: OTEL_INSTRUMENTATION_KAFKA_ENABLED
value: "true"
# Sampling configuration
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1" # 10% sampling
Python Auto-Instrumentation
# Install instrumentation packages
pip install opentelemetry-distro opentelemetry-exporter-otlp
# Auto-instrument your application
opentelemetry-instrument \
--traces_exporter otlp \
--metrics_exporter otlp \
--logs_exporter otlp \
--service_name order-service \
--exporter_otlp_endpoint http://otel-collector:4318 \
python app.py
# Or use environment variables
export OTEL_SERVICE_NAME=order-service
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
export OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production
opentelemetry-instrument python app.py
Node.js Auto-Instrumentation
// Install packages
npm install --save @opentelemetry/auto-instrumentations-node
// Create tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-grpc');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
}),
}),
instrumentations: [
getNodeAutoInstrumentations({
// Enable specific instrumentations
'@opentelemetry/instrumentation-fs': { enabled: false }, // Too noisy
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-mongodb': { enabled: true },
}),
],
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.log('Error terminating tracing', error))
.finally(() => process.exit(0));
});
// Then run: node -r ./tracing.js app.js
Start with auto-instrumentation, then add manual instrumentation for business-critical paths. Disable noisy instrumentations (like filesystem operations) that create too many spans. Always test in stagingβthe overhead is typically 3-5% but can spike with poorly configured exporters.
Manual SDK Instrumentation
For custom spans and business metrics, use the SDK directly:
// Java: Custom span with attributes
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.context.Scope;
public PaymentResult processPayment(PaymentRequest request) {
// Get tracer (typically injected or singleton)
Tracer tracer = GlobalOpenTelemetry.getTracer("payment-service");
Span span = tracer.spanBuilder("process-payment")
.setAttribute("payment.id", request.getId())
.setAttribute("payment.amount", request.getAmount())
.setAttribute("payment.currency", request.getCurrency())
.startSpan();
try (Scope scope = span.makeCurrent()) {
// Add events for important steps
span.addEvent("Validating payment method");
validatePaymentMethod(request);
span.addEvent("Charging customer");
ChargeResult result = paymentGateway.charge(request);
span.setAttribute("charge.id", result.getChargeId());
span.setStatus(StatusCode.OK);
return PaymentResult.success(result);
} catch (ValidationException e) {
span.setStatus(StatusCode.ERROR, "Payment validation failed");
span.recordException(e);
return PaymentResult.failure(e.getMessage());
} catch (PaymentGatewayException e) {
span.setStatus(StatusCode.ERROR, "Gateway error");
span.recordException(e);
span.setAttribute("error.type", e.getErrorCode());
return PaymentResult.failure("Gateway unavailable");
} finally {
span.end();
}
}
Custom Metrics with SDK
// Java: Custom metrics
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.LongHistogram;
import io.opentelemetry.api.metrics.Meter;
public class PaymentMetrics {
private final LongCounter paymentsProcessed;
private final LongHistogram paymentLatency;
private final LongCounter paymentFailures;
public PaymentMetrics(Meter meter) {
this.paymentsProcessed = meter.counterBuilder("payments.processed")
.setDescription("Total payments processed")
.setUnit("1")
.build();
this.paymentLatency = meter.histogramBuilder("payments.latency")
.setDescription("Payment processing time")
.setUnit("ms")
.ofLongs()
.build();
this.paymentFailures = meter.counterBuilder("payments.failures")
.setDescription("Failed payment attempts")
.setUnit("1")
.build();
}
public void recordPayment(String currency, double amount, long latencyMs, boolean success) {
Attributes attrs = Attributes.of(
AttributeKey.stringKey("currency"), currency,
AttributeKey.booleanKey("success"), success
);
paymentsProcessed.add(1, attrs);
paymentLatency.record(latencyMs, attrs);
if (!success) {
paymentFailures.add(1, attrs);
}
}
}
The OpenTelemetry Collector: The Swiss Army Knife
The collector is the most powerful component of OpenTelemetry. It's a vendor-agnostic proxy that receives, processes, and exports telemetry data.
Collector Modes
- Agent: Runs alongside application (DaemonSet, sidecar, or process)
- Gateway: Centralized collector cluster, handles fan-out
- Load Balancing: Stateless collectors with trace-aware load balancing
Production Collector Configuration
# collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 64
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins: ["*"]
allowed_headers: ["*"]
# Prometheus scrape endpoint
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['0.0.0.0:8888']
processors:
# Batch spans/metrics/logs for efficiency
batch:
timeout: 1s
send_batch_size: 1024
send_batch_max_size: 2048
# Memory protection
memory_limiter:
limit_mib: 1500
spike_limit_mib: 512
check_interval: 5s
# Add resource attributes
resource:
attributes:
- key: collector.hostname
value: ${HOSTNAME}
action: upsert
- key: environment
from_attribute: deployment.environment
action: insert
# Filter out health checks
filter/spans:
spans:
exclude:
match_type: strict
services:
- health-check-service
# Tail-based sampling (requires load balancing collector)
tail_sampling:
decision_wait: 10s
num_traces: 100_000
expected_new_traces_per_sec: 100
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow_requests
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
# Traces to Jaeger
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
# Metrics to Prometheus
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
# Logs to Loki
loki:
endpoint: http://loki:3100/loki/api/v1/push
labels:
resource:
service.name: "service_name"
service.namespace: "service_namespace"
# All signals to Tempo (Grafana Cloud)
otlp/tempo:
endpoint: tempo:4317
headers:
authorization: Basic ${TEMPO_AUTH_TOKEN}
# Debug output (development only)
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, filter/spans, batch]
exporters: [otlp/jaeger, otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, resource, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [loki]
extensions:
- health_check
- pprof
- zpages
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: 0.0.0.0:1777
zpages:
endpoint: 0.0.0.0:55679
Kubernetes Deployment
# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
labels:
app: otel-collector
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
hostNetwork: true # Required for some receivers
containers:
- name: otel-collector
image: otel/opentelemetry-collector-contrib:0.96.0
command:
- "/otelcol-contrib"
- "--config=/conf/collector-config.yaml"
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 200m
memory: 512Mi
ports:
- containerPort: 4317 # OTLP gRPC
hostPort: 4317
protocol: TCP
- containerPort: 4318 # OTLP HTTP
hostPort: 4318
protocol: TCP
- containerPort: 13133 # Health
protocol: TCP
volumeMounts:
- name: collector-config
mountPath: /conf
- name: hostfs
mountPath: /hostfs
readOnly: true
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: collector-config
configMap:
name: otel-collector-config
- name: hostfs
hostPath:
path: /
Distributed Tracing Deep Dive
Tracing captures the complete request lifecycle across services. This is where OpenTelemetry's value becomes most apparent.
Understanding Trace Structure
Trace: A distributed request journey (root span to completion)
βββ Trace ID: Unique identifier (16 bytes, hex encoded)
βββ Span: A unit of work with start/end time
β βββ Span ID: 8 bytes, unique within trace
β βββ Parent Span ID: Links to parent span
β βββ Operation Name: Human-readable description
β βββ Start/End Time: Nanosecond precision
β βββ Attributes: Key-value context
β βββ Events: Timestamped log entries
β βββ Status: Ok, Error, or Unset
βββ Context Propagation: Headers carry trace context
Context Propagation
Context propagation is the magic that makes distributed tracing work. OpenTelemetry supports multiple propagation formats:
| Format | Header | When to Use |
|---|---|---|
| W3C TraceContext | traceparent, tracestate | Default, standards-compliant |
| Jaeger | uber-trace-id | Legacy Jaeger environments |
| B3 | X-B3-TraceId, X-B3-SpanId | Zipkin, older Istio |
| Baggage | baggage-key=value | User-defined context propagation |
// Propagating context via HTTP headers (Java)
import io.opentelemetry.api.trace.propagation.W3CTraceContextPropagator;
import io.opentelemetry.context.propagation.ContextPropagators;
public class HttpClient {
private final TextMapPropagator propagator =
W3CTraceContextPropagator.getInstance();
public Response callService(Request request, Context context) {
HttpURLConnection connection = createConnection(request);
// Inject trace context into headers
propagator.inject(context, connection,
(conn, key, value) -> conn.setRequestProperty(key, value));
return execute(connection);
}
}
// Extracting context from incoming request
public void handleRequest(HttpServletRequest request) {
// Extract context from headers
Context extractedContext = propagator.extract(
Context.current(),
request,
(req, key) -> req.getHeader(key)
);
// Create span as child of extracted context
Span span = tracer.spanBuilder("handle-request")
.setParent(extractedContext)
.startSpan();
}
Span Links vs Parent References
Not all relationships are parent-child. Links connect spans across trace boundaries:
- Parent: Standard request flow (calling service β called service)
- Link: Async relationships, batch processing, fan-out operations
// Linking a consumer span to the producer span
SpanContext producerSpanContext = extractFromMessage(message);
Span span = tracer.spanBuilder("process-message")
.addLink(producerSpanContext) // Not parent, but linked
.setAttribute("messaging.system", "kafka")
.setAttribute("messaging.destination", "orders")
.startSpan();
// This creates a connection without making the consumer
// a child of the producer, which is correct for async processing
Metrics: From Counters to Histograms
Metrics in OpenTelemetry follow the same OTLP protocol but have different semantics than traces.
Metric Types
| Type | Use Case | Example |
|---|---|---|
| Counter | Always increasing values | requests_total, errors_total |
| UpDownCounter | Values that go up and down | active_connections, queue_size |
| Histogram | Distributed values (latency, size) | request_duration_seconds |
| ObservableCounter | Asynchronous, cumulative values | bytes_transmitted |
| ObservableGauge | Asynchronous, current values | memory_usage, temperature |
Metrics SDK Example
// Java: Complete metrics setup
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.metrics.Meter;
import io.opentelemetry.api.metrics.LongCounter;
import io.opentelemetry.api.metrics.DoubleHistogram;
import io.opentelemetry.api.metrics.ObservableGauge;
public class ApplicationMetrics {
private final Meter meter;
private final LongCounter requestsCounter;
private final DoubleHistogram requestDuration;
public ApplicationMetrics(OpenTelemetry openTelemetry) {
this.meter = openTelemetry.getMeter("payment-service");
// Counter for total requests
this.requestsCounter = meter.counterBuilder("http.server.requests")
.setDescription("Total HTTP requests")
.setUnit("1")
.build();
// Histogram for request duration
this.requestDuration = meter.histogramBuilder("http.server.duration")
.setDescription("HTTP request duration")
.setUnit("ms")
.setExplicitBucketBoundariesAdvice(
List.of(5.0, 10.0, 25.0, 50.0, 100.0, 250.0, 500.0, 1000.0, 2500.0, 5000.0)
)
.build();
// Observable gauge for heap memory
meter.gaugeBuilder("jvm.memory.heap.used")
.setDescription("JVM heap memory used")
.setUnit("By")
.buildWithCallback(result -> {
MemoryMXBean memoryMXBean = ManagementFactory.getMemoryMXBean();
MemoryUsage heapUsage = memoryMXBean.getHeapMemoryUsage();
result.record(heapUsage.getUsed());
});
}
public void recordRequest(String method, String path, int status, double durationMs) {
Attributes attrs = Attributes.builder()
.put("http.method", method)
.put("http.route", path)
.put("http.status_code", status)
.build();
requestsCounter.add(1, attrs);
requestDuration.record(durationMs, attrs);
}
}
Log Correlation: The Final Piece
Logs gain superpowers when correlated with traces. OpenTelemetry makes this automatic.
Structured Logging with Trace Context
// Java: Log with trace context automatically
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.SpanContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;
public class PaymentService {
private static final Logger log = LoggerFactory.getLogger(PaymentService.class);
public void processPayment(PaymentRequest request) {
Span span = Span.current();
SpanContext ctx = span.getSpanContext();
// Add trace context to logs
MDC.put("trace_id", ctx.getTraceId());
MDC.put("span_id", ctx.getSpanId());
MDC.put("trace_flags", ctx.getTraceFlags().toString());
MDC.put("service.name", "payment-service");
try {
log.info("Processing payment: amount={}, currency={}",
request.getAmount(), request.getCurrency());
// ... processing ...
log.info("Payment processed successfully: id={}",
request.getId());
} finally {
MDC.clear();
}
}
}
// Resulting log (JSON format)
{
"timestamp": "2026-03-12T10:23:45.123Z",
"level": "INFO",
"message": "Payment processed successfully: id=pay_12345",
"service.name": "payment-service",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"trace_flags": "01"
}
Log Appender Configuration
Most logging frameworks have OpenTelemetry appenders that automatically inject trace context:
true
true
Sampling Strategies: Controlling Volume and Cost
Full trace capture in production is expensive. Sampling reduces volume while preserving diagnostic value.
Head-Based Sampling
Decision made at trace startβsimple but can't react to downstream errors.
# Application configuration
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # 10% sampling
Tail-Based Sampling
Decision made after trace completeβcan sample based on latency, errors, or attributes. Requires collector configuration:
processors:
tail_sampling:
decision_wait: 10s # Wait for trace to complete
num_traces: 100_000
expected_new_traces_per_sec: 100
policies:
# Always sample errors
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
# Sample slow requests
- name: slow
type: latency
latency: {threshold_ms: 2000}
# Sample specific operations
- name: important_ops
type: string_attribute
string_attribute:
key: http.route
values: ["/api/payments", "/api/checkout"]
enabled_regex_matching: true
# Probabilistic for rest
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 5}
Production Deployment Checklist
β Instrumentation
β Auto-instrumentation enabled for frameworks
β Manual instrumentation for business operations
β Error tracking with exception recording
β Custom metrics for key business KPIs
β Log correlation configured
β Collector Configuration
β Memory limiter configured
β Batch processor tuned (timeout, size)
β Resource attributes enriched
β Tail sampling for high-value traces
β Queue draining on shutdown
β Health checks exposed
β Security
β TLS for OTLP endpoints
β Authentication for collectors
β Secrets in environment variables
β No sensitive data in attributes
β PII filtering configured
β Performance
β Overhead measured (< 5% CPU, < 10% memory)
β Buffer sizes appropriate for load
β Retries and timeouts configured
β Circuit breakers for backends
β Backpressure handling verified
β Reliability
οΏ½Collector runs as DaemonSet or Deployment
β HPA configured for gateway collectors
β Persistent queues for exporter failures
β Monitoring of the collector itself
β Alerting on dropped spans/metrics/logs
β Cost Optimization
β Sampling reduces data volume
β Attributes don't explode cardinality
β Compression enabled
β Backend-specific batching configured
β Retention policies set
Backend Options Compared
| Stack | Traces | Metrics | Logs | Cost Model |
|---|---|---|---|---|
| Jaeger + Prometheus + Grafana | Jaeger | Prometheus | ELK/Loki | Self-hosted |
| Grafana Stack | Tempo | Prometheus | Loki | Self-hosted/SaaS |
| Honeycomb | β | Derived | β | Event-based SaaS |
| Datadog | β | β | β | Per-host SaaS |
| New Relic | β | β | β | Data volume |
| AWS X-Ray + CloudWatch | X-Ray | CloudWatch | CloudWatch | AWS usage |
| Google Cloud Ops | Cloud Trace | Cloud Monitoring | Cloud Logging | GCP usage |
Conclusion
OpenTelemetry has unified observability. What previously required three agents, three protocols, and three query languages now works through a single, vendor-neutral standard. The benefits are substantial: reduced vendor lock-in, simpler instrumentation, correlated telemetry, and a thriving ecosystem.
If you're starting fresh in 2026, there's no reason not to use OpenTelemetry. If you're migrating from proprietary agents, the path is well-troddenβincremental adoption is possible, and the collector's translation capabilities bridge the gap.
Start with auto-instrumentation, add custom spans for your critical paths, configure tail sampling to capture the traces that matter, and build dashboards that combine traces, metrics, and logs. The result is observability that actually helps you understand and improve your systems.
The future of observability is open, unified, and vendor-neutral. OpenTelemetry is that future.