The Problem: Building Reliable Distributed Systems
Imagine you're building an e-commerce platform. When a customer places an order, you need to:
- Validate the order and check inventory
- Reserve payment from the customer's credit card
- Create a shipment with the logistics provider
- Send confirmation emails
- Update analytics and inventory systems
This sounds straightforward. But what happens when:
- The payment service is temporarily unavailable? (Retry? For how long?)
- The shipment is created but the email server is down? (Partial success?)
- Your service crashes mid-transaction? (Resume from where?)
- The customer cancels while operations are in flight? (Compensate?)
The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous. All of these are false.
Traditional approaches—saga orchestration with message queues, explicit state machines, or hand-rolled retry logic—quickly become unmaintainable. You write more code handling failure than business logic.
What is Temporal?
Temporal is an open-source, stateful orchestration platform for building reliable distributed systems. It was created by the team behind AWS Simple Workflow Service and open-sourced in 2019. Companies like Stripe, Netflix, Snap, Datadog, and Coinbase use Temporal in production.
The Core Promise: Durable Execution
Temporal's key innovation is durable execution. Your workflow code runs in a way that automatically survives:
- Process crashes
- Network failures
- Data center outages
- Even Temporal server restarts
Temporal automatically records every step of your workflow in an event log. If your worker crashes after completing step 3 of a 10-step workflow, Temporal resumes execution from step 4 when a new worker comes online. From your code's perspective, it's as if nothing happened.
Architecture Overview
- Temporal Server: The brains. Maintains workflow state, schedules tasks, manages durability
- Workers: Your code. Poll for tasks, execute workflows and activities
- Persistence: PostgreSQL, MySQL, or Cassandra store the event history
Core Concepts
Workflows
Workflows are the orchestration logic. They define what needs to happen, not how. Workflows are:
- Deterministic: Same input always produces same sequence of operations
- Durable: State is automatically persisted and recovered
- Long-running: Can run for days, months, or even years
Activities
Activities are the operations workflows coordinate. They're where the actual work happens:
- Call external APIs
- Query databases
- Publish messages
- Any operation with side effects
Workflows must be deterministic. They can only call Activities, not external services directly.
Activities can do anything—call APIs, use randomness, get timestamps, etc.
Your First Workflow
Setup
# macOS/Linux
curl -sSf https://temporal.download/cli.sh | sh
# Verify
temporal --version
# temporal version 1.2.0
# Starts Temporal Server, Web UI, and Elasticsearch
temporal server start-dev --ui-port 8080
# Access:
# - Temporal Server: localhost:7233
# - Web UI: http://localhost:8080
# - gRPC: localhost:7233
Go Example: Order Processing Workflow
package workflows
import (
"time"
"go.temporal.io/sdk/workflow"
)
// OrderWorkflowInput defines the workflow input
type OrderWorkflowInput struct {
OrderID string
CustomerID string
Items []OrderItem
}
type OrderItem struct {
SKU string
Quantity int
Price float64
}
// OrderWorkflowResult contains the workflow output
type OrderWorkflowResult struct {
OrderID string
Status string
Total float64
ShipmentID string
CompletedAt time.Time
}
// OrderWorkflow is the main workflow definition
func OrderWorkflow(ctx workflow.Context, input OrderWorkflowInput) (*OrderWorkflowResult, error) {
// Set activity options - timeout and retry policy
ao := workflow.ActivityOptions{
StartToCloseTimeout: time.Minute,
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: time.Minute,
MaximumAttempts: 5,
},
}
ctx = workflow.WithActivityOptions(ctx, ao)
// Step 1: Validate order and check inventory
var validationResult ValidationResult
err := workflow.ExecuteActivity(ctx, ValidateOrder, input).Get(ctx, &validationResult)
if err != nil {
return nil, err
}
if !validationResult.Valid {
return &OrderWorkflowResult{
OrderID: input.OrderID,
Status: "REJECTED",
}, nil
}
// Step 2: Process payment
paymentInput := PaymentInput{
CustomerID: input.CustomerID,
Amount: validationResult.Total,
}
var paymentResult PaymentResult
err = workflow.ExecuteActivity(ctx, ProcessPayment, paymentInput).Get(ctx, &paymentResult)
if err != nil {
// Payment failed - no need to compensate yet, nothing was charged
return &OrderWorkflowResult{
OrderID: input.OrderID,
Status: "PAYMENT_FAILED",
}, nil
}
// Step 3: Create shipment
shipmentInput := ShipmentInput{
OrderID: input.OrderID,
Items: input.Items,
}
var shipmentResult ShipmentResult
err = workflow.ExecuteActivity(ctx, CreateShipment, shipmentInput).Get(ctx, &shipmentResult)
if err != nil {
// Shipment failed - need to refund payment (compensation)
_ = workflow.ExecuteActivity(ctx, RefundPayment, paymentResult.TransactionID).Get(ctx, nil)
return &OrderWorkflowResult{
OrderID: input.OrderID,
Status: "SHIPMENT_FAILED",
}, nil
}
// Step 4: Send confirmation (fire-and-forget, can fail without affecting order)
workflow.ExecuteActivity(ctx, SendConfirmation, input.OrderID)
// Return success
return &OrderWorkflowResult{
OrderID: input.OrderID,
Status: "COMPLETED",
Total: validationResult.Total,
ShipmentID: shipmentResult.ShipmentID,
CompletedAt: workflow.Now(ctx),
}, nil
}
package activities
import (
"context"
"fmt"
"time"
)
// ValidateOrder checks inventory and calculates total
func ValidateOrder(ctx context.Context, input OrderWorkflowInput) (*ValidationResult, error) {
// This is where you'd call your inventory service
// Can fail, will be retried according to retry policy
total := 0.0
for _, item := range input.Items {
// Check inventory via API
available, err := inventoryClient.CheckStock(item.SKU)
if err != nil {
return nil, fmt.Errorf("inventory check failed: %w", err)
}
if available < item.Quantity {
return &ValidationResult{Valid: false}, nil
}
total += float64(item.Quantity) * item.Price
}
return &ValidationResult{
Valid: true,
Total: total,
}, nil
}
// ProcessPayment charges the customer's payment method
func ProcessPayment(ctx context.Context, input PaymentInput) (*PaymentResult, error) {
// Call payment processor (Stripe, etc.)
// This activity is non-idempotent by default - Temporal tracks completion
result, err := paymentClient.Charge(input.CustomerID, input.Amount)
if err != nil {
return nil, fmt.Errorf("payment failed: %w", err)
}
return &PaymentResult{
TransactionID: result.ID,
Status: result.Status,
}, nil
}
// CreateShipment reserves shipping with logistics provider
func CreateShipment(ctx context.Context, input ShipmentInput) (*ShipmentResult, error) {
// Call shipping API (FedEx, UPS, etc.)
shipment, err := shippingClient.CreateShipment(input.Items)
if err != nil {
return nil, fmt.Errorf("shipment creation failed: %w", err)
}
return &ShipmentResult{
ShipmentID: shipment.ID,
LabelURL: shipment.LabelURL,
}, nil
}
// RefundPayment compensates failed transactions
func RefundPayment(ctx context.Context, transactionID string) error {
return paymentClient.Refund(transactionID)
}
// SendConfirmation sends email/SMS confirmation
func SendConfirmation(ctx context.Context, orderID string) error {
// Fire-and-forget activity - failures don't block workflow
return emailClient.SendOrderConfirmation(orderID)
}
package main
import (
"log"
"go.temporal.io/sdk/client"
"go.temporal.io/sdk/worker"
"myapp/activities"
"myapp/workflows"
)
func main() {
// Create Temporal client
c, err := client.New(client.Options{
HostPort: "localhost:7233",
})
if err != nil {
log.Fatalln("Unable to create Temporal client:", err)
}
defer c.Close()
// Start workflow
workflowOptions := client.StartWorkflowOptions{
ID: "order-12345",
TaskQueue: "order-queue",
}
input := workflows.OrderWorkflowInput{
OrderID: "order-12345",
CustomerID: "cust-98765",
Items: []workflows.OrderItem{
{SKU: "WIDGET-001", Quantity: 2, Price: 29.99},
},
}
we, err := c.ExecuteWorkflow(context.Background(), workflowOptions, workflows.OrderWorkflow, input)
if err != nil {
log.Fatalln("Unable to execute workflow:", err)
}
log.Println("Started workflow:", we.GetID(), "RunID:", we.GetRunID())
// Block until complete (in production, you'd get result asynchronously)
var result workflows.OrderWorkflowResult
if err := we.Get(context.Background(), &result); err != nil {
log.Fatalln("Workflow failed:", err)
}
log.Printf("Workflow completed: %+v\n", result)
}
// Worker setup (typically runs in separate process)
func startWorker() {
c, err := client.New(client.Options{
HostPort: "localhost:7233",
})
if err != nil {
log.Fatalln("Unable to create Temporal client:", err)
}
defer c.Close()
// Create worker
w := worker.New(c, "order-queue", worker.Options{})
// Register workflow and activities
w.RegisterWorkflow(workflows.OrderWorkflow)
w.RegisterActivity(activities.ValidateOrder)
w.RegisterActivity(activities.ProcessPayment)
w.RegisterActivity(activities.CreateShipment)
w.RegisterActivity(activities.RefundPayment)
w.RegisterActivity(activities.SendConfirmation)
// Start worker
if err := w.Run(worker.InterruptCh()); err != nil {
log.Fatalln("Unable to start worker:", err)
}
}
Workflow Patterns
Pattern 1: Fan-Out/Fan-In (Parallel Execution)
func ProcessBatchWorkflow(ctx workflow.Context, orders []Order) error {
// Execute all activities in parallel
futures := make([]workflow.Future, len(orders))
for i, order := range orders {
// Create detached context for parallel execution
activityCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: time.Minute,
})
futures[i] = workflow.ExecuteActivity(activityCtx, ProcessOrder, order)
}
// Wait for all to complete
for i, future := range futures {
var result OrderResult
if err := future.Get(ctx, &result); err != nil {
log.Printf("Order %d failed: %v\n", i, err)
}
}
return nil
}
Pattern 2: Child Workflows
For complex processes, decompose into child workflows:
func OrderFulfillmentWorkflow(ctx workflow.Context, order Order) error {
// Payment workflow (child)
paymentFuture := workflow.ExecuteChildWorkflow(
ctx,
PaymentWorkflow,
order.Payment,
workflow.ChildWorkflowOptions{
WorkflowID: "payment-" + order.ID,
},
)
// Inventory workflow (child, parallel)
inventoryFuture := workflow.ExecuteChildWorkflow(
ctx,
InventoryWorkflow,
order.Items,
workflow.ChildWorkflowOptions{
WorkflowID: "inventory-" + order.ID,
},
)
// Wait for both
var paymentResult PaymentResult
var inventoryResult InventoryResult
if err := paymentFuture.Get(ctx, &paymentResult); err != nil {
return err
}
if err := inventoryFuture.Get(ctx, &inventoryResult); err != nil {
// Compensate payment
_ = workflow.ExecuteActivity(ctx, RefundPayment, paymentResult.ID).Get(ctx, nil)
return err
}
// Continue with fulfillment...
return nil
}
Pattern 3: Human-in-the-Loop
func ExpenseApprovalWorkflow(ctx workflow.Context, expense Expense) error {
// Send notification to approver
_ = workflow.ExecuteActivity(ctx, NotifyApprover, expense).Get(ctx, nil)
// Set up signal handler for approval
selector := workflow.NewSelector(ctx)
var approval ApprovalDecision
// Signal channel
approvalCh := workflow.GetSignalChannel(ctx, "approval")
selector.AddReceive(approvalCh, func(ch workflow.ReceiveChannel, more bool) {
ch.Receive(ctx, &approval)
})
// Timeout after 7 days
timer := workflow.NewTimer(ctx, time.Hour*24*7)
selector.AddFuture(timer, func(f workflow.Future) {
approval = ApprovalDecision{Approved: false, Reason: "timeout"}
})
// Wait for either signal or timeout
selector.Select(ctx)
if approval.Approved {
return workflow.ExecuteActivity(ctx, ProcessExpense, expense).Get(ctx, nil)
}
return workflow.ExecuteActivity(ctx, RejectExpense, expense).Get(ctx, nil)
}
// External system sends signal:
// temporalClient.SignalWorkflow(ctx, "expense-workflow-123", "", "approval", ApprovalDecision{Approved: true})
Error Handling & Retries
Temporal provides sophisticated retry and error handling mechanisms:
ao := workflow.ActivityOptions{
StartToCloseTimeout: 5 * time.Minute,
// Retry policy determines behavior on activity failure
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: time.Second, // First retry after 1s
BackoffCoefficient: 2.0, // Double each time
MaximumInterval: time.Minute, // Cap at 1 minute
MaximumAttempts: 10, // Try 10 times total
NonRetryableErrorTypes: []string{"InvalidCreditCard", "InsufficientFunds"},
},
}
ctx = workflow.WithActivityOptions(ctx, ao)
Error Types
| Error Type | Description | Retry Behavior |
|---|---|---|
| ApplicationError | Business logic failure | Configurable via NonRetryableErrorTypes |
| TimeoutError | Activity exceeded time limit | Retries (if within schedule-to-close) |
| CanceledError | Activity was explicitly canceled | No retry |
| TerminatedError | Workflow was terminated | No retry (fatal) |
The Saga Pattern
Sagas manage long-running transactions across multiple services. When one step fails, compensating actions undo previous steps.
A saga is a sequence of local transactions where each service updates its data and publishes an event or calls the next service. If a step fails, compensating transactions are executed to undo the changes.
func SagaOrderWorkflow(ctx workflow.Context, order Order) error {
// Track completed steps for compensation
var completedSteps []string
var compensations []func() error
// Helper to execute with compensation tracking
execute := func(name string, fn, comp interface{}) error {
err := workflow.ExecuteActivity(ctx, fn, order).Get(ctx, nil)
if err != nil {
// Execute compensations in reverse order
for i := len(compensations) - 1; i >= 0; i-- {
_ = compensations[i]()
}
return err
}
completedSteps = append(completedSteps, name)
compensations = append(compensations, func() error {
return workflow.ExecuteActivity(ctx, comp, order).Get(ctx, nil)
})
return nil
}
// Step 1: Reserve inventory
if err := execute("reserve", ReserveInventory, ReleaseInventory); err != nil {
return err
}
// Step 2: Process payment
if err := execute("payment", ProcessPayment, RefundPayment); err != nil {
return err
}
// Step 3: Create shipment
if err := execute("shipment", CreateShipment, CancelShipment); err != nil {
return err
}
// Step 4: Send notification (no compensation needed)
_ = workflow.ExecuteActivity(ctx, SendNotification, order).Get(ctx, nil)
return nil
}
- Compensations can also fail—design idempotent compensations
- Some operations cannot be truly undone (emails sent, notifications delivered)
- Consider business implications: partial compensation may leave system in consistent but unexpected state
Schedules & Cron Workflows
Temporal supports scheduled workflows—perfect for recurring jobs:
// DailyReportWorkflow runs on a schedule
func DailyReportWorkflow(ctx workflow.Context) error {
// This workflow starts, completes, and Temporal schedules the next run
activityOpts := workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Minute,
}
ctx = workflow.WithActivityOptions(ctx, activityOpts)
// Generate daily report
yesterday := workflow.Now(ctx).Add(-24 * time.Hour)
var report Report
if err := workflow.ExecuteActivity(ctx, GenerateDailyReport, yesterday).Get(ctx, &report); err != nil {
return err
}
// Email report
return workflow.ExecuteActivity(ctx, EmailReport, report).Get(ctx, nil)
}
// Schedule creation via client:
// schedule, err := temporalClient.ScheduleClient().Create(ctx, client.ScheduleOptions{
// ID: "daily-report",
// Spec: client.ScheduleSpec{
// CronExpressions: []string{"0 9 * * *"}, // 9 AM daily
// },
// Action: &client.ScheduleWorkflowAction{
// ID: "daily-report-workflow",
// Workflow: DailyReportWorkflow,
// TaskQueue: "report-queue",
// },
// })
Production Deployment
Temporal Server Architecture
- Frontend: Accepts API calls, routes to appropriate services
- Matching: Pairs workflow tasks with available workers
- History: Manages workflow event persistence
- Worker: Background processing (archival, etc.)
Docker Compose Deployment
version: '3'
services:
postgresql:
image: postgres:15-alpine
environment:
POSTGRES_USER: temporal
POSTGRES_PASSWORD: temporal
POSTGRES_DB: temporal
volumes:
- postgresql_data:/var/lib/postgresql/data
elasticsearch:
image: elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
temporal-server:
image: temporalio/server:1.22.3
depends_on:
- postgresql
- elasticsearch
environment:
- DB=postgresql
- DB_PORT=5432
- POSTGRES_SEEDS=postgresql
- POSTGRES_USER=temporal
- POSTGRES_PWD=temporal
- ENABLE_ES=true
- ES_SEEDS=elasticsearch
- ES_PORT=9200
ports:
- "7233:7233"
temporal-web:
image: temporalio/web:2.25.0
depends_on:
- temporal-server
environment:
- TEMPORAL_GRPC_ENDPOINT=temporal-server:7233
ports:
- "8080:8080"
volumes:
postgresql_data:
elasticsearch_data:
Kubernetes Deployment
# Add Temporal Helm repo
helm repo add temporal https://temporalio.github.io/helm-charts
helm repo update
# Install with production values
helm install temporal temporal/temporal \
--values values-production.yaml \
--set server.replicaCount=3 \
--set cassandra.enabled=false \
--set postgresql.enabled=true \
--set elasticsearch.enabled=true \
--set prometheus.enabled=true \
--set grafana.enabled=true
# values-production.yaml highlights:
server:
config:
persistence:
default:
driver: "sql"
sql:
driver: "postgres12"
host: "postgresql.temporal.svc"
port: 5432
database: "temporal"
user: "temporal"
password: "${POSTGRES_PASSWORD}"
visibility:
driver: "es"
es:
version: "v8"
url:
scheme: "http"
host: "elasticsearch.temporal.svc:9200"
# Resource requests
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
# Deploy workers as separate Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: temporal-worker
spec:
replicas: 5
selector:
matchLabels:
app: temporal-worker
template:
spec:
containers:
- name: worker
image: myapp/temporal-worker:v1.0.0
env:
- name: TEMPORAL_HOST
value: "temporal-frontend.temporal.svc:7233"
resources:
requests:
memory: "512Mi"
cpu: "250m"
Observability & Debugging
Temporal Web UI
The Temporal Web UI provides:
- Workflow History: Every event, with timing and results
- Stack Trace: See exactly where a workflow is blocked
- Query Workflows: Execute read-only queries against running workflows
- Signal/Terminate: Manually interact with workflows
- Search: Find workflows by custom search attributes
Custom Search Attributes
// In your workflow, upsert searchable attributes
func OrderWorkflow(ctx workflow.Context, input OrderWorkflowInput) (*OrderWorkflowResult, error) {
// Make order attributes searchable
workflow.UpsertSearchAttributes(ctx, map[string]interface{}{
"CustomOrderID": input.OrderID,
"CustomCustomerID": input.CustomerID,
"CustomOrderStatus": "PENDING",
})
// ... workflow logic ...
// Update status
workflow.UpsertSearchAttributes(ctx, map[string]interface{}{
"CustomOrderStatus": "COMPLETED",
})
return result, nil
}
// Then search via API:
// tctl workflow list --query "CustomOrderStatus='FAILED' AND CustomCustomerID='cust-123'"
// Or in Web UI: CustomOrderStatus="FAILED"
Metrics and Alerting
Temporal exposes Prometheus metrics:
# Workflow success rate
sum(rate(temporal_workflow_completed_count{status="completed"}[5m]))
/
sum(rate(temporal_workflow_completed_count[5m]))
# Activity failure rate by type
sum(rate(temporal_activity_execution_failed_count[5m])) by (activity_type)
# Worker task backlog
temporal_task_scheduler_pending_tasks{namespace="default"}
# Workflow execution latency (p99)
histogram_quantile(0.99,
sum(rate(temporal_workflow_execution_latency_bucket[5m])) by (le)
)
Temporal vs Alternatives
| Feature | Temporal | Cadence | AWS Step Functions | Camunda |
|---|---|---|---|---|
| Self-hosted | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes |
| Code-first | ✅ Yes | ✅ Yes | ❌ JSON/YAML | ⚠️ BPMN |
| Multi-language | Go, Java, TS, Python, PHP, .NET | Go, Java | JSON only | Java, JavaScript |
| Max workflow duration | Unlimited | Unlimited | 1 year | Unlimited |
| Community | Active, growing | Legacy (Uber) | AWS only | Active |
| Cloud offering | Temporal Cloud | None | AWS Step Functions | Camunda Cloud |
Conclusion>
Temporal solves one of the hardest problems in distributed systems: reliably orchestrating complex operations across unreliable infrastructure. By separating business logic from reliability concerns, it lets developers focus on what matters.
Key takeaways:
- Durable execution means your workflows survive anything short of a database wipe
- Code-first workflows are more maintainable than JSON/YAML definitions
- Activity retries and sagas handle failure automatically
- Observability is built-in—see exactly what happened and when
- Self-hosted or cloud—choose based on your compliance needs
Use Temporal when you have multi-step business processes that span services and time. If your code has retry loops, state machines, or saga orchestration—you should probably be using Temporal.
Start small with the local development server, then scale to production with confidence. The investment in learning Temporal pays dividends in reliability and maintainability.