Jaeger Tracing
Jaeger Tracing
Section titled “Jaeger Tracing”Overview
Section titled “Overview”Jaeger is the platform’s distributed tracing system. It tracks requests as they flow through multiple services, helping you:
- Find Performance Bottlenecks: Identify slow operations
- Debug Errors: See exactly where failures occur
- Understand Request Flow: Visualize service interactions
- Analyze Dependencies: See which services call which
Accessing Jaeger UI
Section titled “Accessing Jaeger UI”# Open Jaeger UI in browseropen http://localhost:16686
# Or direct URLhttp://localhost:16686No authentication required for local development.
Key Concepts
Section titled “Key Concepts”A complete request journey from start to finish:
Client Request → API Gateway → User Service → Database → Email ServiceOne trace = one complete request flow
A single operation within a trace:
Trace: Create User Request├─ Span: API Gateway Request├─ Span: Message Bus Publish├─ Span: Handle CreateUserCommand│ ├─ Span: Validate Input│ ├─ Span: Database Save│ └─ Span: Publish UserCreatedEvent└─ Span: Send ResponseMetadata attached to spans:
service.name: user-servicehttp.method: POSThttp.status_code: 200correlation.id: abc-123-def-456error: trueFinding Traces
Section titled “Finding Traces”Search by Service
Section titled “Search by Service”1. Select Service: user-service2. Click "Find Traces"Shows all traces involving that service.
Search by Operation
Section titled “Search by Operation”1. Select Service: user-service2. Select Operation: CreateUserHandler3. Click "Find Traces"Shows only CreateUser operations.
Search by Tags
Section titled “Search by Tags”1. Select Service: user-service2. Tags: correlation.id="abc-123-def-456"3. Click "Find Traces"Finds specific request by correlation ID.
Search by Duration
Section titled “Search by Duration”1. Select Service: user-service2. Min Duration: 1s3. Click "Find Traces"Finds slow requests (> 1 second).
Search by Error Status
Section titled “Search by Error Status”1. Select Service: user-service2. Tags: error=true3. Click "Find Traces"Shows only failed requests.
Search by Time Range
Section titled “Search by Time Range”1. Lookback: Last 15m OR2. Custom: 2024-01-15 12:00 to 13:003. Click "Find Traces"Analyzing Traces
Section titled “Analyzing Traces”Trace View
Section titled “Trace View”Click on a trace to see details:
Timeline View:─────────────────────────────────────────────────API Gateway [════════════════] 500ms Auth Check [══] 50ms Route Request [══] 50ms Message Publish [════] 100ms
User Service [════════════] 400ms Handle Command [═══] 150ms Database Save [═══] 200ms Event Publish [═] 50ms─────────────────────────────────────────────────Total Duration: 500msSpan Details
Section titled “Span Details”Click on a span to see:
- Duration: How long it took
- Start Time: When it started
- Tags: Metadata (service, operation, correlation ID)
- Logs: Events within the span
- Process: Which service executed it
Identifying Bottlenecks
Section titled “Identifying Bottlenecks”Look for:
- Long Spans: Operations taking most time
- Sequential Delays: Operations not parallelized
- External Calls: Network/API latency
- Database Queries: Slow queries
Example bottleneck:
Handle CreateUserCommand [═══════════════] 5000ms ├─ Validate Input [═] 100ms ├─ Check Email Unique [════] 2000ms ← SLOW! ├─ Create User [═] 200ms └─ Send Welcome Email [══════] 2500ms ← SLOW!Optimization targets:
- Add database index for email lookup
- Send welcome email asynchronously
Common Use Cases
Section titled “Common Use Cases”1. Find Slow Requests
Section titled “1. Find Slow Requests”Goal: Identify requests taking > 1 second
Steps:
1. Service: (All)2. Min Duration: 1s3. Limit Results: 204. Click "Find Traces"Analysis:
- Sort by duration (longest first)
- Click trace to see timeline
- Identify slowest span
- Optimize that operation
2. Debug Failed Request
Section titled “2. Debug Failed Request”Goal: Understand why a request failed
Get correlation ID from error response:
curl http://localhost:3000/api/endpoint | jq '.correlationId'# Returns: "abc-123-def-456"Search in Jaeger:
1. Service: (All)2. Tags: correlation.id="abc-123-def-456"3. Click "Find Traces"Analysis:
- Find span with error tag
- Check span logs for error message
- View parent spans to see request flow
- Identify failure point
3. Trace Request Flow
Section titled “3. Trace Request Flow”Goal: See how request flows through services
Steps:
1. Service: api-gateway2. Operation: POST /api/create-user3. Limit: 14. Click "Find Traces"Analysis:
View trace to see:
API Gateway └─> User Service ├─> Database └─> Event Bus └─> Email Service └─> Notification Service4. Compare Request Performance
Section titled “4. Compare Request Performance”Goal: See if performance degraded over time
Steps:
# Get recent requests1. Service: user-service2. Operation: CreateUserHandler3. Lookback: 1h4. Click "Find Traces"
# Compare durations5. Sort by Duration6. Check average duration7. Identify outliers5. Find Errors in Time Range
Section titled “5. Find Errors in Time Range”Goal: See all errors during deployment window
Steps:
1. Service: (All)2. Tags: error=true3. Lookback: Custom (deployment time range)4. Click "Find Traces"Analysis:
- Group errors by service
- Identify common error patterns
- Check if error rate increased after deployment
Span Tags Reference
Section titled “Span Tags Reference”Standard Tags
Section titled “Standard Tags”All spans include:
service.name: Service that created spanservice.version: Service versioncorrelation.id: Request correlation IDcomponent: Platform component (e.g., “message-bus”, “cqrs”)
HTTP Tags
Section titled “HTTP Tags”API Gateway spans:
http.method: GET, POST, etc.http.url: Request URLhttp.status_code: Response statushttp.user_agent: Client user agent
CQRS Tags
Section titled “CQRS Tags”Command/Query handler spans:
message.type: Command or query namehandler.name: Handler class nameuser.id: Authenticated user IDpermissions: User permissions
Message Bus Tags
Section titled “Message Bus Tags”Message publishing/consuming:
message.exchange: RabbitMQ exchangemessage.routing_key: Routing keymessage.correlation_id: Message correlation ID
Database Tags
Section titled “Database Tags”Database operation spans:
db.type: Database type (PostgreSQL, Redis)db.statement: SQL querydb.instance: Database name
Error Tags
Section titled “Error Tags”Failed operations:
error: trueerror.kind: Error typeerror.message: Error messageerror.stack: Stack trace (in logs)
Performance Analysis
Section titled “Performance Analysis”Identifying Slow Operations
Section titled “Identifying Slow Operations”Workflow:
- Search for traces with min duration > threshold
- Sort by duration
- Open slowest trace
- Look at timeline view
- Find longest span
- Check span tags and logs
- Identify optimization opportunity
Example:
Trace: Create User (Total: 3000ms)
Spans:├─ API Gateway [═══] 100ms├─ Message Bus Publish [═] 50ms└─ User Service [════════════] 2850ms ├─ Handle Command [═] 100ms ├─ Validate Input [═] 50ms ├─ Database Query [═════════] 2500ms ← 83% of total time! └─ Event Publish [══] 200msFinding: Database query taking 2.5s (83% of total)
Action: Add database index
Comparing Service Performance
Section titled “Comparing Service Performance”Goal: See which service is slowest in request chain
View:
Service Performance (Average Duration):- API Gateway: 50ms- User Service: 2000ms ← Bottleneck- Email Service: 200ms- Notification: 100msAnalysis: Focus optimization on User Service
Detecting N+1 Problems
Section titled “Detecting N+1 Problems”Symptom: Many sequential database queries
Trace View:
Handle ListUsers [════════════════════] 5000ms ├─ Get Users [═] 100ms ├─ Get Org (1) [═] 50ms ├─ Get Org (2) [═] 50ms ├─ Get Org (3) [═] 50ms ... (100 more) └─ Get Org (100) [═] 50msSolution: Batch load organizations or use JOIN
Correlation ID Tracking
Section titled “Correlation ID Tracking”Every request has a unique correlation ID that links:
- Client request
- All service hops
- Database operations
- Event publications
- Logs
Finding Request by Correlation ID:
Tags: correlation.id="abc-123-def-456"Correlation ID Sources:
- HTTP Response Header:
X-Correlation-Id - Error Response Body:
correlationIdfield - Service Logs:
correlationIdfield - Jaeger Trace Tags
See also: Correlation ID Tracking
Advanced Features
Section titled “Advanced Features”Trace Comparison
Section titled “Trace Comparison”Compare two traces side-by-side:
1. Find two traces (e.g., before/after optimization)2. Click first trace3. Click "Compare" button4. Select second trace5. View differencesShows:
- Duration changes
- New/removed spans
- Tag differences
Trace Graph View
Section titled “Trace Graph View”Visualize service dependencies:
1. Open trace2. Click "Trace Graph" tab3. View service interaction diagramShows which services call which in this trace.
Span Logs
Section titled “Span Logs”View events within a span:
1. Click span2. View "Logs" sectionExample logs:
timestamp: 2024-01-15 12:00:00.100event: command_receivedmessage: "CreateUserCommand received"
timestamp: 2024-01-15 12:00:00.150event: validation_startedmessage: "Validating input"
timestamp: 2024-01-15 12:00:00.200event: validation_completemessage: "Input valid"Best Practices
Section titled “Best Practices”1. Always Use Correlation IDs
Section titled “1. Always Use Correlation IDs”When reporting bugs:
Issue: User creation failsCorrelation ID: abc-123-def-456Makes debugging much faster.
2. Search with Specific Tags
Section titled “2. Search with Specific Tags”Instead of browsing all traces, use tags:
# ✓ GOOD: Specific searchcorrelation.id="abc-123"error=truehttp.status_code=500
# ❌ BAD: Too broadservice.name="user-service" # Returns thousands3. Use Time Ranges
Section titled “3. Use Time Ranges”Narrow down search with time ranges:
# ✓ GOOD: Specific timeLookback: Last 15mCustom: 12:00 to 12:15
# ❌ BAD: Too broadLookback: Last 7d # Too many results4. Focus on Outliers
Section titled “4. Focus on Outliers”When optimizing, focus on:
- Slowest 5% of requests
- Operations with high variance
- Errors
Don’t optimize already-fast operations.
5. Monitor Trends
Section titled “5. Monitor Trends”Check Jaeger regularly:
- Daily: Are there new slow operations?
- After deployments: Did performance change?
- During incidents: What’s failing?
Troubleshooting Jaeger
Section titled “Troubleshooting Jaeger”No Traces Appearing
Section titled “No Traces Appearing”Causes:
- Jaeger not running:
docker ps | grep jaeger
# If not running:docker compose up -d jaeger- Service not sending traces:
# Check telemetry initializationdocker logs my-service | grep -i "telemetry\|jaeger"
# Should see: "Telemetry initialized"- Wrong time range:
# Ensure looking at recent timeLookback: Last 15mIncomplete Traces
Section titled “Incomplete Traces”Symptom: Trace shows only some spans, missing others
Causes:
- Service crashed before sending span
- Network issues between service and Jaeger
- Sampling - some spans randomly dropped (shouldn’t happen in dev)
Solution:
Check service logs for errors during that correlation ID.
High Jaeger Memory Usage
Section titled “High Jaeger Memory Usage”Solution:
Reduce trace retention:
jaeger: environment: - SPAN_STORAGE_TYPE=memory - MEMORY_MAX_TRACES=10000 # Default: 10000, reduce if neededIntegration with Other Tools
Section titled “Integration with Other Tools”Jaeger + Logs
Section titled “Jaeger + Logs”- Find slow request in Jaeger
- Get correlation ID from trace
- Search logs for correlation ID:
docker logs my-service | grep "abc-123-def-456"Jaeger + Metrics (Grafana)
Section titled “Jaeger + Metrics (Grafana)”- Identify slow operation in Jaeger
- Find operation name
- Search Grafana for that operation’s metrics
- See trends over time
Quick Reference
Section titled “Quick Reference”Search Patterns
Section titled “Search Patterns”# Find by correlation IDTags: correlation.id="abc-123"
# Find errorsTags: error=true
# Find slow requestsMin Duration: 1s
# Find specific operationService: user-serviceOperation: CreateUserHandler
# Find HTTP errorsTags: http.status_code=500
# Find by userTags: user.id="user-123"
# Time rangeLookback: Last 15mCustom: 2024-01-15 12:00 to 13:00Keyboard Shortcuts
Section titled “Keyboard Shortcuts”g- Go to search/- Focus search boxEscape- Close trace detailj/k- Next/previous trace
Related Documentation
Section titled “Related Documentation”- Correlation ID Tracking - Using correlation IDs
- Log Analysis - Analyzing service logs
- Telemetry Architecture - How tracing works
- Performance Optimization - Optimization strategies
Summary
Section titled “Summary”Jaeger is essential for:
- Performance Analysis - Find slow operations (Min Duration filter)
- Error Debugging - Trace failures (error=true tag)
- Request Flow - Understand service interactions (Trace timeline)
- Correlation ID Tracking - Link requests across services
Always start troubleshooting by finding the trace for a specific correlation ID or searching for errors/slow requests.