Observability Guide
Observability Guide
Section titled “Observability Guide”Overview
Section titled “Overview”The banyan-core platform provides automatic observability with zero configuration required from developers. All logging, tracing, and metrics are captured and correlated automatically.
Core Principle
Section titled “Core Principle”Developers write zero observability code. The platform automatically:
- Captures logs with correlation IDs
- Traces requests across service boundaries
- Collects performance metrics
- Correlates logs, traces, and metrics
- Exports to Jaeger, Elasticsearch, and Grafana
Distributed Tracing
Section titled “Distributed Tracing”Automatic Tracing
Section titled “Automatic Tracing”Every request automatically generates a distributed trace:
External Request → API Gateway → Service A → Service B │ │ │ │ └────────────────┴─────────────┴───────────┘ All correlated by trace IDTrace Propagation
Section titled “Trace Propagation”Correlation IDs propagate automatically through:
- HTTP requests (via
X-Correlation-IDheader) - Message bus (via MessageEnvelope.correlationId)
- Database queries (via query comments)
- Logs (included in every log entry)
Viewing Traces
Section titled “Viewing Traces”Jaeger UI
Section titled “Jaeger UI”Access at: http://localhost:16686
Features:
- Search traces by service, operation, tags
- View trace timelines
- Analyze service dependencies
- Identify performance bottlenecks
- See error traces
Example Search:
Service: user-serviceOperation: CreateUserHandlerTags: http.status_code=200Lookback: Last 1 hourTrace Details
Section titled “Trace Details”Each trace shows:
- Total duration across all services
- Service spans with individual timings
- Tags: HTTP method, status code, user ID, etc.
- Logs: Application logs tied to trace
- Errors: Stack traces for failures
Trace Context
Section titled “Trace Context”W3C Trace Context automatically included:
interface TraceContextData { traceId: string; // 32 hex characters spanId: string; // 16 hex characters traceFlags: string; // 2 hex characters traceState?: string; // Optional vendor data}Example:
traceId: 0af7651916cd43dd8448eb211c80319cspanId: b7ad6b7169203331traceFlags: 01Logging
Section titled “Logging”Automatic Log Capture
Section titled “Automatic Log Capture”All logs automatically include:
{ "timestamp": "2025-11-15T10:30:00.123Z", "level": "info", "message": "User created successfully", "serviceName": "user-service", "correlationId": "cor_abc123xyz", "traceId": "0af7651916cd43dd8448eb211c80319c", "spanId": "b7ad6b7169203331", "userId": "usr_1234567890", "context": { "email": "alice@example.com", "userId": "usr_1234567890" }}Logger Usage
Section titled “Logger Usage”Use the platform logger in handlers:
import { Logger } from '@banyanai/platform-telemetry';
@CommandHandler(CreateUserContract)export class CreateUserHandler { private readonly logger = Logger.getInstance();
async handle(input: { email: string; name: string }) { this.logger.info('Creating user', { email: input.email, name: input.name });
const user = await this.userRepository.create(input);
this.logger.info('User created successfully', { userId: user.id, email: user.email });
return user; }}Log Levels
Section titled “Log Levels”// Error (automatically captures stack traces)this.logger.error('Failed to create user', error, { email });
// Warningthis.logger.warn('User email already exists', { email });
// Infothis.logger.info('User created', { userId });
// Debug (development only)this.logger.debug('Validating user input', { input });Sensitive Data Redaction
Section titled “Sensitive Data Redaction”Sensitive data automatically redacted:
// Inputthis.logger.info('User logged in', { email: 'alice@example.com', password: 'Secret123', // ← Redacted ssn: '123-45-6789' // ← Redacted});
// Output{ "message": "User logged in", "context": { "email": "alice@example.com", "password": "***REDACTED***", "ssn": "***REDACTED***" }}Redacted Fields:
- password, passwd, pwd
- token, apiKey, secret
- ssn, creditCard, cvv
- privateKey, accessToken
Viewing Logs
Section titled “Viewing Logs”Elasticsearch
Section titled “Elasticsearch”Logs stored in Elasticsearch at: http://localhost:9200
Query logs:
curl http://localhost:9200/logs-*/_search?q=correlationId:cor_abc123xyzGrafana Explore
Section titled “Grafana Explore”Access at: http://localhost:5005
Features:
- Search logs by service, level, correlation ID
- Filter by time range
- Correlate with traces
- View log context around errors
Metrics
Section titled “Metrics”Automatic Metrics
Section titled “Automatic Metrics”Platform collects metrics automatically:
Request Metrics:
- Request count by service and operation
- Request duration (p50, p95, p99)
- Error rate by service
- Success rate by operation
Message Bus Metrics:
- Message throughput
- Queue depth
- Processing time
- Retry count
- Circuit breaker state
Database Metrics:
- Query count
- Query duration
- Connection pool usage
- Transaction count
Cache Metrics:
- Hit rate
- Miss rate
- Eviction count
- Memory usage
Custom Metrics
Section titled “Custom Metrics”Add business metrics:
import { MetricsManager } from '@banyanai/platform-telemetry';
@CommandHandler(ProcessOrderContract)export class ProcessOrderHandler { private readonly metrics = MetricsManager.getInstance();
async handle(input: { orderId: string }) { // Counter this.metrics.incrementCounter('orders.processed', { service: 'order-service' });
// Gauge this.metrics.recordGauge('orders.revenue', order.total, { currency: 'USD' });
// Histogram const startTime = Date.now(); await this.processOrder(input.orderId); this.metrics.recordHistogram( 'orders.processing.duration', Date.now() - startTime ); }}Viewing Metrics
Section titled “Viewing Metrics”Grafana Dashboards
Section titled “Grafana Dashboards”Access at: http://localhost:5005
Pre-built Dashboards:
- Service Performance
- Message Bus Health
- Database Performance
- Error Rates
- Business Metrics
Custom Dashboards: Create dashboards for your metrics in Grafana UI.
Health Monitoring
Section titled “Health Monitoring”Automatic Health Checks
Section titled “Automatic Health Checks”Each service exposes health endpoint:
# Check service healthcurl http://localhost:3001/health
# Response{ "status": "healthy", "timestamp": "2025-11-15T10:30:00Z", "components": { "database": "healthy", "messageBus": "healthy", "cache": "healthy" }, "uptime": 3600}Component Health
Section titled “Component Health”Individual component checks:
import { HealthMonitoring } from '@banyanai/platform-telemetry';
const health = HealthMonitoring.getInstance();
// Check databaseconst dbHealth = await health.checkDatabase();console.log(dbHealth.status); // 'healthy' | 'degraded' | 'unhealthy'
// Check message busconst mbHealth = await health.checkMessageBus();console.log(mbHealth.status);Health Alerts
Section titled “Health Alerts”Configure alerts for unhealthy components:
{ alerting: { channels: ['email', 'slack'], thresholds: { errorRate: 0.05, // Alert at 5% error rate responseTime: 1000, // Alert if p95 > 1s queueDepth: 1000 // Alert if queue > 1000 } }}Correlation
Section titled “Correlation”Request Correlation
Section titled “Request Correlation”Every request has a unique correlation ID:
HTTP Request: X-Correlation-ID: cor_abc123xyz ↓MessageEnvelope: correlationId: cor_abc123xyz ↓Log Entry: correlationId: cor_abc123xyz ↓Trace: traceId: cor_abc123xyzFinding Related Data
Section titled “Finding Related Data”Search across all observability data:
# Logs in Elasticsearchcurl "http://localhost:9200/logs-*/_search?q=correlationId:cor_abc123xyz"
# Traces in Jaegercurl "http://localhost:16686/api/traces?service=user-service&tag=correlationId:cor_abc123xyz"Cross-Service Tracing
Section titled “Cross-Service Tracing”Trace requests across multiple services:
Trace: cor_abc123xyz├─ Span: API Gateway (50ms)│ └─ HTTP POST /api/users├─ Span: User Service (150ms)│ ├─ CreateUserHandler (100ms)│ └─ Database Insert (50ms)└─ Span: Email Service (200ms) └─ SendWelcomeEmail (200ms)Total duration: 400ms (spans overlap)
OpenTelemetry Integration
Section titled “OpenTelemetry Integration”Automatic Instrumentation
Section titled “Automatic Instrumentation”Platform uses OpenTelemetry for:
- HTTP instrumentation (Express, fetch)
- Database instrumentation (PostgreSQL)
- Message bus instrumentation (RabbitMQ)
- Cache instrumentation (Redis)
OTLP Export
Section titled “OTLP Export”Telemetry exported via OTLP:
Services → OTLP Exporter → Jaeger → ElasticsearchJaeger endpoint: http://jaeger:4318/v1/traces
Custom Spans
Section titled “Custom Spans”Add custom spans for detailed tracing:
import { TelemetrySDK } from '@banyanai/platform-telemetry';
@CommandHandler(ComplexOperationContract)export class ComplexOperationHandler { private readonly telemetry = TelemetrySDK.getInstance();
async handle(input: any) { return await this.telemetry.withSpan( 'complex-operation', async (span) => { span.setAttribute('operation.type', 'complex'); span.setAttribute('input.size', JSON.stringify(input).length);
// Sub-operation 1 await this.telemetry.withSpan('step-1', async () => { await this.processStep1(input); });
// Sub-operation 2 await this.telemetry.withSpan('step-2', async () => { await this.processStep2(input); });
return result; } ); }}Performance Analysis
Section titled “Performance Analysis”Identifying Bottlenecks
Section titled “Identifying Bottlenecks”Use Jaeger to find slow operations:
- Search for traces with high duration
- Sort by duration descending
- Analyze span timeline
- Identify slowest span
- Optimize that operation
Common Bottlenecks
Section titled “Common Bottlenecks”| Bottleneck | Symptom | Solution |
|---|---|---|
| Database Query | Span shows slow DB time | Add index, optimize query |
| External API | Long span for HTTP call | Add caching, async processing |
| Message Processing | High queue depth | Scale consumers, optimize handler |
| Serialization | Slow message marshalling | Use compression, reduce payload |
Service Performance Metrics
Section titled “Service Performance Metrics”Monitor in Grafana:
- P50 Response Time: Median response time
- P95 Response Time: 95th percentile (slow requests)
- P99 Response Time: 99th percentile (outliers)
- Error Rate: Percentage of failed requests
- Throughput: Requests per second
Best Practices
Section titled “Best Practices”1. Use Structured Logging
Section titled “1. Use Structured Logging”// Good: Structured contextthis.logger.info('User created', { userId: user.id, email: user.email, role: user.role});
// Avoid: String concatenationthis.logger.info(`User ${user.id} created with email ${user.email}`);2. Log at Appropriate Levels
Section titled “2. Log at Appropriate Levels”// Error: Requires immediate attentionthis.logger.error('Database connection failed', error);
// Warning: Potential issue, not blockingthis.logger.warn('Cache miss, falling back to database');
// Info: Normal operationthis.logger.info('User logged in', { userId });
// Debug: Detailed debugging (dev only)this.logger.debug('Query parameters', { params });3. Add Context to Errors
Section titled “3. Add Context to Errors”try { await this.userRepository.create(input);} catch (error) { this.logger.error('Failed to create user', error, { email: input.email, attemptNumber: retryCount }); throw error;}4. Use Meaningful Metric Names
Section titled “4. Use Meaningful Metric Names”// Good: Clear, hierarchicalthis.metrics.incrementCounter('orders.processed.success');this.metrics.incrementCounter('orders.processed.failed');
// Avoid: Ambiguousthis.metrics.incrementCounter('count');this.metrics.incrementCounter('processed');5. Don’t Over-Instrument
Section titled “5. Don’t Over-Instrument”// Good: Instrument critical pathsthis.metrics.recordHistogram('payment.processing.duration', duration);
// Avoid: Too granularthis.metrics.recordHistogram('variable.x.assignment.duration', 0.001);Troubleshooting
Section titled “Troubleshooting”Traces Not Appearing
Section titled “Traces Not Appearing”Cause: Jaeger not receiving traces
Solution:
# Check Jaeger is runningdocker compose ps jaeger
# Check endpoint configurationecho $JAEGER_ENDPOINT
# Verify telemetry provider initializedcurl http://localhost:3001/health | grep telemetryLogs Not in Elasticsearch
Section titled “Logs Not in Elasticsearch”Cause: Elasticsearch not running or wrong configuration
Solution:
# Check Elasticsearch healthcurl http://localhost:9200/_cluster/health
# Check indicescurl http://localhost:9200/_cat/indices?v
# Verify logs index existscurl http://localhost:9200/logs-*/_countMissing Correlation IDs
Section titled “Missing Correlation IDs”Cause: Request missing correlation ID header
Solution: API Gateway automatically generates correlation IDs, but ensure you’re using the platform’s HTTP client for external calls.