Monitoring Guide
Monitoring Guide
Section titled “Monitoring Guide”Overview
Section titled “Overview”The banyan-core platform provides comprehensive monitoring through Grafana, Jaeger, and Elasticsearch. This guide covers monitoring setup, key metrics, and alerting strategies.
Monitoring Stack
Section titled “Monitoring Stack”| Component | Purpose | URL |
|---|---|---|
| Grafana | Dashboards and visualization | http://localhost:5005 |
| Jaeger | Distributed tracing | http://localhost:16686 |
| Elasticsearch | Metrics and logs storage | http://localhost:9200 |
| RabbitMQ Management | Message bus monitoring | http://localhost:55672 |
Grafana Setup
Section titled “Grafana Setup”Accessing Grafana
Section titled “Accessing Grafana”# URLhttp://localhost:5005
# Default credentials (development)Username: adminPassword: adminPre-configured Datasources
Section titled “Pre-configured Datasources”Grafana automatically connects to:
- Elasticsearch: Logs and metrics
- Jaeger: Distributed traces
Creating Dashboards
Section titled “Creating Dashboards”Service Performance Dashboard
Section titled “Service Performance Dashboard”{ "dashboard": { "title": "Service Performance", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "rate(requests_total[5m])", "datasource": "Elasticsearch" } ] }, { "title": "Response Time (p95)", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.95, response_time_seconds)", "datasource": "Elasticsearch" } ] }, { "title": "Error Rate", "type": "graph", "targets": [ { "expr": "rate(errors_total[5m])", "datasource": "Elasticsearch" } ] } ] }}Recommended Dashboards
Section titled “Recommended Dashboards”-
Service Overview
- Request rate
- Response time (p50, p95, p99)
- Error rate
- Active instances
-
Message Bus Health
- Queue depth by service
- Message rate (in/out)
- Consumer count
- Dead letter queue depth
-
Database Performance
- Query count
- Query duration
- Connection pool usage
- Slow queries
-
Infrastructure Health
- CPU usage by service
- Memory usage by service
- Disk I/O
- Network throughput
Key Metrics
Section titled “Key Metrics”Service Metrics
Section titled “Service Metrics”Request Metrics
Section titled “Request Metrics”// Automatically collectedrequests_total{service, operation, status}request_duration_seconds{service, operation}request_errors_total{service, operation}Monitor:
- High error rate (> 1%)
- Slow requests (p95 > 200ms)
- Request rate spikes
Handler Metrics
Section titled “Handler Metrics”// Automatically collectedhandler_executions_total{service, handler, status}handler_duration_seconds{service, handler}handler_errors_total{service, handler}Monitor:
- Handler failures
- Slow handlers
- Handler throughput
Message Bus Metrics
Section titled “Message Bus Metrics”Queue Metrics
Section titled “Queue Metrics”# Via RabbitMQ Management APIcurl -u admin:admin123 http://localhost:55672/api/queues | jq '.[].messages_ready'Monitor:
- Queue depth > 100: Scale consumers
- Message rate dropping: Check service health
- Unacked messages > 50: Check handler performance
Exchange Metrics
Section titled “Exchange Metrics”# Event publish ratecurl -u admin:admin123 http://localhost:55672/api/exchanges/%2F/exchange.platform.events | jq '.message_stats.publish_in'Monitor:
- Event publish rate
- Routing failures
- Exchange errors
Database Metrics
Section titled “Database Metrics”Connection Pool
Section titled “Connection Pool”// Automatically collecteddb_connections_active{service}db_connections_idle{service}db_connections_total{service}Monitor:
- Connection pool exhaustion
- High idle connections
- Connection errors
Query Performance
Section titled “Query Performance”// Automatically collecteddb_query_duration_seconds{service, operation}db_queries_total{service, operation}db_slow_queries_total{service}Monitor:
- Slow queries (> 1s)
- Query rate spikes
- Query errors
Cache Metrics
Section titled “Cache Metrics”// Automatically collectedcache_hits_total{service}cache_misses_total{service}cache_evictions_total{service}cache_memory_bytes{service}Monitor:
- Hit rate < 90%: Increase TTL or cache size
- High eviction rate: Increase memory
- Memory usage trending up
Distributed Tracing
Section titled “Distributed Tracing”Jaeger UI
Section titled “Jaeger UI”Access at: http://localhost:16686
Finding Traces
Section titled “Finding Traces”By Service
Section titled “By Service”Service: user-serviceLookback: Last 1 hourLimit Results: 20By Operation
Section titled “By Operation”Service: user-serviceOperation: CreateUserHandlerTags: http.status_code=200Min Duration: 100msBy Error
Section titled “By Error”Service: *Tags: error=trueLookback: Last 24 hoursAnalyzing Traces
Section titled “Analyzing Traces”Slow Requests
Section titled “Slow Requests”- Sort by duration (descending)
- Identify slowest span in timeline
- Check span tags for context
- Optimize identified bottleneck
Error Traces
Section titled “Error Traces”- Filter by error tag
- View error details in span logs
- Check correlation ID for related logs
- Fix root cause
Trace Metrics
Section titled “Trace Metrics”Monitor trace statistics:
- Trace count by service
- Average trace duration
- Error trace percentage
- Service dependencies
Alerting
Section titled “Alerting”Alert Configuration
Section titled “Alert Configuration”Grafana Alerts
Section titled “Grafana Alerts”Create alert rules in Grafana:
{ "name": "High Error Rate", "condition": "WHEN avg() OF query(A, 5m) IS ABOVE 0.05", "frequency": "1m", "for": "5m", "notifications": [ { "type": "email", "addresses": ["alerts@your-domain.com"] }, { "type": "slack", "channel": "#alerts" } ]}Alert Channels
Section titled “Alert Channels”Configure notification channels:
Email:
{ "type": "email", "addresses": ["ops@your-domain.com"], "settings": { "singleEmail": true }}Slack:
{ "type": "slack", "url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL", "settings": { "channel": "#alerts", "username": "Grafana Alert" }}Critical Alerts
Section titled “Critical Alerts”High Error Rate
Section titled “High Error Rate”ALERT: Error rate > 5% for 5 minutesService: user-serviceCurrent Rate: 8.5%Action: Check service logs and recent deploymentsQueue Buildup
Section titled “Queue Buildup”ALERT: Queue depth > 1000 for 5 minutesQueue: service.order-service.commands.ProcessOrderDepth: 1,543Action: Scale order-service instancesDatabase Issues
Section titled “Database Issues”ALERT: Database connection pool > 90% for 2 minutesService: user-serviceActive: 18/20 connectionsAction: Scale service instances or increase pool sizeSlow Responses
Section titled “Slow Responses”ALERT: p95 response time > 1s for 10 minutesService: api-gatewayp95: 1.2sAction: Check downstream services for bottlenecksHealth Checks
Section titled “Health Checks”Service Health Endpoints
Section titled “Service Health Endpoints”All services expose /health:
# Check API Gatewaycurl http://localhost:3003/health
# Response{ "status": "healthy", "timestamp": "2025-11-15T10:30:00Z", "components": { "database": "healthy", "messageBus": "healthy", "cache": "healthy" }, "version": "1.0.0", "uptime": 3600}Component Health
Section titled “Component Health”Monitor individual components:
# PostgreSQLdocker exec flow-platform-postgres pg_isready
# RabbitMQcurl -u admin:admin123 http://localhost:55672/api/healthchecks/node
# Redisdocker exec flow-platform-redis redis-cli ping
# Elasticsearchcurl http://localhost:9200/_cluster/healthAutomated Health Monitoring
Section titled “Automated Health Monitoring”Configure health check intervals:
services: api-gateway: healthcheck: test: ["CMD", "curl", "-f", "http://localhost:3003/health"] interval: 30s timeout: 10s retries: 3 start_period: 40sPerformance Monitoring
Section titled “Performance Monitoring”Response Time Tracking
Section titled “Response Time Tracking”Track percentiles for all operations:
- p50 (median): Typical user experience
- p95: Most users’ experience
- p99: Worst-case (but not outliers)
- max: Absolute worst case
Throughput Monitoring
Section titled “Throughput Monitoring”Monitor requests per second:
// Service-level throughputservice_requests_per_second{service="user-service"} = 150
// Handler-level throughputhandler_executions_per_second{handler="CreateUserHandler"} = 50Resource Utilization
Section titled “Resource Utilization”Track container resources:
# View real-time statsdocker stats
# Specific servicedocker stats flow-platform-api-gateway --no-streamLog Monitoring
Section titled “Log Monitoring”Elasticsearch Queries
Section titled “Elasticsearch Queries”Error Logs
Section titled “Error Logs”curl "http://localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "must": [ { "match": { "level": "error" } }, { "range": { "timestamp": { "gte": "now-1h" } } } ] } }, "size": 100, "sort": [{ "timestamp": { "order": "desc" } }]}'Service-Specific Logs
Section titled “Service-Specific Logs”curl "http://localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'{ "query": { "bool": { "must": [ { "match": { "serviceName": "user-service" } }, { "range": { "timestamp": { "gte": "now-24h" } } } ] } }}'Log Aggregation
Section titled “Log Aggregation”Error Trends
Section titled “Error Trends”curl "http://localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'{ "size": 0, "query": { "bool": { "must": [ { "match": { "level": "error" } }, { "range": { "timestamp": { "gte": "now-7d" } } } ] } }, "aggs": { "errors_over_time": { "date_histogram": { "field": "timestamp", "calendar_interval": "1h" } } }}'RabbitMQ Monitoring
Section titled “RabbitMQ Monitoring”Management UI
Section titled “Management UI”Access at: http://localhost:55672
Key Metrics:
- Queue depths
- Message rates (publish/deliver)
- Consumer counts
- Connection/channel counts
API Monitoring
Section titled “API Monitoring”# Queue statisticscurl -u admin:admin123 http://localhost:55672/api/queues | \ jq '.[] | {name: .name, messages: .messages, consumers: .consumers}'
# Node healthcurl -u admin:admin123 http://localhost:55672/api/nodes | \ jq '.[] | {name: .name, mem_used: .mem_used, disk_free: .disk_free}'
# Exchange statisticscurl -u admin:admin123 http://localhost:55672/api/exchanges | \ jq '.[] | {name: .name, type: .type}'Best Practices
Section titled “Best Practices”1. Set Baseline Metrics
Section titled “1. Set Baseline Metrics”Establish normal operating ranges:
Normal Response Time (p95): 100-150msNormal Error Rate: < 0.1%Normal Queue Depth: < 50Normal CPU Usage: 30-50%2. Alert on Trends
Section titled “2. Alert on Trends”# Alert if metric trending in wrong directionALERT: Error rate increased 50% in last hourPrevious: 0.2%Current: 0.3%Trend: Increasing3. Use Runbooks
Section titled “3. Use Runbooks”Document response procedures:
## High Queue Depth Alert
### Diagnosis1. Check service instance count2. Check handler errors in logs3. Check downstream dependencies
### Resolution1. Scale service instances: `docker compose up -d --scale service=10`2. If errors, fix and redeploy3. Monitor queue depth decrease4. Monitor Business Metrics
Section titled “4. Monitor Business Metrics”Track business-relevant metrics:
// Custom metricsthis.metrics.incrementCounter('orders.completed');this.metrics.recordGauge('revenue.total', totalRevenue);this.metrics.recordHistogram('order.value', orderValue);5. Regular Reviews
Section titled “5. Regular Reviews”- Daily: Check error rates and queue depths
- Weekly: Review performance trends
- Monthly: Capacity planning review
Troubleshooting
Section titled “Troubleshooting”Missing Metrics
Section titled “Missing Metrics”Cause: Service not reporting or Elasticsearch connection issue
Solution:
# Check Elasticsearch healthcurl http://localhost:9200/_cluster/health
# Verify service telemetry configurationecho $JAEGER_ENDPOINTGaps in Traces
Section titled “Gaps in Traces”Cause: Service not instrumented or telemetry provider not initialized
Solution: Ensure TelemetryProvider initialized in service startup
Alert Fatigue
Section titled “Alert Fatigue”Cause: Too many non-actionable alerts
Solution:
- Increase alert thresholds
- Reduce notification frequency
- Create separate channels for critical vs warning