Skip to content

Monitoring Guide

The banyan-core platform provides comprehensive monitoring through Grafana, Jaeger, and Elasticsearch. This guide covers monitoring setup, key metrics, and alerting strategies.

ComponentPurposeURL
GrafanaDashboards and visualizationhttp://localhost:5005
JaegerDistributed tracinghttp://localhost:16686
ElasticsearchMetrics and logs storagehttp://localhost:9200
RabbitMQ ManagementMessage bus monitoringhttp://localhost:55672
Terminal window
# URL
http://localhost:5005
# Default credentials (development)
Username: admin
Password: admin

Grafana automatically connects to:

  • Elasticsearch: Logs and metrics
  • Jaeger: Distributed traces
{
"dashboard": {
"title": "Service Performance",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "rate(requests_total[5m])",
"datasource": "Elasticsearch"
}
]
},
{
"title": "Response Time (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, response_time_seconds)",
"datasource": "Elasticsearch"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(errors_total[5m])",
"datasource": "Elasticsearch"
}
]
}
]
}
}
  1. Service Overview

    • Request rate
    • Response time (p50, p95, p99)
    • Error rate
    • Active instances
  2. Message Bus Health

    • Queue depth by service
    • Message rate (in/out)
    • Consumer count
    • Dead letter queue depth
  3. Database Performance

    • Query count
    • Query duration
    • Connection pool usage
    • Slow queries
  4. Infrastructure Health

    • CPU usage by service
    • Memory usage by service
    • Disk I/O
    • Network throughput
// Automatically collected
requests_total{service, operation, status}
request_duration_seconds{service, operation}
request_errors_total{service, operation}

Monitor:

  • High error rate (> 1%)
  • Slow requests (p95 > 200ms)
  • Request rate spikes
// Automatically collected
handler_executions_total{service, handler, status}
handler_duration_seconds{service, handler}
handler_errors_total{service, handler}

Monitor:

  • Handler failures
  • Slow handlers
  • Handler throughput
Terminal window
# Via RabbitMQ Management API
curl -u admin:admin123 http://localhost:55672/api/queues | jq '.[].messages_ready'

Monitor:

  • Queue depth > 100: Scale consumers
  • Message rate dropping: Check service health
  • Unacked messages > 50: Check handler performance
Terminal window
# Event publish rate
curl -u admin:admin123 http://localhost:55672/api/exchanges/%2F/exchange.platform.events | jq '.message_stats.publish_in'

Monitor:

  • Event publish rate
  • Routing failures
  • Exchange errors
// Automatically collected
db_connections_active{service}
db_connections_idle{service}
db_connections_total{service}

Monitor:

  • Connection pool exhaustion
  • High idle connections
  • Connection errors
// Automatically collected
db_query_duration_seconds{service, operation}
db_queries_total{service, operation}
db_slow_queries_total{service}

Monitor:

  • Slow queries (> 1s)
  • Query rate spikes
  • Query errors
// Automatically collected
cache_hits_total{service}
cache_misses_total{service}
cache_evictions_total{service}
cache_memory_bytes{service}

Monitor:

  • Hit rate < 90%: Increase TTL or cache size
  • High eviction rate: Increase memory
  • Memory usage trending up

Access at: http://localhost:16686

Service: user-service
Lookback: Last 1 hour
Limit Results: 20
Service: user-service
Operation: CreateUserHandler
Tags: http.status_code=200
Min Duration: 100ms
Service: *
Tags: error=true
Lookback: Last 24 hours
  1. Sort by duration (descending)
  2. Identify slowest span in timeline
  3. Check span tags for context
  4. Optimize identified bottleneck
  1. Filter by error tag
  2. View error details in span logs
  3. Check correlation ID for related logs
  4. Fix root cause

Monitor trace statistics:

  • Trace count by service
  • Average trace duration
  • Error trace percentage
  • Service dependencies

Create alert rules in Grafana:

{
"name": "High Error Rate",
"condition": "WHEN avg() OF query(A, 5m) IS ABOVE 0.05",
"frequency": "1m",
"for": "5m",
"notifications": [
{
"type": "email",
"addresses": ["alerts@your-domain.com"]
},
{
"type": "slack",
"channel": "#alerts"
}
]
}

Configure notification channels:

Email:

{
"type": "email",
"addresses": ["ops@your-domain.com"],
"settings": {
"singleEmail": true
}
}

Slack:

{
"type": "slack",
"url": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL",
"settings": {
"channel": "#alerts",
"username": "Grafana Alert"
}
}
ALERT: Error rate > 5% for 5 minutes
Service: user-service
Current Rate: 8.5%
Action: Check service logs and recent deployments
ALERT: Queue depth > 1000 for 5 minutes
Queue: service.order-service.commands.ProcessOrder
Depth: 1,543
Action: Scale order-service instances
ALERT: Database connection pool > 90% for 2 minutes
Service: user-service
Active: 18/20 connections
Action: Scale service instances or increase pool size
ALERT: p95 response time > 1s for 10 minutes
Service: api-gateway
p95: 1.2s
Action: Check downstream services for bottlenecks

All services expose /health:

Terminal window
# Check API Gateway
curl http://localhost:3003/health
# Response
{
"status": "healthy",
"timestamp": "2025-11-15T10:30:00Z",
"components": {
"database": "healthy",
"messageBus": "healthy",
"cache": "healthy"
},
"version": "1.0.0",
"uptime": 3600
}

Monitor individual components:

Terminal window
# PostgreSQL
docker exec flow-platform-postgres pg_isready
# RabbitMQ
curl -u admin:admin123 http://localhost:55672/api/healthchecks/node
# Redis
docker exec flow-platform-redis redis-cli ping
# Elasticsearch
curl http://localhost:9200/_cluster/health

Configure health check intervals:

services:
api-gateway:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3003/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

Track percentiles for all operations:

  • p50 (median): Typical user experience
  • p95: Most users’ experience
  • p99: Worst-case (but not outliers)
  • max: Absolute worst case

Monitor requests per second:

// Service-level throughput
service_requests_per_second{service="user-service"} = 150
// Handler-level throughput
handler_executions_per_second{handler="CreateUserHandler"} = 50

Track container resources:

Terminal window
# View real-time stats
docker stats
# Specific service
docker stats flow-platform-api-gateway --no-stream
Terminal window
curl "http://localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "timestamp": { "gte": "now-1h" } } }
]
}
},
"size": 100,
"sort": [{ "timestamp": { "order": "desc" } }]
}
'
Terminal window
curl "http://localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "serviceName": "user-service" } },
{ "range": { "timestamp": { "gte": "now-24h" } } }
]
}
}
}
'
Terminal window
curl "http://localhost:9200/logs-*/_search" -H 'Content-Type: application/json' -d'
{
"size": 0,
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "timestamp": { "gte": "now-7d" } } }
]
}
},
"aggs": {
"errors_over_time": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "1h"
}
}
}
}
'

Access at: http://localhost:55672

Key Metrics:

  • Queue depths
  • Message rates (publish/deliver)
  • Consumer counts
  • Connection/channel counts
Terminal window
# Queue statistics
curl -u admin:admin123 http://localhost:55672/api/queues | \
jq '.[] | {name: .name, messages: .messages, consumers: .consumers}'
# Node health
curl -u admin:admin123 http://localhost:55672/api/nodes | \
jq '.[] | {name: .name, mem_used: .mem_used, disk_free: .disk_free}'
# Exchange statistics
curl -u admin:admin123 http://localhost:55672/api/exchanges | \
jq '.[] | {name: .name, type: .type}'

Establish normal operating ranges:

Normal Response Time (p95): 100-150ms
Normal Error Rate: < 0.1%
Normal Queue Depth: < 50
Normal CPU Usage: 30-50%
# Alert if metric trending in wrong direction
ALERT: Error rate increased 50% in last hour
Previous: 0.2%
Current: 0.3%
Trend: Increasing

Document response procedures:

## High Queue Depth Alert
### Diagnosis
1. Check service instance count
2. Check handler errors in logs
3. Check downstream dependencies
### Resolution
1. Scale service instances: `docker compose up -d --scale service=10`
2. If errors, fix and redeploy
3. Monitor queue depth decrease

Track business-relevant metrics:

// Custom metrics
this.metrics.incrementCounter('orders.completed');
this.metrics.recordGauge('revenue.total', totalRevenue);
this.metrics.recordHistogram('order.value', orderValue);
  • Daily: Check error rates and queue depths
  • Weekly: Review performance trends
  • Monthly: Capacity planning review

Cause: Service not reporting or Elasticsearch connection issue

Solution:

Terminal window
# Check Elasticsearch health
curl http://localhost:9200/_cluster/health
# Verify service telemetry configuration
echo $JAEGER_ENDPOINT

Cause: Service not instrumented or telemetry provider not initialized

Solution: Ensure TelemetryProvider initialized in service startup

Cause: Too many non-actionable alerts

Solution:

  • Increase alert thresholds
  • Reduce notification frequency
  • Create separate channels for critical vs warning