Skip to content

HeliosDB Monitoring and Alerting Guide

HeliosDB Monitoring and Alerting Guide

Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17


Table of Contents

  1. Overview
  2. Monitoring Architecture
  3. Key Metrics
  4. Grafana Dashboards
  5. Alert Rules
  6. Alert Severity Levels
  7. Common Monitoring Scenarios
  8. Troubleshooting Metrics

Overview

HeliosDB Phase 1 includes comprehensive monitoring for all three production features:

  • Conversational BI: Query latency, LLM API performance, cache hit rates, NL2SQL accuracy
  • Auto-Compliance: Compliance violations, audit log health, check latency, storage usage
  • Embedded+Cloud: Sync performance, conflict rates, offline mode, WebSocket connections

Monitoring Stack

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization and dashboards
  • Service Metrics: Built-in Prometheus exporters in each service

Monitoring Architecture

┌─────────────────────────────────────────────────────┐
│ Feature Services │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │Conversational│ │Auto-Compliance│ │Embedded+Cloud│ │
│ │BI :9091 │ │ :9092 │ │ :9093 │ │
│ └──────┬───────┘ └──────┬────────┘ └──────┬──────┘ │
└─────────┼────────────────┼─────────────────┼────────┘
│ │ │
│ Metrics │ │
│ (Prometheus │ │
│ format) │ │
▼ ▼ ▼
┌────────────────────────────────────────────┐
│ Prometheus :9090 │
│ (Scrape, Store, Alert) │
└────────────────┬───────────────────────────┘
│ Query
┌────────────────────────────────────────────┐
│ Grafana :3000 │
│ (Visualize, Dashboard, Notify) │
└────────────────────────────────────────────┘

Key Metrics

Conversational BI Metrics

Query Performance

MetricDescriptionTypeAlert Threshold
heliosdb_conversational_bi_query_duration_secondsNL query to SQL generation latencyHistogramp95 > 10s
heliosdb_conversational_bi_requests_totalTotal query requestsCounter-
heliosdb_conversational_bi_llm_errors_totalLLM API errorsCounterrate > 0.1/s

LLM Performance

MetricDescriptionTypeAlert Threshold
heliosdb_conversational_bi_llm_tokens_used_totalTotal tokens consumedCounterrate > 1M/hour
heliosdb_conversational_bi_llm_latency_secondsLLM API call latencyHistogramp95 > 5s
heliosdb_conversational_bi_llm_cost_estimateEstimated cost in USDGauge-

Cache Performance

MetricDescriptionTypeAlert Threshold
heliosdb_conversational_bi_cache_hits_totalCache hitsCounter-
heliosdb_conversational_bi_cache_misses_totalCache missesCounter-
heliosdb_conversational_bi_cache_size_bytesCache memory usageGauge> 1GB

Accuracy Metrics

MetricDescriptionTypeAlert Threshold
heliosdb_conversational_bi_nl2sql_accuracySQL generation accuracyGauge< 0.7
heliosdb_conversational_bi_sql_validation_errors_totalInvalid SQL generatedCounterrate > 0.05/s

Auto-Compliance Metrics

Compliance Violations

MetricDescriptionTypeAlert Threshold
heliosdb_compliance_violations_totalCompliance violations by frameworkCounterany > 0
heliosdb_compliance_checks_totalCompliance checks performedCounter-
heliosdb_compliance_check_duration_secondsCompliance check latencyHistogramp95 > 5s

Audit Log Health

MetricDescriptionTypeAlert Threshold
heliosdb_compliance_audit_log_write_duration_secondsAudit log write latencyHistogramp95 > 1s
heliosdb_compliance_audit_log_write_errors_totalAudit log write failuresCounterany > 0
heliosdb_compliance_audit_log_storage_bytesAudit log storage sizeGauge> 50GB

Report Generation

MetricDescriptionTypeAlert Threshold
heliosdb_compliance_reports_generated_totalReports generatedCounter-
heliosdb_compliance_report_generation_failures_totalReport generation failuresCounterany > 0
heliosdb_compliance_report_generation_duration_secondsReport generation timeHistogramp95 > 30s

Embedded+Cloud Metrics

Sync Performance

MetricDescriptionTypeAlert Threshold
heliosdb_embedded_cloud_sync_duration_secondsSync operation latencyHistogramp95 > 30s
heliosdb_embedded_cloud_sync_success_totalSuccessful syncsCounter-
heliosdb_embedded_cloud_sync_failures_totalFailed syncsCounterrate > 0.01/s

Conflict Resolution

MetricDescriptionTypeAlert Threshold
heliosdb_embedded_cloud_conflicts_totalData conflicts detectedCounterrate > 0.05/s
heliosdb_embedded_cloud_conflicts_resolved_totalConflicts resolvedCounter-
heliosdb_embedded_cloud_conflict_resolution_failures_totalConflicts unresolvedCounterany > 0

WebSocket Connections

MetricDescriptionTypeAlert Threshold
heliosdb_embedded_cloud_active_connectionsActive WebSocket connectionsGauge-
heliosdb_embedded_cloud_websocket_disconnects_totalWebSocket disconnectionsCounterrate > 10/s
heliosdb_embedded_cloud_websocket_message_errors_totalWebSocket message errorsCounterrate > 0.1/s

Offline Mode

MetricDescriptionTypeAlert Threshold
heliosdb_embedded_cloud_offline_mode_activeDevices in offline modeGauge-
heliosdb_embedded_cloud_offline_cache_usage_bytesOffline cache sizeGauge> 80% of limit
heliosdb_embedded_cloud_offline_cache_evictions_totalCache evictionsCounterrate > 1/s

Grafana Dashboards

Accessing Dashboards

  1. Open Grafana: http://localhost:3000
  2. Login with credentials from .env:
    • Username: admin
    • Password: <GRAFANA_ADMIN_PASSWORD>
  3. Navigate to Dashboards > Browse

Available Dashboards

1. Conversational BI Dashboard

Location: Dashboards > Conversational BI - Production Metrics

Panels:

  • Service Status (UP/DOWN indicator)
  • Requests Per Minute
  • NL2SQL Accuracy
  • Cache Hit Rate
  • LLM Error Rate
  • Query Latency Percentiles (p50, p95, p99)
  • Request Rate & Error Rate
  • LLM Token Usage by Provider
  • Cache Hit vs Miss Distribution
  • Memory Usage
  • Rate Limited Requests by Client
  • Query Type Distribution

Key Insights:

  • Are queries being served successfully?
  • Is the LLM API responding?
  • Is the cache effective?
  • Are we staying within rate limits?

2. Auto-Compliance Dashboard

Location: Dashboards > Auto-Compliance - Production Metrics

Panels:

  • Service Status
  • Total Violations (24h)
  • Compliance Check Latency (p95)
  • Audit Log Storage
  • Audit Log Write Errors
  • Reports Generated (24h)
  • Compliance Violations by Framework
  • Audit Log Write Performance
  • Compliance Checks by Framework
  • Violation Types Distribution
  • Audit Log Retention Compliance
  • Alert Delivery Success Rate
  • Recent Compliance Violations (Top 10)
  • Report Generation Success & Failures
  • Memory Usage
  • Audit Log Compression Ratio

Key Insights:

  • Are there any compliance violations?
  • Is the audit log healthy?
  • Are reports being generated successfully?
  • Is retention policy being met?

3. Embedded+Cloud Dashboard

Location: Dashboards > Embedded+Cloud Unified - Production Metrics

Panels:

  • Service Status
  • Active Connections
  • Sync Success Rate
  • Sync Latency (p95)
  • Conflicts (1h)
  • Offline Mode Devices
  • Sync Operations (Success vs Failures)
  • Sync Latency Percentiles
  • WebSocket Connections
  • Data Transfer Rate
  • Conflict Resolution Strategy Usage
  • Device Count by User
  • Offline Cache Usage
  • Cloud Storage Operation Latency
  • Top Error Types
  • Sync Queue Length
  • Device Auth Failures
  • Memory Usage
  • Cloud Storage Errors

Key Insights:

  • Are syncs completing successfully?
  • How many conflicts are occurring?
  • Is offline mode working?
  • Are WebSocket connections stable?

Alert Rules

Alert Configuration

Alert rules are defined in:

  • /home/claude/HeliosDB/deployment/staging/monitoring/prometheus/alerts/

Conversational BI Alerts

Critical Alerts:

  1. ConversationalBIServiceDown: Service unavailable for > 2 minutes
  2. ConversationalBICriticalLatency: p99 latency > 30s for > 3 minutes
  3. ConversationalBILLMAPIUnavailable: LLM API error rate > 0.5 errors/sec

Warning Alerts:

  1. ConversationalBIHighLatency: p95 latency > 10s for > 5 minutes
  2. ConversationalBIHighLLMErrorRate: LLM error rate > 0.1 errors/sec
  3. ConversationalBILowAccuracy: NL2SQL accuracy < 70% for > 10 minutes
  4. ConversationalBILowCacheHitRate: Cache hit rate < 30% for > 10 minutes
  5. ConversationalBIRateLimitExceeded: > 10 requests/sec being rate-limited
  6. ConversationalBIHighMemoryUsage: Memory usage > 3.5GB for > 5 minutes

Info Alerts:

  1. ConversationalBINoTraffic: No requests for > 10 minutes
  2. ConversationalBIHighLLMCost: Token usage > 1M tokens/hour

Auto-Compliance Alerts

Critical Alerts:

  1. ComplianceServiceDown: Service unavailable for > 2 minutes
  2. ComplianceViolationDetected: Any compliance violation detected
  3. ComplianceAuditLogWriteFailure: Audit log write errors detected
  4. ComplianceGDPRViolation: GDPR violation detected
  5. ComplianceHIPAAViolation: HIPAA violation detected
  6. CompliancePCIDSSViolation: PCI-DSS violation detected
  7. ComplianceAuditLogStorageCritical: Audit log storage > 80GB

Warning Alerts:

  1. ComplianceAuditLogHighLatency: p95 write latency > 1s
  2. ComplianceCheckHighLatency: p95 check latency > 5s
  3. ComplianceCheckFailures: Check failure rate > 0.05 failures/sec
  4. ComplianceReportGenerationFailed: Report generation failed
  5. ComplianceAuditLogStorageHigh: Audit log storage > 50GB
  6. ComplianceAlertDeliveryFailure: Alert delivery failing
  7. ComplianceRetentionPolicyViolation: Logs older than retention policy

Embedded+Cloud Alerts

Critical Alerts:

  1. EmbeddedCloudSyncServiceDown: Service unavailable for > 2 minutes
  2. EmbeddedCloudCriticalSyncLatency: p99 sync latency > 120s
  3. EmbeddedCloudHighSyncFailureRate: Sync failure rate > 0.1 failures/sec
  4. EmbeddedCloudStorageUnavailable: Cloud storage error rate > 0.1 errors/sec
  5. EmbeddedCloudCriticalSyncQueueBacklog: Sync queue > 10,000 items

Warning Alerts:

  1. EmbeddedCloudHighSyncLatency: p95 sync latency > 30s
  2. EmbeddedCloudSyncFailures: Sync failure rate > 0.01 failures/sec
  3. EmbeddedCloudConflictResolutionFailures: Unable to resolve conflicts
  4. EmbeddedCloudStorageHighLatency: p95 storage operation latency > 5s
  5. EmbeddedCloudOfflineCacheNearFull: Offline cache > 80% capacity
  6. EmbeddedCloudDeviceAuthFailures: Device auth failure rate > 0.05 failures/sec
  7. EmbeddedCloudHighWebSocketDisconnects: Disconnect rate > 10/s
  8. EmbeddedCloudSyncQueueBacklog: Sync queue > 1,000 items

Info Alerts:

  1. EmbeddedCloudHighConflictRate: Conflict rate > 0.05 conflicts/sec
  2. EmbeddedCloudOfflineModeActivated: Devices operating offline for > 10 minutes
  3. EmbeddedCloudDeviceLimitExceeded: Users hitting device limits
  4. EmbeddedCloudNoActiveConnections: No WebSocket connections for > 15 minutes
  5. EmbeddedCloudHighDataTransferRate: Data transfer > 100 MB/sec

Alert Severity Levels

Critical (P1)

Definition: Service is down, data loss risk, or compliance violation

Response Time: Immediate (< 15 minutes)

Actions:

  1. Page on-call engineer
  2. Begin incident response
  3. Check recent deployments
  4. Review logs immediately

Examples:

  • Service completely down
  • Database unavailable
  • Compliance violations detected
  • Audit log write failures

Warning (P2)

Definition: Service degraded, potential issue developing

Response Time: Within 1 hour

Actions:

  1. Notify team via Slack/email
  2. Investigate root cause
  3. Monitor for escalation
  4. Plan remediation

Examples:

  • High latency
  • Elevated error rates
  • Resource approaching limits
  • Report generation failures

Info (P3)

Definition: Notable event, no immediate action required

Response Time: Next business day

Actions:

  1. Log for investigation
  2. Review during normal hours
  3. Update runbooks if needed

Examples:

  • No traffic (off-hours)
  • High token usage
  • Offline mode activated
  • Informational events

Common Monitoring Scenarios

Scenario 1: High Query Latency

Symptoms:

  • Dashboard shows p95 > 10s
  • Alert: ConversationalBIHighLatency

Investigation:

  1. Check Grafana: Conversational BI > Query Latency panel
  2. Check LLM API latency: Is the LLM slow?
  3. Check cache hit rate: Is cache effective?
  4. Check database latency: Is PostgreSQL slow?

Resolution:

Terminal window
# Check service logs
docker compose -f deployment/staging/docker-compose.yml logs conversational-bi | grep -i latency
# Check LLM API status
curl http://localhost:9091/metrics | grep llm_latency
# Restart service if needed
docker compose -f deployment/staging/docker-compose.yml restart conversational-bi

Scenario 2: Compliance Violation Detected

Symptoms:

  • Dashboard shows violation count > 0
  • Alert: ComplianceViolationDetected

Investigation:

  1. Check Grafana: Compliance > Violations by Framework panel
  2. Identify which framework (GDPR, HIPAA, etc.)
  3. Check audit logs for violation details

Resolution:

Terminal window
# Check compliance logs
docker compose -f deployment/staging/docker-compose.yml logs compliance | grep -i violation
# Access compliance dashboard
open http://localhost:8090
# Review violation details
curl http://localhost:8082/api/v1/compliance/violations

Scenario 3: Sync Failures

Symptoms:

  • Dashboard shows high sync failure rate
  • Alert: EmbeddedCloudSyncFailures

Investigation:

  1. Check Grafana: Embedded+Cloud > Sync Operations panel
  2. Check cloud storage connectivity
  3. Check for conflicts or errors

Resolution:

Terminal window
# Check sync service logs
docker compose -f deployment/staging/docker-compose.yml logs embedded-cloud-sync | grep -i sync
# Check S3 connectivity
curl http://localhost:9093/metrics | grep storage_errors
# Check sync queue
curl http://localhost:8083/api/v1/sync/queue/status

Troubleshooting Metrics

Metrics Not Appearing

Issue: Grafana shows “No data”

Diagnosis:

Terminal window
# 1. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# 2. Check service metrics endpoints
curl http://localhost:9091/metrics # Should return Prometheus metrics
curl http://localhost:9092/metrics
curl http://localhost:9093/metrics
# 3. Check Prometheus logs
docker compose -f deployment/staging/docker-compose.yml logs prometheus

Resolution:

Terminal window
# Restart Prometheus
docker compose -f deployment/staging/docker-compose.yml restart prometheus
# Verify scrape config
docker compose -f deployment/staging/docker-compose.yml exec prometheus \
cat /etc/prometheus/prometheus.yml

Alert Not Firing

Issue: Expected alert doesn’t trigger

Diagnosis:

Terminal window
# Check Prometheus rules
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name == "YourAlertName")'
# Check if metric exists
curl http://localhost:9090/api/v1/query?query=<metric_name>

Resolution:

Terminal window
# Reload Prometheus config
curl -X POST http://localhost:9090/-/reload
# Check alert state
curl http://localhost:9090/api/v1/alerts

Best Practices

  1. Check Dashboards Daily: Review key metrics every morning
  2. Investigate Warnings: Don’t ignore warning-level alerts
  3. Baseline Metrics: Understand normal operating ranges
  4. Document Incidents: Keep runbooks updated
  5. Regular Reviews: Weekly review of alert effectiveness
  6. Tune Thresholds: Adjust based on observed behavior
  7. Alert Fatigue: Reduce noisy alerts

Next Steps

  • Configure Alertmanager (optional): Set up email/Slack notifications
  • Create Custom Dashboards: Add business-specific metrics
  • Set Up SLOs: Define Service Level Objectives
  • Enable Tracing: Add distributed tracing for debugging
  • Log Aggregation: Integrate with ELK or Loki

Monitoring is operational! Your HeliosDB Phase 1 staging environment is fully observable.