HeliosDB Monitoring and Alerting Guide
HeliosDB Monitoring and Alerting Guide
Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17
Table of Contents
- Overview
- Monitoring Architecture
- Key Metrics
- Grafana Dashboards
- Alert Rules
- Alert Severity Levels
- Common Monitoring Scenarios
- Troubleshooting Metrics
Overview
HeliosDB Phase 1 includes comprehensive monitoring for all three production features:
- Conversational BI: Query latency, LLM API performance, cache hit rates, NL2SQL accuracy
- Auto-Compliance: Compliance violations, audit log health, check latency, storage usage
- Embedded+Cloud: Sync performance, conflict rates, offline mode, WebSocket connections
Monitoring Stack
- Prometheus: Metrics collection and storage
- Grafana: Visualization and dashboards
- Service Metrics: Built-in Prometheus exporters in each service
Monitoring Architecture
┌─────────────────────────────────────────────────────┐│ Feature Services ││ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ ││ │Conversational│ │Auto-Compliance│ │Embedded+Cloud│ ││ │BI :9091 │ │ :9092 │ │ :9093 │ ││ └──────┬───────┘ └──────┬────────┘ └──────┬──────┘ │└─────────┼────────────────┼─────────────────┼────────┘ │ │ │ │ Metrics │ │ │ (Prometheus │ │ │ format) │ │ ▼ ▼ ▼ ┌────────────────────────────────────────────┐ │ Prometheus :9090 │ │ (Scrape, Store, Alert) │ └────────────────┬───────────────────────────┘ │ │ Query │ ▼ ┌────────────────────────────────────────────┐ │ Grafana :3000 │ │ (Visualize, Dashboard, Notify) │ └────────────────────────────────────────────┘Key Metrics
Conversational BI Metrics
Query Performance
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_query_duration_seconds | NL query to SQL generation latency | Histogram | p95 > 10s |
heliosdb_conversational_bi_requests_total | Total query requests | Counter | - |
heliosdb_conversational_bi_llm_errors_total | LLM API errors | Counter | rate > 0.1/s |
LLM Performance
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_llm_tokens_used_total | Total tokens consumed | Counter | rate > 1M/hour |
heliosdb_conversational_bi_llm_latency_seconds | LLM API call latency | Histogram | p95 > 5s |
heliosdb_conversational_bi_llm_cost_estimate | Estimated cost in USD | Gauge | - |
Cache Performance
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_cache_hits_total | Cache hits | Counter | - |
heliosdb_conversational_bi_cache_misses_total | Cache misses | Counter | - |
heliosdb_conversational_bi_cache_size_bytes | Cache memory usage | Gauge | > 1GB |
Accuracy Metrics
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_conversational_bi_nl2sql_accuracy | SQL generation accuracy | Gauge | < 0.7 |
heliosdb_conversational_bi_sql_validation_errors_total | Invalid SQL generated | Counter | rate > 0.05/s |
Auto-Compliance Metrics
Compliance Violations
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_compliance_violations_total | Compliance violations by framework | Counter | any > 0 |
heliosdb_compliance_checks_total | Compliance checks performed | Counter | - |
heliosdb_compliance_check_duration_seconds | Compliance check latency | Histogram | p95 > 5s |
Audit Log Health
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_compliance_audit_log_write_duration_seconds | Audit log write latency | Histogram | p95 > 1s |
heliosdb_compliance_audit_log_write_errors_total | Audit log write failures | Counter | any > 0 |
heliosdb_compliance_audit_log_storage_bytes | Audit log storage size | Gauge | > 50GB |
Report Generation
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_compliance_reports_generated_total | Reports generated | Counter | - |
heliosdb_compliance_report_generation_failures_total | Report generation failures | Counter | any > 0 |
heliosdb_compliance_report_generation_duration_seconds | Report generation time | Histogram | p95 > 30s |
Embedded+Cloud Metrics
Sync Performance
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_sync_duration_seconds | Sync operation latency | Histogram | p95 > 30s |
heliosdb_embedded_cloud_sync_success_total | Successful syncs | Counter | - |
heliosdb_embedded_cloud_sync_failures_total | Failed syncs | Counter | rate > 0.01/s |
Conflict Resolution
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_conflicts_total | Data conflicts detected | Counter | rate > 0.05/s |
heliosdb_embedded_cloud_conflicts_resolved_total | Conflicts resolved | Counter | - |
heliosdb_embedded_cloud_conflict_resolution_failures_total | Conflicts unresolved | Counter | any > 0 |
WebSocket Connections
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_active_connections | Active WebSocket connections | Gauge | - |
heliosdb_embedded_cloud_websocket_disconnects_total | WebSocket disconnections | Counter | rate > 10/s |
heliosdb_embedded_cloud_websocket_message_errors_total | WebSocket message errors | Counter | rate > 0.1/s |
Offline Mode
| Metric | Description | Type | Alert Threshold |
|---|---|---|---|
heliosdb_embedded_cloud_offline_mode_active | Devices in offline mode | Gauge | - |
heliosdb_embedded_cloud_offline_cache_usage_bytes | Offline cache size | Gauge | > 80% of limit |
heliosdb_embedded_cloud_offline_cache_evictions_total | Cache evictions | Counter | rate > 1/s |
Grafana Dashboards
Accessing Dashboards
- Open Grafana: http://localhost:3000
- Login with credentials from
.env:- Username:
admin - Password:
<GRAFANA_ADMIN_PASSWORD>
- Username:
- Navigate to Dashboards > Browse
Available Dashboards
1. Conversational BI Dashboard
Location: Dashboards > Conversational BI - Production Metrics
Panels:
- Service Status (UP/DOWN indicator)
- Requests Per Minute
- NL2SQL Accuracy
- Cache Hit Rate
- LLM Error Rate
- Query Latency Percentiles (p50, p95, p99)
- Request Rate & Error Rate
- LLM Token Usage by Provider
- Cache Hit vs Miss Distribution
- Memory Usage
- Rate Limited Requests by Client
- Query Type Distribution
Key Insights:
- Are queries being served successfully?
- Is the LLM API responding?
- Is the cache effective?
- Are we staying within rate limits?
2. Auto-Compliance Dashboard
Location: Dashboards > Auto-Compliance - Production Metrics
Panels:
- Service Status
- Total Violations (24h)
- Compliance Check Latency (p95)
- Audit Log Storage
- Audit Log Write Errors
- Reports Generated (24h)
- Compliance Violations by Framework
- Audit Log Write Performance
- Compliance Checks by Framework
- Violation Types Distribution
- Audit Log Retention Compliance
- Alert Delivery Success Rate
- Recent Compliance Violations (Top 10)
- Report Generation Success & Failures
- Memory Usage
- Audit Log Compression Ratio
Key Insights:
- Are there any compliance violations?
- Is the audit log healthy?
- Are reports being generated successfully?
- Is retention policy being met?
3. Embedded+Cloud Dashboard
Location: Dashboards > Embedded+Cloud Unified - Production Metrics
Panels:
- Service Status
- Active Connections
- Sync Success Rate
- Sync Latency (p95)
- Conflicts (1h)
- Offline Mode Devices
- Sync Operations (Success vs Failures)
- Sync Latency Percentiles
- WebSocket Connections
- Data Transfer Rate
- Conflict Resolution Strategy Usage
- Device Count by User
- Offline Cache Usage
- Cloud Storage Operation Latency
- Top Error Types
- Sync Queue Length
- Device Auth Failures
- Memory Usage
- Cloud Storage Errors
Key Insights:
- Are syncs completing successfully?
- How many conflicts are occurring?
- Is offline mode working?
- Are WebSocket connections stable?
Alert Rules
Alert Configuration
Alert rules are defined in:
/home/claude/HeliosDB/deployment/staging/monitoring/prometheus/alerts/
Conversational BI Alerts
Critical Alerts:
- ConversationalBIServiceDown: Service unavailable for > 2 minutes
- ConversationalBICriticalLatency: p99 latency > 30s for > 3 minutes
- ConversationalBILLMAPIUnavailable: LLM API error rate > 0.5 errors/sec
Warning Alerts:
- ConversationalBIHighLatency: p95 latency > 10s for > 5 minutes
- ConversationalBIHighLLMErrorRate: LLM error rate > 0.1 errors/sec
- ConversationalBILowAccuracy: NL2SQL accuracy < 70% for > 10 minutes
- ConversationalBILowCacheHitRate: Cache hit rate < 30% for > 10 minutes
- ConversationalBIRateLimitExceeded: > 10 requests/sec being rate-limited
- ConversationalBIHighMemoryUsage: Memory usage > 3.5GB for > 5 minutes
Info Alerts:
- ConversationalBINoTraffic: No requests for > 10 minutes
- ConversationalBIHighLLMCost: Token usage > 1M tokens/hour
Auto-Compliance Alerts
Critical Alerts:
- ComplianceServiceDown: Service unavailable for > 2 minutes
- ComplianceViolationDetected: Any compliance violation detected
- ComplianceAuditLogWriteFailure: Audit log write errors detected
- ComplianceGDPRViolation: GDPR violation detected
- ComplianceHIPAAViolation: HIPAA violation detected
- CompliancePCIDSSViolation: PCI-DSS violation detected
- ComplianceAuditLogStorageCritical: Audit log storage > 80GB
Warning Alerts:
- ComplianceAuditLogHighLatency: p95 write latency > 1s
- ComplianceCheckHighLatency: p95 check latency > 5s
- ComplianceCheckFailures: Check failure rate > 0.05 failures/sec
- ComplianceReportGenerationFailed: Report generation failed
- ComplianceAuditLogStorageHigh: Audit log storage > 50GB
- ComplianceAlertDeliveryFailure: Alert delivery failing
- ComplianceRetentionPolicyViolation: Logs older than retention policy
Embedded+Cloud Alerts
Critical Alerts:
- EmbeddedCloudSyncServiceDown: Service unavailable for > 2 minutes
- EmbeddedCloudCriticalSyncLatency: p99 sync latency > 120s
- EmbeddedCloudHighSyncFailureRate: Sync failure rate > 0.1 failures/sec
- EmbeddedCloudStorageUnavailable: Cloud storage error rate > 0.1 errors/sec
- EmbeddedCloudCriticalSyncQueueBacklog: Sync queue > 10,000 items
Warning Alerts:
- EmbeddedCloudHighSyncLatency: p95 sync latency > 30s
- EmbeddedCloudSyncFailures: Sync failure rate > 0.01 failures/sec
- EmbeddedCloudConflictResolutionFailures: Unable to resolve conflicts
- EmbeddedCloudStorageHighLatency: p95 storage operation latency > 5s
- EmbeddedCloudOfflineCacheNearFull: Offline cache > 80% capacity
- EmbeddedCloudDeviceAuthFailures: Device auth failure rate > 0.05 failures/sec
- EmbeddedCloudHighWebSocketDisconnects: Disconnect rate > 10/s
- EmbeddedCloudSyncQueueBacklog: Sync queue > 1,000 items
Info Alerts:
- EmbeddedCloudHighConflictRate: Conflict rate > 0.05 conflicts/sec
- EmbeddedCloudOfflineModeActivated: Devices operating offline for > 10 minutes
- EmbeddedCloudDeviceLimitExceeded: Users hitting device limits
- EmbeddedCloudNoActiveConnections: No WebSocket connections for > 15 minutes
- EmbeddedCloudHighDataTransferRate: Data transfer > 100 MB/sec
Alert Severity Levels
Critical (P1)
Definition: Service is down, data loss risk, or compliance violation
Response Time: Immediate (< 15 minutes)
Actions:
- Page on-call engineer
- Begin incident response
- Check recent deployments
- Review logs immediately
Examples:
- Service completely down
- Database unavailable
- Compliance violations detected
- Audit log write failures
Warning (P2)
Definition: Service degraded, potential issue developing
Response Time: Within 1 hour
Actions:
- Notify team via Slack/email
- Investigate root cause
- Monitor for escalation
- Plan remediation
Examples:
- High latency
- Elevated error rates
- Resource approaching limits
- Report generation failures
Info (P3)
Definition: Notable event, no immediate action required
Response Time: Next business day
Actions:
- Log for investigation
- Review during normal hours
- Update runbooks if needed
Examples:
- No traffic (off-hours)
- High token usage
- Offline mode activated
- Informational events
Common Monitoring Scenarios
Scenario 1: High Query Latency
Symptoms:
- Dashboard shows p95 > 10s
- Alert:
ConversationalBIHighLatency
Investigation:
- Check Grafana: Conversational BI > Query Latency panel
- Check LLM API latency: Is the LLM slow?
- Check cache hit rate: Is cache effective?
- Check database latency: Is PostgreSQL slow?
Resolution:
# Check service logsdocker compose -f deployment/staging/docker-compose.yml logs conversational-bi | grep -i latency
# Check LLM API statuscurl http://localhost:9091/metrics | grep llm_latency
# Restart service if neededdocker compose -f deployment/staging/docker-compose.yml restart conversational-biScenario 2: Compliance Violation Detected
Symptoms:
- Dashboard shows violation count > 0
- Alert:
ComplianceViolationDetected
Investigation:
- Check Grafana: Compliance > Violations by Framework panel
- Identify which framework (GDPR, HIPAA, etc.)
- Check audit logs for violation details
Resolution:
# Check compliance logsdocker compose -f deployment/staging/docker-compose.yml logs compliance | grep -i violation
# Access compliance dashboardopen http://localhost:8090
# Review violation detailscurl http://localhost:8082/api/v1/compliance/violationsScenario 3: Sync Failures
Symptoms:
- Dashboard shows high sync failure rate
- Alert:
EmbeddedCloudSyncFailures
Investigation:
- Check Grafana: Embedded+Cloud > Sync Operations panel
- Check cloud storage connectivity
- Check for conflicts or errors
Resolution:
# Check sync service logsdocker compose -f deployment/staging/docker-compose.yml logs embedded-cloud-sync | grep -i sync
# Check S3 connectivitycurl http://localhost:9093/metrics | grep storage_errors
# Check sync queuecurl http://localhost:8083/api/v1/sync/queue/statusTroubleshooting Metrics
Metrics Not Appearing
Issue: Grafana shows “No data”
Diagnosis:
# 1. Check Prometheus targetscurl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
# 2. Check service metrics endpointscurl http://localhost:9091/metrics # Should return Prometheus metricscurl http://localhost:9092/metricscurl http://localhost:9093/metrics
# 3. Check Prometheus logsdocker compose -f deployment/staging/docker-compose.yml logs prometheusResolution:
# Restart Prometheusdocker compose -f deployment/staging/docker-compose.yml restart prometheus
# Verify scrape configdocker compose -f deployment/staging/docker-compose.yml exec prometheus \ cat /etc/prometheus/prometheus.ymlAlert Not Firing
Issue: Expected alert doesn’t trigger
Diagnosis:
# Check Prometheus rulescurl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name == "YourAlertName")'
# Check if metric existscurl http://localhost:9090/api/v1/query?query=<metric_name>Resolution:
# Reload Prometheus configcurl -X POST http://localhost:9090/-/reload
# Check alert statecurl http://localhost:9090/api/v1/alertsBest Practices
- Check Dashboards Daily: Review key metrics every morning
- Investigate Warnings: Don’t ignore warning-level alerts
- Baseline Metrics: Understand normal operating ranges
- Document Incidents: Keep runbooks updated
- Regular Reviews: Weekly review of alert effectiveness
- Tune Thresholds: Adjust based on observed behavior
- Alert Fatigue: Reduce noisy alerts
Next Steps
- Configure Alertmanager (optional): Set up email/Slack notifications
- Create Custom Dashboards: Add business-specific metrics
- Set Up SLOs: Define Service Level Objectives
- Enable Tracing: Add distributed tracing for debugging
- Log Aggregation: Integrate with ELK or Loki
Monitoring is operational! Your HeliosDB Phase 1 staging environment is fully observable.