HeliosDB Health Check Guide
HeliosDB Health Check Guide
Version: 1.0 Last Updated: 2025-11-30
Health Check Endpoints
HTTP Endpoint
# Simple health checkcurl http://localhost:5432/health# Returns: 200 OK if healthy
# Detailed health checkcurl http://localhost:5432/health/detailed# Returns: JSON with full statusResponse Format
{ "status": "healthy", "timestamp": "2025-11-30T10:00:00Z", "version": "7.0.0", "components": { "database": "healthy", "replication": "healthy", "cache": "healthy", "storage": "healthy" }, "metrics": { "uptime_seconds": 86400, "connections": 42, "memory_usage_pct": 45.2, "disk_usage_pct": 62.5, "cpu_usage_pct": 12.3 }}Health Check SQL Commands
-- Database healthSELECT pg_is_in_recovery() as is_replica;
-- Replication healthSELECT COUNT(*) as replica_count FROM pg_stat_replication;
-- Cache healthSELECT cache_hit_rate FROM cache_statistics;
-- Storage healthSELECT pg_database_size(current_database()) / 1024 / 1024 / 1024 as size_gb;
-- Vacuum/ANALYZE statusSELECT schemaname, tablename, last_vacuum, last_analyzeFROM pg_stat_user_tablesORDER BY last_vacuum DESC;Monitoring Integration
Prometheus Metrics
# HELP heliosdb_up Database is up# TYPE heliosdb_up gaugeheliosdb_up 1
# HELP heliosdb_connections Active connections# TYPE heliosdb_connections gaugeheliosdb_connections 42
# HELP heliosdb_memory_usage_bytes Memory usage# TYPE heliosdb_memory_usage_bytes gaugeheliosdb_memory_usage_bytes 1073741824Kubernetes Probes
livenessProbe: httpGet: path: /health port: 5432 initialDelaySeconds: 30 periodSeconds: 10
readinessProbe: httpGet: path: /health/ready port: 5432 initialDelaySeconds: 5 periodSeconds: 5Alerting Rules
# Alert if database is down- alert: HeliosDBDown expr: heliosdb_up == 0 for: 1m annotations: summary: "HeliosDB is down"
# Alert if memory usage high- alert: HighMemoryUsage expr: heliosdb_memory_usage_pct > 80 for: 5m annotations: summary: "High memory usage"
# Alert if replication lag- alert: ReplicationLag expr: heliosdb_replication_lag_bytes > 1073741824 for: 1m annotations: summary: "Replication lag detected"Troubleshooting
Issue: Health check returns unhealthy
-- Check what's unhealthySELECT * FROM health_check_details;
-- Check specific componentsSELECT * FROM component_health_status;
-- Review logsSELECT * FROM system_logs WHERE severity = 'ERROR'ORDER BY timestamp DESC LIMIT 20;Best Practices
- Check health every 30 seconds
- Set up alerts for failures
- Monitor trends over time
- Include in deployment checks
- Test failover with health checks
Related Documentation: