HeliosDB Rollback Procedures
HeliosDB Rollback Procedures
Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17
Table of Contents
- Overview
- When to Rollback
- Rollback Decision Matrix
- Pre-Rollback Checklist
- Rollback Procedures
- Post-Rollback Validation
- Incident Documentation
Overview
This document provides procedures for rolling back HeliosDB Phase 1 deployments in the event of critical issues, bugs, or performance problems.
Rollback Strategy
HeliosDB supports multiple rollback strategies:
- Service-Level Rollback: Rollback individual service (recommended)
- Full Stack Rollback: Rollback entire deployment
- Data Rollback: Restore database from backup (last resort)
Recovery Time Objectives (RTO)
- Service-Level Rollback: < 5 minutes
- Full Stack Rollback: < 15 minutes
- Data Rollback: < 30 minutes (depending on backup size)
When to Rollback
Rollback Triggers
Execute rollback immediately when:
-
Service Unavailability
- Service down for > 5 minutes
- Cannot be resolved by restart
- Affecting critical functionality
-
Data Corruption
- Inconsistent data detected
- Data loss occurring
- Database integrity compromised
-
Security Vulnerability
- Critical security flaw discovered
- Exploit actively being used
- Compliance breach
-
Performance Degradation
- Latency increase > 500%
- Error rate > 25%
- Resource exhaustion
-
Compliance Violations
- Audit log failures
- Compliance framework violations
- Regulatory breach risk
Don’t Rollback When
- Minor bugs that don’t affect core functionality
- Cosmetic issues
- Performance degradation < 100%
- Issues can be hotfixed quickly (< 30 minutes)
Rollback Decision Matrix
| Severity | Impact | Response | Rollback? |
|---|---|---|---|
| P1 Critical | Service down, data loss | Immediate | Yes, immediate |
| P1 Critical | Security breach | Immediate | Yes, immediate |
| P2 High | Major degradation | Within 15 min | Yes, if no quick fix |
| P2 High | Partial functionality loss | Within 30 min | Consider after attempt to fix |
| P3 Medium | Minor degradation | Within 1 hour | No, fix forward |
| P4 Low | Cosmetic issues | Next business day | No |
Pre-Rollback Checklist
Before executing rollback:
1. Incident Assessment
- Confirm severity level (P1-P4)
- Identify affected services
- Document symptoms and error messages
- Check if issue is deployment-related
- Verify rollback is appropriate response
2. Communication
- Notify team via Slack/email
- Identify incident commander
- Create incident ticket/doc
- Prepare status update for stakeholders
3. Backup Verification
- Verify backup availability
- Check backup timestamp
- Confirm backup integrity
- Test backup accessibility
4. Rollback Plan
- Determine rollback scope (service vs full stack)
- Identify target version/commit
- Review dependencies
- Prepare rollback commands
Rollback Procedures
Procedure 1: Service-Level Rollback (Docker Compose)
Use Case: Rollback individual service to previous version
Duration: 3-5 minutes
Step 1: Identify Previous Version
# Check deployment historydocker images | grep heliosdb
# Identify previous working image tag# Example: heliosdb-conversational-bi:v7.0.0-rc1Step 2: Update Service to Previous Version
cd /home/claude/HeliosDB/deployment/staging
# Edit docker-compose.yml# Change image tag for affected service# Example:# conversational-bi:# image: heliosdb-conversational-bi:v7.0.0-rc1 # Rollback to this
# Or pull previous versiondocker pull heliosdb-conversational-bi:v7.0.0-rc1docker tag heliosdb-conversational-bi:v7.0.0-rc1 heliosdb-conversational-bi:latestStep 3: Restart Service
# Restart specific servicedocker compose -f docker-compose.yml up -d --force-recreate conversational-bi
# Monitor logsdocker compose -f docker-compose.yml logs -f conversational-biStep 4: Verify Rollback
# Check service healthcurl http://localhost:8081/health
# Check service versioncurl http://localhost:8081/version
# Monitor metrics for 5 minuteswatch -n 2 'curl -s http://localhost:9091/metrics | grep up'Procedure 2: Full Stack Rollback (Docker Compose)
Use Case: Rollback entire deployment to previous stable state
Duration: 10-15 minutes
Step 1: Stop Current Deployment
cd /home/claude/HeliosDB/deployment/staging
# Stop all servicesdocker compose -f docker-compose.yml downStep 2: Checkout Previous Version
cd /home/claude/HeliosDB
# List available tagsgit tag -l
# Checkout previous stable versiongit checkout v7.0.0-rc1 # Replace with target version
# Verify checkoutgit log -1Step 3: Rebuild Images (if needed)
# Rebuild all images from previous versiondocker compose -f deployment/staging/docker-compose.yml build
# Verify imagesdocker images | grep heliosdbStep 4: Restore Configuration
# Restore previous .env file (if backed up)cp /home/claude/HeliosDB/deployment/staging/.env.backup.YYYYMMDD \ /home/claude/HeliosDB/deployment/staging/.env
# Verify configurationcat deployment/staging/.envStep 5: Start Services
# Start infrastructure firstdocker compose -f deployment/staging/docker-compose.yml up -d postgres redis
# Wait for ready (30 seconds)sleep 30
# Start feature servicesdocker compose -f deployment/staging/docker-compose.yml up -d \ conversational-bi \ compliance \ embedded-cloud-sync
# Start monitoringdocker compose -f deployment/staging/docker-compose.yml up -d prometheus grafana
# Start load balancerdocker compose -f deployment/staging/docker-compose.yml up -d nginxStep 6: Verify Full Stack
# Check all servicesdocker compose -f deployment/staging/docker-compose.yml ps
# Test all health endpointscurl http://localhost:8081/health # Conversational BIcurl http://localhost:8082/health # Compliancecurl http://localhost:8083/health # Embedded+Cloud
# Check Prometheus targetscurl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'Procedure 3: Service-Level Rollback (Kubernetes)
Use Case: Rollback individual Kubernetes deployment
Duration: 5-10 minutes
Step 1: Check Deployment History
# View rollout historykubectl rollout history deployment/conversational-bi -n heliosdb-staging
# Example output:# REVISION CHANGE-CAUSE# 1 Initial deployment# 2 Update to v7.0.0# 3 Current deploymentStep 2: Rollback Deployment
# Rollback to previous revisionkubectl rollout undo deployment/conversational-bi -n heliosdb-staging
# Or rollback to specific revisionkubectl rollout undo deployment/conversational-bi --to-revision=2 -n heliosdb-stagingStep 3: Monitor Rollback
# Watch rollout statuskubectl rollout status deployment/conversational-bi -n heliosdb-staging
# Check podswatch kubectl get pods -n heliosdb-staging -l app=conversational-biStep 4: Verify Rollback
# Check deployment statuskubectl describe deployment conversational-bi -n heliosdb-staging
# Check service healthkubectl exec -it -n heliosdb-staging deployment/conversational-bi -- curl http://localhost:8081/health
# View logskubectl logs -f -n heliosdb-staging deployment/conversational-biProcedure 4: Full Stack Rollback (Kubernetes)
Use Case: Rollback all Kubernetes deployments
Duration: 15-20 minutes
Step 1: Rollback All Deployments
# Rollback all feature serviceskubectl rollout undo deployment/conversational-bi -n heliosdb-stagingkubectl rollout undo deployment/compliance -n heliosdb-stagingkubectl rollout undo deployment/embedded-cloud-sync -n heliosdb-stagingStep 2: Monitor Rollbacks
# Watch all rolloutskubectl rollout status deployment/conversational-bi -n heliosdb-stagingkubectl rollout status deployment/compliance -n heliosdb-stagingkubectl rollout status deployment/embedded-cloud-sync -n heliosdb-staging
# Check all podswatch kubectl get pods -n heliosdb-stagingStep 3: Verify All Services
# Check all deploymentskubectl get deployments -n heliosdb-staging
# Check all service endpointskubectl get endpoints -n heliosdb-staging
# Test health checksfor service in conversational-bi compliance embedded-cloud-sync; do echo "Testing $service..." kubectl exec -it -n heliosdb-staging deployment/$service -- curl http://localhost:8081/healthdoneProcedure 5: Database Rollback (LAST RESORT)
Use Case: Data corruption, need to restore from backup
Duration: 30-60 minutes (depending on backup size)
WARNING: This will result in data loss for transactions after backup timestamp!
Step 1: Assess Data Loss Window
# Check latest backup timestampls -lh /backups/heliosdb/ | tail -5
# Calculate data loss window# Example: Backup from 6 hours ago = 6 hours of data lossStep 2: Stop All Services
# Docker Composedocker compose -f deployment/staging/docker-compose.yml down
# Kuberneteskubectl scale deployment --all --replicas=0 -n heliosdb-stagingStep 3: Backup Current Database (Even if Corrupted)
# Create emergency backupdocker compose -f deployment/staging/docker-compose.yml up -d postgres
docker exec heliosdb-postgres \ pg_dump -U heliosdb_admin -Fc heliosdb > \ /backups/heliosdb/emergency_backup_$(date +%Y%m%d_%H%M%S).dump
docker compose -f deployment/staging/docker-compose.yml stop postgresStep 4: Restore from Backup
# Start PostgreSQLdocker compose -f deployment/staging/docker-compose.yml up -d postgres
# Wait for readysleep 30
# Drop current databasedocker exec heliosdb-postgres \ psql -U heliosdb_admin -d postgres -c "DROP DATABASE heliosdb;"
# Create fresh databasedocker exec heliosdb-postgres \ psql -U heliosdb_admin -d postgres -c "CREATE DATABASE heliosdb;"
# Restore from backupdocker exec -i heliosdb-postgres \ pg_restore -U heliosdb_admin -d heliosdb -v \ < /backups/heliosdb/backup_YYYYMMDD_HHMMSS.dumpStep 5: Verify Database Integrity
# Check database sizedocker exec heliosdb-postgres \ psql -U heliosdb_admin -d heliosdb -c "\l+"
# Check table countsdocker exec heliosdb-postgres \ psql -U heliosdb_admin -d heliosdb -c " SELECT schemaname, tablename, n_live_tup as row_count FROM pg_stat_user_tables ORDER BY n_live_tup DESC;"
# Run integrity checksdocker exec heliosdb-postgres \ psql -U heliosdb_admin -d heliosdb -c "VACUUM ANALYZE;"Step 6: Restart Services
# Restart all servicesdocker compose -f deployment/staging/docker-compose.yml up -d
# Or for Kuberneteskubectl scale deployment --all --replicas=1 -n heliosdb-stagingPost-Rollback Validation
1. Health Checks
# Check all service health endpointscurl http://localhost:8081/health # Conversational BIcurl http://localhost:8082/health # Compliancecurl http://localhost:8083/health # Embedded+Cloud
# All should return: {"status": "healthy"}2. Smoke Tests
Conversational BI
curl -X POST http://localhost:8081/api/v1/query \ -H "Content-Type: application/json" \ -d '{"question": "Show me users", "database": "heliosdb"}'Compliance
curl http://localhost:8082/api/v1/compliance/statusEmbedded+Cloud
curl http://localhost:8083/api/v1/sync/status3. Metrics Validation
# Check Prometheus targetscurl http://localhost:9090/api/v1/targets | \ jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# All should show: "health": "up"4. Log Analysis
# Check for errors in last 5 minutesdocker compose -f deployment/staging/docker-compose.yml logs --since 5m | grep -i error
# Should see no critical errors5. Load Testing (Optional)
# Run light load test to verify stability# See VALIDATION_TEST_SUITE.md for test scriptsIncident Documentation
Rollback Checklist
After rollback completes:
- Document root cause of issue
- Record rollback timeline
- Update incident ticket
- Notify stakeholders of resolution
- Schedule post-mortem meeting
- Create action items for prevention
- Update runbooks if needed
Post-Mortem Template
# Incident Post-Mortem: [Service Name] Rollback
**Date**: YYYY-MM-DD**Incident Commander**: [Name]**Severity**: P1/P2/P3**Duration**: [Start] - [End]
## SummaryBrief description of what happened.
## Timeline- HH:MM - Deployment started- HH:MM - Issue detected- HH:MM - Rollback initiated- HH:MM - Rollback complete- HH:MM - Service restored
## Root CauseDetailed analysis of what caused the issue.
## Impact- Services affected: [list]- Users affected: [number]- Data loss: [yes/no, how much]- Duration of outage: [duration]
## ResolutionHow the issue was resolved (rollback details).
## Action Items1. [ ] Prevent recurrence: [action]2. [ ] Improve detection: [action]3. [ ] Update documentation: [action]4. [ ] Training needed: [action]
## Lessons LearnedWhat we learned and how to prevent in future.Best Practices
- Always Backup First: Before any rollback, ensure backups exist
- Test Rollbacks: Regularly test rollback procedures in staging
- Version Tagging: Use semantic versioning and git tags
- Keep Previous Images: Retain last 5 Docker images
- Document Changes: Maintain deployment changelog
- Gradual Rollout: Use canary or blue-green deployments when possible
- Quick Decision: Don’t delay rollback if criteria are met
- Communicate: Keep team informed throughout process
Emergency Contacts
- On-Call Engineer: [PagerDuty/Phone]
- Team Lead: [Contact]
- DevOps: [Contact]
- Database Admin: [Contact]
Rollback Automation
For faster rollbacks, consider implementing:
- Automated Rollback Scripts: Pre-built scripts for common scenarios
- Feature Flags: Toggle features without deployment
- Blue-Green Deployments: Instant switchback capability
- Canary Releases: Gradual rollout with automatic rollback
Remember: Rollback is a recovery tool, not a failure. The goal is to restore service quickly and analyze issues later.