Skip to content

HeliosDB Rollback Procedures

HeliosDB Rollback Procedures

Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17


Table of Contents

  1. Overview
  2. When to Rollback
  3. Rollback Decision Matrix
  4. Pre-Rollback Checklist
  5. Rollback Procedures
  6. Post-Rollback Validation
  7. Incident Documentation

Overview

This document provides procedures for rolling back HeliosDB Phase 1 deployments in the event of critical issues, bugs, or performance problems.

Rollback Strategy

HeliosDB supports multiple rollback strategies:

  1. Service-Level Rollback: Rollback individual service (recommended)
  2. Full Stack Rollback: Rollback entire deployment
  3. Data Rollback: Restore database from backup (last resort)

Recovery Time Objectives (RTO)

  • Service-Level Rollback: < 5 minutes
  • Full Stack Rollback: < 15 minutes
  • Data Rollback: < 30 minutes (depending on backup size)

When to Rollback

Rollback Triggers

Execute rollback immediately when:

  1. Service Unavailability

    • Service down for > 5 minutes
    • Cannot be resolved by restart
    • Affecting critical functionality
  2. Data Corruption

    • Inconsistent data detected
    • Data loss occurring
    • Database integrity compromised
  3. Security Vulnerability

    • Critical security flaw discovered
    • Exploit actively being used
    • Compliance breach
  4. Performance Degradation

    • Latency increase > 500%
    • Error rate > 25%
    • Resource exhaustion
  5. Compliance Violations

    • Audit log failures
    • Compliance framework violations
    • Regulatory breach risk

Don’t Rollback When

  • Minor bugs that don’t affect core functionality
  • Cosmetic issues
  • Performance degradation < 100%
  • Issues can be hotfixed quickly (< 30 minutes)

Rollback Decision Matrix

SeverityImpactResponseRollback?
P1 CriticalService down, data lossImmediateYes, immediate
P1 CriticalSecurity breachImmediateYes, immediate
P2 HighMajor degradationWithin 15 minYes, if no quick fix
P2 HighPartial functionality lossWithin 30 minConsider after attempt to fix
P3 MediumMinor degradationWithin 1 hourNo, fix forward
P4 LowCosmetic issuesNext business dayNo

Pre-Rollback Checklist

Before executing rollback:

1. Incident Assessment

  • Confirm severity level (P1-P4)
  • Identify affected services
  • Document symptoms and error messages
  • Check if issue is deployment-related
  • Verify rollback is appropriate response

2. Communication

  • Notify team via Slack/email
  • Identify incident commander
  • Create incident ticket/doc
  • Prepare status update for stakeholders

3. Backup Verification

  • Verify backup availability
  • Check backup timestamp
  • Confirm backup integrity
  • Test backup accessibility

4. Rollback Plan

  • Determine rollback scope (service vs full stack)
  • Identify target version/commit
  • Review dependencies
  • Prepare rollback commands

Rollback Procedures

Procedure 1: Service-Level Rollback (Docker Compose)

Use Case: Rollback individual service to previous version

Duration: 3-5 minutes

Step 1: Identify Previous Version

Terminal window
# Check deployment history
docker images | grep heliosdb
# Identify previous working image tag
# Example: heliosdb-conversational-bi:v7.0.0-rc1

Step 2: Update Service to Previous Version

Terminal window
cd /home/claude/HeliosDB/deployment/staging
# Edit docker-compose.yml
# Change image tag for affected service
# Example:
# conversational-bi:
# image: heliosdb-conversational-bi:v7.0.0-rc1 # Rollback to this
# Or pull previous version
docker pull heliosdb-conversational-bi:v7.0.0-rc1
docker tag heliosdb-conversational-bi:v7.0.0-rc1 heliosdb-conversational-bi:latest

Step 3: Restart Service

Terminal window
# Restart specific service
docker compose -f docker-compose.yml up -d --force-recreate conversational-bi
# Monitor logs
docker compose -f docker-compose.yml logs -f conversational-bi

Step 4: Verify Rollback

Terminal window
# Check service health
curl http://localhost:8081/health
# Check service version
curl http://localhost:8081/version
# Monitor metrics for 5 minutes
watch -n 2 'curl -s http://localhost:9091/metrics | grep up'

Procedure 2: Full Stack Rollback (Docker Compose)

Use Case: Rollback entire deployment to previous stable state

Duration: 10-15 minutes

Step 1: Stop Current Deployment

Terminal window
cd /home/claude/HeliosDB/deployment/staging
# Stop all services
docker compose -f docker-compose.yml down

Step 2: Checkout Previous Version

Terminal window
cd /home/claude/HeliosDB
# List available tags
git tag -l
# Checkout previous stable version
git checkout v7.0.0-rc1 # Replace with target version
# Verify checkout
git log -1

Step 3: Rebuild Images (if needed)

Terminal window
# Rebuild all images from previous version
docker compose -f deployment/staging/docker-compose.yml build
# Verify images
docker images | grep heliosdb

Step 4: Restore Configuration

Terminal window
# Restore previous .env file (if backed up)
cp /home/claude/HeliosDB/deployment/staging/.env.backup.YYYYMMDD \
/home/claude/HeliosDB/deployment/staging/.env
# Verify configuration
cat deployment/staging/.env

Step 5: Start Services

Terminal window
# Start infrastructure first
docker compose -f deployment/staging/docker-compose.yml up -d postgres redis
# Wait for ready (30 seconds)
sleep 30
# Start feature services
docker compose -f deployment/staging/docker-compose.yml up -d \
conversational-bi \
compliance \
embedded-cloud-sync
# Start monitoring
docker compose -f deployment/staging/docker-compose.yml up -d prometheus grafana
# Start load balancer
docker compose -f deployment/staging/docker-compose.yml up -d nginx

Step 6: Verify Full Stack

Terminal window
# Check all services
docker compose -f deployment/staging/docker-compose.yml ps
# Test all health endpoints
curl http://localhost:8081/health # Conversational BI
curl http://localhost:8082/health # Compliance
curl http://localhost:8083/health # Embedded+Cloud
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

Procedure 3: Service-Level Rollback (Kubernetes)

Use Case: Rollback individual Kubernetes deployment

Duration: 5-10 minutes

Step 1: Check Deployment History

Terminal window
# View rollout history
kubectl rollout history deployment/conversational-bi -n heliosdb-staging
# Example output:
# REVISION CHANGE-CAUSE
# 1 Initial deployment
# 2 Update to v7.0.0
# 3 Current deployment

Step 2: Rollback Deployment

Terminal window
# Rollback to previous revision
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging
# Or rollback to specific revision
kubectl rollout undo deployment/conversational-bi --to-revision=2 -n heliosdb-staging

Step 3: Monitor Rollback

Terminal window
# Watch rollout status
kubectl rollout status deployment/conversational-bi -n heliosdb-staging
# Check pods
watch kubectl get pods -n heliosdb-staging -l app=conversational-bi

Step 4: Verify Rollback

Terminal window
# Check deployment status
kubectl describe deployment conversational-bi -n heliosdb-staging
# Check service health
kubectl exec -it -n heliosdb-staging deployment/conversational-bi -- curl http://localhost:8081/health
# View logs
kubectl logs -f -n heliosdb-staging deployment/conversational-bi

Procedure 4: Full Stack Rollback (Kubernetes)

Use Case: Rollback all Kubernetes deployments

Duration: 15-20 minutes

Step 1: Rollback All Deployments

Terminal window
# Rollback all feature services
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging
kubectl rollout undo deployment/compliance -n heliosdb-staging
kubectl rollout undo deployment/embedded-cloud-sync -n heliosdb-staging

Step 2: Monitor Rollbacks

Terminal window
# Watch all rollouts
kubectl rollout status deployment/conversational-bi -n heliosdb-staging
kubectl rollout status deployment/compliance -n heliosdb-staging
kubectl rollout status deployment/embedded-cloud-sync -n heliosdb-staging
# Check all pods
watch kubectl get pods -n heliosdb-staging

Step 3: Verify All Services

Terminal window
# Check all deployments
kubectl get deployments -n heliosdb-staging
# Check all service endpoints
kubectl get endpoints -n heliosdb-staging
# Test health checks
for service in conversational-bi compliance embedded-cloud-sync; do
echo "Testing $service..."
kubectl exec -it -n heliosdb-staging deployment/$service -- curl http://localhost:8081/health
done

Procedure 5: Database Rollback (LAST RESORT)

Use Case: Data corruption, need to restore from backup

Duration: 30-60 minutes (depending on backup size)

WARNING: This will result in data loss for transactions after backup timestamp!

Step 1: Assess Data Loss Window

Terminal window
# Check latest backup timestamp
ls -lh /backups/heliosdb/ | tail -5
# Calculate data loss window
# Example: Backup from 6 hours ago = 6 hours of data loss

Step 2: Stop All Services

Terminal window
# Docker Compose
docker compose -f deployment/staging/docker-compose.yml down
# Kubernetes
kubectl scale deployment --all --replicas=0 -n heliosdb-staging

Step 3: Backup Current Database (Even if Corrupted)

Terminal window
# Create emergency backup
docker compose -f deployment/staging/docker-compose.yml up -d postgres
docker exec heliosdb-postgres \
pg_dump -U heliosdb_admin -Fc heliosdb > \
/backups/heliosdb/emergency_backup_$(date +%Y%m%d_%H%M%S).dump
docker compose -f deployment/staging/docker-compose.yml stop postgres

Step 4: Restore from Backup

Terminal window
# Start PostgreSQL
docker compose -f deployment/staging/docker-compose.yml up -d postgres
# Wait for ready
sleep 30
# Drop current database
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d postgres -c "DROP DATABASE heliosdb;"
# Create fresh database
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d postgres -c "CREATE DATABASE heliosdb;"
# Restore from backup
docker exec -i heliosdb-postgres \
pg_restore -U heliosdb_admin -d heliosdb -v \
< /backups/heliosdb/backup_YYYYMMDD_HHMMSS.dump

Step 5: Verify Database Integrity

Terminal window
# Check database size
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d heliosdb -c "\l+"
# Check table counts
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d heliosdb -c "
SELECT schemaname, tablename,
n_live_tup as row_count
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;"
# Run integrity checks
docker exec heliosdb-postgres \
psql -U heliosdb_admin -d heliosdb -c "VACUUM ANALYZE;"

Step 6: Restart Services

Terminal window
# Restart all services
docker compose -f deployment/staging/docker-compose.yml up -d
# Or for Kubernetes
kubectl scale deployment --all --replicas=1 -n heliosdb-staging

Post-Rollback Validation

1. Health Checks

Terminal window
# Check all service health endpoints
curl http://localhost:8081/health # Conversational BI
curl http://localhost:8082/health # Compliance
curl http://localhost:8083/health # Embedded+Cloud
# All should return: {"status": "healthy"}

2. Smoke Tests

Conversational BI

Terminal window
curl -X POST http://localhost:8081/api/v1/query \
-H "Content-Type: application/json" \
-d '{"question": "Show me users", "database": "heliosdb"}'

Compliance

Terminal window
curl http://localhost:8082/api/v1/compliance/status

Embedded+Cloud

Terminal window
curl http://localhost:8083/api/v1/sync/status

3. Metrics Validation

Terminal window
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | \
jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# All should show: "health": "up"

4. Log Analysis

Terminal window
# Check for errors in last 5 minutes
docker compose -f deployment/staging/docker-compose.yml logs --since 5m | grep -i error
# Should see no critical errors

5. Load Testing (Optional)

Terminal window
# Run light load test to verify stability
# See VALIDATION_TEST_SUITE.md for test scripts

Incident Documentation

Rollback Checklist

After rollback completes:

  • Document root cause of issue
  • Record rollback timeline
  • Update incident ticket
  • Notify stakeholders of resolution
  • Schedule post-mortem meeting
  • Create action items for prevention
  • Update runbooks if needed

Post-Mortem Template

# Incident Post-Mortem: [Service Name] Rollback
**Date**: YYYY-MM-DD
**Incident Commander**: [Name]
**Severity**: P1/P2/P3
**Duration**: [Start] - [End]
## Summary
Brief description of what happened.
## Timeline
- HH:MM - Deployment started
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Rollback complete
- HH:MM - Service restored
## Root Cause
Detailed analysis of what caused the issue.
## Impact
- Services affected: [list]
- Users affected: [number]
- Data loss: [yes/no, how much]
- Duration of outage: [duration]
## Resolution
How the issue was resolved (rollback details).
## Action Items
1. [ ] Prevent recurrence: [action]
2. [ ] Improve detection: [action]
3. [ ] Update documentation: [action]
4. [ ] Training needed: [action]
## Lessons Learned
What we learned and how to prevent in future.

Best Practices

  1. Always Backup First: Before any rollback, ensure backups exist
  2. Test Rollbacks: Regularly test rollback procedures in staging
  3. Version Tagging: Use semantic versioning and git tags
  4. Keep Previous Images: Retain last 5 Docker images
  5. Document Changes: Maintain deployment changelog
  6. Gradual Rollout: Use canary or blue-green deployments when possible
  7. Quick Decision: Don’t delay rollback if criteria are met
  8. Communicate: Keep team informed throughout process

Emergency Contacts

  • On-Call Engineer: [PagerDuty/Phone]
  • Team Lead: [Contact]
  • DevOps: [Contact]
  • Database Admin: [Contact]

Rollback Automation

For faster rollbacks, consider implementing:

  1. Automated Rollback Scripts: Pre-built scripts for common scenarios
  2. Feature Flags: Toggle features without deployment
  3. Blue-Green Deployments: Instant switchback capability
  4. Canary Releases: Gradual rollout with automatic rollback

Remember: Rollback is a recovery tool, not a failure. The goal is to restore service quickly and analyze issues later.