HeliosDB Rollback Procedures

Version: 7.0.0 Environment: Staging Last Updated: 2025-11-17

Overview
When to Rollback
Rollback Decision Matrix
Pre-Rollback Checklist
Rollback Procedures
Post-Rollback Validation
Incident Documentation

Overview

This document provides procedures for rolling back HeliosDB Phase 1 deployments in the event of critical issues, bugs, or performance problems.

Rollback Strategy

HeliosDB supports multiple rollback strategies:

Service-Level Rollback: Rollback individual service (recommended)
Full Stack Rollback: Rollback entire deployment
Data Rollback: Restore database from backup (last resort)

Recovery Time Objectives (RTO)

Service-Level Rollback: < 5 minutes
Full Stack Rollback: < 15 minutes
Data Rollback: < 30 minutes (depending on backup size)

When to Rollback

Rollback Triggers

Execute rollback immediately when:

Service Unavailability
- Service down for > 5 minutes
- Cannot be resolved by restart
- Affecting critical functionality
Data Corruption
- Inconsistent data detected
- Data loss occurring
- Database integrity compromised
Security Vulnerability
- Critical security flaw discovered
- Exploit actively being used
- Compliance breach
Performance Degradation
- Latency increase > 500%
- Error rate > 25%
- Resource exhaustion
Compliance Violations
- Audit log failures
- Compliance framework violations
- Regulatory breach risk

Don’t Rollback When

Minor bugs that don’t affect core functionality
Cosmetic issues
Performance degradation < 100%
Issues can be hotfixed quickly (< 30 minutes)

Rollback Decision Matrix

Severity	Impact	Response	Rollback?
P1 Critical	Service down, data loss	Immediate	Yes, immediate
P1 Critical	Security breach	Immediate	Yes, immediate
P2 High	Major degradation	Within 15 min	Yes, if no quick fix
P2 High	Partial functionality loss	Within 30 min	Consider after attempt to fix
P3 Medium	Minor degradation	Within 1 hour	No, fix forward
P4 Low	Cosmetic issues	Next business day	No

Pre-Rollback Checklist

Before executing rollback:

1. Incident Assessment

Confirm severity level (P1-P4)
Identify affected services
Document symptoms and error messages
Check if issue is deployment-related
Verify rollback is appropriate response

2. Communication

Notify team via Slack/email
Identify incident commander
Create incident ticket/doc
Prepare status update for stakeholders

3. Backup Verification

Verify backup availability
Check backup timestamp
Confirm backup integrity
Test backup accessibility

4. Rollback Plan

Determine rollback scope (service vs full stack)
Identify target version/commit
Review dependencies
Prepare rollback commands

Rollback Procedures

Procedure 1: Service-Level Rollback (Docker Compose)

Use Case: Rollback individual service to previous version

Duration: 3-5 minutes

Step 1: Identify Previous Version

# Check deployment history
docker images | grep heliosdb

# Identify previous working image tag
# Example: heliosdb-conversational-bi:v7.0.0-rc1

Step 2: Update Service to Previous Version

cd /home/claude/HeliosDB/deployment/staging

# Edit docker-compose.yml
# Change image tag for affected service
# Example:
# conversational-bi:
#   image: heliosdb-conversational-bi:v7.0.0-rc1  # Rollback to this

# Or pull previous version
docker pull heliosdb-conversational-bi:v7.0.0-rc1
docker tag heliosdb-conversational-bi:v7.0.0-rc1 heliosdb-conversational-bi:latest

Step 3: Restart Service

# Restart specific service
docker compose -f docker-compose.yml up -d --force-recreate conversational-bi

# Monitor logs
docker compose -f docker-compose.yml logs -f conversational-bi

Step 4: Verify Rollback

# Check service health
curl http://localhost:8081/health

# Check service version
curl http://localhost:8081/version

# Monitor metrics for 5 minutes
watch -n 2 'curl -s http://localhost:9091/metrics | grep up'

Procedure 2: Full Stack Rollback (Docker Compose)

Use Case: Rollback entire deployment to previous stable state

Duration: 10-15 minutes

Step 1: Stop Current Deployment

cd /home/claude/HeliosDB/deployment/staging

# Stop all services
docker compose -f docker-compose.yml down

Step 2: Checkout Previous Version

cd /home/claude/HeliosDB

# List available tags
git tag -l

# Checkout previous stable version
git checkout v7.0.0-rc1  # Replace with target version

# Verify checkout
git log -1

Step 3: Rebuild Images (if needed)

# Rebuild all images from previous version
docker compose -f deployment/staging/docker-compose.yml build

# Verify images
docker images | grep heliosdb

Step 4: Restore Configuration

# Restore previous .env file (if backed up)
cp /home/claude/HeliosDB/deployment/staging/.env.backup.YYYYMMDD \
   /home/claude/HeliosDB/deployment/staging/.env

# Verify configuration
cat deployment/staging/.env

Step 5: Start Services

# Start infrastructure first
docker compose -f deployment/staging/docker-compose.yml up -d postgres redis

# Wait for ready (30 seconds)
sleep 30

# Start feature services
docker compose -f deployment/staging/docker-compose.yml up -d \
  conversational-bi \
  compliance \
  embedded-cloud-sync

# Start monitoring
docker compose -f deployment/staging/docker-compose.yml up -d prometheus grafana

# Start load balancer
docker compose -f deployment/staging/docker-compose.yml up -d nginx

Step 6: Verify Full Stack

# Check all services
docker compose -f deployment/staging/docker-compose.yml ps

# Test all health endpoints
curl http://localhost:8081/health  # Conversational BI
curl http://localhost:8082/health  # Compliance
curl http://localhost:8083/health  # Embedded+Cloud

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'

Procedure 3: Service-Level Rollback (Kubernetes)

Use Case: Rollback individual Kubernetes deployment

Duration: 5-10 minutes

Step 1: Check Deployment History

# View rollout history
kubectl rollout history deployment/conversational-bi -n heliosdb-staging

# Example output:
# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Update to v7.0.0
# 3         Current deployment

Step 2: Rollback Deployment

# Rollback to previous revision
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging

# Or rollback to specific revision
kubectl rollout undo deployment/conversational-bi --to-revision=2 -n heliosdb-staging

Step 3: Monitor Rollback

# Watch rollout status
kubectl rollout status deployment/conversational-bi -n heliosdb-staging

# Check pods
watch kubectl get pods -n heliosdb-staging -l app=conversational-bi

Step 4: Verify Rollback

# Check deployment status
kubectl describe deployment conversational-bi -n heliosdb-staging

# Check service health
kubectl exec -it -n heliosdb-staging deployment/conversational-bi -- curl http://localhost:8081/health

# View logs
kubectl logs -f -n heliosdb-staging deployment/conversational-bi

Procedure 4: Full Stack Rollback (Kubernetes)

Use Case: Rollback all Kubernetes deployments

Duration: 15-20 minutes

Step 1: Rollback All Deployments

# Rollback all feature services
kubectl rollout undo deployment/conversational-bi -n heliosdb-staging
kubectl rollout undo deployment/compliance -n heliosdb-staging
kubectl rollout undo deployment/embedded-cloud-sync -n heliosdb-staging

Step 2: Monitor Rollbacks

# Watch all rollouts
kubectl rollout status deployment/conversational-bi -n heliosdb-staging
kubectl rollout status deployment/compliance -n heliosdb-staging
kubectl rollout status deployment/embedded-cloud-sync -n heliosdb-staging

# Check all pods
watch kubectl get pods -n heliosdb-staging

Step 3: Verify All Services

# Check all deployments
kubectl get deployments -n heliosdb-staging

# Check all service endpoints
kubectl get endpoints -n heliosdb-staging

# Test health checks
for service in conversational-bi compliance embedded-cloud-sync; do
  echo "Testing $service..."
  kubectl exec -it -n heliosdb-staging deployment/$service -- curl http://localhost:8081/health
done

Procedure 5: Database Rollback (LAST RESORT)

Use Case: Data corruption, need to restore from backup

Duration: 30-60 minutes (depending on backup size)

WARNING: This will result in data loss for transactions after backup timestamp!

Step 1: Assess Data Loss Window

# Check latest backup timestamp
ls -lh /backups/heliosdb/ | tail -5

# Calculate data loss window
# Example: Backup from 6 hours ago = 6 hours of data loss

Step 2: Stop All Services

# Docker Compose
docker compose -f deployment/staging/docker-compose.yml down

# Kubernetes
kubectl scale deployment --all --replicas=0 -n heliosdb-staging

Step 3: Backup Current Database (Even if Corrupted)

# Create emergency backup
docker compose -f deployment/staging/docker-compose.yml up -d postgres

docker exec heliosdb-postgres \
  pg_dump -U heliosdb_admin -Fc heliosdb > \
  /backups/heliosdb/emergency_backup_$(date +%Y%m%d_%H%M%S).dump

docker compose -f deployment/staging/docker-compose.yml stop postgres

Step 4: Restore from Backup

# Start PostgreSQL
docker compose -f deployment/staging/docker-compose.yml up -d postgres

# Wait for ready
sleep 30

# Drop current database
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d postgres -c "DROP DATABASE heliosdb;"

# Create fresh database
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d postgres -c "CREATE DATABASE heliosdb;"

# Restore from backup
docker exec -i heliosdb-postgres \
  pg_restore -U heliosdb_admin -d heliosdb -v \
  < /backups/heliosdb/backup_YYYYMMDD_HHMMSS.dump

Step 5: Verify Database Integrity

# Check database size
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "\l+"

# Check table counts
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "
    SELECT schemaname, tablename,
           n_live_tup as row_count
    FROM pg_stat_user_tables
    ORDER BY n_live_tup DESC;"

# Run integrity checks
docker exec heliosdb-postgres \
  psql -U heliosdb_admin -d heliosdb -c "VACUUM ANALYZE;"

Step 6: Restart Services

# Restart all services
docker compose -f deployment/staging/docker-compose.yml up -d

# Or for Kubernetes
kubectl scale deployment --all --replicas=1 -n heliosdb-staging

Post-Rollback Validation

1. Health Checks

# Check all service health endpoints
curl http://localhost:8081/health  # Conversational BI
curl http://localhost:8082/health  # Compliance
curl http://localhost:8083/health  # Embedded+Cloud

# All should return: {"status": "healthy"}

2. Smoke Tests

Conversational BI

curl -X POST http://localhost:8081/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question": "Show me users", "database": "heliosdb"}'

Compliance

curl http://localhost:8082/api/v1/compliance/status

Embedded+Cloud

curl http://localhost:8083/api/v1/sync/status

3. Metrics Validation

# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | \
  jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# All should show: "health": "up"

4. Log Analysis

# Check for errors in last 5 minutes
docker compose -f deployment/staging/docker-compose.yml logs --since 5m | grep -i error

# Should see no critical errors

5. Load Testing (Optional)

# Run light load test to verify stability
# See VALIDATION_TEST_SUITE.md for test scripts

Incident Documentation

Rollback Checklist

After rollback completes:

Post-Mortem Template

# Incident Post-Mortem: [Service Name] Rollback

**Date**: YYYY-MM-DD
**Incident Commander**: [Name]
**Severity**: P1/P2/P3
**Duration**: [Start] - [End]

## Summary
Brief description of what happened.

## Timeline
- HH:MM - Deployment started
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Rollback complete
- HH:MM - Service restored

## Root Cause
Detailed analysis of what caused the issue.

## Impact
- Services affected: [list]
- Users affected: [number]
- Data loss: [yes/no, how much]
- Duration of outage: [duration]

## Resolution
How the issue was resolved (rollback details).

## Action Items
1. [ ] Prevent recurrence: [action]
2. [ ] Improve detection: [action]
3. [ ] Update documentation: [action]
4. [ ] Training needed: [action]

## Lessons Learned
What we learned and how to prevent in future.

Best Practices

Always Backup First: Before any rollback, ensure backups exist
Test Rollbacks: Regularly test rollback procedures in staging
Version Tagging: Use semantic versioning and git tags
Keep Previous Images: Retain last 5 Docker images
Document Changes: Maintain deployment changelog
Gradual Rollout: Use canary or blue-green deployments when possible
Quick Decision: Don’t delay rollback if criteria are met
Communicate: Keep team informed throughout process

Emergency Contacts

On-Call Engineer: [PagerDuty/Phone]
Team Lead: [Contact]
DevOps: [Contact]
Database Admin: [Contact]

Rollback Automation

For faster rollbacks, consider implementing:

Automated Rollback Scripts: Pre-built scripts for common scenarios
Feature Flags: Toggle features without deployment
Blue-Green Deployments: Instant switchback capability
Canary Releases: Gradual rollout with automatic rollback

Remember: Rollback is a recovery tool, not a failure. The goal is to restore service quickly and analyze issues later.

HeliosDB Rollback Procedures

HeliosDB Rollback Procedures

Table of Contents

Overview

Rollback Strategy

Recovery Time Objectives (RTO)

When to Rollback

Rollback Triggers

Don’t Rollback When

Rollback Decision Matrix

Pre-Rollback Checklist

1. Incident Assessment

2. Communication

3. Backup Verification

4. Rollback Plan

Rollback Procedures

Procedure 1: Service-Level Rollback (Docker Compose)

Step 1: Identify Previous Version

Step 2: Update Service to Previous Version

Step 3: Restart Service

Step 4: Verify Rollback

Procedure 2: Full Stack Rollback (Docker Compose)

Step 1: Stop Current Deployment

Step 2: Checkout Previous Version

Step 3: Rebuild Images (if needed)

Step 4: Restore Configuration

Step 5: Start Services

Step 6: Verify Full Stack

Procedure 3: Service-Level Rollback (Kubernetes)

Step 1: Check Deployment History

Step 2: Rollback Deployment

Step 3: Monitor Rollback

Step 4: Verify Rollback

Procedure 4: Full Stack Rollback (Kubernetes)

Step 1: Rollback All Deployments

Step 2: Monitor Rollbacks

Step 3: Verify All Services

Procedure 5: Database Rollback (LAST RESORT)

Step 1: Assess Data Loss Window

Step 2: Stop All Services

Step 3: Backup Current Database (Even if Corrupted)

Step 4: Restore from Backup

Step 5: Verify Database Integrity

Step 6: Restart Services

Post-Rollback Validation

1. Health Checks

2. Smoke Tests

Conversational BI

Compliance

Embedded+Cloud

3. Metrics Validation

4. Log Analysis

5. Load Testing (Optional)

Incident Documentation

Rollback Checklist

Post-Mortem Template

Best Practices

Emergency Contacts

Rollback Automation