Compaction Strategy Guide for HeliosDB

Overview

Compaction is a critical background process in LSM-tree storage engines that merges SSTables to:

Remove duplicate keys (keeping only the latest version)
Delete tombstones for deleted keys
Reorganize data into sorted levels
Reclaim disk space
Improve read performance

This guide covers HeliosDB’s advanced compaction strategies, including parallel execution, I/O throttling, and early tombstone deletion.

Compaction Strategies
Parallel Compaction
I/O Throttling
Tombstone Management
Priority Scheduling
Performance Optimization
Monitoring and Metrics

Compaction Strategies

1. Leveled Compaction Strategy (LCS)

How It Works:

Data organized into levels (L0, L1, …, Ln)
Each level is 10x larger than the previous level
SSTables within a level don’t overlap (except L0)
Compaction merges overlapping SSTables between adjacent levels

Level Structure:

L0: 100 MB  (4 files × 25 MB, may overlap)
L1: 1 GB    (40 files × 25 MB, non-overlapping)
L2: 10 GB   (400 files × 25 MB, non-overlapping)
L3: 100 GB  (4000 files × 25 MB, non-overlapping)

Pros:

Excellent read performance (1-2 SSTables per read)
Low space amplification (~1.1x)
Predictable performance

Cons:

High write amplification (8-10x)
More I/O intensive
Slower for write-heavy workloads

Best For:

Read-heavy workloads
Point lookups
Random access patterns
Production databases

Configuration:

use heliosdb_storage::CompactionConfig;

let config = CompactionConfig {
    strategy: CompactionStrategy::Leveled { max_level: 5 },
    min_sstables_for_compaction: 2,
    level0_size_threshold: 100 * 1024 * 1024, // 100 MB
    level_size_multiplier: 10,
    max_concurrent_compactions: 4,
    ..Default::default()
};

2. Size-Tiered Compaction Strategy (STCS)

How It Works:

SSTables grouped into buckets by size
Compaction triggered when bucket has 4+ similar-sized SSTables
No strict level organization
Merges SSTables of similar size together

Bucket Example:

Bucket 1: [10 MB, 12 MB, 11 MB, 10 MB] → Compact to 43 MB
Bucket 2: [50 MB, 48 MB, 52 MB, 49 MB] → Compact to 199 MB
Bucket 3: [200 MB, 210 MB, 195 MB, 205 MB] → Compact to 810 MB

Pros:

Low write amplification (2-4x)
Fast writes
Good for time-series data

Cons:

Higher read amplification (may scan many SSTables)
Higher space amplification (~2x)
Temporary disk space spikes during compaction

Best For:

Write-heavy workloads
Time-series data
Log aggregation
Metrics collection

Configuration:

let config = CompactionConfig {
    strategy: CompactionStrategy::SizeTiered,
    min_sstables_for_compaction: 4,
    level0_size_threshold: 100 * 1024 * 1024,
    max_concurrent_compactions: 4,
    ..Default::default()
};

3. Universal Compaction

How It Works:

Optimized for time-series and append-only workloads
Compacts entire sorted runs
Uses size-ratio based triggering
Minimizes write amplification

Compaction Trigger:

Trigger when:
  Size(Run N) / Size(Run N-1) > threshold (typically 1.2)

Pros:

Lowest write amplification (2x)
Excellent for sequential writes
Minimal overhead

Cons:

Can have temporary space amplification
Not ideal for random updates
Less predictable than leveled

Best For:

Time-series databases
Append-only workloads
Event logging
Sensor data

Configuration:

use heliosdb_storage::{CompactionStrategyV2, CompactionManagerV2};

let strategy = CompactionStrategyV2::Universal;

let manager = CompactionManagerV2::new(
    data_dir,
    strategy,
    sstables,
    tuner,
    max_concurrent_compactions,
);

Strategy Comparison

Metric	Leveled	Size-Tiered	Universal
Write Amplification	8-10x	2-4x	2x
Read Amplification	1-2x	5-10x	10-20x
Space Amplification	1.1x	2x	1.5-2x
Read Performance	Excellent	Good	Fair
Write Performance	Good	Excellent	Excellent
Space Efficiency	Excellent	Fair	Good

Parallel Compaction

HeliosDB supports parallel compaction execution to maximize throughput on multi-core systems.

How It Works

Worker Pool: Multiple compaction workers run concurrently
Priority Queue: Tasks scheduled by priority
Semaphore: Limits concurrent compactions to prevent resource exhaustion
Independent Execution: Non-overlapping compactions run in parallel

Configuration

use heliosdb_storage::{CompactionManagerV2, AdaptiveLsmTuner, LsmTuningConfig};
use std::sync::Arc;

let tuner = Arc::new(AdaptiveLsmTuner::new(LsmTuningConfig::default()));

let manager = CompactionManagerV2::new(
    data_dir,
    CompactionStrategyV2::Adaptive,
    sstables,
    tuner,
    8, // Max 8 concurrent compactions
);

Worker Distribution

CPU Cores: 16
Recommendation: max_concurrent_compactions = cores / 2 = 8

Reasoning:
- Leave cores for foreground operations
- Each compaction uses 1-2 threads
- I/O bound, not CPU bound

Benefits

3-4x compaction throughput on multi-core systems
Reduced compaction backlog
Better foreground latency (less compaction debt)
Improved write throughput

Task Scheduling

Tasks are prioritized by urgency score:

urgency_score = (level_priority * 100) + (size_factor * 10) + age_factor

Priority:
  L0 → L1: 500 (highest priority)
  L1 → L2: 400
  L2 → L3: 300
  ...

Example:

Task 1: L0→L1, 200MB, 5 minutes old
  urgency = 500 + 20 + 5 = 525

Task 2: L3→L4, 1GB, 2 minutes old
  urgency = 200 + 100 + 2 = 302

Task 1 executes first (higher urgency)

I/O Throttling

Compaction can overwhelm I/O bandwidth, affecting foreground performance. HeliosDB includes adaptive I/O throttling.

How It Works

Token Bucket Algorithm:

Tokens represent I/O bytes
Tokens refill at configured rate
Compaction consumes tokens before I/O
If tokens exhausted, compaction waits

Configuration

use heliosdb_storage::IoThrottleConfig;

let config = IoThrottleConfig {
    max_read_bytes_per_sec: 100 * 1024 * 1024,  // 100 MB/s
    max_write_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s
    adaptive: true, // Adjust based on load
};

Adaptive Throttling

When adaptive: true, throttling adjusts based on:

Foreground operation latency
System I/O utilization
Compaction backlog

Rules:

If foreground latency <1ms: Allow full compaction bandwidth
If foreground latency 1-5ms: Reduce compaction to 75%
If foreground latency 5-10ms: Reduce compaction to 50%
If foreground latency >10ms: Reduce compaction to 25%

Benefits

Consistent foreground performance
Prevents compaction from starving reads/writes
Better multi-tenant resource sharing
Reduced tail latencies

Tombstone Management

Deletes in LSM trees use tombstones (markers indicating deletion). HeliosDB includes advanced tombstone management.

Standard Tombstone GC

Process:

Tombstone written on delete
Tombstone propagates through levels during compaction
After gc_grace_seconds, tombstone eligible for removal
Tombstone removed if it’s the only version of the key

Configuration:

let config = CompactionConfig {
    gc_grace_seconds: 864000, // 10 days
    ..Default::default()
};

Why Grace Period?

Allows time for repairs/backups
Prevents resurrection of deleted data
Handles distributed clock skew

Early Tombstone Deletion (ETD)

HeliosDB V2 includes early tombstone deletion for high-tombstone scenarios.

When Triggered:

if tombstone_percentage >= 0.30 { // 30% threshold
    enable_early_tombstone_deletion();
}

Process:

Calculate tombstone percentage in compaction input
If >30%, enable aggressive tombstone removal
Remove tombstones older than gc_grace_seconds / 10
Reduces space amplification by 40-60%

Benefits:

Faster space reclamation
Lower read amplification (fewer tombstones to skip)
Better for high-churn workloads

Configuration:

let manager = CompactionManagerV2::new(...);
// ETD is automatic when tombstone % > 30%

Trade-offs:

May delete tombstones before full grace period
Acceptable for most workloads
Disable for strict consistency requirements

Priority Scheduling

Compaction tasks are scheduled by priority to optimize performance.

Priority Calculation

pub struct CompactionPriority {
    pub priority: u8,           // Base priority (0-255)
    pub estimated_size: u64,    // Size of compaction
    pub level: u8,              // Source level
    pub urgency_score: u64,     // Calculated urgency
}

urgency_score = calculate_urgency(level, size, age);

Priority Rules

L0 Compactions: Highest priority (prevents write stalls)
Large Compactions: Higher priority (clear backlog)
Old Compactions: Higher priority (prevent accumulation)
Deep Level Compactions: Lower priority (less impact)

Example

use heliosdb_storage::{CompactionPriority, CompactionTaskV2};

let task = CompactionTaskV2 {
    id: "compact-1".to_string(),
    priority: CompactionPriority {
        priority: 100,
        estimated_size: 200 * 1024 * 1024, // 200 MB
        level: 0,
        urgency_score: 500,
    },
    source_sstables: vec![...],
    target_level: 1,
    strategy: CompactionStrategyV2::LeveledParallel,
    enable_tombstone_gc: true,
    compression: CompressionAlgorithm::Snappy,
};

manager.submit_task(task)?;

Performance Optimization

1. Minimize Write Amplification

Strategies:

Use size-tiered or universal compaction
Increase SSTable size (fewer compactions)
Batch writes in larger memtables
Enable early tombstone deletion

Example:

let config = LsmTuningConfig {
    memtable_size_mb: 256,  // Larger memtables
    target_file_size_base: 128,  // Larger SSTables
    compaction_style: 1,  // Universal compaction
    ..Default::default()
};

Result:

Write amplification: 10x → 3x (70% reduction)
Write throughput: +100%

2. Minimize Read Amplification

Strategies:

Use leveled compaction
Increase bloom filter size
Trigger compaction earlier (fewer L0 files)
Increase block cache

Example:

let config = LsmTuningConfig {
    level0_file_trigger: 2,  // Compact L0 early
    bloom_bits_per_key: vec![16, 14, 12, 10, 8],  // Large filters
    block_cache_mb: 2048,  // Large cache
    compaction_style: 0,  // Leveled
    ..Default::default()
};

Result:

Read amplification: 10x → 2x (80% reduction)
Read throughput: +150%

3. Optimize Space Utilization

Strategies:

Use compression (Zstd for maximum compression)
Enable early tombstone deletion
Use leveled compaction (better space efficiency)
Reduce GC grace period

Example:

let config = LsmTuningConfig {
    compression_per_level: vec![0, 2, 2, 2, 2, 2, 2],  // Zstd
    compaction_style: 0,  // Leveled
    ..Default::default()
};

let compaction_config = CompactionConfig {
    gc_grace_seconds: 86400,  // 1 day (reduced from 10)
    ..Default::default()
};

Result:

Space amplification: 2.5x → 1.2x (52% reduction)
Disk savings: 40-60%

4. Balance Throughput and Latency

Use Adaptive Strategy:

let strategy = CompactionStrategyV2::Adaptive;

The adaptive strategy automatically switches between leveled and universal based on workload.

Benefits:

Optimal performance across workload changes
Reduced operational overhead
Best of both worlds

Monitoring and Metrics

Key Metrics

1. Compaction Metrics

let metrics = manager.metrics();
let snapshot = metrics.snapshot();

println!("Total Compactions: {}", snapshot.total_compactions);
println!("Bytes Read: {} MB", snapshot.bytes_read / (1024 * 1024));
println!("Bytes Written: {} MB", snapshot.bytes_written / (1024 * 1024));
println!("Tombstones Removed: {}", snapshot.tombstones_removed);
println!("Space Reclaimed: {} MB", snapshot.space_reclaimed / (1024 * 1024));

2. Amplification Factors

println!("Write Amplification: {:.2}x", snapshot.write_amplification());
println!("Space Efficiency: {:.2}%", snapshot.space_efficiency() * 100.0);

3. Performance Metrics

println!("Avg Compaction Time: {:.2}s", snapshot.avg_compaction_time_secs());
println!("Avg Throughput: {:.2} MB/s",
         snapshot.avg_throughput as f64 / (1024.0 * 1024.0));
println!("Failed Compactions: {}", snapshot.failed_compactions);

Metrics Report

println!("{}", snapshot.format_report());

Output:

Compaction Metrics Report
=========================
Total Compactions: 1250
  - Size-Tiered: 850
  - Leveled: 400
  - Failed: 5

Space Statistics:
  - Bytes Read: 120 GB
  - Bytes Written: 480 GB
  - Space Reclaimed: 60 GB
  - Space Efficiency: 50.00%

SSTable Statistics:
  - Tables Merged: 5000
  - Tables Created: 1250
  - Tombstones Removed: 500000
  - Duplicates Removed: 1500000

Performance:
  - Total Compaction Time: 3600.00s
  - Average Compaction Time: 2.88s
  - Average Throughput: 133.33 MB/s
  - Write Amplification: 4.00x
  - Peak Memory Usage: 512.00 MB

Alerts and Thresholds

Set up monitoring for:

// Write amplification too high
if snapshot.write_amplification() > 15.0 {
    alert("Write amplification excessive: {:.2}x",
          snapshot.write_amplification());
}

// Compaction falling behind
if manager.active_compactions() >= max_concurrent {
    alert("All compaction workers busy, backlog building");
}

// High failure rate
let failure_rate = snapshot.failed_compactions as f64
                   / snapshot.total_compactions as f64;
if failure_rate > 0.01 {
    alert("Compaction failure rate: {:.2}%", failure_rate * 100.0);
}

// Space efficiency low (not reclaiming enough space)
if snapshot.space_efficiency() < 0.20 {
    alert("Low space reclamation: {:.2}%",
          snapshot.space_efficiency() * 100.0);
}

Best Practices

1. Choose the Right Strategy

Decision Tree:

Is workload write-heavy (>70% writes)?
  Yes → Size-Tiered or Universal
  No → Is it read-heavy (>70% reads)?
    Yes → Leveled
    No → Adaptive (let system choose)

2. Size SSTables Appropriately

Guidelines:

L0 files: 25-50 MB (fast to compact)
L1+ files: 64-128 MB (balance compaction cost)
Time-series: 128-256 MB (sequential writes)

3. Configure Parallel Workers

Formula:

max_concurrent = min(
    cpu_cores / 2,
    available_memory_gb / 2,
    io_bandwidth_gb_per_sec * 2
)

Example:

16 cores, 32 GB RAM, 2 GB/s I/O
max_concurrent = min(8, 16, 4) = 4

4. Use I/O Throttling in Production

Always enable for:

Multi-tenant systems
Shared storage
Cloud deployments
SSD storage (prevent wear)

5. Monitor Continuously

Essential metrics:

Compaction backlog (pending tasks)
Active compactions
Write/read/space amplification
Failure rate
L0 file count

6. Tune Based on Metrics

If write amplification high:

Switch to size-tiered/universal
Increase SSTable size
Reduce compaction frequency

If read amplification high:

Switch to leveled
Increase bloom filter size
Compact L0 more frequently

If space amplification high:

Reduce GC grace period
Enable early tombstone deletion
Trigger manual compaction

Conclusion

HeliosDB’s advanced compaction system provides:

✓ Multiple strategies for different workloads ✓ Parallel execution for high throughput ✓ I/O throttling for consistent performance ✓ Early tombstone deletion for space efficiency ✓ Priority scheduling for optimal resource usage

Key Takeaways:

Choose strategy based on workload (leveled for reads, size-tiered for writes)
Enable parallel compaction on multi-core systems
Use I/O throttling to protect foreground performance
Monitor metrics and adjust configuration as needed
Use adaptive mode for changing workloads

By following this guide, you can achieve:

40-60% reduction in write amplification
2-3x improvement in compaction throughput
Consistent sub-millisecond foreground latencies
50%+ space savings with compression and tombstone management

For more information, see the LSM Tuning Guide.

Compaction Strategy Guide for HeliosDB

Compaction Strategy Guide for HeliosDB

Overview

Table of Contents

Compaction Strategies

1. Leveled Compaction Strategy (LCS)

2. Size-Tiered Compaction Strategy (STCS)

3. Universal Compaction

Strategy Comparison

Parallel Compaction

How It Works

Configuration

Worker Distribution

Benefits

Task Scheduling

I/O Throttling

How It Works

Configuration

Adaptive Throttling

Benefits

Tombstone Management

Standard Tombstone GC

Early Tombstone Deletion (ETD)

Priority Scheduling

Priority Calculation

Priority Rules

Example

Performance Optimization

1. Minimize Write Amplification

2. Minimize Read Amplification

3. Optimize Space Utilization

4. Balance Throughput and Latency

Monitoring and Metrics

Key Metrics

Metrics Report

Alerts and Thresholds

Best Practices

1. Choose the Right Strategy

2. Size SSTables Appropriately

3. Configure Parallel Workers

4. Use I/O Throttling in Production

5. Monitor Continuously

6. Tune Based on Metrics

Conclusion