Skip to content

Compaction Strategy Guide for HeliosDB

Compaction Strategy Guide for HeliosDB

Overview

Compaction is a critical background process in LSM-tree storage engines that merges SSTables to:

  • Remove duplicate keys (keeping only the latest version)
  • Delete tombstones for deleted keys
  • Reorganize data into sorted levels
  • Reclaim disk space
  • Improve read performance

This guide covers HeliosDB’s advanced compaction strategies, including parallel execution, I/O throttling, and early tombstone deletion.


Table of Contents

  1. Compaction Strategies
  2. Parallel Compaction
  3. I/O Throttling
  4. Tombstone Management
  5. Priority Scheduling
  6. Performance Optimization
  7. Monitoring and Metrics

Compaction Strategies

1. Leveled Compaction Strategy (LCS)

How It Works:

  • Data organized into levels (L0, L1, …, Ln)
  • Each level is 10x larger than the previous level
  • SSTables within a level don’t overlap (except L0)
  • Compaction merges overlapping SSTables between adjacent levels

Level Structure:

L0: 100 MB (4 files × 25 MB, may overlap)
L1: 1 GB (40 files × 25 MB, non-overlapping)
L2: 10 GB (400 files × 25 MB, non-overlapping)
L3: 100 GB (4000 files × 25 MB, non-overlapping)

Pros:

  • Excellent read performance (1-2 SSTables per read)
  • Low space amplification (~1.1x)
  • Predictable performance

Cons:

  • High write amplification (8-10x)
  • More I/O intensive
  • Slower for write-heavy workloads

Best For:

  • Read-heavy workloads
  • Point lookups
  • Random access patterns
  • Production databases

Configuration:

use heliosdb_storage::CompactionConfig;
let config = CompactionConfig {
strategy: CompactionStrategy::Leveled { max_level: 5 },
min_sstables_for_compaction: 2,
level0_size_threshold: 100 * 1024 * 1024, // 100 MB
level_size_multiplier: 10,
max_concurrent_compactions: 4,
..Default::default()
};

2. Size-Tiered Compaction Strategy (STCS)

How It Works:

  • SSTables grouped into buckets by size
  • Compaction triggered when bucket has 4+ similar-sized SSTables
  • No strict level organization
  • Merges SSTables of similar size together

Bucket Example:

Bucket 1: [10 MB, 12 MB, 11 MB, 10 MB] → Compact to 43 MB
Bucket 2: [50 MB, 48 MB, 52 MB, 49 MB] → Compact to 199 MB
Bucket 3: [200 MB, 210 MB, 195 MB, 205 MB] → Compact to 810 MB

Pros:

  • Low write amplification (2-4x)
  • Fast writes
  • Good for time-series data

Cons:

  • Higher read amplification (may scan many SSTables)
  • Higher space amplification (~2x)
  • Temporary disk space spikes during compaction

Best For:

  • Write-heavy workloads
  • Time-series data
  • Log aggregation
  • Metrics collection

Configuration:

let config = CompactionConfig {
strategy: CompactionStrategy::SizeTiered,
min_sstables_for_compaction: 4,
level0_size_threshold: 100 * 1024 * 1024,
max_concurrent_compactions: 4,
..Default::default()
};

3. Universal Compaction

How It Works:

  • Optimized for time-series and append-only workloads
  • Compacts entire sorted runs
  • Uses size-ratio based triggering
  • Minimizes write amplification

Compaction Trigger:

Trigger when:
Size(Run N) / Size(Run N-1) > threshold (typically 1.2)

Pros:

  • Lowest write amplification (2x)
  • Excellent for sequential writes
  • Minimal overhead

Cons:

  • Can have temporary space amplification
  • Not ideal for random updates
  • Less predictable than leveled

Best For:

  • Time-series databases
  • Append-only workloads
  • Event logging
  • Sensor data

Configuration:

use heliosdb_storage::{CompactionStrategyV2, CompactionManagerV2};
let strategy = CompactionStrategyV2::Universal;
let manager = CompactionManagerV2::new(
data_dir,
strategy,
sstables,
tuner,
max_concurrent_compactions,
);

Strategy Comparison

MetricLeveledSize-TieredUniversal
Write Amplification8-10x2-4x2x
Read Amplification1-2x5-10x10-20x
Space Amplification1.1x2x1.5-2x
Read PerformanceExcellentGoodFair
Write PerformanceGoodExcellentExcellent
Space EfficiencyExcellentFairGood

Parallel Compaction

HeliosDB supports parallel compaction execution to maximize throughput on multi-core systems.

How It Works

  1. Worker Pool: Multiple compaction workers run concurrently
  2. Priority Queue: Tasks scheduled by priority
  3. Semaphore: Limits concurrent compactions to prevent resource exhaustion
  4. Independent Execution: Non-overlapping compactions run in parallel

Configuration

use heliosdb_storage::{CompactionManagerV2, AdaptiveLsmTuner, LsmTuningConfig};
use std::sync::Arc;
let tuner = Arc::new(AdaptiveLsmTuner::new(LsmTuningConfig::default()));
let manager = CompactionManagerV2::new(
data_dir,
CompactionStrategyV2::Adaptive,
sstables,
tuner,
8, // Max 8 concurrent compactions
);

Worker Distribution

CPU Cores: 16
Recommendation: max_concurrent_compactions = cores / 2 = 8
Reasoning:
- Leave cores for foreground operations
- Each compaction uses 1-2 threads
- I/O bound, not CPU bound

Benefits

  • 3-4x compaction throughput on multi-core systems
  • Reduced compaction backlog
  • Better foreground latency (less compaction debt)
  • Improved write throughput

Task Scheduling

Tasks are prioritized by urgency score:

urgency_score = (level_priority * 100) + (size_factor * 10) + age_factor
Priority:
L0 → L1: 500 (highest priority)
L1 → L2: 400
L2 → L3: 300
...

Example:

Task 1: L0→L1, 200MB, 5 minutes old
urgency = 500 + 20 + 5 = 525
Task 2: L3→L4, 1GB, 2 minutes old
urgency = 200 + 100 + 2 = 302
Task 1 executes first (higher urgency)

I/O Throttling

Compaction can overwhelm I/O bandwidth, affecting foreground performance. HeliosDB includes adaptive I/O throttling.

How It Works

Token Bucket Algorithm:

  1. Tokens represent I/O bytes
  2. Tokens refill at configured rate
  3. Compaction consumes tokens before I/O
  4. If tokens exhausted, compaction waits

Configuration

use heliosdb_storage::IoThrottleConfig;
let config = IoThrottleConfig {
max_read_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s
max_write_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s
adaptive: true, // Adjust based on load
};

Adaptive Throttling

When adaptive: true, throttling adjusts based on:

  • Foreground operation latency
  • System I/O utilization
  • Compaction backlog

Rules:

  • If foreground latency <1ms: Allow full compaction bandwidth
  • If foreground latency 1-5ms: Reduce compaction to 75%
  • If foreground latency 5-10ms: Reduce compaction to 50%
  • If foreground latency >10ms: Reduce compaction to 25%

Benefits

  • Consistent foreground performance
  • Prevents compaction from starving reads/writes
  • Better multi-tenant resource sharing
  • Reduced tail latencies

Tombstone Management

Deletes in LSM trees use tombstones (markers indicating deletion). HeliosDB includes advanced tombstone management.

Standard Tombstone GC

Process:

  1. Tombstone written on delete
  2. Tombstone propagates through levels during compaction
  3. After gc_grace_seconds, tombstone eligible for removal
  4. Tombstone removed if it’s the only version of the key

Configuration:

let config = CompactionConfig {
gc_grace_seconds: 864000, // 10 days
..Default::default()
};

Why Grace Period?

  • Allows time for repairs/backups
  • Prevents resurrection of deleted data
  • Handles distributed clock skew

Early Tombstone Deletion (ETD)

HeliosDB V2 includes early tombstone deletion for high-tombstone scenarios.

When Triggered:

if tombstone_percentage >= 0.30 { // 30% threshold
enable_early_tombstone_deletion();
}

Process:

  1. Calculate tombstone percentage in compaction input
  2. If >30%, enable aggressive tombstone removal
  3. Remove tombstones older than gc_grace_seconds / 10
  4. Reduces space amplification by 40-60%

Benefits:

  • Faster space reclamation
  • Lower read amplification (fewer tombstones to skip)
  • Better for high-churn workloads

Configuration:

let manager = CompactionManagerV2::new(...);
// ETD is automatic when tombstone % > 30%

Trade-offs:

  • May delete tombstones before full grace period
  • Acceptable for most workloads
  • Disable for strict consistency requirements

Priority Scheduling

Compaction tasks are scheduled by priority to optimize performance.

Priority Calculation

pub struct CompactionPriority {
pub priority: u8, // Base priority (0-255)
pub estimated_size: u64, // Size of compaction
pub level: u8, // Source level
pub urgency_score: u64, // Calculated urgency
}
urgency_score = calculate_urgency(level, size, age);

Priority Rules

  1. L0 Compactions: Highest priority (prevents write stalls)
  2. Large Compactions: Higher priority (clear backlog)
  3. Old Compactions: Higher priority (prevent accumulation)
  4. Deep Level Compactions: Lower priority (less impact)

Example

use heliosdb_storage::{CompactionPriority, CompactionTaskV2};
let task = CompactionTaskV2 {
id: "compact-1".to_string(),
priority: CompactionPriority {
priority: 100,
estimated_size: 200 * 1024 * 1024, // 200 MB
level: 0,
urgency_score: 500,
},
source_sstables: vec![...],
target_level: 1,
strategy: CompactionStrategyV2::LeveledParallel,
enable_tombstone_gc: true,
compression: CompressionAlgorithm::Snappy,
};
manager.submit_task(task)?;

Performance Optimization

1. Minimize Write Amplification

Strategies:

  • Use size-tiered or universal compaction
  • Increase SSTable size (fewer compactions)
  • Batch writes in larger memtables
  • Enable early tombstone deletion

Example:

let config = LsmTuningConfig {
memtable_size_mb: 256, // Larger memtables
target_file_size_base: 128, // Larger SSTables
compaction_style: 1, // Universal compaction
..Default::default()
};

Result:

  • Write amplification: 10x → 3x (70% reduction)
  • Write throughput: +100%

2. Minimize Read Amplification

Strategies:

  • Use leveled compaction
  • Increase bloom filter size
  • Trigger compaction earlier (fewer L0 files)
  • Increase block cache

Example:

let config = LsmTuningConfig {
level0_file_trigger: 2, // Compact L0 early
bloom_bits_per_key: vec![16, 14, 12, 10, 8], // Large filters
block_cache_mb: 2048, // Large cache
compaction_style: 0, // Leveled
..Default::default()
};

Result:

  • Read amplification: 10x → 2x (80% reduction)
  • Read throughput: +150%

3. Optimize Space Utilization

Strategies:

  • Use compression (Zstd for maximum compression)
  • Enable early tombstone deletion
  • Use leveled compaction (better space efficiency)
  • Reduce GC grace period

Example:

let config = LsmTuningConfig {
compression_per_level: vec![0, 2, 2, 2, 2, 2, 2], // Zstd
compaction_style: 0, // Leveled
..Default::default()
};
let compaction_config = CompactionConfig {
gc_grace_seconds: 86400, // 1 day (reduced from 10)
..Default::default()
};

Result:

  • Space amplification: 2.5x → 1.2x (52% reduction)
  • Disk savings: 40-60%

4. Balance Throughput and Latency

Use Adaptive Strategy:

let strategy = CompactionStrategyV2::Adaptive;

The adaptive strategy automatically switches between leveled and universal based on workload.

Benefits:

  • Optimal performance across workload changes
  • Reduced operational overhead
  • Best of both worlds

Monitoring and Metrics

Key Metrics

1. Compaction Metrics

let metrics = manager.metrics();
let snapshot = metrics.snapshot();
println!("Total Compactions: {}", snapshot.total_compactions);
println!("Bytes Read: {} MB", snapshot.bytes_read / (1024 * 1024));
println!("Bytes Written: {} MB", snapshot.bytes_written / (1024 * 1024));
println!("Tombstones Removed: {}", snapshot.tombstones_removed);
println!("Space Reclaimed: {} MB", snapshot.space_reclaimed / (1024 * 1024));

2. Amplification Factors

println!("Write Amplification: {:.2}x", snapshot.write_amplification());
println!("Space Efficiency: {:.2}%", snapshot.space_efficiency() * 100.0);

3. Performance Metrics

println!("Avg Compaction Time: {:.2}s", snapshot.avg_compaction_time_secs());
println!("Avg Throughput: {:.2} MB/s",
snapshot.avg_throughput as f64 / (1024.0 * 1024.0));
println!("Failed Compactions: {}", snapshot.failed_compactions);

Metrics Report

println!("{}", snapshot.format_report());

Output:

Compaction Metrics Report
=========================
Total Compactions: 1250
- Size-Tiered: 850
- Leveled: 400
- Failed: 5
Space Statistics:
- Bytes Read: 120 GB
- Bytes Written: 480 GB
- Space Reclaimed: 60 GB
- Space Efficiency: 50.00%
SSTable Statistics:
- Tables Merged: 5000
- Tables Created: 1250
- Tombstones Removed: 500000
- Duplicates Removed: 1500000
Performance:
- Total Compaction Time: 3600.00s
- Average Compaction Time: 2.88s
- Average Throughput: 133.33 MB/s
- Write Amplification: 4.00x
- Peak Memory Usage: 512.00 MB

Alerts and Thresholds

Set up monitoring for:

// Write amplification too high
if snapshot.write_amplification() > 15.0 {
alert("Write amplification excessive: {:.2}x",
snapshot.write_amplification());
}
// Compaction falling behind
if manager.active_compactions() >= max_concurrent {
alert("All compaction workers busy, backlog building");
}
// High failure rate
let failure_rate = snapshot.failed_compactions as f64
/ snapshot.total_compactions as f64;
if failure_rate > 0.01 {
alert("Compaction failure rate: {:.2}%", failure_rate * 100.0);
}
// Space efficiency low (not reclaiming enough space)
if snapshot.space_efficiency() < 0.20 {
alert("Low space reclamation: {:.2}%",
snapshot.space_efficiency() * 100.0);
}

Best Practices

1. Choose the Right Strategy

Decision Tree:

Is workload write-heavy (>70% writes)?
Yes → Size-Tiered or Universal
No → Is it read-heavy (>70% reads)?
Yes → Leveled
No → Adaptive (let system choose)

2. Size SSTables Appropriately

Guidelines:

  • L0 files: 25-50 MB (fast to compact)
  • L1+ files: 64-128 MB (balance compaction cost)
  • Time-series: 128-256 MB (sequential writes)

3. Configure Parallel Workers

Formula:

max_concurrent = min(
cpu_cores / 2,
available_memory_gb / 2,
io_bandwidth_gb_per_sec * 2
)

Example:

  • 16 cores, 32 GB RAM, 2 GB/s I/O
  • max_concurrent = min(8, 16, 4) = 4

4. Use I/O Throttling in Production

Always enable for:

  • Multi-tenant systems
  • Shared storage
  • Cloud deployments
  • SSD storage (prevent wear)

5. Monitor Continuously

Essential metrics:

  • Compaction backlog (pending tasks)
  • Active compactions
  • Write/read/space amplification
  • Failure rate
  • L0 file count

6. Tune Based on Metrics

If write amplification high:

  • Switch to size-tiered/universal
  • Increase SSTable size
  • Reduce compaction frequency

If read amplification high:

  • Switch to leveled
  • Increase bloom filter size
  • Compact L0 more frequently

If space amplification high:

  • Reduce GC grace period
  • Enable early tombstone deletion
  • Trigger manual compaction

Conclusion

HeliosDB’s advanced compaction system provides:

Multiple strategies for different workloads ✓ Parallel execution for high throughput ✓ I/O throttling for consistent performance ✓ Early tombstone deletion for space efficiency ✓ Priority scheduling for optimal resource usage

Key Takeaways:

  1. Choose strategy based on workload (leveled for reads, size-tiered for writes)
  2. Enable parallel compaction on multi-core systems
  3. Use I/O throttling to protect foreground performance
  4. Monitor metrics and adjust configuration as needed
  5. Use adaptive mode for changing workloads

By following this guide, you can achieve:

  • 40-60% reduction in write amplification
  • 2-3x improvement in compaction throughput
  • Consistent sub-millisecond foreground latencies
  • 50%+ space savings with compression and tombstone management

For more information, see the LSM Tuning Guide.