Compaction Strategy Guide for HeliosDB
Compaction Strategy Guide for HeliosDB
Overview
Compaction is a critical background process in LSM-tree storage engines that merges SSTables to:
- Remove duplicate keys (keeping only the latest version)
- Delete tombstones for deleted keys
- Reorganize data into sorted levels
- Reclaim disk space
- Improve read performance
This guide covers HeliosDB’s advanced compaction strategies, including parallel execution, I/O throttling, and early tombstone deletion.
Table of Contents
- Compaction Strategies
- Parallel Compaction
- I/O Throttling
- Tombstone Management
- Priority Scheduling
- Performance Optimization
- Monitoring and Metrics
Compaction Strategies
1. Leveled Compaction Strategy (LCS)
How It Works:
- Data organized into levels (L0, L1, …, Ln)
- Each level is 10x larger than the previous level
- SSTables within a level don’t overlap (except L0)
- Compaction merges overlapping SSTables between adjacent levels
Level Structure:
L0: 100 MB (4 files × 25 MB, may overlap)L1: 1 GB (40 files × 25 MB, non-overlapping)L2: 10 GB (400 files × 25 MB, non-overlapping)L3: 100 GB (4000 files × 25 MB, non-overlapping)Pros:
- Excellent read performance (1-2 SSTables per read)
- Low space amplification (~1.1x)
- Predictable performance
Cons:
- High write amplification (8-10x)
- More I/O intensive
- Slower for write-heavy workloads
Best For:
- Read-heavy workloads
- Point lookups
- Random access patterns
- Production databases
Configuration:
use heliosdb_storage::CompactionConfig;
let config = CompactionConfig { strategy: CompactionStrategy::Leveled { max_level: 5 }, min_sstables_for_compaction: 2, level0_size_threshold: 100 * 1024 * 1024, // 100 MB level_size_multiplier: 10, max_concurrent_compactions: 4, ..Default::default()};2. Size-Tiered Compaction Strategy (STCS)
How It Works:
- SSTables grouped into buckets by size
- Compaction triggered when bucket has 4+ similar-sized SSTables
- No strict level organization
- Merges SSTables of similar size together
Bucket Example:
Bucket 1: [10 MB, 12 MB, 11 MB, 10 MB] → Compact to 43 MBBucket 2: [50 MB, 48 MB, 52 MB, 49 MB] → Compact to 199 MBBucket 3: [200 MB, 210 MB, 195 MB, 205 MB] → Compact to 810 MBPros:
- Low write amplification (2-4x)
- Fast writes
- Good for time-series data
Cons:
- Higher read amplification (may scan many SSTables)
- Higher space amplification (~2x)
- Temporary disk space spikes during compaction
Best For:
- Write-heavy workloads
- Time-series data
- Log aggregation
- Metrics collection
Configuration:
let config = CompactionConfig { strategy: CompactionStrategy::SizeTiered, min_sstables_for_compaction: 4, level0_size_threshold: 100 * 1024 * 1024, max_concurrent_compactions: 4, ..Default::default()};3. Universal Compaction
How It Works:
- Optimized for time-series and append-only workloads
- Compacts entire sorted runs
- Uses size-ratio based triggering
- Minimizes write amplification
Compaction Trigger:
Trigger when: Size(Run N) / Size(Run N-1) > threshold (typically 1.2)Pros:
- Lowest write amplification (2x)
- Excellent for sequential writes
- Minimal overhead
Cons:
- Can have temporary space amplification
- Not ideal for random updates
- Less predictable than leveled
Best For:
- Time-series databases
- Append-only workloads
- Event logging
- Sensor data
Configuration:
use heliosdb_storage::{CompactionStrategyV2, CompactionManagerV2};
let strategy = CompactionStrategyV2::Universal;
let manager = CompactionManagerV2::new( data_dir, strategy, sstables, tuner, max_concurrent_compactions,);Strategy Comparison
| Metric | Leveled | Size-Tiered | Universal |
|---|---|---|---|
| Write Amplification | 8-10x | 2-4x | 2x |
| Read Amplification | 1-2x | 5-10x | 10-20x |
| Space Amplification | 1.1x | 2x | 1.5-2x |
| Read Performance | Excellent | Good | Fair |
| Write Performance | Good | Excellent | Excellent |
| Space Efficiency | Excellent | Fair | Good |
Parallel Compaction
HeliosDB supports parallel compaction execution to maximize throughput on multi-core systems.
How It Works
- Worker Pool: Multiple compaction workers run concurrently
- Priority Queue: Tasks scheduled by priority
- Semaphore: Limits concurrent compactions to prevent resource exhaustion
- Independent Execution: Non-overlapping compactions run in parallel
Configuration
use heliosdb_storage::{CompactionManagerV2, AdaptiveLsmTuner, LsmTuningConfig};use std::sync::Arc;
let tuner = Arc::new(AdaptiveLsmTuner::new(LsmTuningConfig::default()));
let manager = CompactionManagerV2::new( data_dir, CompactionStrategyV2::Adaptive, sstables, tuner, 8, // Max 8 concurrent compactions);Worker Distribution
CPU Cores: 16Recommendation: max_concurrent_compactions = cores / 2 = 8
Reasoning:- Leave cores for foreground operations- Each compaction uses 1-2 threads- I/O bound, not CPU boundBenefits
- 3-4x compaction throughput on multi-core systems
- Reduced compaction backlog
- Better foreground latency (less compaction debt)
- Improved write throughput
Task Scheduling
Tasks are prioritized by urgency score:
urgency_score = (level_priority * 100) + (size_factor * 10) + age_factor
Priority: L0 → L1: 500 (highest priority) L1 → L2: 400 L2 → L3: 300 ...Example:
Task 1: L0→L1, 200MB, 5 minutes old urgency = 500 + 20 + 5 = 525
Task 2: L3→L4, 1GB, 2 minutes old urgency = 200 + 100 + 2 = 302
Task 1 executes first (higher urgency)I/O Throttling
Compaction can overwhelm I/O bandwidth, affecting foreground performance. HeliosDB includes adaptive I/O throttling.
How It Works
Token Bucket Algorithm:
- Tokens represent I/O bytes
- Tokens refill at configured rate
- Compaction consumes tokens before I/O
- If tokens exhausted, compaction waits
Configuration
use heliosdb_storage::IoThrottleConfig;
let config = IoThrottleConfig { max_read_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s max_write_bytes_per_sec: 100 * 1024 * 1024, // 100 MB/s adaptive: true, // Adjust based on load};Adaptive Throttling
When adaptive: true, throttling adjusts based on:
- Foreground operation latency
- System I/O utilization
- Compaction backlog
Rules:
- If foreground latency <1ms: Allow full compaction bandwidth
- If foreground latency 1-5ms: Reduce compaction to 75%
- If foreground latency 5-10ms: Reduce compaction to 50%
- If foreground latency >10ms: Reduce compaction to 25%
Benefits
- Consistent foreground performance
- Prevents compaction from starving reads/writes
- Better multi-tenant resource sharing
- Reduced tail latencies
Tombstone Management
Deletes in LSM trees use tombstones (markers indicating deletion). HeliosDB includes advanced tombstone management.
Standard Tombstone GC
Process:
- Tombstone written on delete
- Tombstone propagates through levels during compaction
- After
gc_grace_seconds, tombstone eligible for removal - Tombstone removed if it’s the only version of the key
Configuration:
let config = CompactionConfig { gc_grace_seconds: 864000, // 10 days ..Default::default()};Why Grace Period?
- Allows time for repairs/backups
- Prevents resurrection of deleted data
- Handles distributed clock skew
Early Tombstone Deletion (ETD)
HeliosDB V2 includes early tombstone deletion for high-tombstone scenarios.
When Triggered:
if tombstone_percentage >= 0.30 { // 30% threshold enable_early_tombstone_deletion();}Process:
- Calculate tombstone percentage in compaction input
- If >30%, enable aggressive tombstone removal
- Remove tombstones older than
gc_grace_seconds / 10 - Reduces space amplification by 40-60%
Benefits:
- Faster space reclamation
- Lower read amplification (fewer tombstones to skip)
- Better for high-churn workloads
Configuration:
let manager = CompactionManagerV2::new(...);// ETD is automatic when tombstone % > 30%Trade-offs:
- May delete tombstones before full grace period
- Acceptable for most workloads
- Disable for strict consistency requirements
Priority Scheduling
Compaction tasks are scheduled by priority to optimize performance.
Priority Calculation
pub struct CompactionPriority { pub priority: u8, // Base priority (0-255) pub estimated_size: u64, // Size of compaction pub level: u8, // Source level pub urgency_score: u64, // Calculated urgency}
urgency_score = calculate_urgency(level, size, age);Priority Rules
- L0 Compactions: Highest priority (prevents write stalls)
- Large Compactions: Higher priority (clear backlog)
- Old Compactions: Higher priority (prevent accumulation)
- Deep Level Compactions: Lower priority (less impact)
Example
use heliosdb_storage::{CompactionPriority, CompactionTaskV2};
let task = CompactionTaskV2 { id: "compact-1".to_string(), priority: CompactionPriority { priority: 100, estimated_size: 200 * 1024 * 1024, // 200 MB level: 0, urgency_score: 500, }, source_sstables: vec![...], target_level: 1, strategy: CompactionStrategyV2::LeveledParallel, enable_tombstone_gc: true, compression: CompressionAlgorithm::Snappy,};
manager.submit_task(task)?;Performance Optimization
1. Minimize Write Amplification
Strategies:
- Use size-tiered or universal compaction
- Increase SSTable size (fewer compactions)
- Batch writes in larger memtables
- Enable early tombstone deletion
Example:
let config = LsmTuningConfig { memtable_size_mb: 256, // Larger memtables target_file_size_base: 128, // Larger SSTables compaction_style: 1, // Universal compaction ..Default::default()};Result:
- Write amplification: 10x → 3x (70% reduction)
- Write throughput: +100%
2. Minimize Read Amplification
Strategies:
- Use leveled compaction
- Increase bloom filter size
- Trigger compaction earlier (fewer L0 files)
- Increase block cache
Example:
let config = LsmTuningConfig { level0_file_trigger: 2, // Compact L0 early bloom_bits_per_key: vec![16, 14, 12, 10, 8], // Large filters block_cache_mb: 2048, // Large cache compaction_style: 0, // Leveled ..Default::default()};Result:
- Read amplification: 10x → 2x (80% reduction)
- Read throughput: +150%
3. Optimize Space Utilization
Strategies:
- Use compression (Zstd for maximum compression)
- Enable early tombstone deletion
- Use leveled compaction (better space efficiency)
- Reduce GC grace period
Example:
let config = LsmTuningConfig { compression_per_level: vec![0, 2, 2, 2, 2, 2, 2], // Zstd compaction_style: 0, // Leveled ..Default::default()};
let compaction_config = CompactionConfig { gc_grace_seconds: 86400, // 1 day (reduced from 10) ..Default::default()};Result:
- Space amplification: 2.5x → 1.2x (52% reduction)
- Disk savings: 40-60%
4. Balance Throughput and Latency
Use Adaptive Strategy:
let strategy = CompactionStrategyV2::Adaptive;The adaptive strategy automatically switches between leveled and universal based on workload.
Benefits:
- Optimal performance across workload changes
- Reduced operational overhead
- Best of both worlds
Monitoring and Metrics
Key Metrics
1. Compaction Metrics
let metrics = manager.metrics();let snapshot = metrics.snapshot();
println!("Total Compactions: {}", snapshot.total_compactions);println!("Bytes Read: {} MB", snapshot.bytes_read / (1024 * 1024));println!("Bytes Written: {} MB", snapshot.bytes_written / (1024 * 1024));println!("Tombstones Removed: {}", snapshot.tombstones_removed);println!("Space Reclaimed: {} MB", snapshot.space_reclaimed / (1024 * 1024));2. Amplification Factors
println!("Write Amplification: {:.2}x", snapshot.write_amplification());println!("Space Efficiency: {:.2}%", snapshot.space_efficiency() * 100.0);3. Performance Metrics
println!("Avg Compaction Time: {:.2}s", snapshot.avg_compaction_time_secs());println!("Avg Throughput: {:.2} MB/s", snapshot.avg_throughput as f64 / (1024.0 * 1024.0));println!("Failed Compactions: {}", snapshot.failed_compactions);Metrics Report
println!("{}", snapshot.format_report());Output:
Compaction Metrics Report=========================Total Compactions: 1250 - Size-Tiered: 850 - Leveled: 400 - Failed: 5
Space Statistics: - Bytes Read: 120 GB - Bytes Written: 480 GB - Space Reclaimed: 60 GB - Space Efficiency: 50.00%
SSTable Statistics: - Tables Merged: 5000 - Tables Created: 1250 - Tombstones Removed: 500000 - Duplicates Removed: 1500000
Performance: - Total Compaction Time: 3600.00s - Average Compaction Time: 2.88s - Average Throughput: 133.33 MB/s - Write Amplification: 4.00x - Peak Memory Usage: 512.00 MBAlerts and Thresholds
Set up monitoring for:
// Write amplification too highif snapshot.write_amplification() > 15.0 { alert("Write amplification excessive: {:.2}x", snapshot.write_amplification());}
// Compaction falling behindif manager.active_compactions() >= max_concurrent { alert("All compaction workers busy, backlog building");}
// High failure ratelet failure_rate = snapshot.failed_compactions as f64 / snapshot.total_compactions as f64;if failure_rate > 0.01 { alert("Compaction failure rate: {:.2}%", failure_rate * 100.0);}
// Space efficiency low (not reclaiming enough space)if snapshot.space_efficiency() < 0.20 { alert("Low space reclamation: {:.2}%", snapshot.space_efficiency() * 100.0);}Best Practices
1. Choose the Right Strategy
Decision Tree:
Is workload write-heavy (>70% writes)? Yes → Size-Tiered or Universal No → Is it read-heavy (>70% reads)? Yes → Leveled No → Adaptive (let system choose)2. Size SSTables Appropriately
Guidelines:
- L0 files: 25-50 MB (fast to compact)
- L1+ files: 64-128 MB (balance compaction cost)
- Time-series: 128-256 MB (sequential writes)
3. Configure Parallel Workers
Formula:
max_concurrent = min( cpu_cores / 2, available_memory_gb / 2, io_bandwidth_gb_per_sec * 2)Example:
- 16 cores, 32 GB RAM, 2 GB/s I/O
- max_concurrent = min(8, 16, 4) = 4
4. Use I/O Throttling in Production
Always enable for:
- Multi-tenant systems
- Shared storage
- Cloud deployments
- SSD storage (prevent wear)
5. Monitor Continuously
Essential metrics:
- Compaction backlog (pending tasks)
- Active compactions
- Write/read/space amplification
- Failure rate
- L0 file count
6. Tune Based on Metrics
If write amplification high:
- Switch to size-tiered/universal
- Increase SSTable size
- Reduce compaction frequency
If read amplification high:
- Switch to leveled
- Increase bloom filter size
- Compact L0 more frequently
If space amplification high:
- Reduce GC grace period
- Enable early tombstone deletion
- Trigger manual compaction
Conclusion
HeliosDB’s advanced compaction system provides:
✓ Multiple strategies for different workloads ✓ Parallel execution for high throughput ✓ I/O throttling for consistent performance ✓ Early tombstone deletion for space efficiency ✓ Priority scheduling for optimal resource usage
Key Takeaways:
- Choose strategy based on workload (leveled for reads, size-tiered for writes)
- Enable parallel compaction on multi-core systems
- Use I/O throttling to protect foreground performance
- Monitor metrics and adjust configuration as needed
- Use adaptive mode for changing workloads
By following this guide, you can achieve:
- 40-60% reduction in write amplification
- 2-3x improvement in compaction throughput
- Consistent sub-millisecond foreground latencies
- 50%+ space savings with compression and tombstone management
For more information, see the LSM Tuning Guide.