Skip to content

Performance Tracing & Bottleneck Analysis

Performance Tracing & Bottleneck Analysis

HeliosDB Full ships with a built-in, zero-cost-when-disabled performance tracing system that instruments every layer of the database — from SQL parsing through storage I/O. Use it to pinpoint bottlenecks during performance tests, diagnose slow queries, and generate production-grade reports without external tooling.

Key Differences from Nano/Lite

CapabilityNanoLiteFull
GranularityPer-query phasesPer-query phasesPer-operation across 8 components
Span count~8 phase types~6 phase types57 instrumented operations
Overhead when disabledBranch check per queryOption<QueryTrace> checkSingle atomic load (AtomicBool)
Overhead when enabled~1-2us per query~1-2us per query~50ns per span (lock-free)
PercentilesNoneavg/p50/p95/p99min/avg/p50/p95/p99 per operation
Hierarchical spansNoNoYes (parent-child relationships)
Component breakdownNoNoYes (8 components, % of session)
Output formatsStructured log eventsText reportText report + JSON + programmatic API
Thread safetySingle-threadedSingle-threadedLock-free DashMap + thread-local stacks
SQL interfaceRUST_LOG env varSET helios.trace_queriesProgrammatic API (see below)

Architecture

PerfTracer (global singleton)
├── enabled: AtomicBool ← zero-cost gate
├── next_span_id: AtomicU64 ← lock-free ID generator
├── completed: DashMap ← all finished spans
└── aggregated: DashMap ← running per-op statistics
Thread 1 Thread 2 Thread 3
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SPAN_STACK │ │ SPAN_STACK │ │ SPAN_STACK │
│ (thread- │ │ (thread- │ │ (thread- │
│ local) │ │ local) │ │ local) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ drop guard │ drop guard │ drop guard
└──────────────────────┴──────────────────────┘
┌──────────▼──────────┐
│ DashMap shard N │ ← lock-free aggregation
└─────────────────────┘

Zero-cost when disabled: Every perf_span! macro call starts with self.enabled.load(Ordering::Acquire). When tracing is off, this is a single CPU instruction (no branch misprediction after warmup) and the RAII guard stores span_id = 0, making Drop a no-op.

Lock-free when enabled: Each thread maintains its own span stack (thread_local!). On span completion, timing data is inserted into a DashMap (sharded concurrent hash map) — no global mutex, no contention between threads.


Quick Start

use heliosdb_common::perf_tracer::{PerfTracer, Component};
use heliosdb_common::perf_span;
// 1. Enable tracing (resets previous data, starts session clock)
PerfTracer::global().enable();
// 2. Run your workload
// All instrumented operations automatically record timing spans.
// (execute queries, run benchmarks, etc.)
// 3. Generate the bottleneck report
let report = PerfTracer::global().generate_report();
// Human-readable text report
println!("{}", report.to_text());
// Machine-readable JSON (for dashboards, CI pipelines)
let json = report.to_json();
println!("{}", serde_json::to_string_pretty(&json).unwrap());
// 4. Disable tracing (returns to zero overhead)
PerfTracer::global().disable();

One-Liner for Quick Profiling

// Wrap any code block with enable/disable
PerfTracer::global().enable();
run_benchmark();
let report = PerfTracer::global().generate_report();
println!("{}", report.to_text());
PerfTracer::global().disable();

Instrumented Operations

Every operation below is automatically traced when the PerfTracer is enabled. No code changes needed — the spans are already wired into the hot paths.

Protocol Layer (10 spans)

OperationDescriptionFile
sql_parseSQL text to ASTheliosdb-protocols/src/postgres/sql_parser.rs
handle_queryFull query lifecycle (top-level)heliosdb-protocols/src/postgres/handler.rs
execute_selectSELECT executionheliosdb-protocols/src/postgres/handler.rs
execute_insertINSERT executionheliosdb-protocols/src/postgres/handler.rs
execute_updateUPDATE executionheliosdb-protocols/src/postgres/handler.rs
execute_deleteDELETE executionheliosdb-protocols/src/postgres/handler.rs
execute_create_tableDDL CREATE TABLEheliosdb-protocols/src/postgres/handler.rs
execute_create_functionDDL CREATE FUNCTIONheliosdb-protocols/src/postgres/handler.rs
execute_call_procedureStored procedure callheliosdb-protocols/src/postgres/handler.rs
execute_create_triggerDDL CREATE TRIGGERheliosdb-protocols/src/postgres/handler.rs

Compute Layer (9 spans)

OperationDescriptionFile
execute_planFull query plan executionheliosdb-compute/src/executor/query_executor.rs
table_scanSequential table scanheliosdb-compute/src/executor/query_executor.rs
index_scanIndex-based lookupheliosdb-compute/src/executor/query_executor.rs
filterPredicate evaluationheliosdb-compute/src/executor/query_executor.rs
projectColumn projectionheliosdb-compute/src/executor/query_executor.rs
hash_joinHash join build + probeheliosdb-compute/src/executor/query_executor.rs
aggregateGROUP BY / aggregationheliosdb-compute/src/executor/query_executor.rs
sortORDER BY sortheliosdb-compute/src/executor/query_executor.rs
window_functionWindow function evaluationheliosdb-compute/src/executor/query_executor.rs

Optimizer Layer (2 spans)

OperationDescriptionFile
optimize_queryFull optimization passheliosdb-compute/src/optimizer_legacy.rs
cost_estimationCost model evaluationheliosdb-compute/src/optimizer_legacy.rs

Parallel Execution (2 spans)

OperationDescriptionFile
parallel_executeWork-stealing parallel executionheliosdb-compute/src/parallel_query.rs
partition_dataData partitioning for parallelismheliosdb-compute/src/parallel_query.rs

Storage Layer (9 spans)

OperationDescriptionFile
lsm_putLSM tree write (memtable insert)heliosdb-storage/src/lsm.rs
lsm_getLSM tree point readheliosdb-storage/src/lsm.rs
lsm_deleteLSM tree delete (tombstone write)heliosdb-storage/src/lsm.rs
lsm_scanLSM tree range scan (k-way merge)heliosdb-storage/src/lsm.rs
wal_writeWrite-ahead log entryheliosdb-storage/src/lsm.rs
memtable_flushMemtable flush to SSTableheliosdb-storage/src/lsm.rs
compactionSSTable level compactionheliosdb-storage/src/lsm.rs
sstable_writeSSTable file creationheliosdb-storage/src/sstable.rs
sstable_getSSTable point lookup (bloom + index)heliosdb-storage/src/sstable.rs
sstable_scanSSTable range scanheliosdb-storage/src/sstable.rs

Cluster Layer (7 spans)

OperationDescriptionFile
raft_appendRaft log appendheliosdb-cluster/src/consensus/mod.rs, log.rs
raft_electionRaft leader electionheliosdb-cluster/src/consensus/mod.rs, raft_node.rs
raft_commitRaft entry commitheliosdb-cluster/src/consensus/mod.rs
cache_fusion_lookupCache fusion remote lookupheliosdb-cluster/src/cache_fusion.rs
cache_fusion_transferCache fusion block transferheliosdb-cluster/src/cache_fusion.rs
distributed_scanDistributed smart scan coordinationheliosdb-storage/src/distributed_smart_scan.rs

Metadata Layer (2 operation types, 8 spans)

OperationDescriptionFile
schema_lookupSchema/catalog resolutionheliosdb-metadata/src/schema_service.rs, raft_service.rs
ddl_executeDDL operation executionheliosdb-metadata/src/schema_service.rs, raft_service.rs

Understanding the Report

Sample Output

+======================================================================+
| HeliosDB Performance Bottleneck Report |
+======================================================================+
Session Duration: 12.45s | Total Spans: 45,230
-- Component Breakdown ----------------------------------------------------
Component Total Time Spans % Sess
--------------------------------------------------------------
Storage 8.92s 22,100 71.6%
Compute 2.34s 11,500 18.8%
Protocol 0.89s 5,600 7.1%
Optimizer 0.21s 5,600 1.7%
Metadata 0.09s 430 0.7%
-- Top Bottlenecks (by total time) ----------------------------------------
Component Operation Total Count Avg P99 % Sess
------------------------------------------------------------------------------------------
Storage lsm_scan 5.12s 2800 1.8ms 12.3ms 41.1%
Storage sstable_scan 2.45s 8400 291us 1.2ms 19.7%
Compute hash_join 1.23s 450 2.7ms 15.1ms 9.9%
Compute sort 0.67s 320 2.1ms 8.4ms 5.4%
Protocol execute_select 0.56s 2800 200us 1.1ms 4.5%
Storage lsm_put 0.45s 5000 90us 450us 3.6%
Storage memtable_flush 0.38s 12 31.7ms 45.2ms 3.1%
Compute aggregate 0.34s 280 1.2ms 4.5ms 2.7%
Optimizer optimize_query 0.21s 5600 38us 120us 1.7%
Protocol sql_parse 0.18s 5600 32us 95us 1.4%
-- All Operations ---------------------------------------------------------
Component Operation Count Total Min Avg P95 P99
---------------------------------------------------------------------------------------------------------
Storage lsm_scan 2800 5.12s 120us 1.8ms 8.5ms 12.3ms
Storage sstable_scan 8400 2.45s 15us 291us 850us 1.2ms
...

How to Read It

Component Breakdown shows where time is spent at the macro level. In the example above, 71.6% of traced time is in Storage — typical for I/O-bound workloads. If Optimizer or Protocol dominate, investigate complex queries or parsing overhead.

Top Bottlenecks ranks individual operations by total cumulative time. The most impactful optimization target is the #1 bottleneck. Key metrics:

  • Total: Cumulative wall time across all invocations. The primary ranking metric.
  • Count: Number of times this operation was called. High count with low avg = many small operations. Low count with high avg = few expensive operations.
  • Avg: Average duration per invocation.
  • P99: 99th percentile — the “worst case” that 1% of invocations exceed. If P99 >> Avg, you have tail latency outliers.
  • % Sess: Percentage of the total session duration. Tells you the maximum speedup you’d get by eliminating this bottleneck entirely (Amdahl’s law).

All Operations provides the full statistical breakdown for every traced operation.


Performance Testing Workflow

1. Baseline Measurement

use heliosdb_common::perf_tracer::PerfTracer;
// Reset and enable
PerfTracer::global().enable();
// Run baseline workload
run_oltp_benchmark(1000); // e.g., 1000 transactions
// Capture baseline report
let baseline = PerfTracer::global().generate_report();
let baseline_json = baseline.to_json();
println!("=== BASELINE ===\n{}", baseline.to_text());
PerfTracer::global().disable();

2. Make Changes

// Add an index, tune a config parameter, change a query, etc.

3. Compare

PerfTracer::global().enable();
run_oltp_benchmark(1000); // same workload
let after = PerfTracer::global().generate_report();
println!("=== AFTER CHANGE ===\n{}", after.to_text());
PerfTracer::global().disable();
// Programmatic comparison
let b = &baseline_json["bottlenecks"][0];
let a = &after.to_json()["bottlenecks"][0];
println!(
"Top bottleneck: {} -> {} (was {}us, now {}us)",
b["operation"], a["operation"],
b["avg_duration_us"], a["avg_duration_us"],
);

4. Drill Into Specific Operations

let report = PerfTracer::global().generate_report();
// Find all storage operations slower than 1ms average
for op in &report.operations {
if op.component == heliosdb_common::perf_tracer::Component::Storage
&& op.avg_duration > std::time::Duration::from_millis(1)
{
println!(
"SLOW: {} — avg {:?}, p99 {:?}, count {}",
op.operation, op.avg_duration, op.p99_duration, op.count
);
}
}

Adding Custom Spans

Instrument your own code with the perf_span! macro. Spans nest automatically via the thread-local stack.

use heliosdb_common::perf_tracer::Component;
use heliosdb_common::perf_span;
fn process_batch(batch: &[Row]) {
// Simple span (operation name only)
let _span = perf_span!(Component::Compute, "process_batch");
for row in batch {
// Span with key-value attributes (for debugging, not aggregation)
let _row_span = perf_span!(
Component::Compute, "process_row",
"batch_size" => batch.len(),
"row_key" => row.key
);
// ... processing ...
// _row_span records timing on drop
}
// _span records timing on drop (includes all child spans)
}

Span Nesting

Spans automatically form parent-child relationships within each thread:

{
let _outer = perf_span!(Component::Protocol, "handle_query");
{
let _parse = perf_span!(Component::Protocol, "sql_parse");
// parse_span.parent_id == outer_span.span_id
}
{
let _optimize = perf_span!(Component::Optimizer, "optimize_query");
// optimize_span.parent_id == outer_span.span_id
}
{
let _exec = perf_span!(Component::Compute, "execute_plan");
{
let _scan = perf_span!(Component::Storage, "lsm_scan");
// scan_span.parent_id == exec_span.span_id
// scan_span.depth == 2
}
}
}

Component Selection

Choose the component that owns the operation:

ComponentUse For
Component::StorageAnything touching disk, memtable, WAL, SSTables, compaction
Component::ComputeQuery execution, joins, sorts, aggregation, expression evaluation
Component::OptimizerQuery planning, cost estimation, rule application
Component::ProtocolSQL parsing, wire protocol handling, result serialization
Component::NetworkConnection management, transport, RPC
Component::ClusterRaft consensus, replication, distributed coordination
Component::CacheBlock cache, query cache, tiered cache lookups
Component::MetadataSchema catalog, DDL operations

JSON Output for CI/Dashboards

The JSON report format is designed for automated analysis:

{
"session_duration_us": 12450000,
"total_spans": 45230,
"components": [
{
"component": "Storage",
"total_duration_us": 8920000,
"span_count": 22100,
"pct_of_session": 71.6
}
],
"bottlenecks": [
{
"component": "Storage",
"operation": "lsm_scan",
"total_duration_us": 5120000,
"count": 2800,
"avg_duration_us": 1828,
"p99_duration_us": 12300,
"pct_of_session": 41.1
}
],
"operations": [
{
"component": "Storage",
"operation": "lsm_scan",
"count": 2800,
"total_duration_us": 5120000,
"min_duration_us": 120,
"avg_duration_us": 1828,
"p50_duration_us": 1450,
"p95_duration_us": 8500,
"p99_duration_us": 12300
}
]
}

CI Integration Example

let report = PerfTracer::global().generate_report();
let json = report.to_json();
// Fail CI if any operation's p99 exceeds threshold
for op in json["operations"].as_array().unwrap() {
let p99 = op["p99_duration_us"].as_u64().unwrap();
let name = op["operation"].as_str().unwrap();
if p99 > 50_000 { // 50ms threshold
panic!("PERF REGRESSION: {} p99 = {}us (threshold: 50000us)", name, p99);
}
}

Interpreting Common Patterns

PatternLikely CauseAction
Storage > 70% of sessionI/O-bound workloadCheck lsm_scan and sstable_scan counts; add indexes; tune compaction
lsm_scan P99 >> AvgTail latency from scanning many SSTablesRun compaction; check level count
memtable_flush high avgLarge memtable or slow diskIncrease flush parallelism; check I/O bandwidth
compaction high totalBackground compaction stealing I/OTune compaction rate limiter; schedule off-peak
hash_join dominates ComputeLarge hash tables for joinsCheck join order; add join indexes; increase memory
sort high totalMany ORDER BY without indexAdd covering indexes for sorted access patterns
optimize_query > 10% sessionComplex queries trigger expensive planningSimplify queries; use prepared statements
sql_parse high totalVery long SQL strings or many distinct queriesUse parameterized queries; batch operations
wal_write P99 highWAL fsync latency spikesCheck disk IOPS; enable group commit (DirectIoWal)
raft_append high in clusterConsensus latency between nodesCheck network latency; reduce replication factor for reads
cache_fusion_transfer highRemote cache missesCo-locate hot data; increase local cache size
Optimizer cost low but Compute highGood plans but slow executionFocus on storage I/O and compute optimization
Protocol > 20% sessionParsing or serialization overheadUse prepared statements; reduce result set size

Overhead Characteristics

Disabled (Default) — Zero Cost

perf_span!() call:
1. AtomicBool::load(Acquire) ~1ns (single CPU instruction)
2. Branch: span_id == 0 → skip ~0ns (predicted after warmup)
3. Drop: span_id == 0 → no-op ~0ns
Total overhead per span: ~1ns (unmeasurable in practice)

The compiler inlines the #[inline(always)] is_enabled() check. After branch predictor warmup, the disabled path adds zero measurable overhead — verified by benchmarking with and without instrumentation.

Enabled — Low Overhead

perf_span!() call:
1. AtomicU64::fetch_add(Relaxed) ~5ns (span ID)
2. Instant::now() ~25ns (clock read)
3. thread_local push ~10ns (stack push)
Drop (span completion):
4. Instant::elapsed() ~25ns (clock read)
5. thread_local pop ~10ns (stack pop)
6. DashMap::entry().record() ~30ns (lock-free insert)
Total overhead per span: ~100ns when enabled

For a workload producing 10,000 spans/second, that’s ~1ms of tracing overhead per second (0.1%). For 100,000 spans/second, ~10ms/second (1%). This is acceptable for staging and targeted production debugging.

Memory Usage

  • Per completed span: ~200 bytes (component, operation string, duration, attributes, IDs)
  • Per aggregated operation: ~500 bytes + 8 bytes per recorded duration
  • 10,000 spans session: ~2MB memory
  • 100,000 spans session: ~20MB memory

Call PerfTracer::global().reset() between test runs to free accumulated data.


API Reference

PerfTracer

// Get the global singleton
PerfTracer::global() -> &'static PerfTracer
// Lifecycle
tracer.enable() // Enable + reset + start session clock
tracer.disable() // Disable (spans become no-ops)
tracer.is_enabled() -> bool // Fast check (atomic load)
tracer.reset() // Clear all data without changing enabled state
// Span creation
tracer.begin_span(component, operation, attributes) -> PerfSpanGuard
// Reporting
tracer.generate_report() -> PerfReport
tracer.span_count() -> usize

PerfReport

report.to_text() -> String // Human-readable table
report.to_json() -> serde_json::Value // Machine-readable JSON
// Fields
report.session_duration: Duration
report.total_spans: u64
report.operations: Vec<OperationStats> // All ops, sorted by total time desc
report.component_summary: Vec<ComponentSummary> // Per-component aggregation
report.bottlenecks: Vec<Bottleneck> // Top 10 by total time
report.spans: Vec<CompletedSpan> // All individual spans (for analysis)

perf_span! Macro

// Simple span (no attributes)
let _span = perf_span!(Component::Storage, "operation_name");
// Span with key-value attributes
let _span = perf_span!(Component::Storage, "operation_name",
"key1" => value1,
"key2" => value2
);
// The guard (_span) MUST be bound to a variable.
// If not bound, it drops immediately and records zero duration.
// WRONG: perf_span!(Component::Storage, "op"); // drops immediately!
// RIGHT: let _span = perf_span!(Component::Storage, "op");

Comparison with External Tracing Tools

FeaturePerfTracer (built-in)tracing crateJaeger/OpenTelemetry
SetupZero configSubscriber setupAgent + collector infrastructure
Overhead disabled~1ns (atomic load)~5ns (subscriber check)N/A (not embedded)
Overhead enabled~100ns/span~200ns/span~1-5us/span (network export)
Percentile statsBuilt-in (p50/p95/p99)Requires aggregatorRequires backend query
Bottleneck rankingAutomaticManual analysisManual query
Thread safetyLock-free (DashMap)Layer-dependentAsync export
Distributed tracingSpan hierarchy onlyFull distributedFull distributed
Production useTargeted debuggingAlways-on possibleAlways-on recommended

Use PerfTracer for: Performance testing, benchmarking, targeted bottleneck analysis, CI regression checks.

Use tracing crate for: Always-on structured logging, distributed request tracing, log aggregation.

Use OpenTelemetry for: Multi-service distributed tracing, production observability, SLO monitoring.

PerfTracer is complementary to these tools — it provides instant, zero-dependency bottleneck analysis without any infrastructure setup.