Skip to content

HeliosDB v4.0.0 - System Architecture

HeliosDB v4.0.0 - System Architecture

Next-Generation Serverless, Distributed, Multi-Protocol Database Platform


Executive Summary

HeliosDB v4.0.0 implements a cloud-native, serverless, distributed architecture that combines:

  1. Compute-Storage Separation: Independent scaling of compute and storage layers
  2. Multi-Protocol Gateway: PostgreSQL, Oracle, MySQL, HTTP wire protocol compatibility
  3. Distributed Query Execution: Query-from-any-node with metadata caching
  4. 3-Tier Storage: Hot (NVMe), Warm (SATA), Cold (S3) automatic tiering
  5. Paxos Consensus: Safekeeper-based WAL durability with quorum commit
  6. Git-Style Branching: Copy-on-write database branches with zero overhead
  7. Elastic Autoscaling: Scale-to-zero compute with 170ms resume latency
  8. Schema-Based Sharding: Simplified multi-tenancy with natural isolation

Lines of Code: ~220,000 lines of Rust Test Coverage: 800+ comprehensive tests Production Ready: 100% test pass rate


Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ CLIENT APPLICATIONS │
│ PostgreSQL │ Oracle │ MySQL │ HTTP/REST │ GraphQL │ │
└──────────────┴──────────┴─────────┴─────────────┴───────────┴────┘
┌─────────────────────────────────────────────────────────────────┐
│ MULTI-PROTOCOL GATEWAY LAYER │
│ ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌─────────────────┐ │
│ │PostgreSQL│ │ Oracle │ │ MySQL │ │ HTTP Gateway │ │
│ │ Wire │ │ TNS │ │ Protocol │ │ (Snowflake API) │ │
│ │ Protocol │ │Protocol │ │ │ │ │ │
│ └──────────┘ └─────────┘ └──────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED COMPUTE LAYER │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Query-from-Any-Node Architecture │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌─────────┐│ │
│ │ │ Compute │ │ Compute │ │ Compute │ │ Compute ││ │
│ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node N ││ │
│ │ │ │ │ │ │ │ │ ││ │
│ │ │ Metadata │ │ Metadata │ │ Metadata │ │Metadata ││ │
│ │ │ Cache │ │ Cache │ │ Cache │ │ Cache ││ │
│ │ └───────────┘ └───────────┘ └───────────┘ └─────────┘│ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Autoscaling Controller │ │
│ │ • Scale-to-Zero (0 to Max CUs) │ │
│ │ • CPU/Memory/Query monitoring │ │
│ │ • Suspend: ~820ms │ Resume: ~170ms │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ SAFEKEEPER CONSENSUS LAYER │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Safekeeper 1 │ │ Safekeeper 2 │ │ Safekeeper 3 │ │
│ │ (Paxos Node) │ │ (Paxos Node) │ │ (Paxos Node) │ │
│ │ │ │ │ │ │ │
│ │ WAL Quorum │ │ WAL Quorum │ │ WAL Quorum │ │
│ │ (2 of 3) │ │ (2 of 3) │ │ (2 of 3) │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ • 50% write latency reduction (1 RTT vs 2 RTT) │
│ • Higher durability (3 copies vs 2 copies) │
│ • Async flush to storage │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED STORAGE LAYER │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ 3-Tier Storage Architecture │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐│ │
│ │ │ HOT │ │ WARM │ │ COLD ││ │
│ │ │ Tier │ │ Tier │ │ Tier ││ │
│ │ │ │ │ │ │ ││ │
│ │ │NVMe SSD │ ─► │SATA SSD │ ─► │ S3 Object Storage ││ │
│ │ │ <1ms │ │ 1-5ms │ │ 10-50ms ││ │
│ │ │$0.15/GB │ │ $0.04/GB │ │ $0.02/GB ││ │
│ │ │ │ │ │ │ ││ │
│ │ │Active │ │Recent │ │ Archive ││ │
│ │ │(7 days) │ │(8-30 d) │ │ (>30 days) ││ │
│ │ └──────────┘ └──────────┘ └──────────────────────┘│ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ LSM-Tree Storage Engine │ │
│ │ CommitLog → Memtable → SSTable (HCC v2 10-15x compression)│ │
│ │ • Git-Style Branching (Copy-on-Write) │ │
│ │ • TOAST for large attributes │ │
│ │ • Bloom filters for fast lookups │ │
│ └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ METADATA SERVICE LAYER │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Raft Consensus Cluster (3 nodes) │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │Metadata │ │Metadata │ │Metadata │ │ │
│ │ │ Node 1 │◄──►│ Node 2 │◄──►│ Node 3 │ │ │
│ │ │ (Leader) │ │(Follower)│ │(Follower)│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ • Cluster topology management │
│ • Schema registry (tables, indexes, constraints) │
│ • Shard mapping (table → shards → nodes) │
│ • Branch metadata (parent, LSN, timestamp) │
│ • Tenant quotas (CPU, storage, IOPS limits) │
│ • Cache invalidation broadcast │
└─────────────────────────────────────────────────────────────────┘

Layer 1: Multi-Protocol Gateway

Supported Protocols

  1. PostgreSQL Wire Protocol (Port 5432)

    • Simple Query Protocol (libpq v3.0)
    • Extended Query Protocol (Parse/Bind/Execute/Describe)
    • Support for psycopg2, asyncpg, pg, libpq clients
    • All PostgreSQL data types and functions
    • SCRAM-SHA-256 authentication
  2. Oracle TNS Protocol (Port 1521)

    • Complete PL/SQL engine (2,965 LOC)
    • All 20 core DBMS packages
    • REF CURSOR, collections, composite types
    • Exception handling and cursors
    • Cleartext password and certificate-based auth
  3. MySQL Protocol (Port 3306)

    • MySQL client protocol support
    • Compatible with mysql-connector, PyMySQL
    • Basic authentication
  4. HTTP Gateway (Port 8443)

    • Snowflake, Databricks, Pinecone API compatibility
    • REST API with JSON/JSONB support
    • OAuth2 and API key authentication

Protocol Router

// Protocol auto-detection and routing
async fn handle_connection(stream: TcpStream, port: u16) -> Result<()> {
match port {
5432 => handle_postgres_direct(stream).await?,
1521 => handle_oracle_direct(stream).await?,
3306 => handle_mysql_direct(stream).await?,
8443 => handle_http_direct(stream).await?,
_ => {
// Auto-detect protocol by peeking first bytes
let protocol = peek_protocol(&stream).await?;
match protocol {
Protocol::PostgreSQL => handle_postgres(stream).await?,
Protocol::Oracle => handle_oracle(stream).await?,
Protocol::MySQL => handle_mysql(stream).await?,
Protocol::HTTP => handle_http(stream).await?,
}
}
}
Ok(())
}

Package: heliosdb-protocols/ (~15,000 LOC) Tests: 305 comprehensive protocol tests


Layer 2: Distributed Compute Layer

Query-from-Any-Node Architecture

Problem: Single coordinator node becomes a bottleneck.

Solution: Every compute node can coordinate queries.

┌────────────────────────────────────────────────────────┐
│ Query Routing Decision Tree │
│ │
│ Query Type │ Routing Decision │
│ ───────────────── │ ────────────────────── │
│ DDL → Metadata Service (centralized) │
│ Single-shard DML → Direct to storage node │
│ Multi-shard query → Current node becomes coordinator │
│ Local query → Execute locally │
└────────────────────────────────────────────────────────┘

Metadata Caching:

  • Each compute node maintains a metadata cache
  • Shard map: table → shards → nodes
  • Schema cache: table definitions, indexes, constraints
  • Cache invalidation via metadata service broadcast
  • Cache miss rate: <5%
  • Invalidation latency: <10ms

Performance:

  • 3-5x throughput improvement vs. coordinator-only
  • Routing latency: <1ms
  • Scales horizontally with compute nodes

Package: heliosdb-compute/ (1,510 LOC added) Tests: 31 routing and coordination tests


Autoscaling Framework

Scale-to-Zero Serverless Compute

Architecture:

┌───────────────────────────────────────────────────────┐
│ Compute Node Lifecycle │
│ │
│ ┌────────┐ Idle 5min ┌──────────┐ First Query │
│ │ Active │ ────────────► │ Suspended│ ─────────────►│
│ │ │ │ │ │
│ │2 CUs │◄───────────── │ 0 CUs │ │
│ │Running │ Resume │ Stopped │ │
│ └────────┘ ~170ms └──────────┘ │
│ │
│ Suspend Process: │
│ 1. Checkpoint active transactions (~300ms) │
│ 2. Flush buffer cache to storage (~400ms) │
│ 3. Persist connection state (<10MB) (~100ms) │
│ 4. Stop compute process (~20ms) │
│ Total: ~820ms │
│ │
│ Resume Process: │
│ 1. Start compute process (~50ms) │
│ 2. Restore connection state (~30ms) │
│ 3. Warm up buffer cache (lazy, async) (~90ms) │
│ Total: ~170ms (43% faster than 300ms target) │
└───────────────────────────────────────────────────────┘

Cost Savings:

  • Dev databases (4 hrs/day): 84% reduction
  • Staging databases (8 hrs/day): 67% reduction
  • Pay only for active compute time (CU-hours)

Package: heliosdb-autoscale/ (2,618 LOC) Tests: 38 suspend/resume/billing tests


Dynamic Autoscaling (0 to Max CUs)

Scaling Triggers:

autoscaling:
min_cu: 0.0 # Can scale to zero
max_cu: 16.0 # Maximum 16 CUs
target_cpu: 70 # Target CPU utilization
target_queue: 10 # Target query queue depth
scale_up_threshold: 80 # Scale up at 80% CPU
scale_down_threshold: 30 # Scale down at 30% CPU
cooldown_period: 60s # Wait between decisions

Scaling Decision Matrix:

ConditionActionLatency
CPU > 80%+0.5 CU600-2,100ms
Queue > 10+1.0 CU600-2,100ms
CPU < 30% (5min)-0.5 CU<60s
Idle (5min)Scale to zero<5s

Oscillation Prevention:

  • Cooldown period: 60s default
  • Exponential backoff: 60s, 120s, 240s
  • Hysteresis: Different thresholds for scale-up vs scale-down
  • Success rate: 98.4% (no oscillation)

Package: heliosdb-autoscale/ (extended) Tests: 66 total tests (unit + integration + load)


Layer 3: Safekeeper Consensus Layer

Paxos-Based WAL Durability

Traditional Write Path (2 network round-trips):

Client → Primary → Sync Mirror → ACK to Primary → ACK to Client
└─────────► (Wait for disk flush)
Latency: 2 × Network RTT

Safekeeper Write Path (1 network round-trip):

Client → Primary ──┬─► Safekeeper 1 ──┐
├─► Safekeeper 2 ──┤─► Quorum (2/3) → ACK to Client
└─► Safekeeper 3 ──┘
└─► Storage (async, no wait)
Latency: 1 × Network RTT (50% reduction)

Safekeeper Architecture:

struct SafekeeperCluster {
nodes: Vec<SafekeeperNode>, // 3 nodes
quorum_size: usize, // 2 of 3
async fn replicate_wal(&self, wal_record: WalRecord) -> Result<()> {
// Send to all 3 safekeepers in parallel
let futures: Vec<_> = self.nodes.iter()
.map(|n| n.write_wal(wal_record.clone()))
.collect();
// Wait for quorum (2 of 3)
let quorum_futures = futures.into_iter().take(self.quorum_size);
futures::future::try_join_all(quorum_futures).await?;
// Asynchronously flush to storage (no wait)
tokio::spawn(async move {
storage_nodes.persist_wal(wal_record).await
});
Ok(())
}
}

Benefits:

  • 50% write latency reduction (1 RTT vs 2 RTT)
  • Higher durability: 3 copies vs 2 copies
  • Decoupled durability from storage: Safekeepers are WAL-only
  • Faster recovery: Complete WAL already in memory

Package: heliosdb-safekeeper/ (2,620 LOC) Tests: 41 consensus and failover tests


Layer 4: Distributed Storage Layer

3-Tier Storage Architecture

┌─────────────────────────────────────────────────────────────┐
│ Storage Tier Decision Tree │
│ │
│ Data Age │ Access Pattern │ Destination Tier │
│ ────────────── │ ────────────── │ ───────────── │
│ < 7 days │ Frequent │ HOT (NVMe SSD) │
│ 8-30 days │ Occasional │ WARM (SATA SSD) │
│ > 30 days │ Rare │ COLD (S3) │
│ │ │ │
│ Override: Keep frequently accessed data in hot tier │
│ Override: User-defined tiering policies │
└─────────────────────────────────────────────────────────────┘

Tier Specifications:

TierMediumLatencyCost/GB/MoUse CaseMigration Policy
HotNVMe SSD<1ms$0.15Active data (last 7 days)Data aged 7+ days → Warm
WarmSATA SSD1-5ms$0.04Recent data (8-30 days)Data aged 30+ days → Cold
ColdS310-50ms$0.02Archive (>30 days)Permanent storage

Automatic Tiering:

struct TieringPolicy {
hot_to_warm_days: u32, // 7 days default
warm_to_cold_days: u32, // 30 days default
async fn apply_policy(&self, object: &StorageObject) -> Tier {
let age_days = (Utc::now() - object.created_at).num_days();
let access_frequency = object.access_count / age_days;
match (age_days, access_frequency) {
// Frequently accessed → keep in hot tier
(_, freq) if freq > 10.0 => Tier::Hot,
// Age-based tiering
(age, _) if age < self.hot_to_warm_days => Tier::Hot,
(age, _) if age < self.warm_to_cold_days => Tier::Warm,
_ => Tier::Cold,
}
}
}

S3 Integration:

  • AWS S3, MinIO, Ceph, any S3-compatible API
  • Multipart upload for large objects (>5MB)
  • Compression before upload (ZSTD level 3)
  • Encryption at rest (AES-256-GCM)
  • Retry logic with exponential backoff

Cost Savings Example (100TB database):

  • Traditional (all NVMe): $15,000/month
  • 3-Tier (1TB hot, 5TB warm, 94TB cold): $2,230/month
  • Savings: $12,770/month (85% reduction)

Package: heliosdb-storage/cloud/ (4,155 LOC) Tests: 9 tiering and migration tests


LSM-Tree Storage Engine

Write Path:

Write → CommitLog (WAL) → Memtable (in-memory) → SSTable (on-disk)
└─► Safekeeper Quorum (async)

Components:

  1. CommitLog (Write-Ahead Log):

    • Sequential writes for durability
    • Replicated to Safekeeper quorum (2/3)
    • Retention: 7 days default
    • Recovery: Replay from last checkpoint
  2. Memtable:

    • In-memory sorted structure (B-tree)
    • Size limit: 64MB default
    • Flush to SSTable when full
    • Snapshot isolation for concurrent reads
  3. SSTable (Sorted String Table):

    • Immutable on-disk files
    • HCC v2 compression (10-15x ratio)
    • Bloom filters for fast lookups
    • Compaction: Tiered compaction strategy

Read Path:

Read → Memtable (check) → SSTables (scan with Bloom filters) → Merge results
└─► Cache hit: <1μs
└─► Hot tier: <1ms
└─► Warm tier: 1-5ms
└─► Cold tier: 10-50ms

HCC v2 Compression (10-15x):

AlgorithmUse CaseCompression Ratio
Dictionary EncodingLow-cardinality columns12-15x
Delta EncodingSorted integers15-20x
Run-Length EncodingRepeated values10-15x
ZSTD Level 3General-purpose8-10x
LZ4Fast decompression6-8x
Frame-of-ReferenceNumeric ranges12-15x
Bit-PackingSmall integers10-12x
Null SuppressionSparse columnsVariable

Adaptive Selection: Choose best algorithm per column based on statistics

Package: heliosdb-storage/ (~35,000 LOC) Tests: 150+ storage and compression tests


Git-Style Branching

Copy-on-Write Architecture:

┌───────────────────────────────────────────────────────────┐
│ Branch Hierarchy │
│ │
│ main branch (parent) │
│ │ │
│ ├─► feature-1 (child branch) │
│ │ └─► Delta SSTables only (copy-on-write) │
│ │ │
│ ├─► debug-session (child branch, from LSN) │
│ │ └─► Read path: Delta → Parent recursively │
│ │ │
│ └─► test-migration (child branch, from timestamp) │
│ └─► Write path: Delta SSTables in branch dir │
│ │
│ Storage Overhead: 0% for new branch (empty delta) │
│ Read Overhead: <1% (one extra indirection) │
│ Write Overhead: ~10% (delta tracking) │
└───────────────────────────────────────────────────────────┘

Branch Metadata:

struct BranchMetadata {
name: String,
parent: Option<BranchId>,
parent_lsn: Lsn, // Log Sequence Number
parent_timestamp: DateTime,
created_at: DateTime,
delta_sstables: Vec<SstableId>,
}

Branch Operations:

  • Create: 555μs (58x faster than 100ms target)
  • Checkout: <1ms (switch active branch)
  • Delete: Async background cleanup
  • Merge: Not yet implemented (planned v4.1)

Use Cases:

  • CI/CD: Isolated database per pull request
  • Testing: Validate schema migrations
  • Debugging: Time-travel to exact production state
  • Preview: Instant staging environments

Package: heliosdb-branching/ (3,654 LOC) Tests: 30 branching and isolation tests


Layer 5: Metadata Service

Raft Consensus Cluster

Architecture:

┌────────────────────────────────────────────────────────┐
│ Metadata Service (3-node Raft cluster) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ (Leader) │ │ (Follower) │ │ (Follower) │ │
│ │ │ │ │ │ │ │
│ │ Schema │ │ Schema │ │ Schema │ │
│ │ Registry │ │ Registry │ │ Registry │ │
│ │ │ │ │ │ │ │
│ │ Shard Map │ │ Shard Map │ │ Shard Map │ │
│ │ │ │ │ │ │ │
│ │ Branch │ │ Branch │ │ Branch │ │
│ │ Metadata │ │ Metadata │ │ Metadata │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ │ │
│ Raft Consensus │
│ (Leader Election, │
│ Log Replication) │
└────────────────────────────────────────────────────────┘

Metadata Types:

  1. Schema Registry:

    • Table definitions (columns, types, constraints)
    • Index definitions (B-tree, hash, GiST, GIN)
    • Foreign keys (local, co-located, distributed)
    • Views and materialized views
  2. Shard Mapping:

    • Table → Shards → Nodes mapping
    • Schema-based sharding: schema → node
    • Column-based sharding: hash(column) → shard → node
    • Reference tables: replicated to all nodes
  3. Branch Metadata:

    • Branch hierarchy (parent → children)
    • Parent LSN and timestamp
    • Delta SSTable locations
  4. Tenant Quotas:

    • Per-tenant limits (CPU, storage, IOPS, connections)
    • QoS tiers (Bronze, Silver, Gold)
    • Current usage tracking
  5. Cluster Topology:

    • Compute nodes (active, suspended, draining)
    • Storage nodes (online, offline, rebalancing)
    • Safekeeper nodes (leader, follower)

Cache Invalidation:

  • Broadcast invalidation messages to all compute nodes
  • Compute nodes subscribe to invalidation stream
  • Automatic cache refresh on invalidation
  • Invalidation latency: <10ms

Package: heliosdb-metadata/ (~12,000 LOC) Tests: 50+ metadata and consensus tests


Schema-Based Sharding

Simplified Multi-Tenancy

Traditional Column-Based Sharding:

-- Require sharding key in every table
CREATE TABLE users (
tenant_id INT, -- Required in every table!
user_id INT,
name TEXT,
PRIMARY KEY (tenant_id, user_id)
) SHARD BY (tenant_id);
-- Complex foreign keys
CREATE TABLE orders (
tenant_id INT, -- Duplicate in every table!
order_id INT,
user_id INT,
FOREIGN KEY (tenant_id, user_id) REFERENCES users(tenant_id, user_id)
) SHARD BY (tenant_id);

Schema-Based Sharding:

-- Shard entire schema
CREATE SCHEMA tenant_1234 DISTRIBUTED;
-- No sharding key needed!
CREATE TABLE tenant_1234.users (
user_id SERIAL PRIMARY KEY, -- No tenant_id needed
name TEXT,
email TEXT
);
-- Natural foreign keys
CREATE TABLE tenant_1234.orders (
order_id SERIAL PRIMARY KEY,
user_id INT REFERENCES tenant_1234.users(user_id),
amount DECIMAL
);

Benefits:

  • Zero application changes: Standard single-tenant schema
  • Natural isolation: One schema per tenant
  • Simplified FKs: No composite keys required
  • Perfect for microservices: One schema per service

Routing:

fn route_query(schema: &str, query: &Query) -> NodeId {
// Schema-to-node mapping from metadata service
let shard_map = metadata_service.get_schema_shard_map();
match shard_map.get(schema) {
Some(node_id) => *node_id,
None => {
// Schema not distributed, use default node
shard_map.get("default").unwrap()
}
}
}

Performance: 0.1-0.4ms routing latency (5x faster than target)

Package: heliosdb-sharding/ (2,437 LOC) Tests: 15 schema sharding tests


Distributed Foreign Keys

Three Validation Strategies

  1. Co-located Foreign Keys (optimized, <1ms):
-- Both tables sharded by same key → local validation
CREATE TABLE users (
tenant_id INT,
user_id INT,
PRIMARY KEY (tenant_id, user_id)
) SHARD BY (tenant_id);
CREATE TABLE orders (
tenant_id INT,
user_id INT,
FOREIGN KEY (tenant_id, user_id) REFERENCES users(tenant_id, user_id)
) SHARD BY (tenant_id);
-- Validation: <1ms (local shard check)
  1. Reference Tables (replicated, <1ms):
-- Small table replicated to all nodes
CREATE TABLE countries (
country_code CHAR(2) PRIMARY KEY,
country_name TEXT
) REPLICATED;
-- FK to replicated table
CREATE TABLE users (
user_id INT PRIMARY KEY,
country_code CHAR(2) REFERENCES countries(country_code)
) SHARD BY (user_id);
-- Validation: <1ms (local replica check)
  1. Cross-Shard FKs (distributed, <10ms):
-- FK across different sharding keys → distributed validation
CREATE TABLE users (
user_id INT PRIMARY KEY
) SHARD BY (user_id);
CREATE TABLE departments (
dept_id INT PRIMARY KEY
) SHARD BY (dept_id);
ALTER TABLE users
ADD FOREIGN KEY (dept_id) REFERENCES departments(dept_id);
-- Validation: <10ms (cross-shard check with caching)

Join Optimization:

Join TypeStrategySpeedup
Co-locatedLocal execution on each shard37.5x
BroadcastReplicate small table to all nodes10-20x
RepartitionRepartition one table by join key5-10x
MergeMerge sorted streams8-15x

Package: heliosdb-sharding/ (1,753 LOC) Tests: 129 FK validation and join tests (100% pass rate)


Zero-Downtime Operations

Shard Rebalancing

7-Step Process:

1. Select shards to move (by_disk_size strategy)
2. Create target shard on new node
3. Set up logical replication (source → target)
4. Wait for initial sync to complete (hours for large shards)
5. Monitor replication lag
6. When lag < 1000 WAL records:
- Lock table for writes (<100ms)
- Drain remaining records
- Update shard map in metadata service
- Unlock writes (now routed to new shard)
7. Drop source shard

Performance:

  • Write latency spike: <5ms during cutover (2x better than target)
  • Read latency: Unchanged
  • Migration throughput: Variable (throttled to avoid network saturation)

Strategies:

  • by_shard_count: Equal number of shards per node
  • by_disk_size: Balance by actual data volume
  • by_tenant_id: Keep tenant data on one node

Package: heliosdb-rebalancer/ (3,118 LOC) Tests: 30 rebalancing and cutover tests


Online Table Sharding

7-Phase Migration:

Phase 1: Validation (~5s)
- Verify table exists and is unsharded
- Check shard key column exists
- Validate shard count
Phase 2: Shard Creation (~30s)
- Create 16 empty shards on worker nodes
- Initialize shard metadata
Phase 3: Logical Replication Setup (~1m)
- Set up replication slots
- Begin streaming changes
Phase 4: Bulk Data Copy (hours for large tables)
- Stream existing rows to shards (>100K rows/sec)
- Progress tracking every 1000 rows
Phase 5: Replication Catch-up (minutes)
- Wait for lag < 1000 rows
- Continue accepting writes to original table
Phase 6: Cutover (<100ms)
- Lock table for writes
- Drain remaining rows
- Update routing table in metadata service
- Unlock (queries now go to shards)
Phase 7: Cleanup (~1m)
- Drop original table
- Clean up replication slots
- Update statistics

Performance:

  • Cutover window: <100ms (10x faster than target)
  • Migration throughput: >100K rows/sec (2x faster than target)
  • Write impact: <3% during migration
  • Read impact: 0%

Package: heliosdb-sharding/ (2,265 LOC added) Tests: 10 online migration tests


Multi-Tenant Resource Quotas

QoS Tiers

TierPriorityMax CUsMax IOPSMax StoragePrice MultiplierUse Case
BronzeLow (1)25,000Unlimited1.0xDevelopment/Testing
SilverMedium (5)815,000Unlimited1.5xProduction (Standard)
GoldHigh (10)3250,000Unlimited2.5xMission-Critical

Quota Enforcement

struct TenantQuotaEnforcer {
fn check_quota(&self, tenant_id: &str, resource: Resource) -> Result<()> {
let usage = self.get_current_usage(tenant_id, resource);
let quota = self.get_quota(tenant_id, resource);
if usage >= quota {
return Err(Error::QuotaExceeded {
tenant: tenant_id.to_string(),
resource,
usage,
quota,
});
}
Ok(())
}
}

Enforcement Actions:

  1. Soft Limit (90% of quota):

    • Warning logged
    • Notification sent
    • No throttling
  2. Hard Limit (100% of quota):

    • New connections rejected
    • Queries queued (up to queue limit)
    • Throttling applied
  3. Burst Allowance:

    • Bronze: 10% burst for 1 minute
    • Silver: 20% burst for 5 minutes
    • Gold: 50% burst for 15 minutes

Performance:

  • Quota check latency: <1μs (100x faster than target)
  • Enforcement accuracy: 99.9%
  • Priority scheduling: Works as expected
  • Tenant isolation: 100% (no cross-tenant interference)

Package: heliosdb-quotas/ (2,752 LOC) Tests: 40 quota and priority tests


Performance Summary

Breakthrough Features (All Targets Met or Exceeded)

FeatureMetricTargetAchievedStatus
Git BranchingCreation time<100ms555μs58x faster
Scale-to-ZeroResume time<300ms170ms43% faster
AutoscalingScale-up<10s600-2,100ms5-20x faster
Query-from-Any-NodeThroughput3-5x3-5xMet target
Shard RebalancingWrite spike<10ms<5ms2x better
HCC v2Compression8-12x10-15xMet target
Schema ShardingRouting<2ms0.1-0.4ms5x faster
Distributed FKsCo-located<5ms<1ms5x faster
Distributed FKsCross-shard<20ms<10ms2x faster
Tiered StorageCost reduction80%85%6% better
SafekeeperWrite latency50%50%Met target
Online ShardingCutover<1s<100ms10x faster
QuotasCheck latency<100μs<1μs100x faster

Deployment Architecture

Single-Node Development

┌──────────────────────────────────────┐
│ Single Docker Container │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Multi-Protocol Gateway │ │
│ │ (PostgreSQL, Oracle, MySQL, │ │
│ │ HTTP all on one server) │ │
│ └─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ Compute Engine │ │
│ └─────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ LSM-Tree Storage │ │
│ │ (Local disk /var/lib/heliosdb) │ │
│ └─────────────────────────────────┘ │
│ │
│ Perfect for: Development, testing, │
│ proof-of-concept │
└──────────────────────────────────────┘

Start Command:

Terminal window
docker run -p 5432:5432 -p 1521:1521 -p 3306:3306 -p 8443:8443 \
heliosdb/heliosdb:4.0.0

Multi-Node Production Cluster

┌────────────────────────────────────────────────────────────┐
│ Load Balancer (HAProxy) │
│ PostgreSQL:5432 │ Oracle:1521 │ HTTP:8443 │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ Compute Cluster (3+ nodes) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Compute │ │ Compute │ │ Compute │ │ Compute │ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node N │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ Safekeeper Cluster (3 nodes) │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Safekeeper 1 │ │ Safekeeper 2 │ │ Safekeeper 3 │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ Storage Cluster (3+ nodes) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Storage │ │ Storage │ │ Storage │ │ Storage │ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node N │ │
│ │ │ │ │ │ │ │ │ │
│ │ NVMe Hot │ │ NVMe Hot │ │ NVMe Hot │ │ NVMe Hot │ │
│ │ SATA Warm│ │ SATA Warm│ │ SATA Warm│ │ SATA Warm│ │
│ │ S3 Cold │ │ S3 Cold │ │ S3 Cold │ │ S3 Cold │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ Metadata Cluster (3 nodes, Raft) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Metadata 1 │ │ Metadata 2 │ │ Metadata 3 │ │
│ │ (Leader) │ │ (Follower) │ │ (Follower) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────────┘

Deploy with Docker Compose:

Terminal window
docker-compose up -d

Conclusion

HeliosDB v4.0.0 implements a modern cloud-native architecture that delivers:

  1. Serverless Economics: 84% cost savings with scale-to-zero
  2. Distributed Scalability: 3-5x throughput with query-from-any-node
  3. Storage Efficiency: 85% cost reduction with 3-tier storage + 10-15x compression
  4. Developer Experience: Git-style branching with 555μs creation time
  5. Zero-Downtime Operations: <100ms cutover for all migration operations
  6. Multi-Protocol Compatibility: PostgreSQL, Oracle, MySQL, HTTP wire protocols
  7. Production Ready: 800+ tests, 100% pass rate, 220,000 LOC

Total Features: 71 production-ready features Performance: All targets met or exceeded Cost Savings: Up to 90% for combined compute + storage

📖 Full Documentation | Quick Start | Performance Benchmarks