HeliosDB Query Optimizer

Version: 7.0 Status: Production Ready Last Updated: 2026-01-04

Overview

The HeliosDB Query Optimizer is a sophisticated cost-based optimizer that transforms SQL queries into efficient execution plans. It combines traditional optimization techniques with advanced features like neural network-based planning, quantum-inspired optimization, and machine learning-driven cardinality estimation.

Key Capabilities

Cost-Based Optimization: Evaluates multiple execution strategies and selects the lowest-cost plan
Statistics-Driven Planning: Uses table and column statistics for accurate cost estimation
Adaptive Query Processing: Adjusts plans at runtime based on actual data characteristics
Multi-Model Support: Optimizes across SQL, document, graph, and time-series workloads
Distributed Query Planning: Optimizes queries across sharded and replicated data

Architecture

                    +------------------+
                    |    SQL Parser    |
                    +--------+---------+
                             |
                             v
                    +------------------+
                    |  Logical Planner |
                    +--------+---------+
                             |
                             v
+------------+      +------------------+      +---------------+
| Statistics |----->| Query Optimizer  |<-----| Configuration |
| Catalog    |      +--------+---------+      | Parameters    |
+------------+               |                +---------------+
                             v
                    +------------------+
                    | Physical Planner |
                    +--------+---------+
                             |
                             v
                    +------------------+
                    | Execution Engine |
                    +------------------+

Optimization Phases

Parsing: SQL text converted to Abstract Syntax Tree (AST)
Semantic Analysis: Validates references, resolves types
Logical Planning: Creates initial logical plan tree
Logical Optimization: Applies rule-based transformations
Physical Planning: Generates physical execution strategies
Cost-Based Selection: Evaluates costs, selects optimal plan
Execution: Runs the chosen plan

Cost-Based Optimizer

The cost-based optimizer evaluates query plans using a sophisticated cost model.

Cost Model Components

-- View optimizer cost parameters
SHOW ALL LIKE '%cost%';

/*
Parameter            | Value | Description
---------------------|-------|----------------------------------
seq_page_cost        | 1.0   | Cost to read one page sequentially
random_page_cost     | 4.0   | Cost to read one page randomly
cpu_tuple_cost       | 0.01  | Cost to process one tuple
cpu_operator_cost    | 0.0025| Cost to execute one operator
cpu_index_tuple_cost | 0.005 | Cost to process one index entry
parallel_setup_cost  | 1000  | Cost to start parallel workers
parallel_tuple_cost  | 0.1   | Cost to transfer tuple to leader
*/

Cost Calculation Example

EXPLAIN (VERBOSE, COSTS)
SELECT * FROM orders WHERE customer_id = 123;

/*
Index Scan using idx_orders_customer on orders
  (cost=0.43..8.45 rows=1 width=120)

Cost Breakdown:
  - Index lookup: 0.43 (startup cost)
  - Index entries processed: 1 x 0.005 = 0.005
  - Random page reads: 1 x 4.0 = 4.0
  - Tuple processing: 1 x 0.01 = 0.01
  - Total: 8.45
*/

Plan Selection Process

The optimizer considers multiple strategies and selects the lowest-cost option:

EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM users WHERE age > 25;

/*
Optimizer Decision: Scan Strategy
  CHOSEN: Sequential Scan (cost: 1000.00, rows: 50000)
  Reasoning: Low selectivity (50% of rows match).
             Sequential scan more efficient than index.

  REJECTED: Index Scan on idx_users_age (cost: 2500.00)
  Reason: High number of matching rows (50%) causes
          excessive random I/O. Random page cost (4.0)
          makes index scan 2.5x more expensive.
*/

Statistics Collection

Accurate statistics are essential for optimal query plans.

Automatic Statistics

-- HeliosDB automatically collects statistics
-- Configure auto-analyze threshold
SET autovacuum_analyze_threshold = 50;
SET autovacuum_analyze_scale_factor = 0.1;

-- Statistics collected:
-- - Row count per table
-- - Column value distributions (histograms)
-- - Most common values (MCV)
-- - Null fraction
-- - Average column width
-- - Correlation (physical vs logical ordering)

Manual Statistics Collection

-- Analyze single table
ANALYZE orders;

-- Analyze specific columns
ANALYZE orders (customer_id, order_date);

-- Analyze with sampling
ANALYZE (SAMPLE 10 PERCENT) large_table;

-- Verbose output
ANALYZE VERBOSE orders;
/*
INFO:  analyzing "public.orders"
INFO:  "orders": 1000000 rows sampled, 1000000 estimated total rows
INFO:  "orders": 5 columns analyzed
*/

Extended Statistics

For correlated columns, create extended statistics:

-- Create dependency statistics
CREATE STATISTICS orders_stats (dependencies)
ON customer_id, order_date FROM orders;

-- Create ndistinct statistics
CREATE STATISTICS product_region_stats (ndistinct)
ON product_id, region FROM sales;

-- Analyze to populate
ANALYZE orders;

-- View statistics
SELECT * FROM pg_statistic_ext_data
WHERE stxname = 'orders_stats';

Viewing Statistics

-- Table-level statistics
SELECT
  relname,
  n_live_tup as rows,
  n_dead_tup as dead_rows,
  last_analyze,
  last_autoanalyze
FROM pg_stat_user_tables
WHERE relname = 'orders';

-- Column-level statistics
SELECT
  attname,
  null_frac,
  avg_width,
  n_distinct,
  most_common_vals,
  most_common_freqs
FROM pg_stats
WHERE tablename = 'orders';

Join Strategies

The optimizer selects from multiple join algorithms based on data characteristics.

Hash Join

Best for: Large tables with equality conditions

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

/*
Hash Join (cost=250.00..1500.00 rows=100000 width=150)
  (actual time=15.2..125.8 ms rows=100000 loops=1)
  Hash Cond: (o.customer_id = c.id)
  -> Seq Scan on orders o (cost=0.00..1000.00 rows=100000)
  -> Hash (cost=200.00..200.00 rows=10000)
       -> Seq Scan on customers c (cost=0.00..200.00 rows=10000)

Characteristics:
- Builds hash table from smaller table (customers)
- Probes hash table for each row from larger table
- O(n + m) time complexity
- Memory usage: work_mem for hash table
*/

Merge Join

Best for: Pre-sorted inputs or when results need ordering

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
ORDER BY o.customer_id;

/*
Merge Join (cost=500.00..800.00 rows=100000 width=150)
  (actual time=45.2..95.8 ms rows=100000 loops=1)
  Merge Cond: (o.customer_id = c.id)
  -> Index Scan using idx_orders_customer on orders o
  -> Index Scan using customers_pkey on customers c

Characteristics:
- Requires sorted inputs
- Uses indexes or explicit sorts
- O(n log n + m log m) with sorting
- Excellent for ordered results
*/

Nested Loop Join

Best for: Small outer tables or indexed inner lookups

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.id = 12345;

/*
Nested Loop (cost=0.57..16.60 rows=1 width=150)
  (actual time=0.05..0.06 ms rows=1 loops=1)
  -> Index Scan using orders_pkey on orders o
       Index Cond: (id = 12345)
  -> Index Scan using customers_pkey on customers c
       Index Cond: (id = o.customer_id)

Characteristics:
- Iterates outer, looks up inner for each row
- O(n * m) worst case
- Excellent when outer is small or inner has index
- No memory requirements for join itself
*/

Join Strategy Selection

-- Force specific join strategy (for testing)
SET enable_hashjoin = false;
SET enable_mergejoin = false;
SET enable_nestloop = true;

-- View optimizer decision
EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM a JOIN b ON a.id = b.a_id;

/*
Optimizer Decision: Join Strategy
  CHOSEN: Hash Join (cost: 500.00)
  Reasoning: Medium-sized tables with equality condition.
             Hash table fits in work_mem (256MB).

  REJECTED: Nested Loop (cost: 5000.00, 10.0x more expensive)
  Reason: Outer table too large (10000 rows). Would require
          10000 index lookups.

  REJECTED: Merge Join (cost: 800.00, 1.6x more expensive)
  Reason: Neither input is sorted on join key. Sort overhead
          exceeds hash join cost.
*/

Index Selection

The optimizer automatically selects the most efficient index.

Index Types Supported

-- B-Tree (default): Range and equality queries
CREATE INDEX idx_orders_date ON orders(order_date);

-- Hash: Equality queries only
CREATE INDEX idx_users_email ON users USING HASH(email);

-- GiST: Spatial and full-text
CREATE INDEX idx_locations_geo ON locations USING GIST(coordinates);

-- GIN: Arrays and JSON
CREATE INDEX idx_products_tags ON products USING GIN(tags);

-- BRIN: Large sequential tables
CREATE INDEX idx_logs_time ON logs USING BRIN(timestamp);

Index Selection Criteria

EXPLAIN ANALYZE (WHY_NOT)
SELECT * FROM orders
WHERE customer_id = 123
  AND order_date > '2025-01-01';

/*
Index Selection Analysis:

CHOSEN: idx_orders_customer_date (composite B-Tree)
  - Selectivity: 0.001 (0.1% of rows)
  - Estimated rows: 10
  - Cost: 8.45

CONSIDERED: idx_orders_customer
  - Selectivity: 0.01 (1% of rows)
  - Would need filter on order_date
  - Cost: 150.00

CONSIDERED: idx_orders_date
  - Selectivity: 0.3 (30% of rows)
  - Low selectivity makes index scan expensive
  - Cost: 3000.00

NOT CONSIDERED: Sequential Scan
  - Would scan all 1M rows
  - Cost: 10000.00
*/

Covering Indexes

-- Create covering index
CREATE INDEX idx_orders_covering
ON orders (customer_id, order_date)
INCLUDE (amount, status);

EXPLAIN ANALYZE
SELECT customer_id, order_date, amount, status
FROM orders
WHERE customer_id = 123;

/*
Index Only Scan using idx_orders_covering
  (cost=0.43..4.45 rows=10 width=50)
  (actual time=0.02..0.03 ms rows=10 loops=1)
  Heap Fetches: 0

Note: All columns satisfied from index (no table access)
*/

Parallel Query Execution

HeliosDB automatically parallelizes queries across CPU cores.

Parallel Configuration

-- View parallel settings
SHOW max_parallel_workers;              -- 8 (cluster-wide)
SHOW max_parallel_workers_per_gather;   -- 4 (per query)
SHOW min_parallel_table_scan_size;      -- 8MB
SHOW min_parallel_index_scan_size;      -- 512KB
SHOW parallel_setup_cost;               -- 1000
SHOW parallel_tuple_cost;               -- 0.1

Parallel Operations

EXPLAIN ANALYZE
SELECT region, SUM(amount)
FROM sales
GROUP BY region;

/*
Finalize GroupAggregate (cost=15000.00..15100.00 rows=10)
  (actual time=150.2..150.5 ms rows=10 loops=1)
  Group Key: region
  -> Gather Merge (cost=15000.00..15050.00 rows=40)
       Workers Planned: 4
       Workers Launched: 4
       -> Partial GroupAggregate (cost=14000.00..14010.00 rows=10)
            (actual time=120.5..120.8 ms rows=10 loops=5)
            -> Parallel Seq Scan on sales
                 (actual time=0.01..80.5 ms rows=200000 loops=5)

Performance: 4x speedup with 4 workers
Total rows: 1,000,000
Each worker processed: ~200,000 rows
*/

Parallel Join

EXPLAIN ANALYZE
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2025-01-01';

/*
Gather (cost=5000.00..15000.00 rows=500000 width=150)
  Workers Planned: 4
  -> Parallel Hash Join (cost=4000.00..12000.00 rows=125000)
       Hash Cond: (o.customer_id = c.id)
       -> Parallel Seq Scan on orders o
            Filter: (order_date > '2025-01-01')
       -> Parallel Hash
            -> Parallel Seq Scan on customers c
*/

Distributed Query Optimization

For sharded deployments, the optimizer minimizes data movement.

Shard-Aware Planning

-- Table sharded by customer_id
CREATE TABLE orders (
    id BIGINT,
    customer_id INTEGER,
    amount DECIMAL(10,2)
) SHARD BY HASH(customer_id) SHARDS 8;

-- Single-shard query (no data movement)
EXPLAIN DISTRIBUTED
SELECT * FROM orders WHERE customer_id = 123;

/*
Distributed Query Plan:
  -> Route to Shard 3 (hash(123) mod 8 = 3)
     -> Local Index Scan on orders
        Index Cond: (customer_id = 123)

Network: 0 bytes shuffled (single shard)
*/

Distributed Join Optimization

-- Both tables sharded by customer_id (co-located)
EXPLAIN DISTRIBUTED
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

/*
Distributed Query Plan:
  -> Parallel execution on 8 shards
     Each shard executes locally:
     -> Hash Join
          -> Local Scan on orders partition
          -> Local Scan on customers partition
  -> Gather results at coordinator

Network: Minimal (only results transferred)
Optimization: Co-located join (both tables sharded same way)
*/

Cross-Shard Aggregation

EXPLAIN DISTRIBUTED
SELECT region, SUM(amount) as total
FROM sales
GROUP BY region;

/*
Distributed Query Plan:
  Phase 1: Partial Aggregation (on each shard)
    -> Each shard computes partial SUM per region
    -> Data shuffled: 8 shards x 10 regions x 16 bytes = 1.3KB

  Phase 2: Final Aggregation (at coordinator)
    -> Merge partial results
    -> Compute final SUM per region

Network: 1.3KB shuffled
Optimization: Pushdown partial aggregation to shards
*/

Advanced Optimization Features

Adaptive Query Execution

-- Enable adaptive execution
SET adaptive_execution = ON;

EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.amount > 1000;

/*
Adaptive Execution Report:
  Initial Plan: Hash Join (estimated 1000 rows)
  Runtime Adjustment: Switched to Nested Loop at row 50
  Reason: Actual selectivity (0.05%) much lower than estimated (1%)

  Performance Impact:
    - Original plan would have: 150ms
    - Adaptive plan achieved: 5ms
    - Improvement: 30x
*/

Query Plan Caching

-- View cached plans
SELECT * FROM pg_prepared_statements;

-- Prepare statement (caches plan)
PREPARE get_orders AS
SELECT * FROM orders WHERE customer_id = $1;

-- Execute with cached plan
EXECUTE get_orders(123);

-- View plan cache statistics
SELECT * FROM pg_stat_statements
WHERE query LIKE '%get_orders%';

Optimizer Hints

-- Force index usage
SELECT /*+ IndexScan(orders idx_orders_date) */
  * FROM orders WHERE order_date > '2025-01-01';

-- Force join order
SELECT /*+ Leading(c o) */
  o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

-- Force parallel execution
SELECT /*+ Parallel(orders 8) */
  SUM(amount) FROM orders;

-- Disable specific optimization
SELECT /*+ NoHashJoin */
  * FROM a JOIN b ON a.id = b.a_id;

Quick Start Guide - Getting started with query optimization
User Guide - Comprehensive optimization techniques
Troubleshooting Guide - Debugging slow queries
EXPLAIN Plan Guide - Understanding execution plans
Index Strategy Guide - Index design patterns

HeliosDB Query Optimizer - Enterprise-grade query optimization for any workload.

HeliosDB Query Optimizer

HeliosDB Query Optimizer

Overview

Key Capabilities

Architecture

Optimization Phases

Cost-Based Optimizer

Cost Model Components

Cost Calculation Example

Plan Selection Process

Statistics Collection

Automatic Statistics

Manual Statistics Collection

Extended Statistics

Viewing Statistics

Join Strategies

Hash Join

Merge Join

Nested Loop Join

Join Strategy Selection

Index Selection

Index Types Supported

Index Selection Criteria

Covering Indexes

Parallel Query Execution

Parallel Configuration

Parallel Operations

Parallel Join

Distributed Query Optimization

Shard-Aware Planning

Distributed Join Optimization

Cross-Shard Aggregation

Advanced Optimization Features

Adaptive Query Execution

Query Plan Caching

Optimizer Hints

Related Documentation