HeliosDB Query Optimizer
HeliosDB Query Optimizer
Version: 7.0 Status: Production Ready Last Updated: 2026-01-04
Overview
The HeliosDB Query Optimizer is a sophisticated cost-based optimizer that transforms SQL queries into efficient execution plans. It combines traditional optimization techniques with advanced features like neural network-based planning, quantum-inspired optimization, and machine learning-driven cardinality estimation.
Key Capabilities
- Cost-Based Optimization: Evaluates multiple execution strategies and selects the lowest-cost plan
- Statistics-Driven Planning: Uses table and column statistics for accurate cost estimation
- Adaptive Query Processing: Adjusts plans at runtime based on actual data characteristics
- Multi-Model Support: Optimizes across SQL, document, graph, and time-series workloads
- Distributed Query Planning: Optimizes queries across sharded and replicated data
Architecture
+------------------+ | SQL Parser | +--------+---------+ | v +------------------+ | Logical Planner | +--------+---------+ | v+------------+ +------------------+ +---------------+| Statistics |----->| Query Optimizer |<-----| Configuration || Catalog | +--------+---------+ | Parameters |+------------+ | +---------------+ v +------------------+ | Physical Planner | +--------+---------+ | v +------------------+ | Execution Engine | +------------------+Optimization Phases
- Parsing: SQL text converted to Abstract Syntax Tree (AST)
- Semantic Analysis: Validates references, resolves types
- Logical Planning: Creates initial logical plan tree
- Logical Optimization: Applies rule-based transformations
- Physical Planning: Generates physical execution strategies
- Cost-Based Selection: Evaluates costs, selects optimal plan
- Execution: Runs the chosen plan
Cost-Based Optimizer
The cost-based optimizer evaluates query plans using a sophisticated cost model.
Cost Model Components
-- View optimizer cost parametersSHOW ALL LIKE '%cost%';
/*Parameter | Value | Description---------------------|-------|----------------------------------seq_page_cost | 1.0 | Cost to read one page sequentiallyrandom_page_cost | 4.0 | Cost to read one page randomlycpu_tuple_cost | 0.01 | Cost to process one tuplecpu_operator_cost | 0.0025| Cost to execute one operatorcpu_index_tuple_cost | 0.005 | Cost to process one index entryparallel_setup_cost | 1000 | Cost to start parallel workersparallel_tuple_cost | 0.1 | Cost to transfer tuple to leader*/Cost Calculation Example
EXPLAIN (VERBOSE, COSTS)SELECT * FROM orders WHERE customer_id = 123;
/*Index Scan using idx_orders_customer on orders (cost=0.43..8.45 rows=1 width=120)
Cost Breakdown: - Index lookup: 0.43 (startup cost) - Index entries processed: 1 x 0.005 = 0.005 - Random page reads: 1 x 4.0 = 4.0 - Tuple processing: 1 x 0.01 = 0.01 - Total: 8.45*/Plan Selection Process
The optimizer considers multiple strategies and selects the lowest-cost option:
EXPLAIN ANALYZE (WHY_NOT)SELECT * FROM users WHERE age > 25;
/*Optimizer Decision: Scan Strategy CHOSEN: Sequential Scan (cost: 1000.00, rows: 50000) Reasoning: Low selectivity (50% of rows match). Sequential scan more efficient than index.
REJECTED: Index Scan on idx_users_age (cost: 2500.00) Reason: High number of matching rows (50%) causes excessive random I/O. Random page cost (4.0) makes index scan 2.5x more expensive.*/Statistics Collection
Accurate statistics are essential for optimal query plans.
Automatic Statistics
-- HeliosDB automatically collects statistics-- Configure auto-analyze thresholdSET autovacuum_analyze_threshold = 50;SET autovacuum_analyze_scale_factor = 0.1;
-- Statistics collected:-- - Row count per table-- - Column value distributions (histograms)-- - Most common values (MCV)-- - Null fraction-- - Average column width-- - Correlation (physical vs logical ordering)Manual Statistics Collection
-- Analyze single tableANALYZE orders;
-- Analyze specific columnsANALYZE orders (customer_id, order_date);
-- Analyze with samplingANALYZE (SAMPLE 10 PERCENT) large_table;
-- Verbose outputANALYZE VERBOSE orders;/*INFO: analyzing "public.orders"INFO: "orders": 1000000 rows sampled, 1000000 estimated total rowsINFO: "orders": 5 columns analyzed*/Extended Statistics
For correlated columns, create extended statistics:
-- Create dependency statisticsCREATE STATISTICS orders_stats (dependencies)ON customer_id, order_date FROM orders;
-- Create ndistinct statisticsCREATE STATISTICS product_region_stats (ndistinct)ON product_id, region FROM sales;
-- Analyze to populateANALYZE orders;
-- View statisticsSELECT * FROM pg_statistic_ext_dataWHERE stxname = 'orders_stats';Viewing Statistics
-- Table-level statisticsSELECT relname, n_live_tup as rows, n_dead_tup as dead_rows, last_analyze, last_autoanalyzeFROM pg_stat_user_tablesWHERE relname = 'orders';
-- Column-level statisticsSELECT attname, null_frac, avg_width, n_distinct, most_common_vals, most_common_freqsFROM pg_statsWHERE tablename = 'orders';Join Strategies
The optimizer selects from multiple join algorithms based on data characteristics.
Hash Join
Best for: Large tables with equality conditions
EXPLAIN ANALYZESELECT o.*, c.nameFROM orders oJOIN customers c ON o.customer_id = c.id;
/*Hash Join (cost=250.00..1500.00 rows=100000 width=150) (actual time=15.2..125.8 ms rows=100000 loops=1) Hash Cond: (o.customer_id = c.id) -> Seq Scan on orders o (cost=0.00..1000.00 rows=100000) -> Hash (cost=200.00..200.00 rows=10000) -> Seq Scan on customers c (cost=0.00..200.00 rows=10000)
Characteristics:- Builds hash table from smaller table (customers)- Probes hash table for each row from larger table- O(n + m) time complexity- Memory usage: work_mem for hash table*/Merge Join
Best for: Pre-sorted inputs or when results need ordering
EXPLAIN ANALYZESELECT o.*, c.nameFROM orders oJOIN customers c ON o.customer_id = c.idORDER BY o.customer_id;
/*Merge Join (cost=500.00..800.00 rows=100000 width=150) (actual time=45.2..95.8 ms rows=100000 loops=1) Merge Cond: (o.customer_id = c.id) -> Index Scan using idx_orders_customer on orders o -> Index Scan using customers_pkey on customers c
Characteristics:- Requires sorted inputs- Uses indexes or explicit sorts- O(n log n + m log m) with sorting- Excellent for ordered results*/Nested Loop Join
Best for: Small outer tables or indexed inner lookups
EXPLAIN ANALYZESELECT o.*, c.nameFROM orders oJOIN customers c ON o.customer_id = c.idWHERE o.id = 12345;
/*Nested Loop (cost=0.57..16.60 rows=1 width=150) (actual time=0.05..0.06 ms rows=1 loops=1) -> Index Scan using orders_pkey on orders o Index Cond: (id = 12345) -> Index Scan using customers_pkey on customers c Index Cond: (id = o.customer_id)
Characteristics:- Iterates outer, looks up inner for each row- O(n * m) worst case- Excellent when outer is small or inner has index- No memory requirements for join itself*/Join Strategy Selection
-- Force specific join strategy (for testing)SET enable_hashjoin = false;SET enable_mergejoin = false;SET enable_nestloop = true;
-- View optimizer decisionEXPLAIN ANALYZE (WHY_NOT)SELECT * FROM a JOIN b ON a.id = b.a_id;
/*Optimizer Decision: Join Strategy CHOSEN: Hash Join (cost: 500.00) Reasoning: Medium-sized tables with equality condition. Hash table fits in work_mem (256MB).
REJECTED: Nested Loop (cost: 5000.00, 10.0x more expensive) Reason: Outer table too large (10000 rows). Would require 10000 index lookups.
REJECTED: Merge Join (cost: 800.00, 1.6x more expensive) Reason: Neither input is sorted on join key. Sort overhead exceeds hash join cost.*/Index Selection
The optimizer automatically selects the most efficient index.
Index Types Supported
-- B-Tree (default): Range and equality queriesCREATE INDEX idx_orders_date ON orders(order_date);
-- Hash: Equality queries onlyCREATE INDEX idx_users_email ON users USING HASH(email);
-- GiST: Spatial and full-textCREATE INDEX idx_locations_geo ON locations USING GIST(coordinates);
-- GIN: Arrays and JSONCREATE INDEX idx_products_tags ON products USING GIN(tags);
-- BRIN: Large sequential tablesCREATE INDEX idx_logs_time ON logs USING BRIN(timestamp);Index Selection Criteria
EXPLAIN ANALYZE (WHY_NOT)SELECT * FROM ordersWHERE customer_id = 123 AND order_date > '2025-01-01';
/*Index Selection Analysis:
CHOSEN: idx_orders_customer_date (composite B-Tree) - Selectivity: 0.001 (0.1% of rows) - Estimated rows: 10 - Cost: 8.45
CONSIDERED: idx_orders_customer - Selectivity: 0.01 (1% of rows) - Would need filter on order_date - Cost: 150.00
CONSIDERED: idx_orders_date - Selectivity: 0.3 (30% of rows) - Low selectivity makes index scan expensive - Cost: 3000.00
NOT CONSIDERED: Sequential Scan - Would scan all 1M rows - Cost: 10000.00*/Covering Indexes
-- Create covering indexCREATE INDEX idx_orders_coveringON orders (customer_id, order_date)INCLUDE (amount, status);
EXPLAIN ANALYZESELECT customer_id, order_date, amount, statusFROM ordersWHERE customer_id = 123;
/*Index Only Scan using idx_orders_covering (cost=0.43..4.45 rows=10 width=50) (actual time=0.02..0.03 ms rows=10 loops=1) Heap Fetches: 0
Note: All columns satisfied from index (no table access)*/Parallel Query Execution
HeliosDB automatically parallelizes queries across CPU cores.
Parallel Configuration
-- View parallel settingsSHOW max_parallel_workers; -- 8 (cluster-wide)SHOW max_parallel_workers_per_gather; -- 4 (per query)SHOW min_parallel_table_scan_size; -- 8MBSHOW min_parallel_index_scan_size; -- 512KBSHOW parallel_setup_cost; -- 1000SHOW parallel_tuple_cost; -- 0.1Parallel Operations
EXPLAIN ANALYZESELECT region, SUM(amount)FROM salesGROUP BY region;
/*Finalize GroupAggregate (cost=15000.00..15100.00 rows=10) (actual time=150.2..150.5 ms rows=10 loops=1) Group Key: region -> Gather Merge (cost=15000.00..15050.00 rows=40) Workers Planned: 4 Workers Launched: 4 -> Partial GroupAggregate (cost=14000.00..14010.00 rows=10) (actual time=120.5..120.8 ms rows=10 loops=5) -> Parallel Seq Scan on sales (actual time=0.01..80.5 ms rows=200000 loops=5)
Performance: 4x speedup with 4 workersTotal rows: 1,000,000Each worker processed: ~200,000 rows*/Parallel Join
EXPLAIN ANALYZESELECT o.*, c.nameFROM orders oJOIN customers c ON o.customer_id = c.idWHERE o.order_date > '2025-01-01';
/*Gather (cost=5000.00..15000.00 rows=500000 width=150) Workers Planned: 4 -> Parallel Hash Join (cost=4000.00..12000.00 rows=125000) Hash Cond: (o.customer_id = c.id) -> Parallel Seq Scan on orders o Filter: (order_date > '2025-01-01') -> Parallel Hash -> Parallel Seq Scan on customers c*/Distributed Query Optimization
For sharded deployments, the optimizer minimizes data movement.
Shard-Aware Planning
-- Table sharded by customer_idCREATE TABLE orders ( id BIGINT, customer_id INTEGER, amount DECIMAL(10,2)) SHARD BY HASH(customer_id) SHARDS 8;
-- Single-shard query (no data movement)EXPLAIN DISTRIBUTEDSELECT * FROM orders WHERE customer_id = 123;
/*Distributed Query Plan: -> Route to Shard 3 (hash(123) mod 8 = 3) -> Local Index Scan on orders Index Cond: (customer_id = 123)
Network: 0 bytes shuffled (single shard)*/Distributed Join Optimization
-- Both tables sharded by customer_id (co-located)EXPLAIN DISTRIBUTEDSELECT o.*, c.nameFROM orders oJOIN customers c ON o.customer_id = c.id;
/*Distributed Query Plan: -> Parallel execution on 8 shards Each shard executes locally: -> Hash Join -> Local Scan on orders partition -> Local Scan on customers partition -> Gather results at coordinator
Network: Minimal (only results transferred)Optimization: Co-located join (both tables sharded same way)*/Cross-Shard Aggregation
EXPLAIN DISTRIBUTEDSELECT region, SUM(amount) as totalFROM salesGROUP BY region;
/*Distributed Query Plan: Phase 1: Partial Aggregation (on each shard) -> Each shard computes partial SUM per region -> Data shuffled: 8 shards x 10 regions x 16 bytes = 1.3KB
Phase 2: Final Aggregation (at coordinator) -> Merge partial results -> Compute final SUM per region
Network: 1.3KB shuffledOptimization: Pushdown partial aggregation to shards*/Advanced Optimization Features
Adaptive Query Execution
-- Enable adaptive executionSET adaptive_execution = ON;
EXPLAIN ANALYZESELECT * FROM orders oJOIN customers c ON o.customer_id = c.idWHERE o.amount > 1000;
/*Adaptive Execution Report: Initial Plan: Hash Join (estimated 1000 rows) Runtime Adjustment: Switched to Nested Loop at row 50 Reason: Actual selectivity (0.05%) much lower than estimated (1%)
Performance Impact: - Original plan would have: 150ms - Adaptive plan achieved: 5ms - Improvement: 30x*/Query Plan Caching
-- View cached plansSELECT * FROM pg_prepared_statements;
-- Prepare statement (caches plan)PREPARE get_orders ASSELECT * FROM orders WHERE customer_id = $1;
-- Execute with cached planEXECUTE get_orders(123);
-- View plan cache statisticsSELECT * FROM pg_stat_statementsWHERE query LIKE '%get_orders%';Optimizer Hints
-- Force index usageSELECT /*+ IndexScan(orders idx_orders_date) */ * FROM orders WHERE order_date > '2025-01-01';
-- Force join orderSELECT /*+ Leading(c o) */ o.*, c.nameFROM orders oJOIN customers c ON o.customer_id = c.id;
-- Force parallel executionSELECT /*+ Parallel(orders 8) */ SUM(amount) FROM orders;
-- Disable specific optimizationSELECT /*+ NoHashJoin */ * FROM a JOIN b ON a.id = b.a_id;Related Documentation
- Quick Start Guide - Getting started with query optimization
- User Guide - Comprehensive optimization techniques
- Troubleshooting Guide - Debugging slow queries
- EXPLAIN Plan Guide - Understanding execution plans
- Index Strategy Guide - Index design patterns
HeliosDB Query Optimizer - Enterprise-grade query optimization for any workload.