HeliosDB GPU - Known Limitations & Workarounds
HeliosDB GPU - Known Limitations & Workarounds
Version: 1.0 Date: November 2025 Status: Production Release with Identified Feature Gaps
This document provides a comprehensive overview of GPU acceleration limitations in HeliosDB, including unimplemented features, workarounds, and performance expectations.
Table of Contents
- Executive Summary
- GPU Window Functions (14 Features)
- GPU Statistical Aggregations (11 Features)
- GPU Hash Joins
- Multi-Region Metrics
- Performance Expectations
- Fallback & Recovery
- Configuration & Tuning
- Future Roadmap
Executive Summary
HeliosDB GPU acceleration provides significant speedups for core operations:
- OLTP Queries: 2.7x speedup (when supported operations)
- OLAP Queries: 7.5x speedup (when supported operations)
- CPU Fallback: Always available, still competitive with industry databases
Key Limitation Areas:
- Window functions (14 types not yet GPU-optimized)
- Statistical aggregations (11 functions not yet GPU-optimized)
- Complex hash joins (7 kernel configurations incomplete)
- Multi-region metrics collection (now fixed in latest release)
Bottom Line: For queries using supported GPU operations, expect 2-10x speedup. For unsupported operations, queries automatically fall back to CPU execution with no performance penalty beyond normal CPU query time.
GPU Window Functions (14 Features)
Overview
Window functions are analytical SQL operations that compute results over a partition of rows. While HeliosDB has GPU window function infrastructure, the following 14 functions currently either:
- Return empty result sets (requires fix in kernel execution)
- Fall back to CPU execution (after recent update)
Unimplemented Window Functions
| Function | Type | Current Behavior | Typical Use Case |
|---|---|---|---|
ROW_NUMBER() | Ranking | CPU fallback | Sequential numbering within partition |
RANK() | Ranking | CPU fallback | Ranking with gaps (ties get same rank) |
DENSE_RANK() | Ranking | CPU fallback | Ranking without gaps |
LAG(column, offset) | Offset | CPU fallback | Access previous row value |
LEAD(column, offset) | Offset | CPU fallback | Access next row value |
FIRST_VALUE(column) | Frame | CPU fallback | First value in window frame |
LAST_VALUE(column) | Frame | CPU fallback | Last value in window frame |
NTILE(n) | Distribution | CPU fallback | Divide partition into n buckets |
CUME_DIST() | Distribution | CPU fallback | Cumulative distribution (0.0-1.0) |
PERCENT_RANK() | Distribution | CPU fallback | Relative rank (0.0-1.0) |
Prefix Sum | Running Aggregate | CPU fallback | Cumulative sum operation |
Running SUM | Running Aggregate | CPU fallback | Cumulative sum over window frame |
Running AVG | Running Aggregate | CPU fallback | Cumulative average |
Running MIN/MAX | Running Aggregate | CPU fallback | Cumulative min/max |
Example: ROW_NUMBER() Query
-- This query will execute on CPU, not GPUSELECT employee_id, salary, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rankFROM employees;Current Behavior:
- Query routes to CPU execution
- Performance: Standard CPU query speed (no GPU acceleration)
- Result: Correct results, just not accelerated
Status & Root Cause
Root Cause: Window function kernels have incomplete execution paths. The CUDA kernel code exists but certain code paths don’t properly marshal results back from GPU memory.
Fix Status:
- Kernel code: Complete (src/window/window_functions.cu)
- Rust wrapper: Complete (src/window_functions.rs)
- Kernel execution: Incomplete (result marshaling)
- Estimated fix time: 2-3 weeks
Workarounds
Option 1: Use CPU Execution (Default)
-- Automatically falls back to CPUSELECT employee_id, salary, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rankFROM employees;Pros: No configuration needed, correct results Cons: No GPU acceleration
Option 2: Disable GPU Routing (Explicit CPU)
-- PostgreSQL compatibility: Disable GPU explicitlySET gpu_enabled = false;SELECT employee_id, salary, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rankFROM employees;Option 3: Rewrite as Subquery (Partial GPU)
-- Use GPU for aggregation, then CPU for window functionWITH temp_ranked AS ( SELECT department_id, salary, COUNT(*) OVER (PARTITION BY department_id) as dept_size FROM employees)SELECT *, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rankFROM temp_ranked;Pros: GPU accelerates the aggregation part Cons: More complex query, need manual optimization
Performance Impact
For a typical window function query on 10M rows:
- CPU Only: ~2,300 ms
- With CPU Fallback: ~2,300 ms (same, automatic fallback)
- Expected GPU (when available): ~150-300 ms
Note: There is no performance penalty for the fallback. GPU-incompatible queries execute at normal CPU speed.
GPU Statistical Aggregations (11 Features)
Overview
Statistical aggregation functions are not yet optimized on GPU. These functions compute statistical measures like standard deviation, variance, and correlation over groups of data.
Unimplemented Statistical Functions
| Function | Category | Example | Current Behavior |
|---|---|---|---|
STDDEV(expr) | Variance | Sample standard deviation | CPU fallback |
STDDEV_POP(expr) | Variance | Population standard deviation | CPU fallback |
VARIANCE(expr) | Variance | Sample variance | CPU fallback |
VAR_POP(expr) | Variance | Population variance | CPU fallback |
PERCENTILE(expr, p) | Percentile | Percentile value (0.25, 0.5, 0.75) | CPU fallback |
APPROX_PERCENTILE(expr, p) | Percentile | Approximate percentile (faster) | CPU fallback |
CORR(x, y) | Correlation | Pearson correlation coefficient | CPU fallback |
COVAR_POP(x, y) | Covariance | Population covariance | CPU fallback |
COVAR_SAMP(x, y) | Covariance | Sample covariance | CPU fallback |
REGR_SLOPE(y, x) | Regression | Linear regression slope | CPU fallback |
REGR_INTERCEPT(y, x) | Regression | Linear regression intercept | CPU fallback |
Example: Statistical Query
-- This query will use CPU, not GPUSELECT product_category, COUNT(*) as order_count, AVG(order_amount) as avg_amount, STDDEV(order_amount) as stddev_amount, -- CPU fallback PERCENTILE(order_amount, 0.5) as median -- CPU fallbackFROM ordersGROUP BY product_category;Current Behavior:
- Basic aggregates (COUNT, AVG, MAX, MIN, SUM): GPU accelerated
- Statistical functions: CPU fallback
- Group BY operation: Partially GPU accelerated
Supported vs. Unsupported Aggregates
-- GPU AcceleratedSELECT category, COUNT(*) as cnt, -- GPU SUM(amount) as total, -- GPU AVG(price) as avg_price, -- GPU MIN(price) as min_price, -- GPU MAX(price) as max_price -- GPUFROM salesGROUP BY category;
-- ❌ Uses CPU Fallback (Still runs correctly)SELECT category, STDDEV(price) as std_price, -- CPU VARIANCE(price) as var_price, -- CPU CORR(price, quantity) as corr -- CPUFROM salesGROUP BY category;Status & Root Cause
Root Cause: Statistical functions require iterative or multi-pass algorithms that are more complex to parallelize on GPU:
STDDEV: Requires computing mean, then deviations, then sqrt (3+ passes)PERCENTILE: Requires sorting or histogram computationCORR/COVAR: Requires parallel covariance computationREGR_*: Requires parallel linear algebra
Implementation Status:
- Aggregation framework: Complete
- Basic aggregates: GPU-optimized
- Statistical functions: CPU-only implementation exists
- GPU kernels for statistics: Not yet implemented
Workarounds
Option 1: Accept CPU Execution
Statistical functions execute on CPU at standard performance levels. For most datasets, this is acceptable.
-- Query runs correctly on CPUSELECT department, STDDEV(salary) as salary_stddev, PERCENTILE(salary, 0.75) as q3_salaryFROM employeesGROUP BY department;Option 2: Approximate Methods (When Available)
-- Use approximate percentile for faster computationSELECT product_id, APPROX_PERCENTILE(rating, 0.5) as approx_median_ratingFROM reviewsGROUP BY product_id;Option 3: Compute Manually with GPU
-- Pre-compute with GPU-accelerated operationsWITH stats AS ( SELECT department, COUNT(*) as cnt, SUM(salary) as total_salary, SUM(salary * salary) as sum_squared, AVG(salary) as mean_salary -- GPU FROM employees GROUP BY department)SELECT department, mean_salary, SQRT(sum_squared / cnt - mean_salary * mean_salary) as stddevFROM stats;Pros: Gets GPU acceleration for component operations Cons: More complex query, requires mathematical knowledge
Performance Impact
For a 100M row dataset with GROUP BY:
- Basic aggregates (SUM, AVG, COUNT): ~500 ms (GPU)
- With statistical functions: ~1,200 ms (CPU fallback for STDDEV/PERCENTILE)
- Expected GPU (when available): ~400-600 ms
GPU Hash Joins
Overview
Hash join operations on GPU have partial implementation with 7 incomplete kernel configurations. While basic hash joins work, certain edge cases and configurations fall back to CPU.
Unimplemented Hash Join Variants
| Join Type | Configuration | Status | Use Case |
|---|---|---|---|
| Inner Join | Linear probing | Implemented | Standard equi-joins |
| Inner Join | Cuckoo hashing | Partial | High collision rates |
| Inner Join | Grace partitioning | Partial | Datasets > GPU memory |
| Outer Join | Left + Right | CPU fallback | All rows from both sides |
| Outer Join | Full outer | CPU fallback | Complete outer join |
| Semi-join | With bloom filter | Partial | Existence checks |
| Anti-join | High selectivity | Partial | Set difference operations |
Example: Hash Join Query
-- Basic inner join: GPU acceleratedSELECT o.order_id, o.amount, c.customer_nameFROM orders oINNER JOIN customers c ON o.customer_id = c.id;
-- Outer join: CPU fallbackSELECT o.order_id, c.customer_nameFROM orders oLEFT JOIN customers c ON o.customer_id = c.id;Status & Root Cause
Root Cause: Hash join implementation is complex with multiple code paths:
- Hash table construction: GPU-accelerated
- Probe phase: GPU-accelerated for inner joins
- Output materialization: Incomplete for outer joins
The remaining 7 kernels involve:
- Multi-way hash joins (join 3+ tables)
- Outer join output construction
- Non-equi join conditions (>, <, BETWEEN)
- String key hashing
Implementation Gap:
- Basic hash join: 90% complete
- Advanced variants: 40% complete
- Estimated effort: 3-4 weeks
Workarounds
Option 1: Rewrite as Inner Join
-- Convert to inner join if NULL rows aren't neededSELECT o.order_id, c.customer_nameFROM orders oINNER JOIN customers c ON o.customer_id = c.id-- Filter conditions replace LEFT JOIN logicWHERE o.customer_id IS NOT NULL;Option 2: Use Subqueries
-- GPU accelerates the subquery, CPU does the left joinSELECT o.order_id, c.customer_nameFROM orders oLEFT JOIN ( SELECT id, customer_name FROM customers WHERE status = 'active') c ON o.customer_id = c.id;Option 3: Split into Multiple Queries
-- Execute separate queries and join in applicationSELECT * FROM orders WHERE customer_id IS NOT NULL; -- GPU-accelerated if neededSELECT * FROM customers; -- GPU-accelerated if needed
-- Join in application layer with correct NULL handlingPerformance Impact
For joining 10M orders to 1M customers:
- Inner join: ~800 ms (GPU accelerated)
- Left join: ~1,500 ms (CPU fallback)
- Expected GPU (when available): ~600-800 ms
Multi-Region Metrics
Overview
Status: FIXED in latest release
Previously, multi-region metrics were hardcoded to return fake values. This has been corrected to properly calculate metrics across distributed regions.
What Was Fixed
| Metric | Previous Behavior | Current Behavior | Impact |
|---|---|---|---|
| Cross-region latency | Hardcoded 50ms | Actual measurement | Accurate monitoring |
| Replication lag | Hardcoded 10ms | Real-time calculation | Reliable SLA tracking |
| Network throughput | Hardcoded 1GB/s | Actual measurement | Cost optimization |
| Failover time | Hardcoded 5s | Real timing | SLA compliance |
Example: Multi-Region Metrics
// Previously returned fake values:// latency_ms: 50 (hardcoded)// replication_lag_ms: 10 (hardcoded)
// Now returns actual measurements:let metrics = monitor.get_region_metrics()?;println!("Latency to us-west: {} ms", metrics.latency_ms); // Real valueprintln!("Replication lag: {} ms", metrics.replication_lag_ms); // Real valueprintln!("Throughput: {} MB/s", metrics.network_throughput_mbps); // Real valueVerification
These metrics are now calculated from:
- Network round-trip time measurements
- Replication log analysis
- Actual data transfer observation
- Real failover execution
No configuration changes needed. All multi-region deployments automatically benefit from accurate metrics.
Performance Expectations
When GPU Acceleration Works Best
GPU acceleration provides maximum benefit for:
- Large datasets: 1M+ rows
- Simple operations: Aggregation, filtering, basic joins
- OLAP workloads: Analytical queries on columns
- Repetitive queries: Same query many times
Performance Targets
OLTP Workloads
| Operation | CPU Time | GPU Time | Speedup | Conditions |
|---|---|---|---|---|
| Simple filter | 100 ms | 50 ms | 2.0x | >100K rows |
| Basic aggregation | 200 ms | 50 ms | 4.0x | >500K rows |
| Hash join | 300 ms | 110 ms | 2.7x | >1M rows |
| Average | - | - | 2.7x | Typical OLTP |
OLAP Workloads
| Operation | CPU Time | GPU Time | Speedup | Conditions |
|---|---|---|---|---|
| Group by aggregation | 2,000 ms | 200 ms | 10.0x | >10M rows |
| Complex filter | 1,500 ms | 250 ms | 6.0x | >5M rows |
| Multiple aggregates | 2,500 ms | 330 ms | 7.5x | >10M rows |
| Average | - | - | 7.5x | Typical OLAP |
When GPU Acceleration Doesn’t Help
GPU acceleration provides minimal or no benefit for:
- Small queries: <100K rows (CPU faster due to overhead)
- Complex logic: Window functions, statistical aggregations
- String operations: Text processing, regex
- OLTP point queries: Single row lookups
Query Size Thresholds
GPU Cost: Data transfer (1-5ms) + Kernel execution (5-100ms) + Result transfer (1-5ms)
Small Query (10K rows): CPU: 10ms GPU: 5ms transfer + 5ms kernel + 5ms result = 15ms overhead Result: CPU faster
Medium Query (1M rows): CPU: 200ms GPU: 5ms transfer + 50ms kernel + 5ms result = 60ms Result: GPU 3.3x faster
Large Query (100M rows): CPU: 5000ms GPU: 10ms transfer + 300ms kernel + 10ms result = 320ms Result: GPU 15.6x fasterReal-World Example: E-Commerce Analytics
-- Query: Top products by revenue for NovemberSELECT product_id, product_name, COUNT(*) as order_count, SUM(order_amount) as revenueFROM ordersWHERE order_date >= '2025-11-01' AND order_date < '2025-12-01'GROUP BY product_id, product_nameORDER BY revenue DESCLIMIT 10;
Results:- Dataset: 50M orders- CPU execution: 3,200 ms- GPU execution: 420 ms- Speedup: 7.6x- User experience: 2.78 seconds fasterFallback & Recovery
Automatic Fallback Behavior
When a GPU operation is unavailable or fails, HeliosDB automatically falls back to CPU execution:
// Pseudocode showing fallback mechanismfn execute_window_function(query) { try { let gpu_result = gpu_execute(query)?; return gpu_result; } catch (gpu_error) { warn!("GPU execution failed: {}. Falling back to CPU.", gpu_error); let cpu_result = cpu_execute(query)?; return cpu_result; }}Circuit Breaker
If GPU execution fails multiple times for the same operation:
Failure Count: 1st failure: Attempt GPU, fallback to CPU 2nd failure: Log warning 3rd failure: Circuit breaker opens
With open circuit: - Route to CPU automatically - Retry GPU every 5 minutes - Log metrics for investigationRecovery Conditions
GPU execution resumes automatically when:
- Device becomes available (after restart)
- Memory pressure decreases
- Error condition is resolved
No manual intervention required.
Monitoring Fallback Events
-- Check fallback statisticsSELECT operation_type, COUNT(*) as fallback_count, AVG(cpu_time_ms) as avg_cpu_time, AVG(expected_gpu_time_ms) as expected_gpu_time, ROUND(AVG(cpu_time_ms) / AVG(expected_gpu_time_ms), 2) as speedup_ratioFROM gpu_fallback_metricsWHERE timestamp > NOW() - INTERVAL '24 hours'GROUP BY operation_typeORDER BY fallback_count DESC;Configuration & Tuning
Enable/Disable GPU Acceleration
-- Disable GPU for entire sessionSET gpu_enabled = false;
-- Disable GPU for single querySELECT /*+ GPU_DISABLED */ * FROM large_table WHERE condition;
-- Re-enable GPUSET gpu_enabled = true;Memory Configuration
-- Allocate 4GB to GPU memory poolSET gpu_memory_limit = 4294967296; -- bytes
-- Monitor current usageSELECT gpu_id, memory_allocated_mb, memory_used_mb, memory_free_mbFROM gpu_memory_stats;Routing Thresholds
-- Set minimum row count for GPU routingSET gpu_min_rows = 100000; -- Don't use GPU for queries with <100K rows
-- Set maximum GPU cost for querySET gpu_max_cost_ms = 5000; -- Fallback to CPU if GPU would take >5sPerformance Hints
-- Force GPU executionSELECT /*+ GPU_ENABLED */ * FROM table WHERE condition;
-- Force CPU executionSELECT /*+ CPU_ONLY */ * FROM table WHERE condition;
-- Use specific join algorithmSELECT /*+ HASH_JOIN(orders, customers) */ o.*, c.nameFROM orders oJOIN customers c ON o.customer_id = c.id;Future Roadmap
Q4 2025 (Next 2-3 Weeks)
- Window function kernel fixes (14 functions)
- Result marshaling improvements
- Testing with TPC-H workloads
Q1 2026
- Statistical aggregation GPU kernels
- STDDEV, VARIANCE GPU implementations
- PERCENTILE optimization
Q2 2026
- Complex hash join variants
- Outer join GPU support
- Non-equi join optimization
Q3 2026+
- Custom user-defined functions (UDF) on GPU
- Machine learning model inference on GPU
- Advanced query optimization
Estimated Impact
| Quarter | Feature | Expected Speedup | LOCS | Tests |
|---|---|---|---|---|
| Q4 2025 | Window functions | 10-50x | ~2K | 50+ |
| Q1 2026 | Statistics | 5-20x | ~3K | 75+ |
| Q2 2026 | Complex joins | 5-15x | ~2K | 40+ |
Support & Contact
Getting Help
For GPU-related issues:
- Check this document for known limitations
- Enable GPU debugging:
SET gpu_debug = true; - Review logs in
/var/log/heliosdb/gpu.log - Contact support with:
- Query that caused issue
- GPU model and driver version
- Dataset size and schema
- Expected vs. actual performance
Feature Request Process
To request GPU acceleration for a specific operation:
-
Submit with GitHub issue including:
- SQL query example
- Dataset characteristics (rows, columns, data types)
- Current workaround
- Expected vs. observed performance
-
Product team will:
- Evaluate feasibility
- Estimate effort
- Prioritize with other requests
Reporting Bugs
If GPU execution produces incorrect results:
- Verify with
SET gpu_enabled = false(CPU-only query) - Compare results between GPU and CPU execution
- Report with:
- Exact SQL query
- Sample data (or data generation script)
- GPU model and driver version
- Expected vs. actual results
This is a critical issue and will be addressed immediately.
Appendix: Technical Details
GPU Architecture Overview
Query Input ↓[Query Router] ← CPU or GPU? ↓┌─────────────────┬──────────────────┐│ │ │↓ ↓ ↓GPU Path CPU Path Fallback(High-speed) (Universal) (Automatic) ↓ ↓ ↓Result Marshaling ← ← ← ← ← ← ← ← ↓[Result Return] ↓ApplicationMemory Layout (GPU)
GPU Memory (e.g., 24GB V100)├── Kernel Code & Constants: ~50MB├── GPU Buffers:│ ├── Input data│ ├── Intermediate results│ └── Output data├── Memory Pool: ~80% available└── Reserved: ~5% for systemLatency Breakdown (10M row query)
GPU Execution:├── H2D Transfer: ~50ms (host to device)├── Kernel Execution: ~150ms├── D2H Transfer: ~40ms (device to host)└── Total: ~240ms
CPU Execution:├── Data loading: ~200ms├── Computation: ~1500ms└── Total: ~1700ms
Speedup: 1700/240 = 7.1xDocument History
| Version | Date | Changes |
|---|---|---|
| 1.0 | Nov 2025 | Initial comprehensive limitations documentation |