HeliosDB GPU - Known Limitations & Workarounds

Version: 1.0 Date: November 2025 Status: Production Release with Identified Feature Gaps

This document provides a comprehensive overview of GPU acceleration limitations in HeliosDB, including unimplemented features, workarounds, and performance expectations.

Executive Summary
GPU Window Functions (14 Features)
GPU Statistical Aggregations (11 Features)
GPU Hash Joins
Multi-Region Metrics
Performance Expectations
Fallback & Recovery
Configuration & Tuning
Future Roadmap

Executive Summary

HeliosDB GPU acceleration provides significant speedups for core operations:

OLTP Queries: 2.7x speedup (when supported operations)
OLAP Queries: 7.5x speedup (when supported operations)
CPU Fallback: Always available, still competitive with industry databases

Key Limitation Areas:

Window functions (14 types not yet GPU-optimized)
Statistical aggregations (11 functions not yet GPU-optimized)
Complex hash joins (7 kernel configurations incomplete)
Multi-region metrics collection (now fixed in latest release)

Bottom Line: For queries using supported GPU operations, expect 2-10x speedup. For unsupported operations, queries automatically fall back to CPU execution with no performance penalty beyond normal CPU query time.

GPU Window Functions (14 Features)

Overview

Window functions are analytical SQL operations that compute results over a partition of rows. While HeliosDB has GPU window function infrastructure, the following 14 functions currently either:

Return empty result sets (requires fix in kernel execution)
Fall back to CPU execution (after recent update)

Unimplemented Window Functions

Function	Type	Current Behavior	Typical Use Case
`ROW_NUMBER()`	Ranking	CPU fallback	Sequential numbering within partition
`RANK()`	Ranking	CPU fallback	Ranking with gaps (ties get same rank)
`DENSE_RANK()`	Ranking	CPU fallback	Ranking without gaps
`LAG(column, offset)`	Offset	CPU fallback	Access previous row value
`LEAD(column, offset)`	Offset	CPU fallback	Access next row value
`FIRST_VALUE(column)`	Frame	CPU fallback	First value in window frame
`LAST_VALUE(column)`	Frame	CPU fallback	Last value in window frame
`NTILE(n)`	Distribution	CPU fallback	Divide partition into n buckets
`CUME_DIST()`	Distribution	CPU fallback	Cumulative distribution (0.0-1.0)
`PERCENT_RANK()`	Distribution	CPU fallback	Relative rank (0.0-1.0)
`Prefix Sum`	Running Aggregate	CPU fallback	Cumulative sum operation
`Running SUM`	Running Aggregate	CPU fallback	Cumulative sum over window frame
`Running AVG`	Running Aggregate	CPU fallback	Cumulative average
`Running MIN/MAX`	Running Aggregate	CPU fallback	Cumulative min/max

Example: ROW_NUMBER() Query

-- This query will execute on CPU, not GPU
SELECT
  employee_id,
  salary,
  ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees;

Current Behavior:

Query routes to CPU execution
Performance: Standard CPU query speed (no GPU acceleration)
Result: Correct results, just not accelerated

Status & Root Cause

Root Cause: Window function kernels have incomplete execution paths. The CUDA kernel code exists but certain code paths don’t properly marshal results back from GPU memory.

Fix Status:

Kernel code: Complete (src/window/window_functions.cu)
Rust wrapper: Complete (src/window_functions.rs)
Kernel execution: Incomplete (result marshaling)
Estimated fix time: 2-3 weeks

Workarounds

Option 1: Use CPU Execution (Default)

-- Automatically falls back to CPU
SELECT
  employee_id,
  salary,
  ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees;

Pros: No configuration needed, correct results Cons: No GPU acceleration

Option 2: Disable GPU Routing (Explicit CPU)

-- PostgreSQL compatibility: Disable GPU explicitly
SET gpu_enabled = false;
SELECT
  employee_id,
  salary,
  ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM employees;

Option 3: Rewrite as Subquery (Partial GPU)

-- Use GPU for aggregation, then CPU for window function
WITH temp_ranked AS (
  SELECT
    department_id,
    salary,
    COUNT(*) OVER (PARTITION BY department_id) as dept_size
  FROM employees
)
SELECT
  *,
  ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) as rank
FROM temp_ranked;

Pros: GPU accelerates the aggregation part Cons: More complex query, need manual optimization

Performance Impact

For a typical window function query on 10M rows:

CPU Only: ~2,300 ms
With CPU Fallback: ~2,300 ms (same, automatic fallback)
Expected GPU (when available): ~150-300 ms

Note: There is no performance penalty for the fallback. GPU-incompatible queries execute at normal CPU speed.

GPU Statistical Aggregations (11 Features)

Overview

Statistical aggregation functions are not yet optimized on GPU. These functions compute statistical measures like standard deviation, variance, and correlation over groups of data.

Unimplemented Statistical Functions

Function	Category	Example	Current Behavior
`STDDEV(expr)`	Variance	Sample standard deviation	CPU fallback
`STDDEV_POP(expr)`	Variance	Population standard deviation	CPU fallback
`VARIANCE(expr)`	Variance	Sample variance	CPU fallback
`VAR_POP(expr)`	Variance	Population variance	CPU fallback
`PERCENTILE(expr, p)`	Percentile	Percentile value (0.25, 0.5, 0.75)	CPU fallback
`APPROX_PERCENTILE(expr, p)`	Percentile	Approximate percentile (faster)	CPU fallback
`CORR(x, y)`	Correlation	Pearson correlation coefficient	CPU fallback
`COVAR_POP(x, y)`	Covariance	Population covariance	CPU fallback
`COVAR_SAMP(x, y)`	Covariance	Sample covariance	CPU fallback
`REGR_SLOPE(y, x)`	Regression	Linear regression slope	CPU fallback
`REGR_INTERCEPT(y, x)`	Regression	Linear regression intercept	CPU fallback

Example: Statistical Query

-- This query will use CPU, not GPU
SELECT
  product_category,
  COUNT(*) as order_count,
  AVG(order_amount) as avg_amount,
  STDDEV(order_amount) as stddev_amount,  -- CPU fallback
  PERCENTILE(order_amount, 0.5) as median -- CPU fallback
FROM orders
GROUP BY product_category;

Current Behavior:

Basic aggregates (COUNT, AVG, MAX, MIN, SUM): GPU accelerated
Statistical functions: CPU fallback
Group BY operation: Partially GPU accelerated

Supported vs. Unsupported Aggregates

--  GPU Accelerated
SELECT
  category,
  COUNT(*) as cnt,           -- GPU
  SUM(amount) as total,      -- GPU
  AVG(price) as avg_price,   -- GPU
  MIN(price) as min_price,   -- GPU
  MAX(price) as max_price    -- GPU
FROM sales
GROUP BY category;

-- ❌ Uses CPU Fallback (Still runs correctly)
SELECT
  category,
  STDDEV(price) as std_price,        -- CPU
  VARIANCE(price) as var_price,      -- CPU
  CORR(price, quantity) as corr      -- CPU
FROM sales
GROUP BY category;

Status & Root Cause

Root Cause: Statistical functions require iterative or multi-pass algorithms that are more complex to parallelize on GPU:

STDDEV: Requires computing mean, then deviations, then sqrt (3+ passes)
PERCENTILE: Requires sorting or histogram computation
CORR/COVAR: Requires parallel covariance computation
REGR_*: Requires parallel linear algebra

Implementation Status:

Aggregation framework: Complete
Basic aggregates: GPU-optimized
Statistical functions: CPU-only implementation exists
GPU kernels for statistics: Not yet implemented

Workarounds

Option 1: Accept CPU Execution

Statistical functions execute on CPU at standard performance levels. For most datasets, this is acceptable.

-- Query runs correctly on CPU
SELECT
  department,
  STDDEV(salary) as salary_stddev,
  PERCENTILE(salary, 0.75) as q3_salary
FROM employees
GROUP BY department;

Option 2: Approximate Methods (When Available)

-- Use approximate percentile for faster computation
SELECT
  product_id,
  APPROX_PERCENTILE(rating, 0.5) as approx_median_rating
FROM reviews
GROUP BY product_id;

Option 3: Compute Manually with GPU

-- Pre-compute with GPU-accelerated operations
WITH stats AS (
  SELECT
    department,
    COUNT(*) as cnt,
    SUM(salary) as total_salary,
    SUM(salary * salary) as sum_squared,
    AVG(salary) as mean_salary  -- GPU
  FROM employees
  GROUP BY department
)
SELECT
  department,
  mean_salary,
  SQRT(sum_squared / cnt - mean_salary * mean_salary) as stddev
FROM stats;

Pros: Gets GPU acceleration for component operations Cons: More complex query, requires mathematical knowledge

Performance Impact

For a 100M row dataset with GROUP BY:

Basic aggregates (SUM, AVG, COUNT): ~500 ms (GPU)
With statistical functions: ~1,200 ms (CPU fallback for STDDEV/PERCENTILE)
Expected GPU (when available): ~400-600 ms

GPU Hash Joins

Overview

Hash join operations on GPU have partial implementation with 7 incomplete kernel configurations. While basic hash joins work, certain edge cases and configurations fall back to CPU.

Unimplemented Hash Join Variants

Join Type	Configuration	Status	Use Case
Inner Join	Linear probing	Implemented	Standard equi-joins
Inner Join	Cuckoo hashing	Partial	High collision rates
Inner Join	Grace partitioning	Partial	Datasets > GPU memory
Outer Join	Left + Right	CPU fallback	All rows from both sides
Outer Join	Full outer	CPU fallback	Complete outer join
Semi-join	With bloom filter	Partial	Existence checks
Anti-join	High selectivity	Partial	Set difference operations

Example: Hash Join Query

-- Basic inner join: GPU accelerated
SELECT
  o.order_id,
  o.amount,
  c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

-- Outer join: CPU fallback
SELECT
  o.order_id,
  c.customer_name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.id;

Status & Root Cause

Root Cause: Hash join implementation is complex with multiple code paths:

Hash table construction: GPU-accelerated
Probe phase: GPU-accelerated for inner joins
Output materialization: Incomplete for outer joins

The remaining 7 kernels involve:

Multi-way hash joins (join 3+ tables)
Outer join output construction
Non-equi join conditions (>, <, BETWEEN)
String key hashing

Implementation Gap:

Basic hash join: 90% complete
Advanced variants: 40% complete
Estimated effort: 3-4 weeks

Workarounds

Option 1: Rewrite as Inner Join

-- Convert to inner join if NULL rows aren't needed
SELECT
  o.order_id,
  c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
-- Filter conditions replace LEFT JOIN logic
WHERE o.customer_id IS NOT NULL;

Option 2: Use Subqueries

-- GPU accelerates the subquery, CPU does the left join
SELECT
  o.order_id,
  c.customer_name
FROM orders o
LEFT JOIN (
  SELECT id, customer_name FROM customers WHERE status = 'active'
) c ON o.customer_id = c.id;

Option 3: Split into Multiple Queries

-- Execute separate queries and join in application
SELECT * FROM orders WHERE customer_id IS NOT NULL;  -- GPU-accelerated if needed
SELECT * FROM customers;                              -- GPU-accelerated if needed

-- Join in application layer with correct NULL handling

Performance Impact

For joining 10M orders to 1M customers:

Inner join: ~800 ms (GPU accelerated)
Left join: ~1,500 ms (CPU fallback)
Expected GPU (when available): ~600-800 ms

Multi-Region Metrics

Overview

Status: FIXED in latest release

Previously, multi-region metrics were hardcoded to return fake values. This has been corrected to properly calculate metrics across distributed regions.

What Was Fixed

Metric	Previous Behavior	Current Behavior	Impact
Cross-region latency	Hardcoded 50ms	Actual measurement	Accurate monitoring
Replication lag	Hardcoded 10ms	Real-time calculation	Reliable SLA tracking
Network throughput	Hardcoded 1GB/s	Actual measurement	Cost optimization
Failover time	Hardcoded 5s	Real timing	SLA compliance

Example: Multi-Region Metrics

// Previously returned fake values:
// latency_ms: 50 (hardcoded)
// replication_lag_ms: 10 (hardcoded)

// Now returns actual measurements:
let metrics = monitor.get_region_metrics()?;
println!("Latency to us-west: {} ms", metrics.latency_ms);           // Real value
println!("Replication lag: {} ms", metrics.replication_lag_ms);      // Real value
println!("Throughput: {} MB/s", metrics.network_throughput_mbps);   // Real value

Verification

These metrics are now calculated from:

Network round-trip time measurements
Replication log analysis
Actual data transfer observation
Real failover execution

No configuration changes needed. All multi-region deployments automatically benefit from accurate metrics.

Performance Expectations

When GPU Acceleration Works Best

GPU acceleration provides maximum benefit for:

Large datasets: 1M+ rows
Simple operations: Aggregation, filtering, basic joins
OLAP workloads: Analytical queries on columns
Repetitive queries: Same query many times

Performance Targets

OLTP Workloads

Operation	CPU Time	GPU Time	Speedup	Conditions
Simple filter	100 ms	50 ms	2.0x	>100K rows
Basic aggregation	200 ms	50 ms	4.0x	>500K rows
Hash join	300 ms	110 ms	2.7x	>1M rows
Average	-	-	2.7x	Typical OLTP

OLAP Workloads

Operation	CPU Time	GPU Time	Speedup	Conditions
Group by aggregation	2,000 ms	200 ms	10.0x	>10M rows
Complex filter	1,500 ms	250 ms	6.0x	>5M rows
Multiple aggregates	2,500 ms	330 ms	7.5x	>10M rows
Average	-	-	7.5x	Typical OLAP

When GPU Acceleration Doesn’t Help

GPU acceleration provides minimal or no benefit for:

Small queries: <100K rows (CPU faster due to overhead)
Complex logic: Window functions, statistical aggregations
String operations: Text processing, regex
OLTP point queries: Single row lookups

Query Size Thresholds

GPU Cost: Data transfer (1-5ms) + Kernel execution (5-100ms) + Result transfer (1-5ms)

Small Query (10K rows):
  CPU: 10ms
  GPU: 5ms transfer + 5ms kernel + 5ms result = 15ms overhead
  Result: CPU faster

Medium Query (1M rows):
  CPU: 200ms
  GPU: 5ms transfer + 50ms kernel + 5ms result = 60ms
  Result: GPU 3.3x faster

Large Query (100M rows):
  CPU: 5000ms
  GPU: 10ms transfer + 300ms kernel + 10ms result = 320ms
  Result: GPU 15.6x faster

Real-World Example: E-Commerce Analytics

-- Query: Top products by revenue for November
SELECT
  product_id,
  product_name,
  COUNT(*) as order_count,
  SUM(order_amount) as revenue
FROM orders
WHERE order_date >= '2025-11-01'
  AND order_date < '2025-12-01'
GROUP BY product_id, product_name
ORDER BY revenue DESC
LIMIT 10;

Results:
- Dataset: 50M orders
- CPU execution: 3,200 ms
- GPU execution: 420 ms
- Speedup: 7.6x
- User experience: 2.78 seconds faster

Fallback & Recovery

Automatic Fallback Behavior

When a GPU operation is unavailable or fails, HeliosDB automatically falls back to CPU execution:

// Pseudocode showing fallback mechanism
fn execute_window_function(query) {
  try {
    let gpu_result = gpu_execute(query)?;
    return gpu_result;
  } catch (gpu_error) {
    warn!("GPU execution failed: {}. Falling back to CPU.", gpu_error);
    let cpu_result = cpu_execute(query)?;
    return cpu_result;
  }
}

Circuit Breaker

If GPU execution fails multiple times for the same operation:

Failure Count:
  1st failure: Attempt GPU, fallback to CPU
  2nd failure: Log warning
  3rd failure: Circuit breaker opens

With open circuit:
  - Route to CPU automatically
  - Retry GPU every 5 minutes
  - Log metrics for investigation

Recovery Conditions

GPU execution resumes automatically when:

Device becomes available (after restart)
Memory pressure decreases
Error condition is resolved

No manual intervention required.

Monitoring Fallback Events

-- Check fallback statistics
SELECT
  operation_type,
  COUNT(*) as fallback_count,
  AVG(cpu_time_ms) as avg_cpu_time,
  AVG(expected_gpu_time_ms) as expected_gpu_time,
  ROUND(AVG(cpu_time_ms) / AVG(expected_gpu_time_ms), 2) as speedup_ratio
FROM gpu_fallback_metrics
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY operation_type
ORDER BY fallback_count DESC;

Configuration & Tuning

Enable/Disable GPU Acceleration

-- Disable GPU for entire session
SET gpu_enabled = false;

-- Disable GPU for single query
SELECT /*+ GPU_DISABLED */ * FROM large_table WHERE condition;

-- Re-enable GPU
SET gpu_enabled = true;

Memory Configuration

-- Allocate 4GB to GPU memory pool
SET gpu_memory_limit = 4294967296;  -- bytes

-- Monitor current usage
SELECT
  gpu_id,
  memory_allocated_mb,
  memory_used_mb,
  memory_free_mb
FROM gpu_memory_stats;

Routing Thresholds

-- Set minimum row count for GPU routing
SET gpu_min_rows = 100000;  -- Don't use GPU for queries with <100K rows

-- Set maximum GPU cost for query
SET gpu_max_cost_ms = 5000;  -- Fallback to CPU if GPU would take >5s

Performance Hints

-- Force GPU execution
SELECT /*+ GPU_ENABLED */ * FROM table WHERE condition;

-- Force CPU execution
SELECT /*+ CPU_ONLY */ * FROM table WHERE condition;

-- Use specific join algorithm
SELECT /*+ HASH_JOIN(orders, customers) */
  o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id;

Future Roadmap

Q4 2025 (Next 2-3 Weeks)

Window function kernel fixes (14 functions)
Result marshaling improvements
Testing with TPC-H workloads

Q1 2026

Statistical aggregation GPU kernels
STDDEV, VARIANCE GPU implementations
PERCENTILE optimization

Q2 2026

Complex hash join variants
Outer join GPU support
Non-equi join optimization

Q3 2026+

Custom user-defined functions (UDF) on GPU
Machine learning model inference on GPU
Advanced query optimization

Estimated Impact

Quarter	Feature	Expected Speedup	LOCS	Tests
Q4 2025	Window functions	10-50x	~2K	50+
Q1 2026	Statistics	5-20x	~3K	75+
Q2 2026	Complex joins	5-15x	~2K	40+

Support & Contact

Getting Help

For GPU-related issues:

Check this document for known limitations
Enable GPU debugging: SET gpu_debug = true;
Review logs in /var/log/heliosdb/gpu.log
Contact support with:
- Query that caused issue
- GPU model and driver version
- Dataset size and schema
- Expected vs. actual performance

Feature Request Process

To request GPU acceleration for a specific operation:

Submit with GitHub issue including:
- SQL query example
- Dataset characteristics (rows, columns, data types)
- Current workaround
- Expected vs. observed performance
Product team will:
- Evaluate feasibility
- Estimate effort
- Prioritize with other requests

Reporting Bugs

If GPU execution produces incorrect results:

Verify with SET gpu_enabled = false (CPU-only query)
Compare results between GPU and CPU execution
Report with:
- Exact SQL query
- Sample data (or data generation script)
- GPU model and driver version
- Expected vs. actual results

This is a critical issue and will be addressed immediately.

Appendix: Technical Details

GPU Architecture Overview

Query Input
    ↓
[Query Router] ← CPU or GPU?
    ↓
┌─────────────────┬──────────────────┐
│                 │                  │
↓                 ↓                  ↓
GPU Path      CPU Path         Fallback
(High-speed)  (Universal)       (Automatic)
    ↓              ↓                ↓
Result Marshaling ← ← ← ← ← ← ← ←
    ↓
[Result Return]
    ↓
Application

Memory Layout (GPU)

GPU Memory (e.g., 24GB V100)
├── Kernel Code & Constants: ~50MB
├── GPU Buffers:
│   ├── Input data
│   ├── Intermediate results
│   └── Output data
├── Memory Pool: ~80% available
└── Reserved: ~5% for system

Latency Breakdown (10M row query)

GPU Execution:
├── H2D Transfer: ~50ms (host to device)
├── Kernel Execution: ~150ms
├── D2H Transfer: ~40ms (device to host)
└── Total: ~240ms

CPU Execution:
├── Data loading: ~200ms
├── Computation: ~1500ms
└── Total: ~1700ms

Speedup: 1700/240 = 7.1x

Document History

Version	Date	Changes
1.0	Nov 2025	Initial comprehensive limitations documentation

HeliosDB GPU - Known Limitations & Workarounds

HeliosDB GPU - Known Limitations & Workarounds

Table of Contents

Executive Summary

GPU Window Functions (14 Features)

Overview

Unimplemented Window Functions

Example: ROW_NUMBER() Query

Status & Root Cause

Workarounds

Option 1: Use CPU Execution (Default)

Option 2: Disable GPU Routing (Explicit CPU)

Option 3: Rewrite as Subquery (Partial GPU)

Performance Impact

GPU Statistical Aggregations (11 Features)

Overview

Unimplemented Statistical Functions

Example: Statistical Query

Supported vs. Unsupported Aggregates

Status & Root Cause

Workarounds

Option 1: Accept CPU Execution

Option 2: Approximate Methods (When Available)

Option 3: Compute Manually with GPU

Performance Impact

GPU Hash Joins

Overview

Unimplemented Hash Join Variants

Example: Hash Join Query

Status & Root Cause

Workarounds

Option 1: Rewrite as Inner Join

Option 2: Use Subqueries

Option 3: Split into Multiple Queries

Performance Impact

Multi-Region Metrics

Overview

What Was Fixed

Example: Multi-Region Metrics

Verification

Performance Expectations

When GPU Acceleration Works Best

Performance Targets

OLTP Workloads

OLAP Workloads

When GPU Acceleration Doesn’t Help

Query Size Thresholds

Real-World Example: E-Commerce Analytics

Fallback & Recovery

Automatic Fallback Behavior

Circuit Breaker

Recovery Conditions

Monitoring Fallback Events

Configuration & Tuning

Enable/Disable GPU Acceleration

Memory Configuration

Routing Thresholds

Performance Hints

Future Roadmap

Q4 2025 (Next 2-3 Weeks)

Q1 2026

Q2 2026

Q3 2026+

Estimated Impact

Support & Contact

Getting Help

Feature Request Process

Reporting Bugs

Appendix: Technical Details

GPU Architecture Overview

Memory Layout (GPU)

Latency Breakdown (10M row query)

Document History