Automated ETL Quick Start Guide

Time to Complete: 5 minutes Prerequisites: Rust 1.75+, HeliosDB installed Last Updated: January 4, 2026

Overview

This guide helps you get started with HeliosDB’s Automated ETL feature in just 5 minutes. You’ll learn how to infer schemas, build pipelines, and transform data automatically.

Step 1: Add Dependency

Add the ETL crate to your Cargo.toml:

[dependencies]
heliosdb-etl = "4.0.0"
tokio = { version = "1.0", features = ["full"] }

Step 2: Basic Schema Inference

use heliosdb_etl::{AutomatedETLEngine, SchemaInferenceConfig};
use std::collections::HashMap;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create engine with default configuration
    let config = SchemaInferenceConfig::default();
    let engine = AutomatedETLEngine::new(config).await?;

    // Sample data (simulating CSV rows)
    let data = vec![
        HashMap::from([
            ("id".to_string(), "1".to_string()),
            ("name".to_string(), "Alice".to_string()),
            ("email".to_string(), "alice@example.com".to_string()),
        ]),
        HashMap::from([
            ("id".to_string(), "2".to_string()),
            ("name".to_string(), "Bob".to_string()),
            ("email".to_string(), "bob@example.com".to_string()),
        ]),
    ];

    // Infer schema automatically
    let schema = engine.infer_schema("users", &data).await?;

    println!("Inferred {} columns:", schema.columns.len());
    for col in &schema.columns {
        println!("  {} : {:?}", col.name, col.data_type);
    }

    Ok(())
}

Step 3: Build and Execute Pipeline

use heliosdb_etl::{AutomatedETLEngine, SchemaInferenceConfig};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let engine = AutomatedETLEngine::new(SchemaInferenceConfig::default()).await?;

    // Your source data
    let source_data = load_your_data();

    // 1. Infer source schema
    let source_schema = engine.infer_schema("source", &source_data).await?;

    // 2. Build transformation pipeline
    let pipeline = engine.build_pipeline(
        source_schema.clone(),
        source_schema,  // Same schema for passthrough
    ).await?;

    // 3. Execute pipeline
    let result = pipeline.execute(source_data).await?;

    println!("Processed {} rows in {:?}", result.rows_processed, result.duration);
    println!("Throughput: {:.0} rows/second", result.throughput);

    Ok(())
}

Step 4: Enable Data Quality Checks

use heliosdb_etl::{PipelineConfig, DataQualityValidator};

// Enable quality validation in pipeline
let pipeline_config = PipelineConfig {
    name: "my_pipeline".to_string(),
    source_schema,
    target_schema,
    quality_checks: true,      // Enable quality validation
    anomaly_detection: true,   // Enable anomaly detection
    batch_size: 10_000,
    parallelism: 8,
    ..Default::default()
};

// After pipeline execution, check quality
let validator = DataQualityValidator::new();
let metrics = validator.validate(&result.data, &schema).await?;

println!("Quality Metrics:");
println!("  Completeness: {:.1}%", metrics.completeness * 100.0);
println!("  Accuracy: {:.1}%", metrics.accuracy * 100.0);
println!("  Overall Score: {:.1}%", metrics.overall_score * 100.0);

Common Configuration Options

use heliosdb_etl::SchemaInferenceConfig;

let config = SchemaInferenceConfig {
    sample_size: 10_000,           // Rows to sample for inference
    confidence_threshold: 0.8,      // Min confidence for type detection
    infer_relationships: true,      // Detect foreign keys
    infer_constraints: true,        // Detect primary keys/unique
    max_rows: 1_000_000,           // Max rows to process
};

What’s Next?

Topic	Guide
Real-world examples	EXAMPLES.md
Full API reference	F5.2.4_AUTOMATED_ETL_USER_GUIDE.md
CDC integration	CDC Webhooks

Troubleshooting

Issue: Schema inference returns wrong types

// Solution: Increase sample size
let config = SchemaInferenceConfig {
    sample_size: 50_000,
    confidence_threshold: 0.9,
    ..Default::default()
};

Issue: Slow transformation performance

// Solution: Increase parallelism and batch size
let config = PipelineConfig {
    batch_size: 50_000,
    parallelism: num_cpus::get(),
    ..Default::default()
};

See Also: README.md | EXAMPLES.md