Automated ETL Quick Start Guide
Automated ETL Quick Start Guide
Time to Complete: 5 minutes Prerequisites: Rust 1.75+, HeliosDB installed Last Updated: January 4, 2026
Overview
This guide helps you get started with HeliosDB’s Automated ETL feature in just 5 minutes. You’ll learn how to infer schemas, build pipelines, and transform data automatically.
Step 1: Add Dependency
Add the ETL crate to your Cargo.toml:
[dependencies]heliosdb-etl = "4.0.0"tokio = { version = "1.0", features = ["full"] }Step 2: Basic Schema Inference
use heliosdb_etl::{AutomatedETLEngine, SchemaInferenceConfig};use std::collections::HashMap;
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { // Create engine with default configuration let config = SchemaInferenceConfig::default(); let engine = AutomatedETLEngine::new(config).await?;
// Sample data (simulating CSV rows) let data = vec![ HashMap::from([ ("id".to_string(), "1".to_string()), ("name".to_string(), "Alice".to_string()), ("email".to_string(), "alice@example.com".to_string()), ]), HashMap::from([ ("id".to_string(), "2".to_string()), ("name".to_string(), "Bob".to_string()), ("email".to_string(), "bob@example.com".to_string()), ]), ];
// Infer schema automatically let schema = engine.infer_schema("users", &data).await?;
println!("Inferred {} columns:", schema.columns.len()); for col in &schema.columns { println!(" {} : {:?}", col.name, col.data_type); }
Ok(())}Step 3: Build and Execute Pipeline
use heliosdb_etl::{AutomatedETLEngine, SchemaInferenceConfig};
#[tokio::main]async fn main() -> Result<(), Box<dyn std::error::Error>> { let engine = AutomatedETLEngine::new(SchemaInferenceConfig::default()).await?;
// Your source data let source_data = load_your_data();
// 1. Infer source schema let source_schema = engine.infer_schema("source", &source_data).await?;
// 2. Build transformation pipeline let pipeline = engine.build_pipeline( source_schema.clone(), source_schema, // Same schema for passthrough ).await?;
// 3. Execute pipeline let result = pipeline.execute(source_data).await?;
println!("Processed {} rows in {:?}", result.rows_processed, result.duration); println!("Throughput: {:.0} rows/second", result.throughput);
Ok(())}Step 4: Enable Data Quality Checks
use heliosdb_etl::{PipelineConfig, DataQualityValidator};
// Enable quality validation in pipelinelet pipeline_config = PipelineConfig { name: "my_pipeline".to_string(), source_schema, target_schema, quality_checks: true, // Enable quality validation anomaly_detection: true, // Enable anomaly detection batch_size: 10_000, parallelism: 8, ..Default::default()};
// After pipeline execution, check qualitylet validator = DataQualityValidator::new();let metrics = validator.validate(&result.data, &schema).await?;
println!("Quality Metrics:");println!(" Completeness: {:.1}%", metrics.completeness * 100.0);println!(" Accuracy: {:.1}%", metrics.accuracy * 100.0);println!(" Overall Score: {:.1}%", metrics.overall_score * 100.0);Common Configuration Options
use heliosdb_etl::SchemaInferenceConfig;
let config = SchemaInferenceConfig { sample_size: 10_000, // Rows to sample for inference confidence_threshold: 0.8, // Min confidence for type detection infer_relationships: true, // Detect foreign keys infer_constraints: true, // Detect primary keys/unique max_rows: 1_000_000, // Max rows to process};What’s Next?
| Topic | Guide |
|---|---|
| Real-world examples | EXAMPLES.md |
| Full API reference | F5.2.4_AUTOMATED_ETL_USER_GUIDE.md |
| CDC integration | CDC Webhooks |
Troubleshooting
Issue: Schema inference returns wrong types
// Solution: Increase sample sizelet config = SchemaInferenceConfig { sample_size: 50_000, confidence_threshold: 0.9, ..Default::default()};Issue: Slow transformation performance
// Solution: Increase parallelism and batch sizelet config = PipelineConfig { batch_size: 50_000, parallelism: num_cpus::get(), ..Default::default()};See Also: README.md | EXAMPLES.md