Introduction
- Purpose of the manual and bootcamp approach
- What is being homologated and why
- What is kept and what is replaced
- Suggested roadmap for progressive conversion
- Work environment (compilers, toolchain, testing)
โ๏ธ Part 1 โ Pipeline Orchestration
Objective: Explain how to integrate Rust within traditional orchestrators or how to replace them.
Chapter 1 โ Why Rust in the Modern Data Stack
- Why keep Airflow and not replace it yet?
- Integrating Rust tasks within Airflow (BashOperator, PythonOperator, CustomOperator)
- Example: DAG executing a Rust binary with dynamic parameters
- Logs, XCom, and state between mixed Python/Rust tasks
- Alternatives: Prefect 2.x, Dagster Pipes with Rust code
- Rust-first orchestrators: Windmill (structure, flows), Temporal (modeling, Rust SDK)
๐งฎ Part 2 โ Transformations and ETL
Objective: Cover everything that would be done with pandas, PySpark, Dask, or ETL scripts in Python.
Chapter 2 โ Transformations and ETL in Rust
- Introduction to Polars (in Rust and Python) โ key differences with pandas
- Reading CSV, Parquet, JSON, and Feather in Rust with Polars
- Basic transformations: filters, groupby, joins, casting, date handling
- Advanced transformations: window functions, pivots, lazy evaluation
- Polars vs pandas vs PySpark โ comparative table and benchmarks
- DataFusion: SQL on Parquet in Rust (embedding + CLI)
- Ballista: distributed processing (status, current scope, examples)
- Complete ETL pipeline in Rust: ingestion โ transformation โ output to Parquet
- Using crates: csv, arrow2, parquet, serde
๐๏ธ Part 3 โ Data Modeling (dbt, SQLMesh)
Objective: Cover the replacement and/or acceleration of dbt-style modeling.
Chapter 3 โ Data Modeling
- dbt Core: possible coupling points for Rust (hooks, macros, external binaries)
- dbt Fusion: Rust as the new internal engine โ practical implications
- SDF (Semantic Data Fabric): structure, commands, use as a dbt replacement
- Quary: CLI replacement for dbt โ compatibility, speed, limitations
- Implementing dependency control between models in Rust
- Example: materializable SQL model executed from a Rust binary
- Considerations on metadata, compilation, graph versioning
๐ Part 4 โ Validation and Quality Assurance (Great Expectations)
Objective: See how to replace or complement tools like Great Expectations.
Chapter 4 โ Data Validation and Quality
- GE: integration with Rust tasks (in Airflow or standalone)
- Validation in Polars: assertions, checks, slicing, and aggregate metrics
- Building a validation module in Rust: structure, JSON reporting
- Minimal declarative checks with YAML or embedded config
- Mixed use cases: GE defines expectations, Rust evaluates them
- Exporting results for dashboards or auditing
- Alternatives like SDF Checks and extended use in Rust modeling
๐งโ๐ป Part 5 โ CLI pipelines and scripting
Objective: Replace Python/Make/Luigi scripts with safe Rust executables.
Chapter 5 โ CLI Pipelines and Rust Task Execution
- CLI in Rust: structure, using Clap, subcommands
- Local pipeline of chained steps (e.g., ingest โ clean โ export)
- Justfiles and cargo-make as reproducible task tools
- Structured output: logs, files, exit codes
- Execution by cron or called from Airflow/Prefect
- Logging and metrics from CLI (with Prometheus or tracing)
๐ฐ๏ธ Part 6 โ Storage and Analytical Engines
Objective: Direct interaction with databases and query engines.
Chapter 6 โ Storage and Analytical Engines
- Connecting to PostgreSQL, Redshift, Snowflake from Rust
- Using tokio-postgres, sqlx, Diesel
- Bulk insert, batch insert, and data streaming
- Reading/writing Parquet, Feather, and Arrow
- Embedded DuckDB in Rust (CLI and library)
- Databend: use as an alternative warehouse
- Throughput and RAM comparison in ingestion
- Pipeline: Rust connects to PG, transforms with Polars, exports Parquet
๐งฑ Part 7 โ Arrow and Columnar Memory
Objective: Understand Arrow in Rust and its advantages in efficient pipelines.
Chapter 7 โ Apache Arrow and Columnar Memory
- Apache Arrow in Rust: RecordBatch and Array structure
- Differences between Arrow vs pandas/NumPy in memory representation
- Arrow2: improvements and use cases
- Arrow Flight: data transfer between systems (RPC)
- Example: Rust app exposing a Flight endpoint
- Zero-copy between processes/languages using Arrow
- Conversion Polars โ Arrow โ Python (interoperability)
๐ Part 8 โ Streaming and Event-Driven Architecture
Objective: Real-time processing and events.
Chapter 8 โ Streaming and Event-Driven Architectures
- Kafka client in Rust (rdkafka, performance)
- Pulsar consumer in Rust
- Fluvio: Rust broker + embedded stream processing
- SmartModules with WebAssembly in Fluvio
- Materialize / RisingWave โ SQL views on streams
- Vector: log ingestion as stream input
- Example: Real-time ETL with Kafka + Rust + ClickHouse
๐ Part 9 โ Observability and Metrics
Objective: Logs, traces, and metrics from Rust tasks and services.
Chapter 9 โ Observability and Metrics
- Prometheus crate: counter, histogram, HTTP exporter
- Tracing in Rust: spans, events, levels
- OpenTelemetry + Jaeger from Rust services
- Structured logging in JSON โ integration with Loki or ELK
- Vector: universal collector written in Rust
- Example: DAG executing Rust with parsable metrics and logs
๐ Part 10 โ APIs and Serving of Models/Pipelines
Objective: Serve results, pipelines, or models via HTTP.
Chapter 10 โ Serving Pipelines and Models via API
- Actix-web vs Axum: structure, routes, middlewares
- Input/output serialization and validation with Serde
- Example: Rust service that receives JSON, transforms, and responds
- Serving models with ONNX Runtime (ort crate)
- Serving batch pipelines as endpoints (execution trigger)
- Comparison with FastAPI โ latency, RAM, throughput
- Logs, metrics, and traces from API
๐ฆ Part 11 โ Distribution and Deployment
Objective: Prepare production-ready binaries and containers.
Chapter 11 โ Distribution and Deployment
- Compiling to static binaries โ cross-compiling
- Using Docker for Rust tasks
- Deployment on Kubernetes / ECS
- Versioning and testing of Rust tasks
- CI/CD for tasks, pipelines, and services
- Integrated observability and health checks
๐ Part 12 โ Appendices
Objective: Include useful references, reusable snippets, and templates.
Chapter 12 โ Appendices, Snippets, and Reusable Assets
- Polars transformation cheatsheet (in Rust)
- Connection snippets for PG, S3, REST, Kafka
- Project templates: CLI, service, task
- Equivalence tables: pandas โ Polars, dbt โ SDF/Quary
- Recommended crates by category
- Books, videos, key resources by module
- Legal considerations (licenses, redistribution, compliance)