Rust for the Modern Data Stack – Structured Index

Introduction

Purpose of the manual and bootcamp approach
What is being homologated and why
What is kept and what is replaced
Suggested roadmap for progressive conversion
Work environment (compilers, toolchain, testing)

⚙️ Part 1 – Pipeline Orchestration

Objective: Explain how to integrate Rust within traditional orchestrators or how to replace them.

Chapter 1 – Why Rust in the Modern Data Stack

Why keep Airflow and not replace it yet?
Integrating Rust tasks within Airflow (BashOperator, PythonOperator, CustomOperator)
Example: DAG executing a Rust binary with dynamic parameters
Logs, XCom, and state between mixed Python/Rust tasks
Alternatives: Prefect 2.x, Dagster Pipes with Rust code
Rust-first orchestrators: Windmill (structure, flows), Temporal (modeling, Rust SDK)

🧮 Part 2 – Transformations and ETL

Objective: Cover everything that would be done with pandas, PySpark, Dask, or ETL scripts in Python.

Chapter 2 – Transformations and ETL in Rust

Introduction to Polars (in Rust and Python) – key differences with pandas
Reading CSV, Parquet, JSON, and Feather in Rust with Polars
Basic transformations: filters, groupby, joins, casting, date handling
Advanced transformations: window functions, pivots, lazy evaluation
Polars vs pandas vs PySpark – comparative table and benchmarks
DataFusion: SQL on Parquet in Rust (embedding + CLI)
Ballista: distributed processing (status, current scope, examples)
Complete ETL pipeline in Rust: ingestion → transformation → output to Parquet
Using crates: csv, arrow2, parquet, serde

🗃️ Part 3 – Data Modeling (dbt, SQLMesh)

Objective: Cover the replacement and/or acceleration of dbt-style modeling.

Chapter 3 – Data Modeling

dbt Core: possible coupling points for Rust (hooks, macros, external binaries)
dbt Fusion: Rust as the new internal engine – practical implications
SDF (Semantic Data Fabric): structure, commands, use as a dbt replacement
Quary: CLI replacement for dbt – compatibility, speed, limitations
Implementing dependency control between models in Rust
Example: materializable SQL model executed from a Rust binary
Considerations on metadata, compilation, graph versioning

📋 Part 4 – Validation and Quality Assurance (Great Expectations)

Objective: See how to replace or complement tools like Great Expectations.

Chapter 4 – Data Validation and Quality

GE: integration with Rust tasks (in Airflow or standalone)
Validation in Polars: assertions, checks, slicing, and aggregate metrics
Building a validation module in Rust: structure, JSON reporting
Minimal declarative checks with YAML or embedded config
Mixed use cases: GE defines expectations, Rust evaluates them
Exporting results for dashboards or auditing
Alternatives like SDF Checks and extended use in Rust modeling

🧑‍💻 Part 5 – CLI pipelines and scripting

Objective: Replace Python/Make/Luigi scripts with safe Rust executables.

Chapter 5 – CLI Pipelines and Rust Task Execution

CLI in Rust: structure, using Clap, subcommands
Local pipeline of chained steps (e.g., ingest → clean → export)
Justfiles and cargo-make as reproducible task tools
Structured output: logs, files, exit codes
Execution by cron or called from Airflow/Prefect
Logging and metrics from CLI (with Prometheus or tracing)

🛰️ Part 6 – Storage and Analytical Engines

Objective: Direct interaction with databases and query engines.

Chapter 6 – Storage and Analytical Engines

Connecting to PostgreSQL, Redshift, Snowflake from Rust
Using tokio-postgres, sqlx, Diesel
Bulk insert, batch insert, and data streaming
Reading/writing Parquet, Feather, and Arrow
Embedded DuckDB in Rust (CLI and library)
Databend: use as an alternative warehouse
Throughput and RAM comparison in ingestion
Pipeline: Rust connects to PG, transforms with Polars, exports Parquet

🧱 Part 7 – Arrow and Columnar Memory

Objective: Understand Arrow in Rust and its advantages in efficient pipelines.

Chapter 7 – Apache Arrow and Columnar Memory

Apache Arrow in Rust: RecordBatch and Array structure
Differences between Arrow vs pandas/NumPy in memory representation
Arrow2: improvements and use cases
Arrow Flight: data transfer between systems (RPC)
Example: Rust app exposing a Flight endpoint
Zero-copy between processes/languages using Arrow
Conversion Polars ↔ Arrow ↔ Python (interoperability)

🔁 Part 8 – Streaming and Event-Driven Architecture

Objective: Real-time processing and events.

Chapter 8 – Streaming and Event-Driven Architectures

Kafka client in Rust (rdkafka, performance)
Pulsar consumer in Rust
Fluvio: Rust broker + embedded stream processing
SmartModules with WebAssembly in Fluvio
Materialize / RisingWave – SQL views on streams
Vector: log ingestion as stream input
Example: Real-time ETL with Kafka + Rust + ClickHouse

📊 Part 9 – Observability and Metrics

Objective: Logs, traces, and metrics from Rust tasks and services.

Chapter 9 – Observability and Metrics

Prometheus crate: counter, histogram, HTTP exporter
Tracing in Rust: spans, events, levels
OpenTelemetry + Jaeger from Rust services
Structured logging in JSON – integration with Loki or ELK
Vector: universal collector written in Rust
Example: DAG executing Rust with parsable metrics and logs

🌐 Part 10 – APIs and Serving of Models/Pipelines

Objective: Serve results, pipelines, or models via HTTP.

Chapter 10 – Serving Pipelines and Models via API

Actix-web vs Axum: structure, routes, middlewares
Input/output serialization and validation with Serde
Example: Rust service that receives JSON, transforms, and responds
Serving models with ONNX Runtime (ort crate)
Serving batch pipelines as endpoints (execution trigger)
Comparison with FastAPI – latency, RAM, throughput
Logs, metrics, and traces from API

📦 Part 11 – Distribution and Deployment

Objective: Prepare production-ready binaries and containers.

Chapter 11 – Distribution and Deployment

Compiling to static binaries – cross-compiling
Using Docker for Rust tasks
Deployment on Kubernetes / ECS
Versioning and testing of Rust tasks
CI/CD for tasks, pipelines, and services
Integrated observability and health checks

📚 Part 12 – Appendices

Objective: Include useful references, reusable snippets, and templates.

Chapter 12 – Appendices, Snippets, and Reusable Assets

Polars transformation cheatsheet (in Rust)
Connection snippets for PG, S3, REST, Kafka
Project templates: CLI, service, task
Equivalence tables: pandas → Polars, dbt → SDF/Quary
Recommended crates by category
Books, videos, key resources by module
Legal considerations (licenses, redistribution, compliance)