๐Ÿ“˜

Rust for the Modern Data Stack โ€“ Structured Index

Introduction

  • Purpose of the manual and bootcamp approach
  • What is being homologated and why
  • What is kept and what is replaced
  • Suggested roadmap for progressive conversion
  • Work environment (compilers, toolchain, testing)

โš™๏ธ Part 1 โ€“ Pipeline Orchestration

Objective: Explain how to integrate Rust within traditional orchestrators or how to replace them.

Chapter 1 โ€“ Why Rust in the Modern Data Stack

  • Why keep Airflow and not replace it yet?
  • Integrating Rust tasks within Airflow (BashOperator, PythonOperator, CustomOperator)
  • Example: DAG executing a Rust binary with dynamic parameters
  • Logs, XCom, and state between mixed Python/Rust tasks
  • Alternatives: Prefect 2.x, Dagster Pipes with Rust code
  • Rust-first orchestrators: Windmill (structure, flows), Temporal (modeling, Rust SDK)

๐Ÿงฎ Part 2 โ€“ Transformations and ETL

Objective: Cover everything that would be done with pandas, PySpark, Dask, or ETL scripts in Python.

Chapter 2 โ€“ Transformations and ETL in Rust

  • Introduction to Polars (in Rust and Python) โ€“ key differences with pandas
  • Reading CSV, Parquet, JSON, and Feather in Rust with Polars
  • Basic transformations: filters, groupby, joins, casting, date handling
  • Advanced transformations: window functions, pivots, lazy evaluation
  • Polars vs pandas vs PySpark โ€“ comparative table and benchmarks
  • DataFusion: SQL on Parquet in Rust (embedding + CLI)
  • Ballista: distributed processing (status, current scope, examples)
  • Complete ETL pipeline in Rust: ingestion โ†’ transformation โ†’ output to Parquet
  • Using crates: csv, arrow2, parquet, serde

๐Ÿ—ƒ๏ธ Part 3 โ€“ Data Modeling (dbt, SQLMesh)

Objective: Cover the replacement and/or acceleration of dbt-style modeling.

Chapter 3 โ€“ Data Modeling

  • dbt Core: possible coupling points for Rust (hooks, macros, external binaries)
  • dbt Fusion: Rust as the new internal engine โ€“ practical implications
  • SDF (Semantic Data Fabric): structure, commands, use as a dbt replacement
  • Quary: CLI replacement for dbt โ€“ compatibility, speed, limitations
  • Implementing dependency control between models in Rust
  • Example: materializable SQL model executed from a Rust binary
  • Considerations on metadata, compilation, graph versioning

๐Ÿ“‹ Part 4 โ€“ Validation and Quality Assurance (Great Expectations)

Objective: See how to replace or complement tools like Great Expectations.

Chapter 4 โ€“ Data Validation and Quality

  • GE: integration with Rust tasks (in Airflow or standalone)
  • Validation in Polars: assertions, checks, slicing, and aggregate metrics
  • Building a validation module in Rust: structure, JSON reporting
  • Minimal declarative checks with YAML or embedded config
  • Mixed use cases: GE defines expectations, Rust evaluates them
  • Exporting results for dashboards or auditing
  • Alternatives like SDF Checks and extended use in Rust modeling

๐Ÿง‘โ€๐Ÿ’ป Part 5 โ€“ CLI pipelines and scripting

Objective: Replace Python/Make/Luigi scripts with safe Rust executables.

Chapter 5 โ€“ CLI Pipelines and Rust Task Execution

  • CLI in Rust: structure, using Clap, subcommands
  • Local pipeline of chained steps (e.g., ingest โ†’ clean โ†’ export)
  • Justfiles and cargo-make as reproducible task tools
  • Structured output: logs, files, exit codes
  • Execution by cron or called from Airflow/Prefect
  • Logging and metrics from CLI (with Prometheus or tracing)

๐Ÿ›ฐ๏ธ Part 6 โ€“ Storage and Analytical Engines

Objective: Direct interaction with databases and query engines.

Chapter 6 โ€“ Storage and Analytical Engines

  • Connecting to PostgreSQL, Redshift, Snowflake from Rust
  • Using tokio-postgres, sqlx, Diesel
  • Bulk insert, batch insert, and data streaming
  • Reading/writing Parquet, Feather, and Arrow
  • Embedded DuckDB in Rust (CLI and library)
  • Databend: use as an alternative warehouse
  • Throughput and RAM comparison in ingestion
  • Pipeline: Rust connects to PG, transforms with Polars, exports Parquet

๐Ÿงฑ Part 7 โ€“ Arrow and Columnar Memory

Objective: Understand Arrow in Rust and its advantages in efficient pipelines.

Chapter 7 โ€“ Apache Arrow and Columnar Memory

  • Apache Arrow in Rust: RecordBatch and Array structure
  • Differences between Arrow vs pandas/NumPy in memory representation
  • Arrow2: improvements and use cases
  • Arrow Flight: data transfer between systems (RPC)
  • Example: Rust app exposing a Flight endpoint
  • Zero-copy between processes/languages using Arrow
  • Conversion Polars โ†” Arrow โ†” Python (interoperability)

๐Ÿ” Part 8 โ€“ Streaming and Event-Driven Architecture

Objective: Real-time processing and events.

Chapter 8 โ€“ Streaming and Event-Driven Architectures

  • Kafka client in Rust (rdkafka, performance)
  • Pulsar consumer in Rust
  • Fluvio: Rust broker + embedded stream processing
  • SmartModules with WebAssembly in Fluvio
  • Materialize / RisingWave โ€“ SQL views on streams
  • Vector: log ingestion as stream input
  • Example: Real-time ETL with Kafka + Rust + ClickHouse

๐Ÿ“Š Part 9 โ€“ Observability and Metrics

Objective: Logs, traces, and metrics from Rust tasks and services.

Chapter 9 โ€“ Observability and Metrics

  • Prometheus crate: counter, histogram, HTTP exporter
  • Tracing in Rust: spans, events, levels
  • OpenTelemetry + Jaeger from Rust services
  • Structured logging in JSON โ€“ integration with Loki or ELK
  • Vector: universal collector written in Rust
  • Example: DAG executing Rust with parsable metrics and logs

๐ŸŒ Part 10 โ€“ APIs and Serving of Models/Pipelines

Objective: Serve results, pipelines, or models via HTTP.

Chapter 10 โ€“ Serving Pipelines and Models via API

  • Actix-web vs Axum: structure, routes, middlewares
  • Input/output serialization and validation with Serde
  • Example: Rust service that receives JSON, transforms, and responds
  • Serving models with ONNX Runtime (ort crate)
  • Serving batch pipelines as endpoints (execution trigger)
  • Comparison with FastAPI โ€“ latency, RAM, throughput
  • Logs, metrics, and traces from API

๐Ÿ“ฆ Part 11 โ€“ Distribution and Deployment

Objective: Prepare production-ready binaries and containers.

Chapter 11 โ€“ Distribution and Deployment

  • Compiling to static binaries โ€“ cross-compiling
  • Using Docker for Rust tasks
  • Deployment on Kubernetes / ECS
  • Versioning and testing of Rust tasks
  • CI/CD for tasks, pipelines, and services
  • Integrated observability and health checks

๐Ÿ“š Part 12 โ€“ Appendices

Objective: Include useful references, reusable snippets, and templates.

Chapter 12 โ€“ Appendices, Snippets, and Reusable Assets

  • Polars transformation cheatsheet (in Rust)
  • Connection snippets for PG, S3, REST, Kafka
  • Project templates: CLI, service, task
  • Equivalence tables: pandas โ†’ Polars, dbt โ†’ SDF/Quary
  • Recommended crates by category
  • Books, videos, key resources by module
  • Legal considerations (licenses, redistribution, compliance)