Chapter 6: Rust Across Storage and Analytical Engines

“Data doesn't live in the cloud. It lives in files. It lives in tables. And if you're serious about moving it, shaping it, making it useful — Rust gives you the power to touch it directly, without the noise.”

6.1 – Introduction: Rust as a Storage Pipeline Substrate

The modern data engineer is a stack-weaver — orchestrating pipelines that run across databases, lakes, streams, and serverless glue code. And yet, most of those pipelines are built on languages and platforms that were never designed for performance, determinism, or observability at scale.

In this chapter, we’re not just going to connect to warehouses. We’re going to dissect how Rust interacts with structured storage (PostgreSQL, Redshift, Snowflake), file-based formats (Parquet, Feather), embedded query engines (DuckDB), and cloud-native analytical runtimes (Databend). We’ll explore when to embed, when to call out, when to write rows, and when to stream them. Rust isn’t here to add sugar to your pipeline. It’s here to remove friction.

6.2 – From Scripts to Systems: The Role of Storage

In most companies today, storage layers are under pressure:

Relational DBs (PostgreSQL): source-of-truth for operational records, used as staging points in ETL jobs.
Analytical Warehouses (Snowflake, Redshift): consume curated data for reporting, BI, or ML features.
File-based Lakes (Parquet): flexible and cheap, but require smarter orchestration and format handling.
Embedded Engines (DuckDB): used for QA, local prototyping, or staging ETLs.
Cloud-native OLAP (Databend): the Rust-native future — serverless, composable, and fast.

What connects them all is the demand for low-latency, schema-aware, and interoperable ingestion and querying mechanisms. And that’s where Rust enters.

6.3 – PostgreSQL: From Async Pools to Streaming Exports

PostgreSQL remains the backbone of many pipelines — as a trusted source, a scratchpad, and a staging area. Connecting with `sqlx`:

use sqlx::postgres::PgPoolOptions;

let pool = PgPoolOptions::new()
    .max_connections(5)
    .connect("postgres://user:pass@localhost/db")
    .await?;

Rust allows us to treat database connections as first-class, pooled, and typed resources. `sqlx` even checks your SQL against the live schema at compile time if you opt in. You’ll never ship a broken query again.

Pattern: Extract → Filter → Reload

let rows = sqlx::query!("SELECT id, email FROM users WHERE active = true")
    .fetch_all(&pool).await?;

for row in rows {
    if row.email.ends_with("@example.com") {
        sqlx::query!("INSERT INTO exports (id, email) VALUES ($1, $2)", row.id, row.email)
            .execute(&pool).await?;
    }
}

Efficient. Pooled. Predictable.

Streaming for Scale

For massive tables, use `.fetch()` or `.query_as()` with backpressure-aware streams (futures::Stream). Combine this with Polars for chunked, typed transformation.

6.4 – Redshift and Snowflake: Going Beyond ODBC Hell

Snowflake and Redshift are deeply entrenched. Python connectors are mature — but Rust is catching up.

Option 1: `odbc-api` crate

use odbc_api::{Environment, ConnectionOptions};

let env = Environment::new()?;
let conn = env.connect_with_connection_string(
    "Driver=SnowflakeDSIIDriver;Uid=user;Pwd=pass;Server=host.snowflakecomputing.com;",
    ConnectionOptions::default()
)?;

This works for both Snowflake and Redshift, and plays well with the arrow-odbc bridge if you want to push/pull `RecordBatch`es instead of rows.

Option 2: Snowflake REST or Arrow Flight

Snowflake has emerging support for ADBC and Flight SQL, both of which use the Arrow columnar protocol to transport data over gRPC. In Rust, you can use REST APIs with `reqwest`, stage Parquet to S3, and trigger `COPY INTO` commands. The performance delta between JSON-based REST and Arrow Flight is 10–30x in real-world benchmarks. If you're moving GBs, Arrow wins.

6.5 – Parquet, Feather, and Arrow IPC: Files That Think in Columns

"CSV is a toy. Parquet is real. Feather is fast."

Writing Parquet with Arrow

let file = File::create("output.parquet")?;
let props = WriterProperties::builder().build();
let mut writer = ArrowWriter::try_new(file, schema, Some(props))?;

writer.write(&batch)?;
writer.close()?;

Arrow lets you express columnar memory with type safety. Parquet gives you compressed, schema-aware persistence.

Feather vs Parquet

Format	Best for	Compression	Streaming	Python Interop
Parquet	Storage, Lakes	✅	✅	✅
Feather	IPC, pipelines	🚫	✅	✅

Use Feather when passing data between Rust ↔ Python fast. Use Parquet when writing to disk or lake.

6.6 – Polars: Rust's Killer DataFrame

let df = CsvReader::from_path("users.csv")?
    .infer_schema(Some(100))
    .has_header(true)
    .finish()?;

// Filter and export to Parquet
let filtered = df.lazy()
    .filter(col("active").eq(lit(true)))
    .collect()?;

filtered.write_parquet("active_users.parquet", None)?;

Polars uses Arrow2 under the hood. It’s multi-threaded, SIMD-accelerated, and lazy-evaluation aware. Benchmarks show 10x–50x speedups over pandas for large datasets.

6.7 – DuckDB: In-Process OLAP Power

DuckDB lets you run SQL on CSVs and Parquets without servers.

Querying a Parquet File

let conn = duckdb::Connection::open_in_memory()?;
conn.execute("SELECT COUNT(*) FROM 'data.parquet'", [])?;

Use Cases

Validating pre-ingestion datasets
Joining temp tables in ETL
Embedded dashboards
Microanalytics in local CLI tools

It’s the SQLite of OLAP, and it runs right next to your code.

6.8 – Databend: A Warehouse Written in Rust

Databend is columnar, serverless, vectorized, and compatible with MySQL, Arrow Flight, and more. Connecting with `databend-driver`:

let client = Client::new("databend://user:pass@host:8000/db".to_string());
let conn = client.get_conn().await?;
let rows = conn.query("SELECT id FROM users").await?;

Use it to replace Redshift/Snowflake for internal apps, load analytical data, or expose your own data API endpoints. It’s the future of Rust-native warehousing.

6.9 – Transport and I/O Tradeoffs

Format	Parse Speed	Size on Disk	Streaming Capable	Interop
CSV	❌ Slow	❌ Large	❌	✅
Parquet	✅ Fast	✅ Compact	✅	✅
Feather	✅ Blazing	🚫 Large-ish	✅	✅
Arrow IPC	✅ Blazing	🚫 Medium	✅	✅
Arrow Flight	✅ Network	✅ Compact	✅	🧪

Rust pipelines should default to Arrow or Parquet. Use CSV only at the boundaries. Avoid JSON unless forced.

6.10 – Full Pipeline Example: PostgreSQL → Polars → Parquet → Snowflake

Query raw data from PostgreSQL with `sqlx`.
Transform it with Polars’ lazy API.
Write a partitioned Parquet set.
Upload to S3.
Trigger COPY INTO Snowflake via REST.

🚀 All orchestrated from a single Rust binary.

6.11 – Embedded ETL: DuckDB in Airflow Pipelines

Imagine this DAG step:

cargo run --bin etl-filter --input "s3://bucket/data.parquet" --sql "SELECT * FROM data WHERE valid = true"

The DuckDB query runs in-process, outputs CSV, and the next Airflow step loads it into a reporting table. Rust takes care of speed. Airflow takes care of dependencies and retries.

6.12 – Benchmark: CSV vs Parquet vs Arrow

Operation	CSV + pandas	Parquet + Polars	Arrow Flight
1M rows read	5s	0.5s	0.3s
Filter 10% rows	2s	0.1s	0.05s
Serialize to disk	3s	0.3s	0.2s
Memory usage peak	1.2 GB	450 MB	380 MB

Rust + Parquet outperforms Python + CSV by 10–15x in both latency and resource usage.

6.13 – When Rust Replaces Airflow, Glue, dbt

Use Case	Rust Wins?
Real-time parsing at edge	✅
Complex DAGs with 20+ tasks	❌
Massive ETL with SQL-only steps	❌ dbt wins
Micro-ingestion CLI tools	✅
Low-latency joins of local files	✅ DuckDB
Orchestration and retry logic	❌ Airflow

The sweet spot? Rust inside Airflow. Rust replacing Glue. Rust under dbt.

6.14 – Deployment Patterns

Rust binary + CronJob
Lambda (custom runtime)
Docker + K8s CronJob
CLI tools invoked from DAGs
Sidecar ETL jobs with Polars + DuckDB

Small. Fast. Deterministic.

6.15 – Observability and Reliability

Rust enforces:

Typed errors (`Result<T, E>`)
Explicit retries (`retry` crate)
Memory safety (no leaks in long-running ETLs)

`tracing` crate for structured logs + spans. Production Rust pipelines are boring. And boring is good.

6.16 – Glossary

Tool	Purpose	Notes
sqlx	Async SQL driver + compile checks	"Supports Postgres, MySQL"
odbc-api	Access Redshift/Snowflake	Use block cursors for performance
arrow / arrow2	Columnar memory structures	"Used by Polars, DuckDB, etc."
parquet	Compressed storage format	"Predicate pushdown, schema-aware"
polars	DataFrame engine in Rust	"Fast, lazy, parallel"
duckdb	Embedded OLAP SQL engine	In-process query power
databend_driver	Client for Rust-native warehouse	Supports REST + FlightSQL

6.17 – Final Words

Rust isn’t for writing pipelines faster. It’s for writing pipelines that stay fast, at scale, forever. You don’t babysit memory. You don’t crash after 10 million rows. You don’t debug null dereferences at 3am. You ship binaries that move data like they mean it. So, go build them. Sign your name. Make your system proud.