💾

Chapter 6: Storage and Analytical Engines

Coauthor: Segfault Ramírez
“Data doesn't live in the cloud. It lives in files. It lives in tables. And if you're serious about moving it, shaping it, making it useful — Rust gives you the power to touch it directly, without the noise.”

6.1 – Introduction: Rust as a Storage Pipeline Substrate

The modern data engineer is a stack-weaver — orchestrating pipelines that run across databases, lakes, streams, and serverless glue code. And yet, most of those pipelines are built on languages and platforms that were never designed for performance, determinism, or observability at scale.

In this chapter, we’re not just going to connect to warehouses. We’re going to dissect how Rust interacts with structured storage (PostgreSQL, Redshift, Snowflake), file-based formats (Parquet, Feather), embedded query engines (DuckDB), and cloud-native analytical runtimes (Databend). We’ll explore when to embed, when to call out, when to write rows, and when to stream them. Rust isn’t here to add sugar to your pipeline. It’s here to remove friction.

6.2 – From Scripts to Systems: The Role of Storage

In most companies today, storage layers are under pressure:

  • Relational DBs (PostgreSQL): source-of-truth for operational records, used as staging points in ETL jobs.
  • Analytical Warehouses (Snowflake, Redshift): consume curated data for reporting, BI, or ML features.
  • File-based Lakes (Parquet): flexible and cheap, but require smarter orchestration and format handling.
  • Embedded Engines (DuckDB): used for QA, local prototyping, or staging ETLs.
  • Cloud-native OLAP (Databend): the Rust-native future — serverless, composable, and fast.

What connects them all is the demand for low-latency, schema-aware, and interoperable ingestion and querying mechanisms. And that’s where Rust enters.

6.3 – PostgreSQL: From Async Pools to Streaming Exports

PostgreSQL remains the backbone of many pipelines — as a trusted source, a scratchpad, and a staging area. Connecting with `sqlx`:

use sqlx::postgres::PgPoolOptions;

let pool = PgPoolOptions::new()
    .max_connections(5)
    .connect("postgres://user:pass@localhost/db")
    .await?;

Rust allows us to treat database connections as first-class, pooled, and typed resources. `sqlx` even checks your SQL against the live schema at compile time if you opt in. You’ll never ship a broken query again.

Pattern: Extract → Filter → Reload

let rows = sqlx::query!("SELECT id, email FROM users WHERE active = true")
    .fetch_all(&pool).await?;

for row in rows {
    if row.email.ends_with("@example.com") {
        sqlx::query!("INSERT INTO exports (id, email) VALUES ($1, $2)", row.id, row.email)
            .execute(&pool).await?;
    }
}

Efficient. Pooled. Predictable.

Streaming for Scale

For massive tables, use `.fetch()` or `.query_as()` with backpressure-aware streams (futures::Stream). Combine this with Polars for chunked, typed transformation.

6.4 – Redshift and Snowflake: Going Beyond ODBC Hell

Snowflake and Redshift are deeply entrenched. Python connectors are mature — but Rust is catching up.

Option 1: `odbc-api` crate

use odbc_api::{Environment, ConnectionOptions};

let env = Environment::new()?;
let conn = env.connect_with_connection_string(
    "Driver=SnowflakeDSIIDriver;Uid=user;Pwd=pass;Server=host.snowflakecomputing.com;",
    ConnectionOptions::default()
)?;

This works for both Snowflake and Redshift, and plays well with the arrow-odbc bridge if you want to push/pull `RecordBatch`es instead of rows.

Option 2: Snowflake REST or Arrow Flight

Snowflake has emerging support for ADBC and Flight SQL, both of which use the Arrow columnar protocol to transport data over gRPC. In Rust, you can use REST APIs with `reqwest`, stage Parquet to S3, and trigger `COPY INTO` commands. The performance delta between JSON-based REST and Arrow Flight is 10–30x in real-world benchmarks. If you're moving GBs, Arrow wins.

6.5 – Parquet, Feather, and Arrow IPC: Files That Think in Columns

"CSV is a toy. Parquet is real. Feather is fast."

Writing Parquet with Arrow

let file = File::create("output.parquet")?;
let props = WriterProperties::builder().build();
let mut writer = ArrowWriter::try_new(file, schema, Some(props))?;

writer.write(&batch)?;
writer.close()?;

Arrow lets you express columnar memory with type safety. Parquet gives you compressed, schema-aware persistence.

Feather vs Parquet

FormatBest forCompressionStreamingPython Interop
ParquetStorage, Lakes
FeatherIPC, pipelines🚫

Use Feather when passing data between Rust ↔ Python fast. Use Parquet when writing to disk or lake.

6.6 – Polars: Rust's Killer DataFrame

let df = CsvReader::from_path("users.csv")?
    .infer_schema(Some(100))
    .has_header(true)
    .finish()?;

// Filter and export to Parquet
let filtered = df.lazy()
    .filter(col("active").eq(lit(true)))
    .collect()?;

filtered.write_parquet("active_users.parquet", None)?;

Polars uses Arrow2 under the hood. It’s multi-threaded, SIMD-accelerated, and lazy-evaluation aware. Benchmarks show 10x–50x speedups over pandas for large datasets.

6.7 – DuckDB: In-Process OLAP Power

DuckDB lets you run SQL on CSVs and Parquets without servers.

Querying a Parquet File

let conn = duckdb::Connection::open_in_memory()?;
conn.execute("SELECT COUNT(*) FROM 'data.parquet'", [])?;

Use Cases

  • Validating pre-ingestion datasets
  • Joining temp tables in ETL
  • Embedded dashboards
  • Microanalytics in local CLI tools

It’s the SQLite of OLAP, and it runs right next to your code.

6.8 – Databend: A Warehouse Written in Rust

Databend is columnar, serverless, vectorized, and compatible with MySQL, Arrow Flight, and more. Connecting with `databend-driver`:

let client = Client::new("databend://user:pass@host:8000/db".to_string());
let conn = client.get_conn().await?;
let rows = conn.query("SELECT id FROM users").await?;

Use it to replace Redshift/Snowflake for internal apps, load analytical data, or expose your own data API endpoints. It’s the future of Rust-native warehousing.

6.9 – Transport and I/O Tradeoffs

FormatParse SpeedSize on DiskStreaming CapableInterop
CSV❌ Slow❌ Large
Parquet✅ Fast✅ Compact
Feather✅ Blazing🚫 Large-ish
Arrow IPC✅ Blazing🚫 Medium
Arrow Flight✅ Network✅ Compact🧪

Rust pipelines should default to Arrow or Parquet. Use CSV only at the boundaries. Avoid JSON unless forced.

6.10 – Full Pipeline Example: PostgreSQL → Polars → Parquet → Snowflake

  1. Query raw data from PostgreSQL with `sqlx`.
  2. Transform it with Polars’ lazy API.
  3. Write a partitioned Parquet set.
  4. Upload to S3.
  5. Trigger COPY INTO Snowflake via REST.

🚀 All orchestrated from a single Rust binary.

6.11 – Embedded ETL: DuckDB in Airflow Pipelines

Imagine this DAG step:

cargo run --bin etl-filter --input "s3://bucket/data.parquet" --sql "SELECT * FROM data WHERE valid = true"

The DuckDB query runs in-process, outputs CSV, and the next Airflow step loads it into a reporting table. Rust takes care of speed. Airflow takes care of dependencies and retries.

6.12 – Benchmark: CSV vs Parquet vs Arrow

OperationCSV + pandasParquet + PolarsArrow Flight
1M rows read5s0.5s0.3s
Filter 10% rows2s0.1s0.05s
Serialize to disk3s0.3s0.2s
Memory usage peak1.2 GB450 MB380 MB

Rust + Parquet outperforms Python + CSV by 10–15x in both latency and resource usage.

6.13 – When Rust Replaces Airflow, Glue, dbt

Use CaseRust Wins?
Real-time parsing at edge
Complex DAGs with 20+ tasks
Massive ETL with SQL-only steps❌ dbt wins
Micro-ingestion CLI tools
Low-latency joins of local files✅ DuckDB
Orchestration and retry logic❌ Airflow

The sweet spot? Rust inside Airflow. Rust replacing Glue. Rust under dbt.

6.14 – Deployment Patterns

  • Rust binary + CronJob
  • Lambda (custom runtime)
  • Docker + K8s CronJob
  • CLI tools invoked from DAGs
  • Sidecar ETL jobs with Polars + DuckDB

Small. Fast. Deterministic.

6.15 – Observability and Reliability

Rust enforces:

  • Typed errors (`Result<T, E>`)
  • Explicit retries (`retry` crate)
  • Memory safety (no leaks in long-running ETLs)

`tracing` crate for structured logs + spans. Production Rust pipelines are boring. And boring is good.

6.16 – Glossary

ToolPurposeNotes
sqlxAsync SQL driver + compile checks"Supports Postgres, MySQL"
odbc-apiAccess Redshift/SnowflakeUse block cursors for performance
arrow / arrow2Columnar memory structures"Used by Polars, DuckDB, etc."
parquetCompressed storage format"Predicate pushdown, schema-aware"
polarsDataFrame engine in Rust"Fast, lazy, parallel"
duckdbEmbedded OLAP SQL engineIn-process query power
databend_driverClient for Rust-native warehouseSupports REST + FlightSQL

6.17 – Final Words

Rust isn’t for writing pipelines faster. It’s for writing pipelines that stay fast, at scale, forever. You don’t babysit memory. You don’t crash after 10 million rows. You don’t debug null dereferences at 3am. You ship binaries that move data like they mean it. So, go build them. Sign your name. Make your system proud.