“Data doesn't live in the cloud. It lives in files. It lives in tables. And if you're serious about moving it, shaping it, making it useful — Rust gives you the power to touch it directly, without the noise.”
6.1 – Introduction: Rust as a Storage Pipeline Substrate
The modern data engineer is a stack-weaver — orchestrating pipelines that run across databases, lakes, streams, and serverless glue code. And yet, most of those pipelines are built on languages and platforms that were never designed for performance, determinism, or observability at scale.
In this chapter, we’re not just going to connect to warehouses. We’re going to dissect how Rust interacts with structured storage (PostgreSQL, Redshift, Snowflake), file-based formats (Parquet, Feather), embedded query engines (DuckDB), and cloud-native analytical runtimes (Databend). We’ll explore when to embed, when to call out, when to write rows, and when to stream them. Rust isn’t here to add sugar to your pipeline. It’s here to remove friction.
6.2 – From Scripts to Systems: The Role of Storage
In most companies today, storage layers are under pressure:
- Relational DBs (PostgreSQL): source-of-truth for operational records, used as staging points in ETL jobs.
- Analytical Warehouses (Snowflake, Redshift): consume curated data for reporting, BI, or ML features.
- File-based Lakes (Parquet): flexible and cheap, but require smarter orchestration and format handling.
- Embedded Engines (DuckDB): used for QA, local prototyping, or staging ETLs.
- Cloud-native OLAP (Databend): the Rust-native future — serverless, composable, and fast.
What connects them all is the demand for low-latency, schema-aware, and interoperable ingestion and querying mechanisms. And that’s where Rust enters.
6.3 – PostgreSQL: From Async Pools to Streaming Exports
PostgreSQL remains the backbone of many pipelines — as a trusted source, a scratchpad, and a staging area. Connecting with `sqlx`:
use sqlx::postgres::PgPoolOptions;
let pool = PgPoolOptions::new()
.max_connections(5)
.connect("postgres://user:pass@localhost/db")
.await?;
Rust allows us to treat database connections as first-class, pooled, and typed resources. `sqlx` even checks your SQL against the live schema at compile time if you opt in. You’ll never ship a broken query again.
Pattern: Extract → Filter → Reload
let rows = sqlx::query!("SELECT id, email FROM users WHERE active = true")
.fetch_all(&pool).await?;
for row in rows {
if row.email.ends_with("@example.com") {
sqlx::query!("INSERT INTO exports (id, email) VALUES ($1, $2)", row.id, row.email)
.execute(&pool).await?;
}
}
Efficient. Pooled. Predictable.
Streaming for Scale
For massive tables, use `.fetch()` or `.query_as()` with backpressure-aware streams (futures::Stream). Combine this with Polars for chunked, typed transformation.
6.4 – Redshift and Snowflake: Going Beyond ODBC Hell
Snowflake and Redshift are deeply entrenched. Python connectors are mature — but Rust is catching up.
Option 1: `odbc-api` crate
use odbc_api::{Environment, ConnectionOptions};
let env = Environment::new()?;
let conn = env.connect_with_connection_string(
"Driver=SnowflakeDSIIDriver;Uid=user;Pwd=pass;Server=host.snowflakecomputing.com;",
ConnectionOptions::default()
)?;
This works for both Snowflake and Redshift, and plays well with the arrow-odbc bridge if you want to push/pull `RecordBatch`es instead of rows.
Option 2: Snowflake REST or Arrow Flight
Snowflake has emerging support for ADBC and Flight SQL, both of which use the Arrow columnar protocol to transport data over gRPC. In Rust, you can use REST APIs with `reqwest`, stage Parquet to S3, and trigger `COPY INTO` commands. The performance delta between JSON-based REST and Arrow Flight is 10–30x in real-world benchmarks. If you're moving GBs, Arrow wins.
6.5 – Parquet, Feather, and Arrow IPC: Files That Think in Columns
"CSV is a toy. Parquet is real. Feather is fast."
Writing Parquet with Arrow
let file = File::create("output.parquet")?;
let props = WriterProperties::builder().build();
let mut writer = ArrowWriter::try_new(file, schema, Some(props))?;
writer.write(&batch)?;
writer.close()?;
Arrow lets you express columnar memory with type safety. Parquet gives you compressed, schema-aware persistence.
Feather vs Parquet
Format | Best for | Compression | Streaming | Python Interop |
---|---|---|---|---|
Parquet | Storage, Lakes | ✅ | ✅ | ✅ |
Feather | IPC, pipelines | 🚫 | ✅ | ✅ |
Use Feather when passing data between Rust ↔ Python fast. Use Parquet when writing to disk or lake.
6.6 – Polars: Rust's Killer DataFrame
let df = CsvReader::from_path("users.csv")?
.infer_schema(Some(100))
.has_header(true)
.finish()?;
// Filter and export to Parquet
let filtered = df.lazy()
.filter(col("active").eq(lit(true)))
.collect()?;
filtered.write_parquet("active_users.parquet", None)?;
Polars uses Arrow2 under the hood. It’s multi-threaded, SIMD-accelerated, and lazy-evaluation aware. Benchmarks show 10x–50x speedups over pandas for large datasets.
6.7 – DuckDB: In-Process OLAP Power
DuckDB lets you run SQL on CSVs and Parquets without servers.
Querying a Parquet File
let conn = duckdb::Connection::open_in_memory()?;
conn.execute("SELECT COUNT(*) FROM 'data.parquet'", [])?;
Use Cases
- Validating pre-ingestion datasets
- Joining temp tables in ETL
- Embedded dashboards
- Microanalytics in local CLI tools
It’s the SQLite of OLAP, and it runs right next to your code.
6.8 – Databend: A Warehouse Written in Rust
Databend is columnar, serverless, vectorized, and compatible with MySQL, Arrow Flight, and more. Connecting with `databend-driver`:
let client = Client::new("databend://user:pass@host:8000/db".to_string());
let conn = client.get_conn().await?;
let rows = conn.query("SELECT id FROM users").await?;
Use it to replace Redshift/Snowflake for internal apps, load analytical data, or expose your own data API endpoints. It’s the future of Rust-native warehousing.
6.9 – Transport and I/O Tradeoffs
Format | Parse Speed | Size on Disk | Streaming Capable | Interop |
---|---|---|---|---|
CSV | ❌ Slow | ❌ Large | ❌ | ✅ |
Parquet | ✅ Fast | ✅ Compact | ✅ | ✅ |
Feather | ✅ Blazing | 🚫 Large-ish | ✅ | ✅ |
Arrow IPC | ✅ Blazing | 🚫 Medium | ✅ | ✅ |
Arrow Flight | ✅ Network | ✅ Compact | ✅ | 🧪 |
Rust pipelines should default to Arrow or Parquet. Use CSV only at the boundaries. Avoid JSON unless forced.
6.10 – Full Pipeline Example: PostgreSQL → Polars → Parquet → Snowflake
- Query raw data from PostgreSQL with `sqlx`.
- Transform it with Polars’ lazy API.
- Write a partitioned Parquet set.
- Upload to S3.
- Trigger COPY INTO Snowflake via REST.
🚀 All orchestrated from a single Rust binary.
6.11 – Embedded ETL: DuckDB in Airflow Pipelines
Imagine this DAG step:
cargo run --bin etl-filter --input "s3://bucket/data.parquet" --sql "SELECT * FROM data WHERE valid = true"
The DuckDB query runs in-process, outputs CSV, and the next Airflow step loads it into a reporting table. Rust takes care of speed. Airflow takes care of dependencies and retries.
6.12 – Benchmark: CSV vs Parquet vs Arrow
Operation | CSV + pandas | Parquet + Polars | Arrow Flight |
---|---|---|---|
1M rows read | 5s | 0.5s | 0.3s |
Filter 10% rows | 2s | 0.1s | 0.05s |
Serialize to disk | 3s | 0.3s | 0.2s |
Memory usage peak | 1.2 GB | 450 MB | 380 MB |
Rust + Parquet outperforms Python + CSV by 10–15x in both latency and resource usage.
6.13 – When Rust Replaces Airflow, Glue, dbt
Use Case | Rust Wins? |
---|---|
Real-time parsing at edge | ✅ |
Complex DAGs with 20+ tasks | ❌ |
Massive ETL with SQL-only steps | ❌ dbt wins |
Micro-ingestion CLI tools | ✅ |
Low-latency joins of local files | ✅ DuckDB |
Orchestration and retry logic | ❌ Airflow |
The sweet spot? Rust inside Airflow. Rust replacing Glue. Rust under dbt.
6.14 – Deployment Patterns
- Rust binary + CronJob
- Lambda (custom runtime)
- Docker + K8s CronJob
- CLI tools invoked from DAGs
- Sidecar ETL jobs with Polars + DuckDB
Small. Fast. Deterministic.
6.15 – Observability and Reliability
Rust enforces:
- Typed errors (`Result<T, E>`)
- Explicit retries (`retry` crate)
- Memory safety (no leaks in long-running ETLs)
`tracing` crate for structured logs + spans. Production Rust pipelines are boring. And boring is good.
6.16 – Glossary
Tool | Purpose | Notes |
---|---|---|
sqlx | Async SQL driver + compile checks | "Supports Postgres, MySQL" |
odbc-api | Access Redshift/Snowflake | Use block cursors for performance |
arrow / arrow2 | Columnar memory structures | "Used by Polars, DuckDB, etc." |
parquet | Compressed storage format | "Predicate pushdown, schema-aware" |
polars | DataFrame engine in Rust | "Fast, lazy, parallel" |
duckdb | Embedded OLAP SQL engine | In-process query power |
databend_driver | Client for Rust-native warehouse | Supports REST + FlightSQL |
6.17 – Final Words
Rust isn’t for writing pipelines faster. It’s for writing pipelines that stay fast, at scale, forever. You don’t babysit memory. You don’t crash after 10 million rows. You don’t debug null dereferences at 3am. You ship binaries that move data like they mean it. So, go build them. Sign your name. Make your system proud.