🧠

Chapter 7 – Apache Arrow and the Memory Interface: How Rust Is Rewriting the Data Stack

Coauthor: Bruno Milgram
"What if the bottleneck isn’t in your compute — it’s in how you move bytes?"

1. Introduction — Stop Moving Data, Start Sharing Memory

If you’ve worked long enough in data, you’ve seen the performance tax. The cost of taking data from a database, converting it into JSON, sending it over HTTP, reading it into Python objects, transforming it in pandas, and writing it out again. That entire pipeline exists because the tools you use — SQL engines, ML libraries, APIs, storage — can’t agree on a single format to share data. So you parse. You serialize. You buffer. You waste cycles.

Apache Arrow is how we stop doing that. Arrow is not just a format. It’s a contract for memory. It defines how data is laid out in RAM — column by column — so that tools can pass it around without copying or parsing. It’s fast, portable, and designed for analytics.

And Rust? Rust is the only language that can give you full control over memory layout without losing your mind or breaking safety. Rust is where Arrow can actually live, not just be wrapped.

2. The Problem That Arrow Solves

In the old world — and most current pipelines — data is row-oriented. You work with objects or structs. You load JSON, CSV, records from a database. Each record has fields, and you process them one by one. But analytics doesn't need rows. It needs columns. You're not scanning millions of complete records — you're scanning a single column of sales, or filtering a region, or joining on user_id.

Columnar layout is faster. It’s aligned with:

  • CPU cache lines
  • SIMD instructions
  • GPU memory access
  • Compression efficiency

Arrow gives you that layout — in RAM. And it does so in a standardized, language-independent, shareable format. That’s the core idea: stop transforming data between tools. Shape it once in memory and reuse it everywhere.

3. The Arrow Memory Model

Let’s look under the hood. An Arrow array is not a Vec<T> or Vec<Option<T>>. It’s a structured, typed set of buffers. For primitive types (like Int32Array):

  • Values buffer: [1, 2, 3]
  • Validity bitmap: [1, 1, 0] (bit 0 = valid, bit 1 = valid, bit 2 = null)

For variable-length types (like strings):

  • Offsets buffer: [0, 4, 9, 14]
  • Values buffer: b"JohnAliceMark"
  • Validity bitmap: [1, 0, 1]

The offsets indicate where each string starts and ends in the values buffer. This layout allows:

  • Constant-time random access
  • Fast vectorized iteration
  • Compact null tracking
  • Zero-copy slicing

All of it is just memory. There are no objects. No structs. No dynamic allocation per element. Arrow is built to be mmap-friendly, GPU-friendly, cache-friendly, and network-transferable.

4. Why Rust Is the Best Host for Arrow

Other languages can use Arrow. Python has PyArrow. C++ has the reference implementation. Java has support too. But only Rust lets you write Arrow pipelines that are fast, safe, and low-level without footguns. In C++, you get speed but no safety. In Python, you get convenience but poor control. In Java, you get overhead.

Rust gives you:

  • Precise types
  • Ownership without GC
  • No undefined behavior
  • SIMD and alignment control

That’s why the most performant Arrow tools are written in Rust: arrow2, Polars, DataFusion, and Ballista. In Rust, Arrow isn’t wrapped — it’s native.

5. Creating Arrow Data in Rust

Here’s what it looks like to create a batch in `arrow2`.

use arrow2::{
    array::{Int32Array, Utf8Array},
    datatypes::{Field, DataType, Schema},
    record_batch::RecordBatch,
};
use std::sync::Arc;

let schema = Schema::from(vec![
    Field::new("user_id", DataType::Int32, false),
    Field::new("name", DataType::Utf8, true),
]);

let user_ids = Int32Array::from_slice(&[1, 2, 3]);
let names = Utf8Array::<i32>::from(&[Some("John"), None, Some("Alice")]);

let batch = RecordBatch::try_new(
    Arc::new(schema),
    vec![Arc::new(user_ids), Arc::new(names)],
)?;

This batch holds 3 rows and 2 columns as Arrow arrays, which you can serialize, send over the network, or convert to a DataFrame with no object mapping.

6. Columnar Pipelines and Zero-Copy Thinking

Traditional data flows copy and interpret data at every step. A zero-copy flow with Arrow looks like this: Read Parquet into Arrow arrays → Filter columns using `arrow2::compute` → Write Arrow IPC stream or send via Flight. One format, no parsing, one set of buffers. Arrow becomes your data plane.

7. Arrow Flight and Arrow Flight SQL

Arrow Flight is a protocol for transferring Arrow data over gRPC, far more efficient than REST+JSON. It allows clients to request streams of `RecordBatch`es. Flight SQL extends this to allow SQL queries over Flight. You send a query string and get back a stream of Arrow batches, bypassing slow drivers like JDBC/ODBC.

8. Dictionaries and Encoding Tricks

Arrow supports dictionary encoding for columns with many repeated values. Instead of storing ["US", "CA", "US", "US", "DE"], you store a dictionary of unique values ["US", "CA", "DE"] and an array of indices [0, 1, 0, 0, 2]. This saves space and speeds up joins and group-bys, as operations can work directly on the compact integer indices.

9. Nested Data: Lists, Structs, and Beyond

Arrow isn’t just for flat data. It supports `ListArray`, `StructArray`, and `UnionArray` to represent complex, JSON-like structures without leaving the efficient columnar format. In Rust, `arrow2` provides safe builders and iterators to handle this nested data.

10. Arrow and Machine Learning Pipelines

Arrow is ideal for ML due to fast feature loading and no serialization between ETL and inference. You can build a Rust service that reads features from Parquet, preprocesses them with Arrow, and sends them to a model in Python or C++ without ever allocating intermediate objects.

11. DuckDB, Postgres, Polars, DataFusion

Arrow is the common format across modern data systems. You can build pipelines like: Parquet → Arrow → DataFusion SQL → Arrow Flight → Python. At every step, you keep the data in Arrow. No conversions, no schema mismatches.

12. Benchmarking: Arrow vs. Everything Else

TaskJSONCSVArrow
Parse 1M rows5s3s0.3s
Transfer over network300MB150MB80MB
CPU usageHighMediumLow
GC pressure (Python)HighHighNone

Arrow wins because it avoids strings, uses compact buffers, streams batches, and enables vectorized processing. In Rust, you get all that without a garbage collector.

13. Real Systems Using Arrow in Rust

This isn’t theory. Real systems are using Arrow in Rust:

  • InfluxDB uses Rust + Arrow for time-series analytics, storing data in Parquet and using Flight for client communication.
  • Estuary, a data integration platform, uses Arrow as its internal data format.
  • Hugging Face Datasets uses Arrow as a backend for storing massive datasets.
  • MotherDuck uses Arrow Flight SQL for high-throughput feature serving.

14. Performance Tuning in Rust + Arrow

To go fast, use SIMD-aware kernels, choose batch sizes that fit in CPU cache, reuse buffers, and profile with tools like `perf` or `flamegraph`. Avoid per-row iteration and unnecessary cloning. Use the optimized compute kernels in `arrow2`.

15. Anti-Patterns and Pitfalls

Common mistakes include treating Arrow like a row store, not setting null bitmaps correctly, and converting to intermediate structs. Avoid these by thinking in batches, using builders properly, and respecting Rust's ownership and borrowing rules.

16. Designing Architectures with Arrow at the Core

You can build an entire stack where Arrow is the backbone: Ingest → decode to Arrow → Transform → Arrow compute → Store → Parquet → Serve → Arrow Flight. Arrow becomes your ABI for data.

17. Thinking Differently: Memory as API

This is the mindset shift. Not: “how do I process my data?” But: “how do I structure memory so that compute is trivial?” You move from rows to columns, parsing to mapping, and serializing to sharing. You start to see data as a physical shape that you design, reuse, and optimize.

18. Arrow Flight SQL: SQL Without Overhead

Flight SQL lets you send SQL queries over gRPC and get back Arrow, achieving over 20+ GB/s per core without the overhead of JDBC or ORMs. This isn't the future—it's live today in tools like DuckDB, Postgres (via pgArrow), and DataFusion.

19. Tooling, Observability, and DX

To build serious systems, use `cargo flamegraph` to see bottlenecks, `arrow-viewer` to inspect batches, and build your own CLI tools for common Arrow-related tasks. Rust's precise types make it easy to build reliable tooling.

20. Closing — The Memory Interface Is the Real API

In the end, Arrow isn’t a format. It’s a discipline. It’s a way of thinking about data where there’s one representation, it works across languages, it’s optimized for modern hardware, and it eliminates waste. Rust is the only language that makes this discipline safe, fast, and maintainable. If you’re building the data stack of the next 10 years, you’re not looking for a new framework. You’re looking for a new shape. That shape is columnar. That layout is Arrow. That implementation is Rust.

By Bruno Milgram