1. Introduction: Validation as a First-Class Citizen
In the world of data engineering, it’s tempting to think that if the pipeline runs, the data must be fine. It’s not.
Broken joins. Nulls in critical columns. Category values that don’t match any known domain. Duplicates in supposedly unique identifiers. These are not software bugs. They’re silent failures in data quality — and they’re more common than we admit.
What stops them? Validation.
Data validation is not a side task, or something left for dashboards or BI tools to catch. It should be part of the pipeline itself. A first-class citizen. The same way we write unit tests to guard logic, we must write validation rules to guard meaning.
This chapter explores how to do just that, using Rust. We will contrast traditional declarative frameworks like Great Expectations with the embedded, composable approach available in Rust, leveraging tools like Polars, Arrow, and DataFusion.
The goal is not just to learn how to check a column for nulls — it’s to understand how validation becomes infrastructure, how it changes the culture of data reliability, and how it can be treated with the same rigor as application logic.
2. The Landscape: Declarative Validation and Its Limits
Most data teams start with declarative validation frameworks. Great Expectations is the flagship example: you define rules (called “expectations”) as YAML or Python config, such as:
expect_column_values_to_be_between:
column: age
min_value: 0
max_value: 120
These expectation suites are then applied to batches of data, and validation results are recorded. GE can generate reports, alerts, and even interactive documentation called Data Docs. It integrates well with pandas, Spark, Airflow, and Jupyter, making it appealing for teams already in the Python ecosystem.
Benefits:
- Clear separation of expectations from code
- Easy to read and share with non-developers
- Built-in profiling tools to auto-generate rules
- Report generation and alerting integrations
But:
Declarative validation has limitations:
- Validation logic lives far from processing logic. The rules are separate, not embedded in the code that transforms the data.
- Limited expressiveness. Complex or context-dependent rules often require custom Python extensions.
- Execution detachment. You often define what to check, but not when or how to act on failures.
- Performance issues. Python-based validation struggles with large volumes unless you offload to Spark.
As data volumes grow and pipelines diversify, many teams seek an approach where validation is not external configuration — but internal code.
3. The Rust Philosophy: Validation as Executable Contract
Rust flips the model. Instead of writing a configuration file to describe what the data should look like, you write code that enforces those rules — directly in the pipeline. Validation becomes a function. It can be composed, reused, tested, and versioned like any other part of your system.
Example:
let invalid_age = df
.filter(&col("age").lt(0).or(col("age").gt(120)))?
.height();
This line doesn’t describe an expectation. It checks it. And the pipeline can act on the result. The core idea: validation logic is embedded into the transformation logic.
Why this matters:
Trait | Declarative (GE) | Embedded (Rust) |
---|---|---|
Expressiveness | Limited to expectation APIs | Full programming capabilities |
Speed | Python-speed | Native compiled performance |
Reusability | Partial (via suites) | High (via functions/crates) |
Integration | External step | Inline with pipeline logic |
Testing | Separate from unit tests | Included in testable modules |
When validation becomes executable code, it becomes observable, versioned, and automatically aligned with business logic.
4. Tools for Validation in Rust
To write validation logic in Rust, we use the following building blocks:
Polars
A high-performance DataFrame library inspired by pandas, but written in Rust and built on Apache Arrow. Offers:
- Lazy and eager execution modes
- Fast filtering, aggregation, joins
- Expression-based transformations
- Strong typing and Arrow integration
Arrow
A language-independent columnar memory format. Used by Polars and DataFusion for zero-copy data sharing and efficient storage. Gives:
- Efficient memory usage
- Fast SIMD operations
- Cross-language data exchange (e.g., Python ↔ Rust)
DataFusion
An embeddable SQL query engine. Enables:
- SQL-based validation inside Rust
- Execution of complex joins, filters, aggregates
- In-memory query plans over Arrow RecordBatches
Each of these layers allows you to implement validation at different levels of abstraction, depending on whether you prefer expressions, SQL, or low-level checks.
5. Example: Validating a User Dataset with Polars
Let’s say you’re ingesting user data:
id,name,email,age
1,Alice,alice@example.com,30
2,Bob,,27
3,Charlie,charlie.com,-5
4,Dave,dave@example.com,999
We want to validate:
- Email must not be null
- Email must match pattern
- Age must be between 0 and 120
- ID must be unique
Here’s how we do that with Polars:
use polars::prelude::*;
use regex::Regex;
fn validate_users(df: &DataFrame) -> Result<(), Box<dyn std::error::Error>> {
let mut issues = vec![];
// Null check
if df.column("email")?.null_count() > 0 {
issues.push("Email column contains nulls.");
}
// Format check
let email_re = Regex::new(r"^[^@]+@[^@]+\.[^@]+$")?;
let email_col = df.column("email")?.utf8()?;
let invalid_format = email_col.into_iter().filter(|e| {
match e {
Some(val) => !email_re.is_match(val),
None => true,
}
}).count();
if invalid_format > 0 {
issues.push("Invalid email formats found.");
}
// Range check
let age_out_of_bounds = df
.filter(&col("age").lt(0).or(col("age").gt(120)))?
.height();
if age_out_of_bounds > 0 {
issues.push("Age values outside range 0–120.");
}
// Uniqueness check
let id_unique = df.column("id")?.unique()?.len();
if id_unique != df.height() {
issues.push("Duplicate IDs found.");
}
if issues.is_empty() {
println!("All validations passed.");
Ok(())
} else {
for msg in issues {
println!("Validation error: {}", msg);
}
Err("Validation failed.".into())
}
}
This is readable, fast, and fully embeddable in any data processing task. And it runs orders of magnitude faster than equivalent Python scripts on large datasets.
6. Diagram: Where Validation Fits
A typical Rust pipeline might look like this:
+----------------------+ | Source File | +----------------------+ | v +-------------------------+ | Load into Polars DataFrame | +-------------------------+ | v +--------------------+ | validate_users(df) | +--------------------+ | | Pass Fail | | v v +--------------+ +------------------+ | Transform | | Report & Quarantine | +--------------+ +------------------+ | v +------------------+ | Load to DWH/Sink | +------------------+
Validation is not a side process. It is the gatekeeper to the rest of the pipeline.
7. Example: SQL-Based Validation with DataFusion
Suppose you want to validate:
- All order totals must be non-negative
- No order appears twice
With DataFusion:
use datafusion::prelude::*;
async fn validate_orders() -> Result<()> {
let mut ctx = SessionContext::new();
ctx.register_csv("orders", "data/orders.csv", CsvReadOptions::new()).await?;
let total_check = ctx.sql("SELECT COUNT(*) FROM orders WHERE total < 0").await?;
let duplicate_check = ctx.sql("SELECT order_id, COUNT(*) FROM orders GROUP BY order_id HAVING COUNT(*) > 1").await?;
let negative_rows = total_check.collect().await?;
let duplicates = duplicate_check.collect().await?;
// Logic to handle results...
Ok(())
}
This gives you the comfort of SQL while staying inside Rust’s runtime, with full parallel execution and memory efficiency.
8. Exporting Validation Reports
To make validation results observable and auditable, export them in structured format.
Example (JSON):
{
"dataset": "users.csv",
"run_at": "2025-06-20T13:12:00Z",
"results": [
{ "rule": "email_not_null", "status": "fail", "count": 1 },
{ "rule": "email_format", "status": "fail", "count": 1 },
{ "rule": "age_range", "status": "fail", "count": 2 },
{ "rule": "id_unique", "status": "pass" }
]
}
This can be stored, versioned, pushed to monitoring tools, or compared over time for trend analysis.
9. Integrating into Pipelines
Validation modules in Rust can be deployed as:
- Command-line binaries (
cargo build --release
) - Webhooks or microservices (using
axum
oractix
) - Library crates for other Rust services
- Streaming processors (via Kafka consumers)
They can also be wired into tools like Airflow:
$ ./validate-users data/users.csv
Exit code 0: valid. Exit code 1: invalid. This makes validation controllable and observable, just like any other CI step.
10. Advantages of Rust-Based Validation
- Performance: Rust processes millions of rows in seconds.
- Testability: Functions can be unit tested like application logic.
- Portability: Compile once, run anywhere — no interpreter required.
- Precision: Handle edge cases, conditional logic, and business rules without constraint.
- Ownership: Validation becomes a software concern, not a BI or config one.
11. Trade-offs
- Higher entry barrier: Engineers need to know Rust.
- More code: A YAML rule becomes 10 lines of logic.
- Fewer batteries included: You need to build your own reports, alerts, etc.
But these costs pay off in speed, reliability, and cultural clarity. Validation is no longer a bolt-on step. It’s built-in.
12. Cultural Shift: Validation as Engineering, Not Decoration
When validation is treated as configuration, it gets neglected. When it’s treated as code, it becomes part of the engineering lifecycle:
- It gets tested
- It gets reviewed
- It gets versioned
- It gets enforced
This changes the culture. Engineers don’t “hope” data is clean — they know it is, because the same code that transforms also asserts.
13. Glossary
Validation: The process of checking data for conformity to expected rules.
Expectation (GE): A declarative rule about data in Great Expectations.
DataFrame: A tabular data structure, like a table in memory.
Arrow: A memory format for columnar data, optimized for performance.
Polars: A Rust DataFrame library built on Arrow.
DataFusion: A Rust SQL engine for querying Arrow data.
Embedded Validation: Writing validation logic as part of the program’s code.
Declarative Validation: Describing data rules in configuration format (e.g. YAML).
Test Drift: When validation rules diverge from actual processing logic.
RecordBatch: A chunk of data in Arrow format, used by DataFusion and Polars.
14. Conclusion
Data validation is not decoration. It’s not an appendix. It’s not a side step. It’s a contract.
And Rust lets you write that contract as code — not as a wishlist, but as an executable, testable, enforced gate in your pipeline.
With tools like Polars, Arrow, and DataFusion, you can validate faster, safer, and more precisely than ever before. More than performance, it gives you ownership of your data logic.
And that’s the real gain: not just better data, but better systems.