Chapter 4 – Data Validation and Quality

1. Introduction: Validation as a First-Class Citizen

In the world of data engineering, it’s tempting to think that if the pipeline runs, the data must be fine. It’s not.

Broken joins. Nulls in critical columns. Category values that don’t match any known domain. Duplicates in supposedly unique identifiers. These are not software bugs. They’re silent failures in data quality — and they’re more common than we admit.

What stops them? Validation.

Data validation is not a side task, or something left for dashboards or BI tools to catch. It should be part of the pipeline itself. A first-class citizen. The same way we write unit tests to guard logic, we must write validation rules to guard meaning.

This chapter explores how to do just that, using Rust. We will contrast traditional declarative frameworks like Great Expectations with the embedded, composable approach available in Rust, leveraging tools like Polars, Arrow, and DataFusion.

The goal is not just to learn how to check a column for nulls — it’s to understand how validation becomes infrastructure, how it changes the culture of data reliability, and how it can be treated with the same rigor as application logic.

2. The Landscape: Declarative Validation and Its Limits

Most data teams start with declarative validation frameworks. Great Expectations is the flagship example: you define rules (called “expectations”) as YAML or Python config, such as:

expect_column_values_to_be_between:
    column: age
    min_value: 0
    max_value: 120

These expectation suites are then applied to batches of data, and validation results are recorded. GE can generate reports, alerts, and even interactive documentation called Data Docs. It integrates well with pandas, Spark, Airflow, and Jupyter, making it appealing for teams already in the Python ecosystem.

Benefits:

Clear separation of expectations from code
Easy to read and share with non-developers
Built-in profiling tools to auto-generate rules
Report generation and alerting integrations

But:

Declarative validation has limitations:

Validation logic lives far from processing logic. The rules are separate, not embedded in the code that transforms the data.
Limited expressiveness. Complex or context-dependent rules often require custom Python extensions.
Execution detachment. You often define what to check, but not when or how to act on failures.
Performance issues. Python-based validation struggles with large volumes unless you offload to Spark.

As data volumes grow and pipelines diversify, many teams seek an approach where validation is not external configuration — but internal code.

3. The Rust Philosophy: Validation as Executable Contract

Rust flips the model. Instead of writing a configuration file to describe what the data should look like, you write code that enforces those rules — directly in the pipeline. Validation becomes a function. It can be composed, reused, tested, and versioned like any other part of your system.

Example:

let invalid_age = df
    .filter(&col("age").lt(0).or(col("age").gt(120)))?
    .height();

This line doesn’t describe an expectation. It checks it. And the pipeline can act on the result. The core idea: validation logic is embedded into the transformation logic.

Why this matters:

Trait	Declarative (GE)	Embedded (Rust)
Expressiveness	Limited to expectation APIs	Full programming capabilities
Speed	Python-speed	Native compiled performance
Reusability	Partial (via suites)	High (via functions/crates)
Integration	External step	Inline with pipeline logic
Testing	Separate from unit tests	Included in testable modules

When validation becomes executable code, it becomes observable, versioned, and automatically aligned with business logic.

4. Tools for Validation in Rust

To write validation logic in Rust, we use the following building blocks:

Polars

A high-performance DataFrame library inspired by pandas, but written in Rust and built on Apache Arrow. Offers:

Lazy and eager execution modes
Fast filtering, aggregation, joins
Expression-based transformations
Strong typing and Arrow integration

Arrow

A language-independent columnar memory format. Used by Polars and DataFusion for zero-copy data sharing and efficient storage. Gives:

Efficient memory usage
Fast SIMD operations
Cross-language data exchange (e.g., Python ↔ Rust)

DataFusion

An embeddable SQL query engine. Enables:

SQL-based validation inside Rust
Execution of complex joins, filters, aggregates
In-memory query plans over Arrow RecordBatches

Each of these layers allows you to implement validation at different levels of abstraction, depending on whether you prefer expressions, SQL, or low-level checks.

5. Example: Validating a User Dataset with Polars

Let’s say you’re ingesting user data:

id,name,email,age
1,Alice,alice@example.com,30
2,Bob,,27
3,Charlie,charlie.com,-5
4,Dave,dave@example.com,999

We want to validate:

Email must not be null
Email must match pattern
Age must be between 0 and 120
ID must be unique

Here’s how we do that with Polars:

use polars::prelude::*;
use regex::Regex;

fn validate_users(df: &DataFrame) -> Result<(), Box<dyn std::error::Error>> {
    let mut issues = vec![];

    // Null check
    if df.column("email")?.null_count() > 0 {
        issues.push("Email column contains nulls.");
    }

    // Format check
    let email_re = Regex::new(r"^[^@]+@[^@]+\.[^@]+$")?;
    let email_col = df.column("email")?.utf8()?;
    let invalid_format = email_col.into_iter().filter(|e| {
        match e {
            Some(val) => !email_re.is_match(val),
            None => true,
        }
    }).count();
    if invalid_format > 0 {
        issues.push("Invalid email formats found.");
    }

    // Range check
    let age_out_of_bounds = df
        .filter(&col("age").lt(0).or(col("age").gt(120)))?
        .height();
    if age_out_of_bounds > 0 {
        issues.push("Age values outside range 0–120.");
    }

    // Uniqueness check
    let id_unique = df.column("id")?.unique()?.len();
    if id_unique != df.height() {
        issues.push("Duplicate IDs found.");
    }

    if issues.is_empty() {
        println!("All validations passed.");
        Ok(())
    } else {
        for msg in issues {
            println!("Validation error: {}", msg);
        }
        Err("Validation failed.".into())
    }
}

This is readable, fast, and fully embeddable in any data processing task. And it runs orders of magnitude faster than equivalent Python scripts on large datasets.

6. Diagram: Where Validation Fits

A typical Rust pipeline might look like this:

+----------------------+
|     Source File      |
+----------------------+
|
v
+-------------------------+
| Load into Polars DataFrame |
+-------------------------+
|
v
+--------------------+
| validate_users(df) |
+--------------------+
|           |
Pass         Fail
|           |
v           v
+--------------+   +------------------+
|   Transform  |   | Report & Quarantine |
+--------------+   +------------------+
|
v
+------------------+
| Load to DWH/Sink |
+------------------+

Validation is not a side process. It is the gatekeeper to the rest of the pipeline.

7. Example: SQL-Based Validation with DataFusion

Suppose you want to validate:

All order totals must be non-negative
No order appears twice

With DataFusion:

use datafusion::prelude::*;

async fn validate_orders() -> Result<()> {
    let mut ctx = SessionContext::new();
    ctx.register_csv("orders", "data/orders.csv", CsvReadOptions::new()).await?;

    let total_check = ctx.sql("SELECT COUNT(*) FROM orders WHERE total < 0").await?;
    let duplicate_check = ctx.sql("SELECT order_id, COUNT(*) FROM orders GROUP BY order_id HAVING COUNT(*) > 1").await?;

    let negative_rows = total_check.collect().await?;
    let duplicates = duplicate_check.collect().await?;

    // Logic to handle results...
    Ok(())
}

This gives you the comfort of SQL while staying inside Rust’s runtime, with full parallel execution and memory efficiency.

8. Exporting Validation Reports

To make validation results observable and auditable, export them in structured format.

Example (JSON):

{
    "dataset": "users.csv",
    "run_at": "2025-06-20T13:12:00Z",
    "results": [
        { "rule": "email_not_null", "status": "fail", "count": 1 },
        { "rule": "email_format", "status": "fail", "count": 1 },
        { "rule": "age_range", "status": "fail", "count": 2 },
        { "rule": "id_unique", "status": "pass" }
    ]
}

This can be stored, versioned, pushed to monitoring tools, or compared over time for trend analysis.

9. Integrating into Pipelines

Validation modules in Rust can be deployed as:

Command-line binaries (cargo build --release)
Webhooks or microservices (using axum or actix)
Library crates for other Rust services
Streaming processors (via Kafka consumers)

They can also be wired into tools like Airflow:

$ ./validate-users data/users.csv

Exit code 0: valid. Exit code 1: invalid. This makes validation controllable and observable, just like any other CI step.

10. Advantages of Rust-Based Validation

Performance: Rust processes millions of rows in seconds.
Testability: Functions can be unit tested like application logic.
Portability: Compile once, run anywhere — no interpreter required.
Precision: Handle edge cases, conditional logic, and business rules without constraint.
Ownership: Validation becomes a software concern, not a BI or config one.

11. Trade-offs

Higher entry barrier: Engineers need to know Rust.
More code: A YAML rule becomes 10 lines of logic.
Fewer batteries included: You need to build your own reports, alerts, etc.

But these costs pay off in speed, reliability, and cultural clarity. Validation is no longer a bolt-on step. It’s built-in.

12. Cultural Shift: Validation as Engineering, Not Decoration

When validation is treated as configuration, it gets neglected. When it’s treated as code, it becomes part of the engineering lifecycle:

It gets tested
It gets reviewed
It gets versioned
It gets enforced

This changes the culture. Engineers don’t “hope” data is clean — they know it is, because the same code that transforms also asserts.

13. Glossary

Validation: The process of checking data for conformity to expected rules.

Expectation (GE): A declarative rule about data in Great Expectations.

DataFrame: A tabular data structure, like a table in memory.

Arrow: A memory format for columnar data, optimized for performance.

Polars: A Rust DataFrame library built on Arrow.

DataFusion: A Rust SQL engine for querying Arrow data.

Embedded Validation: Writing validation logic as part of the program’s code.

Declarative Validation: Describing data rules in configuration format (e.g. YAML).

Test Drift: When validation rules diverge from actual processing logic.

RecordBatch: A chunk of data in Arrow format, used by DataFusion and Polars.

14. Conclusion

Data validation is not decoration. It’s not an appendix. It’s not a side step. It’s a contract.

And Rust lets you write that contract as code — not as a wishlist, but as an executable, testable, enforced gate in your pipeline.

With tools like Polars, Arrow, and DataFusion, you can validate faster, safer, and more precisely than ever before. More than performance, it gives you ownership of your data logic.

And that’s the real gain: not just better data, but better systems.