CHAPTER 10 — Serving Pipelines and Models via API in Rust

PART 1 — WHY AXUM INSTEAD OF ACTIX

Choosing structure over speed when your API stops being a toy.

You’re not writing code. You’re choosing your next failure. In production, every line of code is a liability. It’s not about what runs. It’s about what breaks — and how fast you understand it. That’s why the choice between Actix and Axum isn’t about syntax or speed. It’s about what happens when the system lies. Actix is fast. That’s a fact. Axum is honest. That’s a decision. And when you’re serving predictions, pipelines, or transformations that impact business logic, honesty scales better than performance.

The illusion of “fast enough”

Benchmarks tell you that Actix is faster. And it is — on empty routes. But the real world isn’t a benchmark. In the real world your handler does shape validation, you enrich requests with state, you wrap responses in logs and metrics, and you deploy to teams, not lone wolves. The time you save with Actix in raw performance, you lose tenfold in onboarding, debugging, and post-incident analysis.

Axum makes you spell it out — and that’s a feature

In Axum, every handler receives typed input, middleware is layered explicitly, error responses are enforced by the compiler, and spans are built-in. You can test a handler in isolation without bootstrapping the whole app. Logs are contextual — not just “something failed.” That’s not overengineering. That’s making your system explain itself. And when something goes wrong in production — that’s the only thing that matters.

Honest table

Capability	Actix	Axum
Raw performance	Very fast	Fast enough
Middleware	Macros, layered, unclear	Explicit .layer() chain
State injection	Requires internal discipline	Built into handler signature
Error responses	Optional	Enforced via types
Logging & tracing	External, partial	Native tracing integration
Testing ergonomics	Hard to isolate	Modular and injectable
Team ramp-up	Slower without mentor	Clear for Rust teams
Post-mortem clarity	Difficult	Transparent per route

Final reflection

If your API ever returns 200 with a bad prediction, and the only answer the logs give you is “request succeeded” — then you didn’t pick the wrong framework. You picked a system that refuses to speak when it matters most. Axum isn’t perfect. But it gives you a fighting chance to understand your own code. And in production, that’s what you actually scale.

PART 2 — THE LIFECYCLE OF A REQUEST

Understanding how control flows is the first step to building systems that don't lie.

An HTTP request in Axum is not just a function call. It’s a contract execution. And contracts, to be trustworthy, need structure. Here’s the actual path every request travels:

Client HTTP Request
↓
Router (Path + Method match)
↓
Middleware (tracing, limits, timeouts)
↓
Handler (your logic)
↓
IntoResponse (serialized output)

Why is this important? Because it makes clear where failure can happen — and where it should be caught.

Router: the bouncer

If the method or path doesn’t match, it dies here. This is where 404 and 405 responses happen. Nothing inside your app is even touched. You want this to fail fast.

Middleware: the firewall

You attach tracing layers, rate limiting, timeouts, and authentication here. This is where you catch cross-cutting concerns. In Axum, middleware is explicit: `Router::new().route(...).layer(...)`.

Handler: the executor

This is the only function you write that “does something.” It receives structured, already-validated input like `Json`, `Path`, `Query`, `State`. The handler should be pure logic.

IntoResponse: the diplomat

Your handler returns a value that Axum automatically converts into an HTTP response. If you return a `Result` and have implemented `IntoResponse` for your error type, it will map cleanly to proper status codes.

PART 3 — BUILDING THE /PREDICT ENDPOINT

Serving models isn’t about inference. It’s about trust.

A real `/predict` endpoint needs shape validation, safe tensor construction, traceable failures, meaningful logs, and explainable output. Here’s the contract:

#[derive(Deserialize)]
struct PredictInput {
    features: Vec<f32>,
}

#[derive(Serialize)]
struct PredictOutput {
    prediction: f32,
}

And the handler:

async fn predict_handler(
    State(state): State<Arc<AppState>>,
    Json(input): Json<PredictInput>,
) -> Result<Json<PredictOutput>, AppError> {
    if input.features.len() != 128 {
        return Err(AppError::BadRequest("Expected 128 features".into()));
    }
    let input_array = Array::from_shape_vec((1, 128), input.features.clone())
        .map_err(|e| AppError::BadRequest(e.to_string()))?;
    let output = state
        .model
        .run(vec![input_array])
        .map_err(|e| AppError::Inference(e.to_string()))?;
    let prediction = *output[0].iter().next().ok_or_else(|| {
        AppError::Internal("Model returned empty output".into())
    })?;
    if prediction.is_nan() {
        return Err(AppError::Internal("Prediction is NaN".into()));
    }
    Ok(Json(PredictOutput { prediction }))
}

This endpoint doesn’t trust anything. And that’s why you can trust it. The most dangerous bug in a serving system is not a panic. It’s a 200 with bad output.

PART 4 — TRIGGERING LONG-RUNNING JOBS

If your system can't decouple work from response, it's not a system. It's a blockage.

When the business asks to "retrain the model with one click," they mean launch a background task, don't crash, don't wait, and report back. Here’s how you do that in Axum with Tokio:

static JOB_COUNTER: AtomicU64 = AtomicU64::new(1);

async fn trigger_handler(State(_): State<AppState>) -> StatusCode {
    let job_id = JOB_COUNTER.fetch_add(1, Ordering::Relaxed);
    let span = info_span!("pipeline_job", job_id);

    tokio::spawn(async move {
        tracing::info!("Job {job_id} started");
        do_some_work().await;
        tracing::info!("Job {job_id} completed");
    }.instrument(span));

    StatusCode::ACCEPTED
}

This design launches tasks that survive client disconnects, logs everything with a job_id, and returns 202 immediately. If your job fails and nobody knows, it’s not failure. It’s negligence.

PART 5 — ERROR HANDLING AS DESIGN

Your system doesn’t need to be crash-proof. It needs to be accountable.

In Rust, error handling is a design principle. You define what can go wrong and how to respond. You define a custom error enum:

#[derive(Debug)]
pub enum AppError {
    BadRequest(String),
    Inference(String),
    Internal(String),
}

And then map it to HTTP responses:

impl IntoResponse for AppError {
    fn into_response(self) -> Response {
        let (status, message) = match self {
            AppError::BadRequest(msg) => (StatusCode::BAD_REQUEST, msg),
            AppError::Inference(msg) => (StatusCode::UNPROCESSABLE_ENTITY, msg),
            AppError::Internal(_) => (StatusCode::INTERNAL_SERVER_ERROR, "Internal server error".into()),
        };
        (status, Json(json!({ "error": message }))).into_response()
    }
}

This matters because you can write tests asserting specific errors, your logs reflect intention, and the client knows why something failed. Your errors are part of your API surface.

PART 6 — OBSERVABILITY IS NOT OPTIONAL

Logs and metrics are not side-effects. They’re your only witnesses.

If your system goes down and all you have is a 500, you failed. In Axum, observability is first-class. Set up structured JSON logs with `tracing_subscriber` and expose a `/metrics` endpoint with a `PrometheusBuilder`. This is the difference between firefighting and debugging.

PART 7 — OPERATIONAL COMPARISON: AXUM VS FASTAPI

Why Python isn’t enough when uptime is your product.

Capability	FastAPI	Axum
Cold start latency	1–2 seconds	<100ms
Memory per instance	100–300 MB	10–30 MB
Error typing	Optional via Pydantic	Enforced at compile time
Observability	Requires plug-ins	Built-in via tracing
Multi-tenancy	Custom logic	DashMap or ArcSwap
Concurrency safety	GIL-limited	True multithreaded async
Testability	Easy mocks, few guarantees	Strong isolation, no magic

FastAPI wins in speed of prototyping. Axum wins in everything that happens after the prototype. You don’t ship APIs. You ship confidence. And Axum helps you do that.

PART 8 — Real Extensions: Reloads, Tenants, Limits

When your API becomes a system, you need to handle model reloads, multiple tenants, and concurrency limits. In Rust, if you plan it right, the transition can be seamless and cheap.

🔁 Hot Reloading Models Without Downtime

With `arc_swap`, you can atomically swap the model being served without restarting the server or dropping requests. This business win reduces model update lag from minutes to seconds.

👥 Multi-Tenant Model Serving

With a concurrent hash map like `DashMap`, you can serve a different model for each tenant from a single service instance, efficiently and safely, without needing complex infrastructure like Kubernetes sidecars.

🌊 Concurrency Limits — Or: How Not to Die on Launch Day

Using a `Semaphore`, you can limit the number of concurrent requests. If the limit is reached, the 33rd request gets a 503 Service Unavailable instead of freezing or crashing the server. The server doesn't die; it breathes.

PART — 9 Testing and Deployment — Or: How to Avoid Regret at 2AM

🧪 Testing in Rust: Painful, but Honest

In Rust, you write tests that exercise the real stack, and the compiler forces you to do it right. I’ve seen Rust teams with half the test coverage of their Python counterparts — but twice the reliability in production.

🚢 Deployment — One Binary. No Sorcery.

A minimal multi-stage Dockerfile produces a single, self-contained binary. No Python, no conda, no virtualenv. You get cold starts under 100ms and zero chance of “dependency hell”.

# Stage 1: Build
FROM rust:1.72 as builder
WORKDIR /app
COPY . .
RUN cargo build --release

# Stage 2: Deploy
FROM debian:buster-slim
COPY --from=builder /app/target/release/api /api
CMD ["/api"]