📊

Chapter 9 — Observability and Metrics in Rust

Coauthor: Bran Koenig

Prologue — When Systems Lie

The call came at 2:17 AM. Not because an alert fired — but because someone checked manually. The dashboard showed zero errors. Prometheus was scraping clean. Logs were quiet. Traces were sparse. And yet, the analytics pipeline hadn’t updated in hours. Someone restarted the job. Nothing changed. Someone redeployed the service. Still nothing. The ingestion nodes were running. The logs were flowing. But the numbers were off. And then you realized: You weren’t debugging code. You were debugging observability.

9.1 Metrics in Rust with Prometheus

Why Metrics Exist

Metrics are not decoration. They’re not for dashboards. They’re for reconstruction — the ability to rebuild what happened when you weren’t watching. In Rust, metrics don’t come by default. You have to build the observability you want to see. And when you do, you must do it intentionally.

Instrumenting Rust with `prometheus`

First, define your metrics using crates like `prometheus` and `lazy_static`:

use prometheus::{register_counter, register_histogram};
use lazy_static::lazy_static;

lazy_static! {
    static ref RECORDS_PROCESSED: prometheus::Counter =
        register_counter!("records_processed_total", "Total records processed").unwrap();

    static ref RECORD_LATENCY: prometheus::Histogram =
        register_histogram!("record_latency_seconds", "Latency of record processing").unwrap();
}

Exposing `/metrics`

Then, expose an HTTP endpoint (e.g., using `warp`) for Prometheus to scrape:

use warp::Filter;
use prometheus::{Encoder, TextEncoder};

async fn metrics_handler() -> impl warp::Reply {
    let encoder = TextEncoder::new();
    let metric_families = prometheus::gather();
    let mut buffer = Vec::new();
    encoder.encode(&metric_families, &mut buffer).unwrap();
    String::from_utf8(buffer).unwrap()
}

#[tokio::main]
async fn main() {
    let route = warp::path("metrics").map(metrics_handler);
    warp::serve(route).run(([0, 0, 0, 0], 9090)).await;
}

Labeling and Cardinality

Every unique combination of labels becomes a new time series. This is powerful but dangerous. Labeling by `user_id` or `request_id` can lead to a "cardinality explosion" that kills your monitoring system. Avoid unbounded values in labels and keep cardinality low.

9.2 Contextual Tracing with `tracing`

`println!` is not observability. It is a lie. It lies by omission. It knows nothing of context.

Real observability needs spans and events. The `tracing` crate provides this structure. Using the `#[instrument]` macro creates a span with associated metadata.

use tracing::{info, instrument};

#[instrument]
fn handle_request(user_id: u64) {
    info!("Handling user request");
}

For structured logging, configure a subscriber to output JSON. Now every log includes timestamp, level, message, span, and trace_id, making them queryable.

9.3 Distributed Tracing with OpenTelemetry

Without Context Propagation, Your Trace Is a Lie.

When a request crosses service boundaries, you must propagate the tracing context (e.g., `traceparent` header). In Rust, this is often a manual process of extracting headers from the incoming request and setting the parent context on the current span. Forgetting this results in broken, disconnected traces where each service appears as a new root span.

9.4 Structured Logging and Ingestion

Text Logs Are for Terminals, Not Systems.

Structured logs (e.g., JSON) are essential for modern systems. Tools like Fluentbit, Promtail, and Vector expect structure. Without it, logs are just entropy.

{
    "timestamp": "2025-06-26T05:14:00Z",
    "level": "INFO",
    "message": "start_ingestion",
    "span": "ingest::job_42",
    "trace_id": "abcd-1234"
}

9.5 Vector: Your Rust-Native Router

Vector is a high-throughput, low-overhead observability router written in Rust. It can scrape metrics, collect logs, and receive traces, then route them to different backends like Prometheus, Loki, and Jaeger. This lets services emit raw data and centralizes the routing logic.

9.6 Full Pipeline: A DAG That Talks

In a fully observable pipeline, each node (API, Ingest, Transform, Load) emits its own metrics, logs, and traces. All three signals are correlated by a `trace_id`. This allows you to debug issues like a silent failure in a transform step by querying all related signals for a specific job run.

9.7 Anti-Patterns and Production Failures

  1. Unbounded Labels: Labeling a metric with a user ID can create millions of time series and crash Prometheus.
  2. Histogram Without Buckets: Default histogram buckets are often a poor fit, rendering SLO queries meaningless.
  3. `println!` in Traced System: Using `println!` breaks context, as the logs won't have a `trace_id`.
  4. Missing `traceparent` Header: Forgetting to propagate trace headers breaks distributed traces.
  5. Exporter Crash, No Logs: If an OTLP exporter fails silently, traces can be dropped without warning.

9.8 Comparative Matrix: Rust vs Go vs Python

FeatureRustGoPython
Default logging`log` / `tracing` (manual)stdlib log (basic)logging (flexible)
Structured logging`tracing` JSON output`zap`, `logrus``structlog`, `json-logger`
Metrics`prometheus`, `prom-client`client_golangprometheus_client
Label convention enforcementManualStrict (Go style guides)Weak
Trace propagation (OTel)Manual via headersAuto (gRPC interceptors)Auto (gRPC/FastAPI)
Ecosystem maturityGrowingStableMature

Appendix — A Second Failure at 3:11 AM

A pipeline showed success, but the data was stale. Tracing revealed the root cause: a transform node was using a legacy library with `log::info!` and had no span context. The logs were text-only, histograms were misconfigured, and traces were broken. The fix involved replacing `log` with `tracing`, adding `#[instrument]` to all major functions, rewriting logs in JSON, and correlating everything via `trace_id` through Vector.