Metrics

metrics-crate counters, histograms, and gauges for your workflows.

Behind the metrics feature gate (features = ["metrics"]).

Setup

Enable the metrics feature flag in your Cargo.toml. You can also use features = ["all"] to enable everything (scheduler + tracing + recovery + metrics) at once.

[dependencies]
cano = { version = "0.13", features = ["metrics"] }


# The metrics crate is a facade — you also need a recorder/exporter:
metrics-exporter-prometheus = "0.16"  # for production
# or: metrics-util = "0.18"           # for testing / debugging snapshots

# Or enable everything (scheduler + tracing + recovery + metrics):
# cano = { version = "0.13", features = ["all"] }

Because metrics is a facade, Cano only depends on the shared interface. Your application picks the concrete recorder (e.g. metrics_exporter_prometheus::PrometheusBuilder::new().install_recorder() for a Prometheus scrape endpoint, or metrics_util::debugging::DebuggingRecorder in tests). Call cano::metrics::describe() once, after installing your recorder, so exporters receive help text and units.

Two Surfaces

The metrics feature exposes instrumentation through two complementary surfaces, mirroring how Tracing pairs engine spans with the TracingObserver bridge.

`MetricsObserver` — opt-in lifecycle counters

MetricsObserver is a WorkflowObserver that re-emits the observer hooks as metrics-crate counters. Wire it up in one line:

use cano::prelude::*;
use std::sync::Arc;

Workflow::bare()
    .register(/* ... */)
    .add_exit_state(/* ... */)
    .with_observer(Arc::new(MetricsObserver::new()))

It emits these counters (each incremented on the corresponding observer hook):

cano_state_enters_total{state} — on on_state_enter
cano_observed_task_runs_total{task, outcome} — on on_task_success / on_task_failure (outcome ∈ completed|failed)
cano_task_retries_total{task} — on on_retry
cano_circuit_open_events_total{task} — on on_circuit_open
cano_checkpoints_observed_total — on on_checkpoint
cano_resumes_total — on on_resume

on_task_start is intentionally not counted — every dispatch already shows up in cano_observed_task_runs_total{outcome}, so a separate "start" counter would just be the sum of the completed and failed rows.

MetricsObserver is in the prelude behind the metrics feature — no extra import needed when you use use cano::prelude::*.

Always-on direct instrumentation — engine internals

Compiled in whenever the metrics feature is on, regardless of whether a MetricsObserver is attached. Covers engine internals the observer hooks do not reach: workflow run duration, circuit-breaker state transitions, per-attempt retry-loop outcomes, poll/batch/step iteration counts, scheduler flow telemetry, and checkpoint store operations. See What Gets Measured for the full list.

What Gets Measured

Histograms record raw f64 seconds samples — bucketing and quantile computation are the exporter's responsibility. Metric names follow metrics-crate underscore conventions.

Workflow

cano_workflow_runs_total{outcome} — counter; outcome ∈ completed|failed|timeout
cano_workflow_duration_seconds{outcome} — histogram (seconds)
cano_workflow_active — gauge; workflows currently executing

Task Dispatch

cano_task_duration_seconds{state, kind} — histogram (seconds); kind ∈ single|router|split|compensatable|stepped
cano_task_attempts_total{outcome} — counter; per-attempt inside the retry loop; outcome ∈ completed|failed
cano_circuit_rejections_total — counter; attempts short-circuited by an open breaker

Split / Join

cano_split_branch_results_total{result} — counter; result ∈ success|failure|cancelled

Circuit Breaker

cano_circuit_transitions_total{transition} — counter; transition ∈ closed_to_open|open_to_halfopen|halfopen_to_closed|halfopen_to_open
cano_circuit_acquires_total{result} — counter; result ∈ acquired|rejected
cano_circuit_outcomes_total{outcome} — counter; outcome ∈ success|failure

Processing Loops

cano_poll_iterations_total{outcome} — counter; outcome ∈ ready|pending
cano_batch_runs_total{outcome} — counter; outcome ∈ completed|failed
cano_batch_items_total{result} — counter; result ∈ ok|err
cano_step_iterations_total{outcome} — counter; outcome ∈ more|done

Recovery & Saga

cano_checkpoint_appends_total{result} — counter; result ∈ ok|err
cano_checkpoint_clears_total{result} — counter; result ∈ ok|err
cano_compensations_run_total{result} — counter; result ∈ ok|err
cano_compensation_drains_total{outcome} — counter; outcome ∈ clean|partial

Scheduler

Also requires the scheduler feature.

cano_scheduler_flow_runs_total{flow, outcome} — counter; outcome ∈ completed|failed
cano_scheduler_flow_duration_seconds{flow} — histogram (seconds)
cano_scheduler_flow_backoff_total{flow} — counter
cano_scheduler_flow_tripped_total{flow} — counter
cano_scheduler_active_flows — gauge; flows currently executing

Registering Descriptions

cano::metrics::describe() registers a human-readable description and unit for every metric Cano emits. Call it once, after installing your recorder, so exporters receive help text in their output (e.g. Prometheus # HELP / # TYPE lines).

// Install your recorder first, then describe:
metrics_exporter_prometheus::PrometheusBuilder::new()
    .install()
    .expect("install prometheus exporter");

cano::metrics::describe();

If you skip describe(), metrics still flow — only the help text and units are missing from the exporter output.

Cardinality

Labels are deliberately minimal to keep cardinality bounded:

state — format!("{:?}") of your FSM state enum; bounded by registered states.
task — Task::name(), which defaults to std::any::type_name; bounded by registered task types.
flow — the scheduler flow id string; bounded by registered flows.
All other label values (outcome, kind, result, transition) are fixed, bounded enum labels.

The deepest hot-path metrics — per-attempt retry-loop counters (cano_task_attempts_total), circuit-breaker internals, and poll/batch/step iteration counters — carry no per-state label, keeping their cardinality constant regardless of how many states your workflow defines.

Cost

Compiling the metrics feature in adds a small, bounded per-state-transition cost (formatting the state label as a string for the state label) even when no recorder is installed. If you are building a latency-critical service that does not collect metrics, leave the feature off. Otherwise, when a recorder is installed, the overhead is the same as any other metrics-crate emission — a hash-map lookup plus atomic increment, comparable to a log line.

Correlating with Traces

When you also enable the tracing feature, metrics and tracing interoperate through tracing spans: the metrics-tracing-context crate makes any metric emitted inside a span inherit that span's fields as labels. This is wiring you do in your application — Cano never depends on metrics-tracing-context itself (the same posture it takes toward tracing-subscriber).

[dependencies]
cano = { version = "0.13", features = ["metrics", "tracing"] }

metrics-tracing-context = "0.18"
tracing-subscriber = "0.3"

use metrics_tracing_context::{MetricsLayer, TracingContextLayer};
use metrics_util::layers::Layer;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};

// 1. Wrap your metrics recorder so it reads the current span's fields.
let recorder = TracingContextLayer::all().layer(your_recorder);
metrics::set_global_recorder(recorder).expect("install metrics recorder");
cano::metrics::describe();

// 2. Add MetricsLayer to your tracing subscriber so spans expose their fields.
tracing_subscriber::registry()
    .with(MetricsLayer::new())
    .with(tracing_subscriber::fmt::layer())
    .init();

With both layers installed, a span you open around Workflow::orchestrate — e.g. info_span!("api_request", request_id = …) — tags every cano_* metric recorded during that run with request_id. Cano's own default workflow_orchestrate and workflow_resume spans carry a workflow_id field whenever one is set via with_workflow_id, so that becomes a metric label too. Span fields are merged with the explicit labels Cano already attaches (state, task, flow, outcome, …) — Cano deliberately does not name any span field after an existing metric label, so there is no merge ambiguity.

Cardinality

TracingContextLayer::all() promotes every field of every entered span as a label — including Cano-internal span fields such as max_attempts from the retry-loop spans. For production, prefer TracingContextLayer::new(filter) with a LabelFilter that allow-lists just the fields you author, to keep metric cardinality bounded.

Runnable example: cargo run --example metrics_tracing_context --features "metrics tracing" — wires the two layers, runs a workflow under a workflow_id and another inside a user api_request span, and prints the captured metrics so you can see the workflow_id and request_id labels propagated purely from span context.

Known Limitation

Note

The #[task::poll] and #[task::stepped] macros have two usage forms. The trait-impl form (impl PollTask<S> for T / impl SteppedTask<S> for T) inlines the loop body into the synthesised Task::run, so cano_poll_iterations_total and cano_step_iterations_total are not emitted for that form.

The inherent-impl form (#[task::poll(state = S)] impl T { async fn poll ... } / #[task::stepped(state = S)] impl T { async fn step ... }, the recommended form) and Workflow::register_stepped (engine-owned loop) both emit the iteration counters as expected.

Full Example

Install a DebuggingRecorder (useful for tests and self-contained demos), call cano::metrics::describe(), attach a MetricsObserver, run a workflow directly and then under the scheduler, then dump the captured snapshot. This mirrors the metrics_demo example shipped with the crate.

use cano::prelude::*;
use metrics_util::debugging::{DebugValue, DebuggingRecorder};
use std::sync::Arc;
use std::time::Duration;

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum Step { Fetch, Process, Done }

struct FetchTask;

#[task]
impl Task<Step> for FetchTask {
    async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
        tokio::time::sleep(Duration::from_millis(5)).await;
        Ok(TaskResult::Single(Step::Process))
    }
}

struct ProcessTask;

#[task]
impl Task<Step> for ProcessTask {
    async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
        tokio::time::sleep(Duration::from_millis(3)).await;
        Ok(TaskResult::Single(Step::Done))
    }
}

fn workflow() -> Workflow<Step> {
    Workflow::bare()
        .with_observer(Arc::new(MetricsObserver::new()))
        .register(Step::Fetch, FetchTask)
        .register(Step::Process, ProcessTask)
        .add_exit_state(Step::Done)
}

#[tokio::main]
async fn main() {
    // 1. Install recorder and register descriptions.
    let recorder = DebuggingRecorder::new();
    let snapshotter = recorder.snapshotter();
    metrics::set_global_recorder(recorder).expect("install metrics recorder");
    cano::metrics::describe();

    // In production, use a real exporter instead:
    // metrics_exporter_prometheus::PrometheusBuilder::new()
    //     .install().expect("install prometheus exporter");

    // 2. Run the workflow a few times directly.
    for _ in 0..3 {
        workflow()
            .orchestrate(Step::Fetch)
            .await
            .expect("workflow run");
    }

    // 3. Run the same workflow under the scheduler for ~1.2s, firing every 1s.
    let mut scheduler = Scheduler::new();
    scheduler
        .every_seconds("demo_flow", workflow(), Step::Fetch, 1)
        .expect("register flow");
    let running = scheduler.start().await.expect("start scheduler");
    tokio::time::sleep(Duration::from_millis(1200)).await;
    running.stop().await.expect("stop scheduler");

    // 4. Dump every captured metric (sorted alphabetically).
    println!("\n=== Cano metrics ===");
    let mut rows = snapshotter.snapshot().into_vec();
    rows.sort_by(|a, b| a.0.key().name().cmp(b.0.key().name()));
    for (ck, _unit, _desc, value) in rows {
        let key = ck.key();
        let labels: Vec<String> = key
            .labels()
            .map(|l| format!("{}={}", l.key(), l.value()))
            .collect();
        let label_str = if labels.is_empty() {
            String::new()
        } else {
            format!("{{{}}}", labels.join(","))
        };
        match value {
            DebugValue::Counter(v) => println!("  {}{label_str} = {v}", key.name()),
            DebugValue::Gauge(v)   => println!("  {}{label_str} = {}", key.name(), v.into_inner()),
            DebugValue::Histogram(s) => {
                let n = s.len();
                let sum: f64 = s.iter().map(|x| x.into_inner()).sum();
                println!("  {}{label_str} = {{count={n}, sum={sum:.6}s}}", key.name());
            }
        }
    }
}

Runnable example: cargo run --example metrics_demo --features "metrics scheduler" — installs a DebuggingRecorder, attaches a MetricsObserver, runs a workflow and a scheduled flow, and prints every emitted metric. For a real Prometheus exporter, swap in metrics_exporter_prometheus::PrometheusBuilder::new().install() before cano::metrics::describe().

Metrics

#Setup

#Two Surfaces

#MetricsObserver — opt-in lifecycle counters

#Always-on direct instrumentation — engine internals

#What Gets Measured

Workflow

Task Dispatch

Split / Join

#Circuit Breaker

#Processing Loops

#Recovery & Saga

#Scheduler

#Registering Descriptions

#Cardinality

#Cost

#Correlating with Traces

#Known Limitation

#Full Example

Setup

Two Surfaces

`MetricsObserver` — opt-in lifecycle counters

Always-on direct instrumentation — engine internals

What Gets Measured

Circuit Breaker

Processing Loops

Recovery & Saga

Scheduler

Registering Descriptions

Cardinality

Cost

Correlating with Traces

Known Limitation

Full Example