Resilience

Recover from transient faults — retries, timeouts, circuit breakers, bulkheads, panic safety, scheduler backoff.

"Resilient" in Cano's tagline isn't a vibe — it's a set of concrete, composable primitives that sit on the FSM dispatch path. Every one of them is opt-in: a workflow that wires none of them up pays nothing, and the hot path stays allocation-light. This is the guide: the circuit breaker gets its full treatment here; retries, per-attempt timeouts, and bulkheads get an overview with a pointer to the API reference page that owns the builder methods.

The self-healing half of the tagline — checkpoint + resume, sagas, observers, health probes — lives in Recovery, Saga, and Observers.

Retries

A task's config() returns a TaskConfig whose retry_mode drives the workflow dispatcher's retry loop. The default is exponential backoff with jitter (3 retries, 100ms base, 2.0× multiplier, 30s cap, 0.1 jitter); RetryMode also has None and Fixed(count, delay). On exhaustion the loop returns CanoError::RetryExhausted wrapping the last error.

use cano::prelude::*;
use std::time::Duration;

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum Step { Fetch, Done }

#[derive(Clone)]
struct FetchTask;

#[task(state = Step)]
impl FetchTask {
    fn config(&self) -> TaskConfig {
        // 5 attempts, 200ms apart; or `.with_exponential_retry(n)` / `TaskConfig::minimal()` for none.
        TaskConfig::new().with_fixed_retry(4, Duration::from_millis(200))
    }
    async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
        // ... call the flaky thing ...
        Ok(TaskResult::Single(Step::Done))
    }
}

See Tasks → Configuration & Retries for the full RetryMode / TaskConfig reference.

Per-Attempt Timeouts

TaskConfig::with_attempt_timeout(Duration) wraps each retry attempt in tokio::time::timeout. A blown deadline produces CanoError::Timeout, which the retry loop treats as a recoverable failure — the attempt is retried like any other, and only if every attempt times out does the final timeout get wrapped in CanoError::RetryExhausted. Combine it with retries to bound total wall-clock per state: 3 attempts × a 2s attempt timeout ≈ 6s worst case (plus backoff).

#[task(state = Step)]
impl CallTask {
    fn config(&self) -> TaskConfig {
        TaskConfig::new()
            .with_fixed_retry(2, Duration::from_millis(100))
            .with_attempt_timeout(Duration::from_secs(2)) // each attempt gets 2s
    }
    async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
        Ok(TaskResult::Single(Step::Done))
    }
}

Distinct from the workflow-wide timeout (Workflow::with_timeout), which bounds the whole orchestration, not one attempt. The full TaskConfig / RetryMode API — including how attempt timeouts compose with each retry mode — lives in Tasks → Configuration & Retries.

Circuit Breakers

Retries help with transient faults; a CircuitBreaker helps when a dependency is down — it stops hammering it. Once it has seen failure_threshold consecutive failures the breaker opens: every subsequent call short-circuits with CanoError::CircuitOpen without invoking the task body and without consuming a retry attempt, so the unhealthy dependency gets a break to recover.

A breaker is cheap to clone (it's an Arc inside) — share one Arc<CircuitBreaker> across every task that hits the same dependency so a trip from any caller protects every caller, including tasks running in parallel inside a split/join state. Internally it's a synchronous std::sync::Mutex with no awaits held across the critical section, so concurrent acquires from split tasks are safe.

The state machine

CircuitBreaker state machine

stateDiagram-v2 [*] --> Closed Closed --> Open: failure_threshold consecutive failures Open --> HalfOpen: reset_timeout elapsed (lazy, on next acquire) HalfOpen --> Closed: half_open_max_calls consecutive successes HalfOpen --> Open: any failure (fresh cool-down)

The state is Closed → Open { until } → HalfOpen → Closed:

Closed — calls flow through; consecutive failures count toward failure_threshold. A success resets the counter.
Open { until } — every try_acquire returns CanoError::CircuitOpen until the clock passes until (= when_it_opened + reset_timeout).
HalfOpen — entered lazily: there is no background task; the first try_acquire after until performs the Open → HalfOpen transition. It admits up to half_open_max_calls trial calls; that many consecutive successes close the breaker, and any failure re-opens it with a fresh cool-down.

`CircuitPolicy`

A breaker is constructed from a CircuitPolicy { failure_threshold, reset_timeout, half_open_max_calls }:

Field	Type	Meaning
`failure_threshold`	`u32`	Consecutive failures in `Closed` that trip the breaker to `Open`.
`reset_timeout`	`Duration`	How long the breaker stays `Open` before the next acquire is allowed to probe (`Open → HalfOpen`).
`half_open_max_calls`	`u32`	Does double duty: the cap on concurrent trial calls admitted in `HalfOpen`, and the number of consecutive successes needed to close. So `half_open_max_calls > 1` means "admit up to N concurrent probes, and close only after N of them all succeed."

CircuitBreaker::new panics on a misconfigured policy at construction: half_open_max_calls == 0 would deadlock the breaker permanently in HalfOpen (no probe could ever be admitted), and values approaching u32::MAX would either saturate the success counter before reaching the threshold or take effectively forever to close. Both are programmer errors, caught before any task runs.

Permits and the RAII API

The breaker's primitive operations are:

try_acquire() -> Result<Permit, CanoError> — Err(CanoError::CircuitOpen) when the breaker is Open (or when HalfOpen has already handed out its half_open_max_calls trial permits).
record_success(permit) — the call succeeded.
record_failure(permit) — the call failed.

A Permit that is dropped without being consumed counts as a failure — so an early return, a ? bail-out, or a panic doesn't leave the breaker believing the call succeeded. That's the panic-safety guarantee for the manual path.

Permits also carry an epoch tag — a counter the breaker bumps on every state transition (Closed → Open, Open → HalfOpen, HalfOpen → Open, HalfOpen → Closed). When a permit is consumed (success, failure, or RAII drop) the breaker compares the permit's epoch against the current one and silently discards stale outcomes. This means a slow caller whose call straddles a state-machine session — e.g. a request that started in Closed and only returned after the breaker tripped, cooled down, and entered a fresh HalfOpen probe — cannot accidentally close the breaker on the strength of a result that was never meant to count as a probe.

Wiring it in: `TaskConfig::with_circuit_breaker`

The common path — attach the breaker to a task's config and the workflow's retry loop does the rest. It consults the breaker before each attempt, records the outcome after the task body, and an open breaker short-circuits the whole retry loop: the CircuitOpen error is returned raw, never wrapped in RetryExhausted. A per-attempt timeout firing is recorded as a circuit failure too, so the breaker also guards against slow upstreams.

use cano::prelude::*;
use std::sync::Arc;
use std::time::Duration;

#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum Step { Call, Done }

#[derive(Clone)]
struct CallUpstream { breaker: Arc<CircuitBreaker> }

#[task(state = Step)]
impl CallUpstream {
    fn config(&self) -> TaskConfig {
        TaskConfig::new()
            .with_fixed_retry(2, Duration::from_millis(50))
            .with_circuit_breaker(Arc::clone(&self.breaker))
    }
    async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
        // ... call the dependency; an Err here counts against the breaker ...
        Ok(TaskResult::Single(Step::Done))
    }
}

let breaker = Arc::new(CircuitBreaker::new(CircuitPolicy {
    failure_threshold: 3,
    reset_timeout: Duration::from_secs(5),
    half_open_max_calls: 1,
}));
let workflow = Workflow::bare()
    .register(Step::Call, CallUpstream { breaker: Arc::clone(&breaker) })
    .add_exit_state(Step::Done);

Recovery needs a fresh call

A breaker that trips mid-loop ends that retry loop immediately — even when the remaining retry attempts (with their backoff) could outlast the breaker's reset_timeout. Recovery is the next dispatch of that state, after the cool-down; the loop will not silently re-probe the breaker on its own. Retrying against an open breaker would only pile load onto the dependency the breaker is already protecting.

Driving it by hand: `try_acquire` / `record_*`

When wiring it through TaskConfig is awkward — e.g. you guard several distinct calls in one task body, or the breaker is shared with non-task code — register the breaker as a resource (it implements Resource with no-op lifecycle), look it up by key inside the task, and drive the RAII API directly:

let permit = breaker.try_acquire()?;        // Err(CanoError::CircuitOpen) when open
match do_the_call().await {
    Ok(v)  => { breaker.record_success(permit); Ok(v) }
    Err(e) => { breaker.record_failure(permit); Err(e) }
}

The trade-off: this path bypasses both the retry-loop short-circuit (you decide whether CircuitOpen aborts or retries) and the before-call / after-call ordering guarantee the with_circuit_breaker integration gives you for free. So prefer TaskConfig::with_circuit_breaker whenever a task maps one-to-one to one guarded call; reach for the manual path only when it genuinely doesn't.

Different layer

The scheduler's flow-level Status::Tripped (a scheduled flow that stops firing after a streak of failed runs) is a separate mechanism from this task-level CanoError::CircuitOpen — they live at different layers and don't interact. See Scheduler → Backoff & Trip State.

Runnable demos: cargo run --example circuit_breaker (the TaskConfig::with_circuit_breaker integration path) and cargo run --example circuit_breaker_manual (the manual try_acquire / record_* path shown above — breaker opens after a failure streak, rejects calls while open, then closes via HalfOpen).

Bulkheads (split concurrency)

A split/join state runs its tasks in parallel — but "parallel" can mean "500 connections to a database that handles 50". JoinConfig::with_bulkhead(n) gates the task bodies on a shared Semaphore so at most n run concurrently; the rest queue. Tasks still all get spawned and the join strategy is unchanged — only their execution is rate-limited. Full details on the Split & Join page.

let join = JoinConfig::new(JoinStrategy::All, S::Join).with_bulkhead(8); // ≤ 8 at a time
let workflow = Workflow::bare()
    .register_split(S::Fan, (0..200).map(|_| W).collect(), join)
    .add_exit_state(S::Done);

Runnable example: cargo run --example split_bulkhead — 8 split tasks behind a with_bulkhead(2); the start/end timestamps make the rate-limiting visible.

Panic Safety

A panic! inside a task body — or a `.unwrap()` that fired — does not unwind through the workflow engine and abort the runtime worker. The dispatcher wraps each task in catch_unwind; a panic becomes CanoError::TaskExecution("panic: …"), which then flows through the normal retry / error / (with a saga) compensation path. Split tasks do the same per spawned task, so a panic preserves its task index. There's nothing to configure — it's always on.

Runnable example: cargo run --example panic_safety — a task that panic!s mid-workflow; the engine returns CanoError::TaskExecution("panic: …") and the preceding compensatable step's rollback still runs.

Scheduler Backoff & Trip

Behind the scheduler feature gate.

The primitives above act per dispatch. When a workflow runs on a timer there's a second, flow-level layer: a per-flow BackoffPolicy stretches the gap between failed runs and can trip a flow that keeps failing so it stops firing until RunningScheduler::reset_flow(id) clears it — surfaced as Status::Backoff { … } / Status::Tripped { … }. That's documented on its own page: Scheduler → Backoff & Trip State (runnable demo: cargo run --example scheduler_backoff --features scheduler).

Composing It All

These stack cleanly. A typical "talk to a flaky external service" state:

Circuit breaker (shared across every task hitting that service) — bail fast when it's down.
Per-attempt timeout — don't hang on a slow call.
Retries with backoff — ride out transients.
If the call mutated something, make it a compensatable task so a downstream failure can undo it.
Attach a checkpoint store so a crash mid-workflow doesn't lose progress.
Attach a WorkflowObserver (or just TracingObserver) to see retries, circuit-opens, and checkpoints as they happen.

The circuit_breaker, workflow_recovery, saga_payment, and workflow_observer / observer_metrics examples shipped with the crate each exercise one of these in isolation.

Resilience

#Retries

#Per-Attempt Timeouts

#Circuit Breakers

#The state machine

#CircuitPolicy

#Permits and the RAII API

#Wiring it in: TaskConfig::with_circuit_breaker

#Driving it by hand: try_acquire / record_*

#Bulkheads (split concurrency)

#Panic Safety

#Scheduler Backoff & Trip

#Composing It All