Resilience
Recover from transient faults — retries, timeouts, circuit breakers, bulkheads, panic safety, scheduler backoff.
"Resilient" in Cano's tagline isn't a vibe — it's a set of concrete, composable primitives that sit on the FSM dispatch path. Every one of them is opt-in: a workflow that wires none of them up pays nothing, and the hot path stays allocation-light. This is the guide: the circuit breaker gets its full treatment here; retries, per-attempt timeouts, and bulkheads get an overview with a pointer to the API reference page that owns the builder methods.
The self-healing half of the tagline — checkpoint + resume, sagas, observers, health probes — lives in Recovery, Saga, and Observers.
Retries
A task's config() returns a TaskConfig
whose retry_mode drives the workflow dispatcher's retry loop. The default is
exponential backoff with jitter (3 retries, 100ms base, 2.0× multiplier, 30s cap, 0.1 jitter);
RetryMode also has None and Fixed(count, delay). On exhaustion
the loop returns CanoError::RetryExhausted wrapping the last error.
use cano::prelude::*;
use std::time::Duration;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum Step { Fetch, Done }
#[derive(Clone)]
struct FetchTask;
#[task(state = Step)]
impl FetchTask {
fn config(&self) -> TaskConfig {
// 5 attempts, 200ms apart; or `.with_exponential_retry(n)` / `TaskConfig::minimal()` for none.
TaskConfig::new().with_fixed_retry(4, Duration::from_millis(200))
}
async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
// ... call the flaky thing ...
Ok(TaskResult::Single(Step::Done))
}
}
See Tasks → Configuration & Retries for the full
RetryMode / TaskConfig reference.
Per-Attempt Timeouts
TaskConfig::with_attempt_timeout(Duration) wraps each retry attempt in
tokio::time::timeout. A blown deadline produces CanoError::Timeout, which
the retry loop treats as a recoverable failure — the attempt is retried like any other, and only if
every attempt times out does the final timeout get wrapped in CanoError::RetryExhausted.
Combine it with retries to bound total wall-clock per state: 3 attempts × a 2s attempt timeout ≈ 6s
worst case (plus backoff).
#[task(state = Step)]
impl CallTask {
fn config(&self) -> TaskConfig {
TaskConfig::new()
.with_fixed_retry(2, Duration::from_millis(100))
.with_attempt_timeout(Duration::from_secs(2)) // each attempt gets 2s
}
async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
Ok(TaskResult::Single(Step::Done))
}
}
Distinct from the workflow-wide timeout (Workflow::with_timeout), which
bounds the whole orchestration, not one attempt. The full TaskConfig /
RetryMode API — including how attempt timeouts compose with each retry mode — lives in
Tasks → Configuration & Retries.
Circuit Breakers
Retries help with transient faults; a CircuitBreaker helps when a dependency
is down — it stops hammering it. Once it has seen failure_threshold consecutive
failures the breaker opens: every subsequent call short-circuits with
CanoError::CircuitOpen without invoking the task body and without consuming a
retry attempt, so the unhealthy dependency gets a break to recover.
A breaker is cheap to clone (it's an Arc inside) — share one
Arc<CircuitBreaker> across every task that hits the same dependency so a
trip from any caller protects every caller, including tasks running in parallel inside a
split/join state. Internally it's a synchronous
std::sync::Mutex with no awaits held across the critical section, so concurrent
acquires from split tasks are safe.
The state machine
CircuitBreaker state machine
The state is Closed → Open { until } → HalfOpen → Closed:
Closed— calls flow through; consecutive failures count towardfailure_threshold. A success resets the counter.Open { until }— everytry_acquirereturnsCanoError::CircuitOpenuntil the clock passesuntil(= when_it_opened + reset_timeout).HalfOpen— entered lazily: there is no background task; the firsttry_acquireafteruntilperforms theOpen → HalfOpentransition. It admits up tohalf_open_max_callstrial calls; that many consecutive successes close the breaker, and any failure re-opens it with a fresh cool-down.
CircuitPolicy
A breaker is constructed from a CircuitPolicy { failure_threshold, reset_timeout,
half_open_max_calls }:
| Field | Type | Meaning |
|---|---|---|
failure_threshold | u32 | Consecutive failures in Closed that trip the breaker to Open. |
reset_timeout | Duration | How long the breaker stays Open before the next acquire is allowed to probe (Open → HalfOpen). |
half_open_max_calls | u32 | Does double duty: the cap on concurrent trial calls admitted in HalfOpen, and the number of consecutive successes needed to close. So half_open_max_calls > 1 means "admit up to N concurrent probes, and close only after N of them all succeed." |
CircuitBreaker::new panics on a misconfigured policy at construction:
half_open_max_calls == 0 would deadlock the breaker permanently in HalfOpen
(no probe could ever be admitted), and values approaching u32::MAX would either
saturate the success counter before reaching the threshold or take effectively forever to close.
Both are programmer errors, caught before any task runs.
Permits and the RAII API
The breaker's primitive operations are:
try_acquire() -> Result<Permit, CanoError>—Err(CanoError::CircuitOpen)when the breaker isOpen(or whenHalfOpenhas already handed out itshalf_open_max_callstrial permits).record_success(permit)— the call succeeded.record_failure(permit)— the call failed.
A Permit that is dropped without being consumed counts as a
failure — so an early return, a ? bail-out, or a panic doesn't leave
the breaker believing the call succeeded. That's the panic-safety guarantee for the manual path.
Permits also carry an epoch tag — a counter the breaker bumps on every state transition
(Closed → Open, Open → HalfOpen, HalfOpen → Open,
HalfOpen → Closed). When a permit is consumed (success, failure, or RAII drop) the
breaker compares the permit's epoch against the current one and silently discards stale
outcomes. This means a slow caller whose call straddles a state-machine session — e.g. a
request that started in Closed and only returned after the breaker tripped, cooled
down, and entered a fresh HalfOpen probe — cannot accidentally close the breaker on the
strength of a result that was never meant to count as a probe.
Wiring it in: TaskConfig::with_circuit_breaker
The common path — attach the breaker to a task's config and the workflow's retry loop does the rest.
It consults the breaker before each attempt, records the outcome after the task body, and an open
breaker short-circuits the whole retry loop: the CircuitOpen error is returned
raw, never wrapped in RetryExhausted. A per-attempt
timeout firing is recorded as a circuit failure too, so the breaker also
guards against slow upstreams.
use cano::prelude::*;
use std::sync::Arc;
use std::time::Duration;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
enum Step { Call, Done }
#[derive(Clone)]
struct CallUpstream { breaker: Arc<CircuitBreaker> }
#[task(state = Step)]
impl CallUpstream {
fn config(&self) -> TaskConfig {
TaskConfig::new()
.with_fixed_retry(2, Duration::from_millis(50))
.with_circuit_breaker(Arc::clone(&self.breaker))
}
async fn run_bare(&self) -> Result<TaskResult<Step>, CanoError> {
// ... call the dependency; an Err here counts against the breaker ...
Ok(TaskResult::Single(Step::Done))
}
}
let breaker = Arc::new(CircuitBreaker::new(CircuitPolicy {
failure_threshold: 3,
reset_timeout: Duration::from_secs(5),
half_open_max_calls: 1,
}));
let workflow = Workflow::bare()
.register(Step::Call, CallUpstream { breaker: Arc::clone(&breaker) })
.add_exit_state(Step::Done);
A breaker that trips mid-loop ends that retry loop immediately — even when the remaining
retry attempts (with their backoff) could outlast the breaker's reset_timeout. Recovery
is the next dispatch of that state, after the cool-down; the loop will not silently re-probe
the breaker on its own. Retrying against an open breaker would only pile load onto the dependency the
breaker is already protecting.
Driving it by hand: try_acquire / record_*
When wiring it through TaskConfig is awkward — e.g. you guard several distinct calls in
one task body, or the breaker is shared with non-task code — register the breaker as a
resource (it implements Resource with no-op lifecycle), look
it up by key inside the task, and drive the RAII API directly:
let permit = breaker.try_acquire()?; // Err(CanoError::CircuitOpen) when open
match do_the_call().await {
Ok(v) => { breaker.record_success(permit); Ok(v) }
Err(e) => { breaker.record_failure(permit); Err(e) }
}
The trade-off: this path bypasses both the retry-loop short-circuit (you decide
whether CircuitOpen aborts or retries) and the before-call /
after-call ordering guarantee the with_circuit_breaker integration gives you for free.
So prefer TaskConfig::with_circuit_breaker whenever a task
maps one-to-one to one guarded call; reach for the manual path only when it genuinely doesn't.
The scheduler's flow-level Status::Tripped
(a scheduled flow that stops firing after a streak of failed runs) is a separate mechanism
from this task-level CanoError::CircuitOpen — they live at different layers and don't
interact. See Scheduler → Backoff & Trip State.
Runnable demos: cargo run --example circuit_breaker (the
TaskConfig::with_circuit_breaker integration path) and
cargo run --example circuit_breaker_manual (the manual try_acquire /
record_* path shown above — breaker opens after a failure streak, rejects calls while
open, then closes via HalfOpen).
Bulkheads (split concurrency)
A split/join state runs its tasks in parallel — but "parallel" can mean
"500 connections to a database that handles 50". JoinConfig::with_bulkhead(n)
gates the task bodies on a shared Semaphore so at most n run concurrently;
the rest queue. Tasks still all get spawned and the join strategy is unchanged — only their
execution is rate-limited. Full details on the
Split & Join page.
let join = JoinConfig::new(JoinStrategy::All, S::Join).with_bulkhead(8); // ≤ 8 at a time
let workflow = Workflow::bare()
.register_split(S::Fan, (0..200).map(|_| W).collect(), join)
.add_exit_state(S::Done);
Runnable example: cargo run --example split_bulkhead — 8 split tasks behind a
with_bulkhead(2); the start/end timestamps make the rate-limiting visible.
Panic Safety
A panic! inside a task body — or a `.unwrap()` that fired — does not
unwind through the workflow engine and abort the runtime worker. The dispatcher wraps each task in
catch_unwind; a panic becomes CanoError::TaskExecution("panic: …"), which
then flows through the normal retry / error / (with a saga) compensation path. Split tasks do the
same per spawned task, so a panic preserves its task index. There's nothing to configure — it's
always on.
Runnable example: cargo run --example panic_safety — a task that panic!s
mid-workflow; the engine returns CanoError::TaskExecution("panic: …") and the preceding
compensatable step's rollback still runs.
Scheduler Backoff & Trip
Behind the scheduler feature gate.
The primitives above act per dispatch. When a workflow runs on a timer there's a second,
flow-level layer: a per-flow BackoffPolicy stretches the gap between failed
runs and can trip a flow that keeps failing so it stops firing until
RunningScheduler::reset_flow(id) clears it — surfaced as
Status::Backoff { … } / Status::Tripped { … }. That's documented on its
own page: Scheduler → Backoff & Trip State (runnable
demo: cargo run --example scheduler_backoff --features scheduler).
Composing It All
These stack cleanly. A typical "talk to a flaky external service" state:
- Circuit breaker (shared across every task hitting that service) — bail fast when it's down.
- Per-attempt timeout — don't hang on a slow call.
- Retries with backoff — ride out transients.
- If the call mutated something, make it a compensatable task so a downstream failure can undo it.
- Attach a checkpoint store so a crash mid-workflow doesn't lose progress.
- Attach a
WorkflowObserver(or justTracingObserver) to see retries, circuit-opens, and checkpoints as they happen.
The circuit_breaker, workflow_recovery, saga_payment, and
workflow_observer / observer_metrics examples shipped with the crate each
exercise one of these in isolation.