Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Incident response

audience: operators

Runbook entries for the common failure modes. One entry per named alert; each entry is a cause, a verification step, and a mitigation. Keep this page current; on-call engineers read it half-asleep.

General principle

Before any mitigation, run the slot walk (see Wiring the organisms together — the slot as foreign key). Pick the current slot, query each organism’s collection, and identify the last organism to have committed. That is where the pipeline stalled. Everything downstream of that point is paused, not broken.

Missing upstream

Alert. <org>_upstream_peers{source="<upstream>"} = 0 for more than 60 seconds.

Cause. The organism’s driver cannot bond to any peer of its upstream organism. Common reasons: upstream committee is down; network partition; mismatched LatticeConfig (rare, but possible after a botched rotation).

Verify.

  1. On the upstream organism’s hosts, is the process up? Check <upstream>_up metric.
  2. Is the lattice fingerprint consistent? Print lattice_id() on both organism hosts and compare.
  3. Is discovery healthy? Check discovery_peers_total{lattice=...}.

Mitigate.

  • Upstream is down: bring it back. Do not try to work around upstream absence by re-configuring downstream.
  • Fingerprint mismatch: emergency retire the lattice per Rotations and upgrades — Lattice retirement.
  • Network partition: wait the mosaik discovery back-off; if not resolved in five minutes, escalate to the upstream’s operator.

Committee size below quorum

Alert. <org>_committee_size < ceil(n/2) + 1 on any member for more than 60 seconds.

Cause. Enough committee members have gone offline that Raft cannot commit.

Verify. <org>_member_up per member identifies which members are down.

Mitigate.

TDX attestation failure

Alert. <org>_member_tdx_attested{member=...} = 0.

Cause. The member’s TDX image is not producing a valid quote. The image has booted without TDX, the hardware’s attestation service is reachable but returning errors, or the MR_TD does not match the pinned value.

Verify.

  • systemctl status builder@<org>-member for logs.
  • The TDX quote provider’s own logs on the host.
  • <ORG>_MRTD env vs pinned LATTICE_CONFIG_HEX fingerprint.

Mitigate.

  • Transient quote failure: wait for the attestation service to recover, restart the unit.
  • MR_TD mismatch: if you changed the image intentionally, this is a fingerprint change — see Rotations and upgrades — Rolling a TDX image with a new MR_TD. If you did not change the image, rebuild from a clean source and compare MR_TDs.

atelier simulation divergence

Alert. atelier_simulation_divergence_total > 0.

Cause. Atelier committee members disagree on a simulation output. Either the chain RPC differs across members (one host is pinning an old block; another is on the tip), an external input to the simulation has been compromised, or a committee member is malicious.

Verify.

  • Chain RPC head block per member — any laggards?
  • atelier_simulation_input_hash — committee members should have matching hashes for matching slots.
  • Cross-check the divergent member’s output against an independent node running the same chain client.

Mitigate.

  • Lagging RPC: rotate the member onto a healthier RPC endpoint.
  • Malicious member: rotate them out (lose one committee member, reduce trust to n-1, retire-and-replace at the next scheduled window).
  • Repeated divergence: pause the lattice per Pausing the lattice.

relay on-chain mismatch

Alert. relay_on_chain_mismatches_total > 0.

Cause. A header relay committed as accepted does not match what landed on-chain. Either the relay’s committee is majority-malicious, the proposer rotated the block after ack (MEV-Boost allows this in some configurations), or the relay committed an ack that was itself forged.

Verify.

  • Cross-check the on-chain block’s builder vs the AcceptedHeaders[S].proposer.
  • tally_evidence_failures_total — does tally agree with relay?
  • Check each relay member’s log for the raw proposer ack payload.

Mitigate.

  • Proposer-side rotation: the chain’s reality is what happened; tally will not issue an attestation for the mis-accepted block. Alert is noise if this is a one-off.
  • Majority-malicious relay: escalate immediately. Pause the lattice and coordinate with the atelier committee to determine whether to rotate relay or retire the lattice.

tally evidence failure

Alert. tally_evidence_failures_total > 0 sustained.

Cause. Tally cannot reconcile an upstream AcceptedHeaders commit with atelier, offer, or zipnet data. Either an upstream organism committed a fact that does not line up, or tally’s state machine has a bug.

Verify.

  • Walk the slot. Dump every organism’s commit for the slot tally is stumbling on.
  • Compare tally’s expected attribution against what the state machine ought to derive from the dumped commits.

Mitigate.

  • Upstream bug: escalate to the relevant organism’s on-call. Tally pauses attribution for that slot automatically; do not try to force it through.
  • Tally bug: file against tally; in the interim, attestations for that slot are missing, integrators claim through the on-chain settlement contract’s dispute mechanism (if any).

Pausing the lattice

When the pipeline must stop — compromised committee, serious integration bug — the lattice-wide kill switch is “stop every organism’s systemd units in reverse pipeline order”:

systemctl stop builder@tally-member
systemctl stop builder@relay-member
systemctl stop builder@atelier-member
systemctl stop builder@offer-member
systemctl stop builder@unseal-member
systemctl stop builder@zipnet-server builder@zipnet-aggregator

Reverse order so outputs drain before inputs stop. Once every organism is down, integrators see ConnectTimeout until you decide whether to restart, rotate, or retire.

There is no protocol-level “pause” primitive. The lattice is its processes.

Public communication

Every incident should be accompanied by a status-page update. Integrators rely on your public signals (see Monitoring). Be explicit about scope:

  • Which organism is affected?
  • Is submission still accepted (zipnet up) or not?
  • Is tally still paying (tally up) or not?

Do not post internal debugging detail publicly.

Cross-references