Incident response
audience: operators
Runbook entries for the common failure modes. One entry per named alert; each entry is a cause, a verification step, and a mitigation. Keep this page current; on-call engineers read it half-asleep.
General principle
Before any mitigation, run the slot walk (see Wiring the organisms together — the slot as foreign key). Pick the current slot, query each organism’s collection, and identify the last organism to have committed. That is where the pipeline stalled. Everything downstream of that point is paused, not broken.
Missing upstream
Alert. <org>_upstream_peers{source="<upstream>"} = 0 for
more than 60 seconds.
Cause. The organism’s driver cannot bond to any peer of
its upstream organism. Common reasons: upstream committee is
down; network partition; mismatched LatticeConfig (rare, but
possible after a botched rotation).
Verify.
- On the upstream organism’s hosts, is the process up? Check
<upstream>_upmetric. - Is the lattice fingerprint consistent? Print
lattice_id()on both organism hosts and compare. - Is discovery healthy? Check
discovery_peers_total{lattice=...}.
Mitigate.
- Upstream is down: bring it back. Do not try to work around upstream absence by re-configuring downstream.
- Fingerprint mismatch: emergency retire the lattice per Rotations and upgrades — Lattice retirement.
- Network partition: wait the mosaik discovery back-off; if not resolved in five minutes, escalate to the upstream’s operator.
Committee size below quorum
Alert. <org>_committee_size < ceil(n/2) + 1 on any
member for more than 60 seconds.
Cause. Enough committee members have gone offline that Raft cannot commit.
Verify. <org>_member_up per member identifies which
members are down.
Mitigate.
- Restart the downed members. If a host is unrecoverable, rotate in a replacement per Rotations and upgrades — Rotating a committee member’s peer identity.
- Do not shrink
nto route around dead members.nchanges the fingerprint.
TDX attestation failure
Alert. <org>_member_tdx_attested{member=...} = 0.
Cause. The member’s TDX image is not producing a valid quote. The image has booted without TDX, the hardware’s attestation service is reachable but returning errors, or the MR_TD does not match the pinned value.
Verify.
systemctl status builder@<org>-memberfor logs.- The TDX quote provider’s own logs on the host.
<ORG>_MRTDenv vs pinnedLATTICE_CONFIG_HEXfingerprint.
Mitigate.
- Transient quote failure: wait for the attestation service to recover, restart the unit.
- MR_TD mismatch: if you changed the image intentionally, this is a fingerprint change — see Rotations and upgrades — Rolling a TDX image with a new MR_TD. If you did not change the image, rebuild from a clean source and compare MR_TDs.
atelier simulation divergence
Alert. atelier_simulation_divergence_total > 0.
Cause. Atelier committee members disagree on a simulation output. Either the chain RPC differs across members (one host is pinning an old block; another is on the tip), an external input to the simulation has been compromised, or a committee member is malicious.
Verify.
- Chain RPC head block per member — any laggards?
atelier_simulation_input_hash— committee members should have matching hashes for matching slots.- Cross-check the divergent member’s output against an independent node running the same chain client.
Mitigate.
- Lagging RPC: rotate the member onto a healthier RPC endpoint.
- Malicious member: rotate them out (lose one committee
member, reduce trust to
n-1, retire-and-replace at the next scheduled window). - Repeated divergence: pause the lattice per Pausing the lattice.
relay on-chain mismatch
Alert. relay_on_chain_mismatches_total > 0.
Cause. A header relay committed as accepted does not match what landed on-chain. Either the relay’s committee is majority-malicious, the proposer rotated the block after ack (MEV-Boost allows this in some configurations), or the relay committed an ack that was itself forged.
Verify.
- Cross-check the on-chain block’s builder vs the
AcceptedHeaders[S].proposer. tally_evidence_failures_total— does tally agree with relay?- Check each relay member’s log for the raw proposer ack payload.
Mitigate.
- Proposer-side rotation: the chain’s reality is what happened; tally will not issue an attestation for the mis-accepted block. Alert is noise if this is a one-off.
- Majority-malicious relay: escalate immediately. Pause the lattice and coordinate with the atelier committee to determine whether to rotate relay or retire the lattice.
tally evidence failure
Alert. tally_evidence_failures_total > 0 sustained.
Cause. Tally cannot reconcile an upstream AcceptedHeaders
commit with atelier, offer, or zipnet data. Either an
upstream organism committed a fact that does not line up, or
tally’s state machine has a bug.
Verify.
- Walk the slot. Dump every organism’s commit for the slot tally is stumbling on.
- Compare tally’s expected attribution against what the state machine ought to derive from the dumped commits.
Mitigate.
- Upstream bug: escalate to the relevant organism’s on-call. Tally pauses attribution for that slot automatically; do not try to force it through.
- Tally bug: file against tally; in the interim, attestations for that slot are missing, integrators claim through the on-chain settlement contract’s dispute mechanism (if any).
Pausing the lattice
When the pipeline must stop — compromised committee, serious integration bug — the lattice-wide kill switch is “stop every organism’s systemd units in reverse pipeline order”:
systemctl stop builder@tally-member
systemctl stop builder@relay-member
systemctl stop builder@atelier-member
systemctl stop builder@offer-member
systemctl stop builder@unseal-member
systemctl stop builder@zipnet-server builder@zipnet-aggregator
Reverse order so outputs drain before inputs stop. Once every
organism is down, integrators see ConnectTimeout until you
decide whether to restart, rotate, or retire.
There is no protocol-level “pause” primitive. The lattice is its processes.
Public communication
Every incident should be accompanied by a status-page update. Integrators rely on your public signals (see Monitoring). Be explicit about scope:
- Which organism is affected?
- Is submission still accepted (
zipnetup) or not? - Is tally still paying (
tallyup) or not?
Do not post internal debugging detail publicly.
Cross-references
- Monitoring and alerts — the signals that page on-call.
- Rotations and upgrades — rotation procedures the incident playbooks reference.
- Wiring the organisms together — slot-walk debugging technique.