Monitoring and alerts
audience: operators
Every organism exposes Prometheus metrics via mosaik’s built-in exporter. This page is the short list of what to watch, broken out per organism plus a lattice-wide section. The full metrics catalogue lives in Metrics reference.
Enable the exporter
Each organism binary accepts a PROMETHEUS_ADDR env var
(proposal). The reference systemd units expose it at
0.0.0.0:909x with the last digit picked per organism so one
host can run several side by side.
# /etc/builder/common.env
PROMETHEUS_ADDR=0.0.0.0:9090 # plus 9091 for unseal, 9092 offer, ...
Scrape with your Prometheus stack; the labels include
lattice, organism, and role so you can aggregate across
lattices.
Lattice-wide dashboards
Two dashboards every lattice operator should have.
End-to-end slot health
One row per slot in the past hour, one column per organism, green / yellow / red per cell. Green if the organism committed for that slot, yellow if it is expected and overdue, red if a deadline has passed without commit. Walk the row left-to-right to spot where the pipeline stalled.
Data sources: each organism’s <org>_commits_per_slot counter.
Discovery health
discovery_peers_total across all your hosts, filtered by
lattice. A sudden drop in peer count indicates network
partition or a discovery subsystem failure; diagnose before
any organism-level alert fires.
Data source: mosaik’s discovery metrics; see the mosaik metrics reference.
Per-organism red lines
Conservative defaults; tune to your lattice’s slot cadence.
zipnet
zipnet_server_up— 0 on any server beyond 1 minute = page.zipnet_round_commit_latency_secondsP95 >round_period— page.zipnet_broadcasts_appended_totalrate drops to zero for three consecutive rounds — page.
unseal
unseal_decrypt_latency_secondsP95 > slot period — page. Unseal is the first organism whose latency directly delays downstream organisms.unseal_member_tdx_attested= 0 for any member — page.
offer
offer_auction_commit_latency_secondsP95 > auction window — investigate.offer_bids_received_totalrate drops to zero beyond 10 consecutive slots — investigate (could be legitimate: quiet mempool).
atelier
atelier_candidates_committed_totalrate drops below slot rate = page. This is the lattice’s primary output.atelier_simulation_divergence_totalany non-zero rate = investigate immediately; a divergence means the co-building committee disagrees, which should be a rarity.atelier_member_tdx_attested= 0 for any member = page.
relay
relay_proposer_ack_latency_secondsP95 > slot deadline = page.relay_on_chain_mismatches_totalany non-zero = page. AnAcceptedHeadersthat did not land on-chain is either a malicious relay member or a proposer that changed its mind; both require immediate inspection.
tally
tally_attestation_latency_secondsP95 > 2 × block time = investigate.tally_evidence_failures_totalany sustained rate = page. A sustained mismatch between upstream evidence and tally’s attribution is a cross-organism integration bug.tally_chain_rpc_lag_seconds> 30 seconds = investigate.
Alerts you do not need
- Raft leader change alerts. Mosaik’s Raft variant churns leaders on normal network turbulence. Alerts for individual leader changes are noise; alert on “no leader for 30 seconds” instead.
- Per-commit-latency alerts at the fastest cadence you can measure. P95 over 1-minute windows is enough; sub-second jitter is not actionable.
Dashboards to hand integrators
A public status page:
- Per-organism “is up” indicator.
- Per-slot end-to-end pipeline health for the past hour.
- Lattice
lattice_id()hex so integrators can eyeball their handshake.
Integrators should not need access to your full metrics stack; they need to know whether the lattice is up. Publish that rolled-up signal; keep the rest internal.
Cross-references
- Appendix — Metrics reference
- Incident response — what to do when any of the red lines above trip.
- mosaik metrics