Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring and alerts

audience: operators

Every organism exposes Prometheus metrics via mosaik’s built-in exporter. This page is the short list of what to watch, broken out per organism plus a lattice-wide section. The full metrics catalogue lives in Metrics reference.

Enable the exporter

Each organism binary accepts a PROMETHEUS_ADDR env var (proposal). The reference systemd units expose it at 0.0.0.0:909x with the last digit picked per organism so one host can run several side by side.

# /etc/builder/common.env
PROMETHEUS_ADDR=0.0.0.0:9090  # plus 9091 for unseal, 9092 offer, ...

Scrape with your Prometheus stack; the labels include lattice, organism, and role so you can aggregate across lattices.

Lattice-wide dashboards

Two dashboards every lattice operator should have.

End-to-end slot health

One row per slot in the past hour, one column per organism, green / yellow / red per cell. Green if the organism committed for that slot, yellow if it is expected and overdue, red if a deadline has passed without commit. Walk the row left-to-right to spot where the pipeline stalled.

Data sources: each organism’s <org>_commits_per_slot counter.

Discovery health

discovery_peers_total across all your hosts, filtered by lattice. A sudden drop in peer count indicates network partition or a discovery subsystem failure; diagnose before any organism-level alert fires.

Data source: mosaik’s discovery metrics; see the mosaik metrics reference.

Per-organism red lines

Conservative defaults; tune to your lattice’s slot cadence.

zipnet

  • zipnet_server_up — 0 on any server beyond 1 minute = page.
  • zipnet_round_commit_latency_seconds P95 > round_period — page.
  • zipnet_broadcasts_appended_total rate drops to zero for three consecutive rounds — page.

unseal

  • unseal_decrypt_latency_seconds P95 > slot period — page. Unseal is the first organism whose latency directly delays downstream organisms.
  • unseal_member_tdx_attested = 0 for any member — page.

offer

  • offer_auction_commit_latency_seconds P95 > auction window — investigate.
  • offer_bids_received_total rate drops to zero beyond 10 consecutive slots — investigate (could be legitimate: quiet mempool).

atelier

  • atelier_candidates_committed_total rate drops below slot rate = page. This is the lattice’s primary output.
  • atelier_simulation_divergence_total any non-zero rate = investigate immediately; a divergence means the co-building committee disagrees, which should be a rarity.
  • atelier_member_tdx_attested = 0 for any member = page.

relay

  • relay_proposer_ack_latency_seconds P95 > slot deadline = page.
  • relay_on_chain_mismatches_total any non-zero = page. An AcceptedHeaders that did not land on-chain is either a malicious relay member or a proposer that changed its mind; both require immediate inspection.

tally

  • tally_attestation_latency_seconds P95 > 2 × block time = investigate.
  • tally_evidence_failures_total any sustained rate = page. A sustained mismatch between upstream evidence and tally’s attribution is a cross-organism integration bug.
  • tally_chain_rpc_lag_seconds > 30 seconds = investigate.

Alerts you do not need

  • Raft leader change alerts. Mosaik’s Raft variant churns leaders on normal network turbulence. Alerts for individual leader changes are noise; alert on “no leader for 30 seconds” instead.
  • Per-commit-latency alerts at the fastest cadence you can measure. P95 over 1-minute windows is enough; sub-second jitter is not actionable.

Dashboards to hand integrators

A public status page:

  • Per-organism “is up” indicator.
  • Per-slot end-to-end pipeline health for the past hour.
  • Lattice lattice_id() hex so integrators can eyeball their handshake.

Integrators should not need access to your full metrics stack; they need to know whether the lattice is up. Publish that rolled-up signal; keep the rest internal.

Cross-references