Operational Alerting¶
Overview¶
The alerting system monitors operational signals (policy decisions, recovery events, budget usage) and generates alerts when configurable thresholds are exceeded. Alerts flow through a 3-tier delivery pipeline:
- EventBus --
AlertEventpublished for real-time subscribers - Audit log -- persisted to the database for historical queries
- CLI --
lango alerts listandlango alerts summaryfor operators
Architecture¶
PolicyDecisionEvent ─────────▶ Alerting Dispatcher ──▶ AlertEvent ──▶ EventBus
RecoveryDecisionEvent ───────▶ (sliding window) │
CircuitBreakerTrippedEvent ──▶ (deduplication) ├──▶ Audit Recorder ──▶ DB
└──▶ Other subscribers
lango alerts list ──▶ GET /alerts ──▶ Query audit DB (action="alert")
Alert Conditions¶
| Condition | Type | Severity | Trigger | Status |
|---|---|---|---|---|
| Policy block rate | policy_block_rate |
warning | Block count exceeds threshold in 5min window | Active |
| Recovery retries | recovery_retries |
warning | Retry count exceeds threshold per session (sliding 5min window) | Active |
| Circuit breaker | circuit_breaker |
critical | Circuit breaker tripped for an agent | Active |
| Config drift | config_drift |
warning | Configuration or provenance drift detected | Planned |
Deduplication¶
The dispatcher deduplicates alerts by type within each 5-minute window. Only one alert per type per window is published. This prevents alert storms when a persistent condition repeatedly triggers the threshold.
Configuration¶
alerting:
enabled: true # Master switch (default: false)
policyBlockRateThreshold: 10 # Max blocks per 5min window
recoveryRetryThreshold: 5 # Max retries per session
All thresholds are configurable. The system is disabled by default and must be explicitly enabled.
Note: Alert channel routing (e.g., to Slack or Discord) is planned for a future release.
CLI Usage¶
# List recent alerts
lango alerts list --days=7
# Alert summary by type
lango alerts summary
# JSON output
lango alerts list --output json
HTTP API¶
GET /alerts?days=7
Returns:
{
"alerts": [
{
"id": "uuid",
"type": "policy_block_rate",
"actor": "system",
"details": {
"severity": "warning",
"message": "policy block rate exceeded threshold",
"count": 12,
"threshold": 10,
"window": "5m0s"
},
"timestamp": "2026-04-01T12:00:00Z"
}
],
"total": 1,
"days": 7
}