Observability¶
Experimental
The observability system is experimental. Metrics format and gateway endpoints may change in future releases.
Lango includes an observability subsystem for metrics collection, token usage tracking, health monitoring, and audit logging. All data is accessible through gateway HTTP endpoints when running lango serve.
Metrics Collector¶
The metrics collector provides a system-level snapshot including:
- Goroutine count, memory usage, and process uptime
- Per-session, per-agent, and per-tool breakdowns
- Request counts and latency distributions
Gateway endpoint: GET /metrics
Token Tracking¶
Token tracking records LLM provider token usage via the event bus (TokenUsageEvent). Usage data is stored in an Ent-backed persistent store with configurable retention.
- Subscribes to
token.usageevents from the event bus - Tracks input, output, cache, and total tokens per session/agent/model
- Configurable retention period (default: 30 days)
- Supports historical queries by time range
Gateway endpoints:
| Endpoint | Description |
|---|---|
GET /metrics/sessions |
Per-session token usage |
GET /metrics/tools |
Per-tool metrics |
GET /metrics/agents |
Per-agent metrics |
GET /metrics/history |
Historical metrics (?days=N parameter) |
Health Checks¶
The health check system uses a registry-based architecture where components register their own health check functions.
- Built-in memory check (512 MB threshold)
- Configurable check interval
- Returns per-component status with details
Gateway endpoint: GET /health/detailed
Policy Metrics¶
The metrics collector tracks exec policy decisions (block and observe verdicts) published via the event bus as PolicyDecisionEvent. Allow verdicts are not tracked.
Collected counters:
- Blocks -- Total commands blocked by exec policy
- Observes -- Total commands flagged for observation
- By Reason -- Per-reason breakdown (e.g.,
catastrophic_pattern,destructive_command)
The collector's RecordPolicyDecision(verdict, reason) method aggregates these counters in memory. They are included in the SystemSnapshot used by the gateway endpoint and CLI command.
Gateway endpoint: GET /metrics/policy
Response format:
{
"blocks": 3,
"observes": 12,
"byReason": {
"catastrophic_pattern": 2,
"destructive_command": 1,
"network_exfiltration": 5,
"suspicious_pipe": 7
}
}
CLI command: lango metrics policy (see CLI Reference)
Audit Logging¶
The audit recorder subscribes to event bus events and writes audit log entries to the database:
- Tool execution events -- Records tool name, duration, success/failure, and error details via
ToolExecutedEvent - Token usage events -- Records provider, model, and token counts via
TokenUsageEvent - Policy decision events -- Records exec policy block/observe verdicts via
PolicyDecisionEvent - Default retention: 90 days
Policy Decision Audit Logging¶
When the exec policy evaluator blocks or flags a command, it publishes a PolicyDecisionEvent on the event bus. The audit recorder subscribes to these events and writes a database entry with:
| Field | Source | Description |
|---|---|---|
action |
policy_decision |
Audit log action type |
actor |
PolicyDecisionEvent.AgentName |
Agent that attempted the command (or "system") |
target |
PolicyDecisionEvent.Command |
The original command string |
details.verdict |
PolicyDecisionEvent.Verdict |
"block" or "observe" |
details.reason |
PolicyDecisionEvent.Reason |
Machine-readable reason code |
details.unwrapped |
PolicyDecisionEvent.Unwrapped |
Command after shell wrapper unwrap |
details.message |
PolicyDecisionEvent.Message |
Human-readable explanation (if present) |
This enables operators to query the audit log for all policy decisions in a session and correlate them with tool execution events.
Recovery Decision Events¶
When the coordinating executor handles an agent execution failure, it publishes a RecoveryDecisionEvent on the event bus with structured metadata for observability. This event is published for every recovery decision (retry, retry with hint, direct answer, escalate).
Event fields:
| Field | Type | Description |
|---|---|---|
CauseClass |
string | Error classification: rate_limit, transient, malformed_tool_call, timeout, or empty for non-agent errors |
Action |
string | Recovery decision: retry, retry_with_hint, direct_answer, escalate, or none |
Attempt |
int | Current retry attempt number (0-based) |
Backoff |
duration | Computed backoff duration before the next retry (zero for non-retry actions) |
SessionKey |
string | Session identifier |
Event name: agent.recovery.decision
Exponential Backoff¶
Retry actions use exponential backoff before the next attempt. The formula is:
backoff = min(baseDelay * 2^attempt, maxBackoff)
| Parameter | Value |
|---|---|
| Base delay | 1 second |
| Max backoff | 30 seconds |
Example progression: 1s, 2s, 4s, 8s, 16s, 30s, 30s, ...
Backoff sleeps are context-aware and will abort immediately if the context is cancelled.
Per-Error-Class Retry Limits¶
In addition to the global maxRetries setting (default: 2), each error class has its own maximum retry count. When the per-class limit is reached for a specific error type, the recovery policy escalates even if the global limit has not been reached.
| Cause Class | Default Max Retries | Description |
|---|---|---|
rate_limit |
5 | Provider rate-limiting (429 responses) |
transient |
3 | Transient provider errors |
malformed_tool_call |
1 | Invalid function call schema |
timeout |
3 | Execution or idle timeout |
| (other) | Global maxRetries |
Falls through to the global setting |
Per-class retry counts are tracked independently within a single run. The global maxRetries is configured via recovery.maxRetries in the config.
Gateway Endpoints¶
All observability endpoints are available when the gateway is running (lango serve):
| Endpoint | Description |
|---|---|
GET /metrics |
System metrics snapshot (goroutines, memory, uptime) |
GET /metrics/sessions |
Per-session token usage |
GET /metrics/tools |
Per-tool metrics |
GET /metrics/agents |
Per-agent metrics |
GET /metrics/policy |
Policy decision statistics (blocks, observes, by-reason) |
GET /metrics/history |
Historical metrics (?days=N parameter) |
GET /health/detailed |
Detailed health check results per component |
Configuration¶
Settings:
lango settings-> Observability
{
"observability": {
"enabled": true,
"tokens": {
"enabled": true,
"persistHistory": true,
"retentionDays": 30
},
"health": {
"enabled": true,
"interval": "30s"
},
"audit": {
"enabled": true,
"retentionDays": 90
},
"metrics": {
"enabled": true,
"format": "json"
}
}
}
| Key | Default | Description |
|---|---|---|
observability.enabled |
false |
Activates the observability subsystem |
observability.tokens.enabled |
true |
Activates token tracking (when observability is enabled) |
observability.tokens.persistHistory |
false |
Enables DB-backed persistent storage |
observability.tokens.retentionDays |
30 |
Days to keep token usage records |
observability.health.enabled |
true |
Activates health checks (when observability is enabled) |
observability.health.interval |
30s |
Health check interval |
observability.audit.enabled |
false |
Activates audit logging |
observability.audit.retentionDays |
90 |
Days to keep audit records |
observability.metrics.enabled |
false |
Activates metrics export endpoint |
observability.metrics.format |
"json" |
Metrics export format |
See the Metrics CLI Reference for command documentation.