Skip to content

Observability

Experimental

The observability system is experimental. Metrics format and gateway endpoints may change in future releases.

Lango includes an observability subsystem for metrics collection, token usage tracking, health monitoring, and audit logging. All data is accessible through gateway HTTP endpoints when running lango serve.

Metrics Collector

The metrics collector provides a system-level snapshot including:

  • Goroutine count, memory usage, and process uptime
  • Per-session, per-agent, and per-tool breakdowns
  • Request counts and latency distributions

Gateway endpoint: GET /metrics

Token Tracking

Token tracking records LLM provider token usage via the event bus (TokenUsageEvent). Usage data is stored in an Ent-backed persistent store with configurable retention.

  • Subscribes to token.usage events from the event bus
  • Tracks input, output, cache, and total tokens per session/agent/model
  • Configurable retention period (default: 30 days)
  • Supports historical queries by time range

Gateway endpoints:

Endpoint Description
GET /metrics/sessions Per-session token usage
GET /metrics/tools Per-tool metrics
GET /metrics/agents Per-agent metrics
GET /metrics/history Historical metrics (?days=N parameter)

Health Checks

The health check system uses a registry-based architecture where components register their own health check functions.

  • Built-in memory check (512 MB threshold)
  • Configurable check interval
  • Returns per-component status with details

Gateway endpoint: GET /health/detailed

Policy Metrics

The metrics collector tracks exec policy decisions (block and observe verdicts) published via the event bus as PolicyDecisionEvent. Allow verdicts are not tracked.

Collected counters:

  • Blocks -- Total commands blocked by exec policy
  • Observes -- Total commands flagged for observation
  • By Reason -- Per-reason breakdown (e.g., catastrophic_pattern, destructive_command)

The collector's RecordPolicyDecision(verdict, reason) method aggregates these counters in memory. They are included in the SystemSnapshot used by the gateway endpoint and CLI command.

Gateway endpoint: GET /metrics/policy

Response format:

{
  "blocks": 3,
  "observes": 12,
  "byReason": {
    "catastrophic_pattern": 2,
    "destructive_command": 1,
    "network_exfiltration": 5,
    "suspicious_pipe": 7
  }
}

CLI command: lango metrics policy (see CLI Reference)

Audit Logging

The audit recorder subscribes to event bus events and writes audit log entries to the database:

  • Tool execution events -- Records tool name, duration, success/failure, and error details via ToolExecutedEvent
  • Token usage events -- Records provider, model, and token counts via TokenUsageEvent
  • Policy decision events -- Records exec policy block/observe verdicts via PolicyDecisionEvent
  • Default retention: 90 days

Policy Decision Audit Logging

When the exec policy evaluator blocks or flags a command, it publishes a PolicyDecisionEvent on the event bus. The audit recorder subscribes to these events and writes a database entry with:

Field Source Description
action policy_decision Audit log action type
actor PolicyDecisionEvent.AgentName Agent that attempted the command (or "system")
target PolicyDecisionEvent.Command The original command string
details.verdict PolicyDecisionEvent.Verdict "block" or "observe"
details.reason PolicyDecisionEvent.Reason Machine-readable reason code
details.unwrapped PolicyDecisionEvent.Unwrapped Command after shell wrapper unwrap
details.message PolicyDecisionEvent.Message Human-readable explanation (if present)

This enables operators to query the audit log for all policy decisions in a session and correlate them with tool execution events.

Recovery Decision Events

When the coordinating executor handles an agent execution failure, it publishes a RecoveryDecisionEvent on the event bus with structured metadata for observability. This event is published for every recovery decision (retry, retry with hint, direct answer, escalate).

Event fields:

Field Type Description
CauseClass string Error classification: rate_limit, transient, malformed_tool_call, timeout, or empty for non-agent errors
Action string Recovery decision: retry, retry_with_hint, direct_answer, escalate, or none
Attempt int Current retry attempt number (0-based)
Backoff duration Computed backoff duration before the next retry (zero for non-retry actions)
SessionKey string Session identifier

Event name: agent.recovery.decision

Exponential Backoff

Retry actions use exponential backoff before the next attempt. The formula is:

backoff = min(baseDelay * 2^attempt, maxBackoff)
Parameter Value
Base delay 1 second
Max backoff 30 seconds

Example progression: 1s, 2s, 4s, 8s, 16s, 30s, 30s, ...

Backoff sleeps are context-aware and will abort immediately if the context is cancelled.

Per-Error-Class Retry Limits

In addition to the global maxRetries setting (default: 2), each error class has its own maximum retry count. When the per-class limit is reached for a specific error type, the recovery policy escalates even if the global limit has not been reached.

Cause Class Default Max Retries Description
rate_limit 5 Provider rate-limiting (429 responses)
transient 3 Transient provider errors
malformed_tool_call 1 Invalid function call schema
timeout 3 Execution or idle timeout
(other) Global maxRetries Falls through to the global setting

Per-class retry counts are tracked independently within a single run. The global maxRetries is configured via recovery.maxRetries in the config.

Gateway Endpoints

All observability endpoints are available when the gateway is running (lango serve):

Endpoint Description
GET /metrics System metrics snapshot (goroutines, memory, uptime)
GET /metrics/sessions Per-session token usage
GET /metrics/tools Per-tool metrics
GET /metrics/agents Per-agent metrics
GET /metrics/policy Policy decision statistics (blocks, observes, by-reason)
GET /metrics/history Historical metrics (?days=N parameter)
GET /health/detailed Detailed health check results per component

Configuration

Settings: lango settings -> Observability

{
  "observability": {
    "enabled": true,
    "tokens": {
      "enabled": true,
      "persistHistory": true,
      "retentionDays": 30
    },
    "health": {
      "enabled": true,
      "interval": "30s"
    },
    "audit": {
      "enabled": true,
      "retentionDays": 90
    },
    "metrics": {
      "enabled": true,
      "format": "json"
    }
  }
}
Key Default Description
observability.enabled false Activates the observability subsystem
observability.tokens.enabled true Activates token tracking (when observability is enabled)
observability.tokens.persistHistory false Enables DB-backed persistent storage
observability.tokens.retentionDays 30 Days to keep token usage records
observability.health.enabled true Activates health checks (when observability is enabled)
observability.health.interval 30s Health check interval
observability.audit.enabled false Activates audit logging
observability.audit.retentionDays 90 Days to keep audit records
observability.metrics.enabled false Activates metrics export endpoint
observability.metrics.format "json" Metrics export format

See the Metrics CLI Reference for command documentation.