Sophon Docs
Administration

Operations & System Health

Monitor Gateway health, manage services, view resource usage, control agent processing, and diagnose issues.

The Operations section of the Dashboard is where operators manage a running Sophon deployment — system health dashboards, service control (pause/resume/restart), cache management, remote-access ticket issuance, and diagnostic tools. Requires the Operator or Admin role.

System health

Operations → System.

Top section — service status:

  • Gateway — running / degraded / stopped
  • Database — connection pool utilization, slow queries, replication lag (Postgres)
  • Vector DB — index size, query latency, ingestion queue
  • Cache (Redis, if configured) — hit rate, memory usage, eviction rate
  • Message bus — queue depth, consumer lag
  • Background agent service — active tasks, queue depth, worker utilization

Middle section — resource usage:

  • CPU (Gateway process + total)
  • Memory (RSS + heap)
  • Disk (data directory, log directory, tmp)
  • Network I/O
  • Open file descriptors

Charts default to 1-hour rolling; togglable to 24h / 7d / 30d.

Bottom section — recent errors:

  • Last 20 log entries at Error level or higher, grouped by exception type
  • Click an entry for full stack trace

Control

Operations → Control.

Processing

  • Pause processing — stop dequeuing agent tasks. In-flight tasks complete; queued tasks stay queued. New tasks queue but don't execute.
  • Resume processing — restart dequeue.
  • Drain and pause — wait for active tasks to complete, then pause.

Useful for maintenance windows or for isolating a misbehaving agent while you debug.

Service restarts

  • Restart Background Agent Service — stops the hosted service, flushes in-memory queues to DB, restarts. In-flight tasks are cancelled and marked for recovery.
  • Restart Scheduler (Quartz) — restart cron scheduling without restarting the Gateway.
  • Restart SignalR hubs — force all connected clients to reconnect. Active streams disconnect; clients reconnect with exponential backoff.

All restart operations are audited.

Caches

  • Flush tool registry cache — force re-enumeration of bundled + installed skills.
  • Flush model capability cache — re-check what each configured LLM provider supports (useful after adding a new model to a provider).
  • Flush context cache — per-session memory + agent prompt cache.
  • Flush insights cache — recompute metrics on next request.

Cache flushes are safe; they cost a little latency on the next request but don't affect correctness.

Diagnostics

  • Run self-test — end-to-end smoke test: database connect, vector DB connect, provider ping, credential vault read, queue enqueue/dequeue. Reports pass/fail per component.
  • Capture diagnostic bundle — zips last 24 h of logs, current config (secrets redacted), feature flags, recent audit entries, process stats. Downloadable for support tickets.
  • Force GC — triggers .NET garbage collection. Generally unnecessary — the runtime handles this — but useful when investigating memory issues.

Remote access

Operations → Remote Access. See Remote Access Onboarding for the full flow.

Summary: generate short-lived onboarding tickets with device configuration pre-filled. Share the URL or QR code with the remote user / device. Revoke if needed.

Queue depth alerts

Configure alerts when queue depth exceeds thresholds:

{
  "Sophon": {
    "Alerts": {
      "AgentTaskQueue": { "DepthWarn": 50, "DepthCritical": 100 },
      "WebhookDeliveryQueue": { "DepthWarn": 100, "DepthCritical": 500 },
      "PushDeliveryQueue": { "DepthWarn": 200, "DepthCritical": 1000 }
    }
  }
}

When thresholds are crossed, Sophon:

  • Fires a systemUpdates push to Admin-role users
  • Emits a queue.depth_exceeded event
  • Optionally escalates to PagerDuty / Opsgenie (configured via webhook)

Metrics and observability

OpenTelemetry

Sophon emits OpenTelemetry metrics and traces out of the box. Point at your OTel collector:

{
  "Sophon": {
    "Observability": {
      "Otlp": {
        "Endpoint": "https://otel-collector.example.com:4317",
        "Headers": { "Authorization": "Bearer {{vault:otel_token}}" }
      }
    }
  }
}

Metrics exported:

  • sophon.requests.total — HTTP requests by path + status
  • sophon.signalr.connections.active
  • sophon.agent.tasks.* — queue depth, duration, success / failure counts
  • sophon.llm.tokens.* — by provider, by agent
  • sophon.memory.entries.* — counts by scope + tenant
  • sophon.approvals.* — requested / approved / rejected
  • sophon.push.* — delivery success / failure by category
  • sophon.database.connections.*

Traces cover the full orchestration pipeline — every middleware emits a span, LLM calls emit spans, tool executions emit spans.

Prometheus

Alternatively, expose /metrics in Prometheus exposition format:

{
  "Sophon": {
    "Observability": {
      "Prometheus": { "Enabled": true, "Path": "/metrics" }
    }
  }
}

Scrape with your Prometheus server; pair with the Grafana dashboards in deploy/grafana/ (Sophon repo).

Structured logs (Serilog)

Default JSON-structured logs go to ~/.sophon/logs/. Sinks are configurable — ship to Datadog, Loki, Elasticsearch, or your logging stack of choice via Serilog configuration.

Running multiple Gateways

For high-availability deployments, run multiple Gateway instances behind a load balancer:

  • Shared Postgres + Qdrant + Redis
  • Shared vault backend
  • Sticky SignalR (or Redis backplane for pub/sub)
  • One task queue (Redis on Pro, RabbitMQ on Enterprise)

Each instance is stateless except for its SignalR connections. Blue/green deploys work out of the box — route traffic to the new pool, drain the old pool (cancel active tasks will recover on the new pool), decommission.

Maintenance windows

Planned maintenance pattern:

  1. Announce via systemUpdates push.
  2. Pause processing — queue fills but nothing executes.
  3. Drain in-flight tasks (usually < 2 minutes).
  4. Perform upgrade / migration / config change.
  5. Resume processing — queue drains quickly.

Total user-visible downtime: the time to drain + the time to resume. Typically < 5 minutes for a rolling upgrade.

Troubleshooting shortcuts

  • Tool calls timing out → Check active tasks in Operations → System. If many tasks are in-flight and slow, check LLM provider latency; rotate API keys if one is cooling down.
  • Push notifications delayedAdmin → Push Notifications → Activity for rate-limit headroom.
  • Workflow runs not firingOperations → System → Scheduler status. Restart scheduler if it shows degraded.
  • Credential vault errors → Check vault backend connectivity. Local backend: file permissions on ~/.sophon/config/vault.enc. External backend: network reachability, token validity.
  • Qdrant degraded → Flush vector cache, re-index documents. sophon documents reindex for full rebuild.

Audit

All operations-panel actions are audited with category operations:

  • ops.pause, ops.resume, ops.drain
  • ops.service.restart
  • ops.cache.flush
  • ops.diagnostic.run

Filter in Admin → Audit → category: operations.

Where to go next