Operations & System Health

Monitor Gateway health, manage services, view resource usage, control agent processing, and diagnose issues.

The Operations section of the Dashboard is where operators manage a running Sophon deployment — system health dashboards, service control (pause/resume/restart), cache management, remote-access ticket issuance, and diagnostic tools. Requires the Operator or Admin role.

System health

Operations → System.

Top section — service status:

Gateway — running / degraded / stopped
Database — connection pool utilization, slow queries, replication lag (Postgres)
Vector DB — index size, query latency, ingestion queue
Cache (Redis, if configured) — hit rate, memory usage, eviction rate
Message bus — queue depth, consumer lag
Background agent service — active tasks, queue depth, worker utilization

Middle section — resource usage:

CPU (Gateway process + total)
Memory (RSS + heap)
Disk (data directory, log directory, tmp)
Network I/O
Open file descriptors

Charts default to 1-hour rolling; togglable to 24h / 7d / 30d.

Bottom section — recent errors:

Last 20 log entries at Error level or higher, grouped by exception type
Click an entry for full stack trace

Control

Operations → Control.

Processing

Pause processing — stop dequeuing agent tasks. In-flight tasks complete; queued tasks stay queued. New tasks queue but don't execute.
Resume processing — restart dequeue.
Drain and pause — wait for active tasks to complete, then pause.

Useful for maintenance windows or for isolating a misbehaving agent while you debug.

Service restarts

Restart Background Agent Service — stops the hosted service, flushes in-memory queues to DB, restarts. In-flight tasks are cancelled and marked for recovery.
Restart Scheduler (Quartz) — restart cron scheduling without restarting the Gateway.
Restart SignalR hubs — force all connected clients to reconnect. Active streams disconnect; clients reconnect with exponential backoff.

All restart operations are audited.

Caches

Flush tool registry cache — force re-enumeration of bundled + installed skills.
Flush model capability cache — re-check what each configured LLM provider supports (useful after adding a new model to a provider).
Flush context cache — per-session memory + agent prompt cache.
Flush insights cache — recompute metrics on next request.

Cache flushes are safe; they cost a little latency on the next request but don't affect correctness.

Diagnostics

Run self-test — end-to-end smoke test: database connect, vector DB connect, provider ping, credential vault read, queue enqueue/dequeue. Reports pass/fail per component.
Capture diagnostic bundle — zips last 24 h of logs, current config (secrets redacted), feature flags, recent audit entries, process stats. Downloadable for support tickets.
Force GC — triggers .NET garbage collection. Generally unnecessary — the runtime handles this — but useful when investigating memory issues.

Remote access

Operations → Remote Access. See Remote Access Onboarding for the full flow.

Summary: generate short-lived onboarding tickets with device configuration pre-filled. Share the URL or QR code with the remote user / device. Revoke if needed.

Queue depth alerts

Configure alerts when queue depth exceeds thresholds:

{
  "Sophon": {
    "Alerts": {
      "AgentTaskQueue": { "DepthWarn": 50, "DepthCritical": 100 },
      "WebhookDeliveryQueue": { "DepthWarn": 100, "DepthCritical": 500 },
      "PushDeliveryQueue": { "DepthWarn": 200, "DepthCritical": 1000 }
    }
  }
}

When thresholds are crossed, Sophon:

Fires a systemUpdates push to Admin-role users
Emits a queue.depth_exceeded event
Optionally escalates to PagerDuty / Opsgenie (configured via webhook)

Metrics and observability

OpenTelemetry

Sophon emits OpenTelemetry metrics and traces out of the box. Point at your OTel collector:

{
  "Sophon": {
    "Observability": {
      "Otlp": {
        "Endpoint": "https://otel-collector.example.com:4317",
        "Headers": { "Authorization": "Bearer {{vault:otel_token}}" }
      }
    }
  }
}

Metrics exported:

sophon.requests.total — HTTP requests by path + status
sophon.signalr.connections.active
sophon.agent.tasks.* — queue depth, duration, success / failure counts
sophon.llm.tokens.* — by provider, by agent
sophon.memory.entries.* — counts by scope + tenant
sophon.approvals.* — requested / approved / rejected
sophon.push.* — delivery success / failure by category
sophon.database.connections.*

Traces cover the full orchestration pipeline — every middleware emits a span, LLM calls emit spans, tool executions emit spans.

Prometheus

Alternatively, expose /metrics in Prometheus exposition format:

{
  "Sophon": {
    "Observability": {
      "Prometheus": { "Enabled": true, "Path": "/metrics" }
    }
  }
}

Scrape with your Prometheus server; pair with the Grafana dashboards in deploy/grafana/ (Sophon repo).

Structured logs (Serilog)

Default JSON-structured logs go to ~/.sophon/logs/. Sinks are configurable — ship to Datadog, Loki, Elasticsearch, or your logging stack of choice via Serilog configuration.

Running multiple Gateways

For high-availability deployments, run multiple Gateway instances behind a load balancer:

Shared Postgres + Qdrant + Redis
Shared vault backend
Sticky SignalR (or Redis backplane for pub/sub)
One task queue (Redis on Pro, RabbitMQ on Enterprise)

Each instance is stateless except for its SignalR connections. Blue/green deploys work out of the box — route traffic to the new pool, drain the old pool (cancel active tasks will recover on the new pool), decommission.

Maintenance windows

Planned maintenance pattern:

Announce via systemUpdates push.
Pause processing — queue fills but nothing executes.
Drain in-flight tasks (usually < 2 minutes).
Perform upgrade / migration / config change.
Resume processing — queue drains quickly.

Total user-visible downtime: the time to drain + the time to resume. Typically < 5 minutes for a rolling upgrade.

Troubleshooting shortcuts

Tool calls timing out → Check active tasks in Operations → System. If many tasks are in-flight and slow, check LLM provider latency; rotate API keys if one is cooling down.
Push notifications delayed → Admin → Push Notifications → Activity for rate-limit headroom.
Workflow runs not firing → Operations → System → Scheduler status. Restart scheduler if it shows degraded.
Credential vault errors → Check vault backend connectivity. Local backend: file permissions on ~/.sophon/config/vault.enc. External backend: network reachability, token validity.
Qdrant degraded → Flush vector cache, re-index documents. sophon documents reindex for full rebuild.

Audit

All operations-panel actions are audited with category operations:

ops.pause, ops.resume, ops.drain
ops.service.restart
ops.cache.flush
ops.diagnostic.run

Filter in Admin → Audit → category: operations.

Where to go next

Self-Hosting → Backup & Upgrade — operator runbooks
Remote Access — the ticketing system for device onboarding
Audit Logging — ops activity is fully tracked

Operations & System Health

On this page