Operations & System Health
Monitor Gateway health, manage services, view resource usage, control agent processing, and diagnose issues.
The Operations section of the Dashboard is where operators manage a running Sophon deployment — system health dashboards, service control (pause/resume/restart), cache management, remote-access ticket issuance, and diagnostic tools. Requires the Operator or Admin role.
System health
Operations → System.
Top section — service status:
- Gateway — running / degraded / stopped
- Database — connection pool utilization, slow queries, replication lag (Postgres)
- Vector DB — index size, query latency, ingestion queue
- Cache (Redis, if configured) — hit rate, memory usage, eviction rate
- Message bus — queue depth, consumer lag
- Background agent service — active tasks, queue depth, worker utilization
Middle section — resource usage:
- CPU (Gateway process + total)
- Memory (RSS + heap)
- Disk (data directory, log directory, tmp)
- Network I/O
- Open file descriptors
Charts default to 1-hour rolling; togglable to 24h / 7d / 30d.
Bottom section — recent errors:
- Last 20 log entries at
Errorlevel or higher, grouped by exception type - Click an entry for full stack trace
Control
Operations → Control.
Processing
- Pause processing — stop dequeuing agent tasks. In-flight tasks complete; queued tasks stay queued. New tasks queue but don't execute.
- Resume processing — restart dequeue.
- Drain and pause — wait for active tasks to complete, then pause.
Useful for maintenance windows or for isolating a misbehaving agent while you debug.
Service restarts
- Restart Background Agent Service — stops the hosted service, flushes in-memory queues to DB, restarts. In-flight tasks are cancelled and marked for recovery.
- Restart Scheduler (Quartz) — restart cron scheduling without restarting the Gateway.
- Restart SignalR hubs — force all connected clients to reconnect. Active streams disconnect; clients reconnect with exponential backoff.
All restart operations are audited.
Caches
- Flush tool registry cache — force re-enumeration of bundled + installed skills.
- Flush model capability cache — re-check what each configured LLM provider supports (useful after adding a new model to a provider).
- Flush context cache — per-session memory + agent prompt cache.
- Flush insights cache — recompute metrics on next request.
Cache flushes are safe; they cost a little latency on the next request but don't affect correctness.
Diagnostics
- Run self-test — end-to-end smoke test: database connect, vector DB connect, provider ping, credential vault read, queue enqueue/dequeue. Reports pass/fail per component.
- Capture diagnostic bundle — zips last 24 h of logs, current config (secrets redacted), feature flags, recent audit entries, process stats. Downloadable for support tickets.
- Force GC — triggers .NET garbage collection. Generally unnecessary — the runtime handles this — but useful when investigating memory issues.
Remote access
Operations → Remote Access. See Remote Access Onboarding for the full flow.
Summary: generate short-lived onboarding tickets with device configuration pre-filled. Share the URL or QR code with the remote user / device. Revoke if needed.
Queue depth alerts
Configure alerts when queue depth exceeds thresholds:
{
"Sophon": {
"Alerts": {
"AgentTaskQueue": { "DepthWarn": 50, "DepthCritical": 100 },
"WebhookDeliveryQueue": { "DepthWarn": 100, "DepthCritical": 500 },
"PushDeliveryQueue": { "DepthWarn": 200, "DepthCritical": 1000 }
}
}
}When thresholds are crossed, Sophon:
- Fires a
systemUpdatespush to Admin-role users - Emits a
queue.depth_exceededevent - Optionally escalates to PagerDuty / Opsgenie (configured via webhook)
Metrics and observability
OpenTelemetry
Sophon emits OpenTelemetry metrics and traces out of the box. Point at your OTel collector:
{
"Sophon": {
"Observability": {
"Otlp": {
"Endpoint": "https://otel-collector.example.com:4317",
"Headers": { "Authorization": "Bearer {{vault:otel_token}}" }
}
}
}
}Metrics exported:
sophon.requests.total— HTTP requests by path + statussophon.signalr.connections.activesophon.agent.tasks.*— queue depth, duration, success / failure countssophon.llm.tokens.*— by provider, by agentsophon.memory.entries.*— counts by scope + tenantsophon.approvals.*— requested / approved / rejectedsophon.push.*— delivery success / failure by categorysophon.database.connections.*
Traces cover the full orchestration pipeline — every middleware emits a span, LLM calls emit spans, tool executions emit spans.
Prometheus
Alternatively, expose /metrics in Prometheus exposition format:
{
"Sophon": {
"Observability": {
"Prometheus": { "Enabled": true, "Path": "/metrics" }
}
}
}Scrape with your Prometheus server; pair with the Grafana dashboards in deploy/grafana/ (Sophon repo).
Structured logs (Serilog)
Default JSON-structured logs go to ~/.sophon/logs/. Sinks are configurable — ship to Datadog, Loki, Elasticsearch, or your logging stack of choice via Serilog configuration.
Running multiple Gateways
For high-availability deployments, run multiple Gateway instances behind a load balancer:
- Shared Postgres + Qdrant + Redis
- Shared vault backend
- Sticky SignalR (or Redis backplane for pub/sub)
- One task queue (Redis on Pro, RabbitMQ on Enterprise)
Each instance is stateless except for its SignalR connections. Blue/green deploys work out of the box — route traffic to the new pool, drain the old pool (cancel active tasks will recover on the new pool), decommission.
Maintenance windows
Planned maintenance pattern:
- Announce via
systemUpdatespush. - Pause processing — queue fills but nothing executes.
- Drain in-flight tasks (usually < 2 minutes).
- Perform upgrade / migration / config change.
- Resume processing — queue drains quickly.
Total user-visible downtime: the time to drain + the time to resume. Typically < 5 minutes for a rolling upgrade.
Troubleshooting shortcuts
- Tool calls timing out → Check active tasks in Operations → System. If many tasks are in-flight and slow, check LLM provider latency; rotate API keys if one is cooling down.
- Push notifications delayed → Admin → Push Notifications → Activity for rate-limit headroom.
- Workflow runs not firing → Operations → System → Scheduler status. Restart scheduler if it shows degraded.
- Credential vault errors → Check vault backend connectivity. Local backend: file permissions on
~/.sophon/config/vault.enc. External backend: network reachability, token validity. - Qdrant degraded → Flush vector cache, re-index documents.
sophon documents reindexfor full rebuild.
Audit
All operations-panel actions are audited with category operations:
ops.pause,ops.resume,ops.drainops.service.restartops.cache.flushops.diagnostic.run
Filter in Admin → Audit → category: operations.
Where to go next
- Self-Hosting → Backup & Upgrade — operator runbooks
- Remote Access — the ticketing system for device onboarding
- Audit Logging — ops activity is fully tracked