Orchestration Pipeline

The 12-middleware pipeline that processes every message — budgets, routing, tool filtering, planning, and the agentic loop.

Every user message in Sophon runs through the orchestration pipeline — an ASP.NET-style middleware chain in the core engine. It replaced the original flat tool-calling loop with twelve composable middlewares that handle budgets, routing, context window management, tool filtering, planning, parallel execution, approvals, and the LLM call itself.

This page is the conceptual map — what each middleware does, in what order, and why the pipeline is split into two zones.

The two zones

User Message
    │
    ▼
┌─────────────────────────────────────────────┐
│  OUTER ZONE — runs once per message         │
│                                             │
│  1. SessionEvent       Record user message  │
│  2. Budget             Check spending       │
│  3. CapabilityRouting  Pick provider/model  │
│  4. ContextWindow      Token budget mgmt    │
│  5. ToolFilter         Select top-10 tools  │
│  6. PromptToolBridge   Non-FC tool inject   │
│  7. Planning           Multi-step planning  │
├─────────────────────────────────────────────┤
│  INNER LOOP — repeats for tool calls        │
│                                             │
│  8.  AgenticLoop                            │
│   ┌─────────────────────────────────────┐   │
│   │  9.  ParallelExecution              │   │
│   │  10. Approval                       │   │
│   │  11. ToolExecution                  │   │
│   │  12. LlmInvoker   ←── LLM API call │   │
│   └─────────────────────────────────────┘   │
│       ↑                    │                │
│       └─── loop if tool ───┘                │
└─────────────────────────────────────────────┘
    │
    ▼
Final Response → Memory → Title Generation

The outer zone prepares the request; the inner loop repeats until the LLM produces a final response (no tool calls) or hits the iteration cap.

Middleware reference

1. Session event

Records a UserMessage event before the pipeline runs and an AgentMessage event after, so the full conversation is durable.

2. Budget

Checks per-provider spending limits (daily tokens, monthly cost). If exceeded, the pipeline short-circuits with a synthetic budget_exceeded response before any LLM call is made. Usage is recorded after each call.

3. Capability routing

Picks the LLM provider and model. If a plan step declared a ModelHint (e.g., "requires reasoning" or "requires code generation"), the router finds a provider that advertises those capabilities. Otherwise it uses the default.

4. Context window

Prevents context overflow. If estimated tokens exceed 80% of the model's max context, the middleware asks an LLM to compact the oldest 60% of the conversation. If compaction times out (default 30 seconds) or errors, it hard-truncates to 70% of the budget — always keeping the system prompt and the most recent messages.

5. Tool filter

Selects the top 10 most relevant tools for the request. It keyword-scores every tool (+1.0 per tag match, +0.5 per description keyword match) and always includes datetime.now and memory.search. This keeps the tool manifest small and reduces LLM confusion on agents with hundreds of tools.

6. Prompt-tool bridge

For providers that don't support native function calling, this middleware injects tool definitions into the system prompt and extracts <tool_call>{"name": "...", "arguments": {...}}</tool_call> XML blocks from the response. Lets you use function calling against models that don't speak it natively.

7. Planning

Decomposes complex requests into a DAG of steps. If the request is complex (by heuristic or LLM vote) and any step is rated ≥ Medium risk, the plan is presented to the user for approval before execution. See Planning.

8. Agentic loop

The main tool-calling loop. Calls the inner pipeline → if the LLM returned tool calls, executes them and loops. Stops when the LLM produces a final response or hits MaxToolIterations (default: 100, hard cap: 500). Every 10 iterations, it emits a progress checkpoint.

Built into the loop: a tool loop detector that watches the last 30 tool calls for three patterns:

Pattern	Detection	Severity
`generic_repeat`	Same tool + args called N+ times	≥ 30: circuit breaker. ≥ 20: critical. ≥ 10: warning
`poll_no_progress`	Same tool called consecutively with identical args	≥ 5: critical. ≥ 3: warning
`ping_pong`	Alternating A → B → A → B in last 6 calls	Warning

A circuit breaker stops the loop immediately. Warnings inject a hint into the conversation steering the LLM away.

9. Parallel execution

If the LLM returned 2+ tool calls in one response, they run concurrently via Task.WhenAll. Results are truncated to 8000 chars (with JSON-aware truncation for arrays), then the pipeline re-invokes itself so the LLM sees all results at once.

10. Approval

For every tool call, looks up the tool's risk level. If ≥ High, it sends an approval request via the approval gate (SignalR). The originating channel (WhatsApp, Telegram, etc.) is preserved, so approvals can come back through the same channel. Rejected or timed-out tools are replaced with error messages; approved tools proceed. See Approval Gates.

11. Tool execution

Executes tool calls sequentially (the fallback for single calls or when parallel doesn't apply). Approval is already resolved by middleware 10, so this middleware calls the registry with skipApproval: true.

12. LLM invoker (terminal)

The innermost middleware. Builds the LLM request (messages + filtered tools + max_tokens=4096 + tool_choice=auto), then either:

Streaming — if the caller requested streaming and the provider supports it, streams the completion and forwards chunks to the client in real time (typically over SignalR).
Non-streaming — runs through the provider failover policy for retry + fallback, or calls the provider directly otherwise.

Safety mechanisms

Tool loop detection

Already covered above. Three patterns, three severity levels, one 30-call FIFO window.

Tool result truncation

Every tool result is capped at 8000 chars. For JSON arrays it binary-searches for the max number of elements that fit and appends [...truncated, showing X/Y items]. For text it truncates at the last word boundary.

SSRF guard

Outbound HTTP calls are routed through the SSRF guard, which blocks 127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 169.254.0.0/16, ::1, fc00::/7, and metadata.google.internal.

Budget enforcement

The budget middleware is a hard gate. If you're out of budget, the LLM doesn't get called — the user sees a budget_exceeded response explaining the limit.

Provider failover

When an LLM call fails with a retryable error (HTTP 429, 529, rate limit, overload), the provider failover policy retries with exponential backoff (250 ms base, 8 s cap, ±20% jitter). Before each backoff it tries to rotate API keys — marks the current key as cooling down for 60 seconds, picks the next available key, retries immediately.

If the primary provider exhausts its retries, it falls back through all other active providers in priority order. Each provider gets up to 3 retries before moving on.

Streaming has the same failover semantics, but only for errors that happen before the first chunk flows. Once streaming starts, errors propagate to the caller.

Background execution

The pipeline doesn't run on the SignalR hub thread. The chat hub enqueues an agent task and returns immediately. A background worker service dequeues tasks, calls the pipeline, and pushes results back over SignalR (TaskStarted, AgentStatus, StreamChunk, TaskCompleted, TaskFailed).

Tier	Queue backend
Personal	In-memory bounded queue (capacity 100)
Pro	Redis (roadmap)
Enterprise	RabbitMQ (roadmap)

On startup, the service recovers any SessionRun records stuck in Processing (from a crash mid-turn) and marks them Failed.

Configuration

AgentExecutionOptions in appsettings.json:

{
  "Sophon": {
    "AgentExecution": {
      "MaxToolIterations": 100,
      "MaxToolIterationsHardLimit": 500,
      "PlanMaxSteps": 20,
      "MaxExecutionTime": "02:00:00",
      "MaxConcurrentTasks": 3
    }
  }
}

MaxToolIterations — per-agent override ceiling. MaxToolIterationsHardLimit is absolute.
PlanMaxSteps — steps per auto-generated plan.
MaxExecutionTime — wall-clock timeout per agent task.
MaxConcurrentTasks — background-task concurrency.

Where to go next

Planning — how complex requests get decomposed into DAGs
Approval Gates — risk classification and human-in-the-loop gating
Tiers — which features are available per tier

Orchestration Pipeline

On this page