Sophon Docs
Features

Documents

Upload, extract, index, and Q&A across PDFs, DOCX, spreadsheets, images, and audio.

Sophon's document pipeline lets you upload files and then talk to them. PDFs, Word docs, spreadsheets, images, Markdown, HTML, and audio all get normalized into searchable text, chunked, embedded (on Pro/Enterprise), and indexed. Agents can summarize, compare, extract, and answer questions — with citations back to the source.

Supported formats

FormatHow it's processed
PDFPdfPig for native-text PDFs; Tesseract OCR fallback for image-only PDFs
DOCXDocumentFormat.OpenXml — text + headings + lists
XLSX / CSVClosedXML — sheets, ranges, formulas rendered as values
PNG / JPG / WEBPOCR via Tesseract + optional LLM vision for structural understanding
HTMLReadability-style extraction — boilerplate removed
Markdown / TXTIngested directly
MP3 / WAVWhisper transcription

Unsupported formats fail fast with a clear message. You can drop a new extractor into src/Sophon.Documents/Extractors/ — extractors are plugins registered with the document pipeline.

Uploading

From the Dashboard

Drag and drop onto Documents (/documents). You'll see a progress bar per file and a status chip (extracting → chunking → embedding → ready).

From chat

Attach a file to any chat message. The attachment upload goes through the same pipeline; by the time the agent sees your message, the document is indexed and referenceable.

From a channel

Send a file on Telegram, WhatsApp, Slack, etc. The channel adapter normalizes it into a SophonMessage attachment; the document pipeline processes it identically.

From the CLI

sophon documents upload report.pdf
sophon documents upload *.pdf --tag q3-review
sophon documents list
sophon documents summarize report.pdf
sophon documents delete <id>

The pipeline

Upload


Detect type (MIME + extension + content sniff)


Route to extractor


Extract raw text + metadata


Chunk (semantic-boundary-aware, overlap)


Embed (Pro/Enterprise)                FTS5 index (all tiers)
  │                                   │
  ▼                                   ▼
Qdrant / pgvector / Milvus    SQLite FTS5 / Postgres FTS
  │                                   │
  └──────────────┬────────────────────┘

             Ready to query

All of this runs async. Small documents are ready in seconds; large PDFs with OCR can take minutes. You can start asking questions as soon as status flips to ready.

Storage layout

~/.sophon/documents/
├── raw/               # Original uploaded files (content-addressed)
│   └── <sha256>.pdf
├── extracted/         # Extracted text + metadata
│   └── <doc-id>.json
└── thumbnails/        # PDF page thumbnails
    └── <doc-id>/

The database tracks metadata (title, size, pages, tags, ownership). The vector store (if present) holds chunk embeddings with the document ID as metadata. The FTS index is in the same DB (SQLite FTS5 on Personal; Postgres tsvector on Pro/Enterprise).

Q&A

Ask questions in chat:

"What does the Q3 report say about revenue growth?"

The agent:

  1. Calls document.search (hybrid keyword + semantic on Pro/Enterprise; keyword only on Personal).
  2. Retrieves the top chunks.
  3. Answers with citations back to the document and page/section.
  4. Offers follow-up actions — compare with another document, extract key figures, add facts to memory.

Scope the question to specific documents:

"Based only on q3-report.pdf and q4-plan.docx, how does the revenue target change?"

Multi-document analysis

Upload several files. Then:

"Compare the three competitor pitch decks. What's their shared messaging?"

The agent retrieves chunks across all three, synthesizes, and answers. For large batches (10+), use the Batch operation button in the Documents page:

  • Summarize all — one summary per document
  • Extract — pull structured fields (dates, amounts, names) across the batch
  • Merge — combine into a single output document

Document tools

Agent-invocable tools in the document.* namespace:

  • document.upload — ingest a file from a URL or path
  • document.search — hybrid keyword + semantic search
  • document.get — fetch a document's extracted text
  • document.summarize — summarize by style (brief / detailed / executive)
  • document.extract — structured extraction with a JSON schema
  • document.compare — diff or side-by-side analysis of multiple docs
  • document.delete — remove a document (Medium risk — gates if part of a plan)

Dashboard

The Documents module has:

  • Library — all documents with thumbnails, search, sort, filter by type/tag/date/owner
  • Detail view — extracted text side-by-side with original, per-chunk preview, metadata, tags
  • Q&A pane — chat-like interface scoped to the open document
  • Storage manager — disk usage by type, retention policies, cleanup

Limits and gotchas

  • Single-file upload limit: 100 MB (configurable). Larger files return HTTP 413.
  • OCR quality depends on the source — low-contrast scans produce noisy extracted text.
  • Audio transcription requires an embedding/Whisper-compatible provider configured in Settings → Models.
  • Deleted documents are not recoverable. The raw file under raw/ is removed along with the DB + index entries.
  • Vector search is disabled on Personal; keyword search still works but returns fewer conceptually related hits.

Security and isolation

  • All documents are user-scoped. Cross-user reads are impossible.
  • Tenant-scoped in Enterprise (EF Core global filters).
  • Uploaded files are virus-scanned via the configured scanner (ClamAV in the reference deployment).
  • Vector metadata is filtered server-side, so semantic search can't return another user's chunks.

Where to go next

  • Memory — add document-derived facts to memory
  • Workflows — wire a file-change trigger into a document pipeline
  • Skills — OCR, Web Scrape, and other extraction skills