Documents
Upload, extract, index, and Q&A across PDFs, DOCX, spreadsheets, images, and audio.
Sophon's document pipeline lets you upload files and then talk to them. PDFs, Word docs, spreadsheets, images, Markdown, HTML, and audio all get normalized into searchable text, chunked, embedded (on Pro/Enterprise), and indexed. Agents can summarize, compare, extract, and answer questions — with citations back to the source.
Supported formats
| Format | How it's processed |
|---|---|
PdfPig for native-text PDFs; Tesseract OCR fallback for image-only PDFs | |
| DOCX | DocumentFormat.OpenXml — text + headings + lists |
| XLSX / CSV | ClosedXML — sheets, ranges, formulas rendered as values |
| PNG / JPG / WEBP | OCR via Tesseract + optional LLM vision for structural understanding |
| HTML | Readability-style extraction — boilerplate removed |
| Markdown / TXT | Ingested directly |
| MP3 / WAV | Whisper transcription |
Unsupported formats fail fast with a clear message. You can drop a new extractor into src/Sophon.Documents/Extractors/ — extractors are plugins registered with the document pipeline.
Uploading
From the Dashboard
Drag and drop onto Documents (/documents). You'll see a progress bar per file and a status chip (extracting → chunking → embedding → ready).
From chat
Attach a file to any chat message. The attachment upload goes through the same pipeline; by the time the agent sees your message, the document is indexed and referenceable.
From a channel
Send a file on Telegram, WhatsApp, Slack, etc. The channel adapter normalizes it into a SophonMessage attachment; the document pipeline processes it identically.
From the CLI
sophon documents upload report.pdf
sophon documents upload *.pdf --tag q3-review
sophon documents list
sophon documents summarize report.pdf
sophon documents delete <id>The pipeline
Upload
│
▼
Detect type (MIME + extension + content sniff)
│
▼
Route to extractor
│
▼
Extract raw text + metadata
│
▼
Chunk (semantic-boundary-aware, overlap)
│
▼
Embed (Pro/Enterprise) FTS5 index (all tiers)
│ │
▼ ▼
Qdrant / pgvector / Milvus SQLite FTS5 / Postgres FTS
│ │
└──────────────┬────────────────────┘
▼
Ready to queryAll of this runs async. Small documents are ready in seconds; large PDFs with OCR can take minutes. You can start asking questions as soon as status flips to ready.
Storage layout
~/.sophon/documents/
├── raw/ # Original uploaded files (content-addressed)
│ └── <sha256>.pdf
├── extracted/ # Extracted text + metadata
│ └── <doc-id>.json
└── thumbnails/ # PDF page thumbnails
└── <doc-id>/The database tracks metadata (title, size, pages, tags, ownership). The vector store (if present) holds chunk embeddings with the document ID as metadata. The FTS index is in the same DB (SQLite FTS5 on Personal; Postgres tsvector on Pro/Enterprise).
Q&A
Ask questions in chat:
"What does the Q3 report say about revenue growth?"
The agent:
- Calls
document.search(hybrid keyword + semantic on Pro/Enterprise; keyword only on Personal). - Retrieves the top chunks.
- Answers with citations back to the document and page/section.
- Offers follow-up actions — compare with another document, extract key figures, add facts to memory.
Scope the question to specific documents:
"Based only on q3-report.pdf and q4-plan.docx, how does the revenue target change?"
Multi-document analysis
Upload several files. Then:
"Compare the three competitor pitch decks. What's their shared messaging?"
The agent retrieves chunks across all three, synthesizes, and answers. For large batches (10+), use the Batch operation button in the Documents page:
- Summarize all — one summary per document
- Extract — pull structured fields (dates, amounts, names) across the batch
- Merge — combine into a single output document
Document tools
Agent-invocable tools in the document.* namespace:
document.upload— ingest a file from a URL or pathdocument.search— hybrid keyword + semantic searchdocument.get— fetch a document's extracted textdocument.summarize— summarize by style (brief / detailed / executive)document.extract— structured extraction with a JSON schemadocument.compare— diff or side-by-side analysis of multiple docsdocument.delete— remove a document (Medium risk — gates if part of a plan)
Dashboard
The Documents module has:
- Library — all documents with thumbnails, search, sort, filter by type/tag/date/owner
- Detail view — extracted text side-by-side with original, per-chunk preview, metadata, tags
- Q&A pane — chat-like interface scoped to the open document
- Storage manager — disk usage by type, retention policies, cleanup
Limits and gotchas
- Single-file upload limit: 100 MB (configurable). Larger files return HTTP 413.
- OCR quality depends on the source — low-contrast scans produce noisy extracted text.
- Audio transcription requires an embedding/Whisper-compatible provider configured in Settings → Models.
- Deleted documents are not recoverable. The raw file under
raw/is removed along with the DB + index entries. - Vector search is disabled on Personal; keyword search still works but returns fewer conceptually related hits.
Security and isolation
- All documents are user-scoped. Cross-user reads are impossible.
- Tenant-scoped in Enterprise (EF Core global filters).
- Uploaded files are virus-scanned via the configured scanner (ClamAV in the reference deployment).
- Vector metadata is filtered server-side, so semantic search can't return another user's chunks.