Documents

Upload, extract, index, and Q&A across PDFs, DOCX, spreadsheets, images, and audio.

Sophon's document pipeline lets you upload files and then talk to them. PDFs, Word docs, spreadsheets, images, Markdown, HTML, and audio all get normalized into searchable text, chunked, embedded (on Pro/Enterprise), and indexed. Agents can summarize, compare, extract, and answer questions — with citations back to the source.

Supported formats

Format	How it's processed
PDF	`PdfPig` for native-text PDFs; Tesseract OCR fallback for image-only PDFs
DOCX	`DocumentFormat.OpenXml` — text + headings + lists
XLSX / CSV	`ClosedXML` — sheets, ranges, formulas rendered as values
PNG / JPG / WEBP	OCR via Tesseract + optional LLM vision for structural understanding
HTML	Readability-style extraction — boilerplate removed
Markdown / TXT	Ingested directly
MP3 / WAV	Whisper transcription

Unsupported formats fail fast with a clear message. You can drop a new extractor into src/Sophon.Documents/Extractors/ — extractors are plugins registered with the document pipeline.

Uploading

From the Dashboard

Drag and drop onto Documents (/documents). You'll see a progress bar per file and a status chip (extracting → chunking → embedding → ready).

From chat

Attach a file to any chat message. The attachment upload goes through the same pipeline; by the time the agent sees your message, the document is indexed and referenceable.

From a channel

Send a file on Telegram, WhatsApp, Slack, etc. The channel adapter normalizes it into a SophonMessage attachment; the document pipeline processes it identically.

From the CLI

sophon documents upload report.pdf
sophon documents upload *.pdf --tag q3-review
sophon documents list
sophon documents summarize report.pdf
sophon documents delete <id>

The pipeline

Upload
  │
  ▼
Detect type (MIME + extension + content sniff)
  │
  ▼
Route to extractor
  │
  ▼
Extract raw text + metadata
  │
  ▼
Chunk (semantic-boundary-aware, overlap)
  │
  ▼
Embed (Pro/Enterprise)                FTS5 index (all tiers)
  │                                   │
  ▼                                   ▼
Qdrant / pgvector / Milvus    SQLite FTS5 / Postgres FTS
  │                                   │
  └──────────────┬────────────────────┘
                 ▼
             Ready to query

All of this runs async. Small documents are ready in seconds; large PDFs with OCR can take minutes. You can start asking questions as soon as status flips to ready.

Storage layout

~/.sophon/documents/
├── raw/               # Original uploaded files (content-addressed)
│   └── <sha256>.pdf
├── extracted/         # Extracted text + metadata
│   └── <doc-id>.json
└── thumbnails/        # PDF page thumbnails
    └── <doc-id>/

The database tracks metadata (title, size, pages, tags, ownership). The vector store (if present) holds chunk embeddings with the document ID as metadata. The FTS index is in the same DB (SQLite FTS5 on Personal; Postgres tsvector on Pro/Enterprise).

Q&A

Ask questions in chat:

"What does the Q3 report say about revenue growth?"

The agent:

Calls document.search (hybrid keyword + semantic on Pro/Enterprise; keyword only on Personal).
Retrieves the top chunks.
Answers with citations back to the document and page/section.
Offers follow-up actions — compare with another document, extract key figures, add facts to memory.

Scope the question to specific documents:

"Based only on q3-report.pdf and q4-plan.docx, how does the revenue target change?"

Multi-document analysis

Upload several files. Then:

"Compare the three competitor pitch decks. What's their shared messaging?"

The agent retrieves chunks across all three, synthesizes, and answers. For large batches (10+), use the Batch operation button in the Documents page:

Summarize all — one summary per document
Extract — pull structured fields (dates, amounts, names) across the batch
Merge — combine into a single output document

Document tools

Agent-invocable tools in the document.* namespace:

document.upload — ingest a file from a URL or path
document.search — hybrid keyword + semantic search
document.get — fetch a document's extracted text
document.summarize — summarize by style (brief / detailed / executive)
document.extract — structured extraction with a JSON schema
document.compare — diff or side-by-side analysis of multiple docs
document.delete — remove a document (Medium risk — gates if part of a plan)

Dashboard

The Documents module has:

Library — all documents with thumbnails, search, sort, filter by type/tag/date/owner
Detail view — extracted text side-by-side with original, per-chunk preview, metadata, tags
Q&A pane — chat-like interface scoped to the open document
Storage manager — disk usage by type, retention policies, cleanup

Limits and gotchas

Single-file upload limit: 100 MB (configurable). Larger files return HTTP 413.
OCR quality depends on the source — low-contrast scans produce noisy extracted text.
Audio transcription requires an embedding/Whisper-compatible provider configured in Settings → Models.
Deleted documents are not recoverable. The raw file under raw/ is removed along with the DB + index entries.
Vector search is disabled on Personal; keyword search still works but returns fewer conceptually related hits.

Security and isolation

All documents are user-scoped. Cross-user reads are impossible.
Tenant-scoped in Enterprise (EF Core global filters).
Uploaded files are virus-scanned via the configured scanner (ClamAV in the reference deployment).
Vector metadata is filtered server-side, so semantic search can't return another user's chunks.

Where to go next

Memory — add document-derived facts to memory
Workflows — wire a file-change trigger into a document pipeline
Skills — OCR, Web Scrape, and other extraction skills

On this page