Skip to main content

Overview

When you upload a PDF or DOCX file, Raycaster Doc automatically parses it into structured Markdown in the background. This parsed content powers AI features like chat document reading, semantic search, and review analysis — without you having to do anything.

How Parsing Works

1

Upload triggers parsing

When a file is uploaded or replaced, a SHA-256 hash of the source file is computed.
2

Cache check

If parsed content already exists for this exact file hash, parsing is skipped entirely (cache hit).
3

Parse job queued

For new files, a background parse job is queued and processed asynchronously.
4

Parsed content stored

The output — per-page Markdown files and extracted media — is stored in a separate cache bucket.

Parsing by File Type

FormatParserOutput
PDFMistral OCRPer-page Markdown + extracted images
DOCXReductoPer-page Markdown + media
DOCReducto (legacy)Per-page Markdown
Markdown / PlaintextNative (no parsing needed)Indexed directly

Cache Properties

  • Hash-based — Identical files uploaded by different users reuse the same parsed output
  • Idempotent — Re-uploading the same file doesn’t trigger redundant parsing
  • Automatic cleanup — When an artifact is deleted or a project is removed, its cached content is cleaned up

Parse Status

Each artifact tracks its parsing state internally:
StatusMeaning
noneNo parsed content exists yet
pendingParse job is queued or in progress
readyParsed content is available for AI features

What Uses Parsed Content

Parsed Markdown is consumed by several AI features behind the scenes:
  • Chat view tool — When the agent reads a document in text mode, it uses parsed Markdown for PDFs and DOCX files
  • Semantic search — Parsed content is chunked, embedded, and indexed in the vector database
  • Review runs — The review agent reads parsed content to analyze documents
You never interact with parsed files directly. The document viewer always shows the original source file. Parsing is a backend optimization that makes AI features fast and accurate.

Limitations

  • PDF parsing supports files up to 25 MB
  • Very large documents may take a few minutes to parse
  • Complex layouts (multi-column, heavy tables) may have reduced parsing accuracy — use the visual mode in chat for layout-sensitive analysis