| title | Tutorial: Document Ingestion Deep-Dive |
|---|
Understanding and Customizing the Document Pipeline
This tutorial explores EdgeQuake's document processing pipeline in depth, covering chunking strategies, entity extraction, and how to optimize for your use case.
Time: ~25 minutes
Level: Intermediate
Prerequisites: Completed First RAG App
┌─────────────────────────────────────────────────────────────────┐
│ DOCUMENT INGESTION PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Document ─────────────────────────────────────────────────────▶
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 1. Parse │ Extract text from PDF, DOCX, TXT, HTML │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 2. Chunk │ Split into semantic units (1200 tokens default)│
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 3. Extract │ LLM extracts entities + relationships │
│ │ (per chunk)│ Runs in parallel │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 4. Normalize│ Deduplicate entities, merge descriptions │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 5. Embed │ Generate embeddings for chunks + entities │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ 6. Store │ Save to PostgreSQL (pgvector + AGE) │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
EdgeQuake has two PDF extraction backends:
vision(default) for scanned, image-heavy, or layout-complex PDFsedgeparsefor fast CPU-only extraction of digital-native PDFs
You can choose the backend per upload, set a workspace default, or use
EDGEQUAKE_PDF_PARSER_BACKEND as a server fallback. This section provides a quick overview - see
the PDF Ingestion Tutorial for complete details.
In the Web UI:
- Set the workspace default on the workspace configuration page (
/workspaceor/w/[slug]/workspace). - Override it per file in the upload dialog with the
Parser for this uploadselector. - Leave the upload selector on
Workspace Defaultto inherit the workspace setting.
# Upload a PDF with default settings (vision backend)
curl -X POST "http://localhost:8080/api/v1/documents/upload" \
-F "file=@research_paper.pdf" \
-F "title=AI Research Paper"
# Upload a digital-native PDF with EdgeParse
curl -X POST "http://localhost:8080/api/v1/documents/upload" \
-F "file=@annual_report.pdf" \
-F "title=Annual Report" \
-F "pdf_parser_backend=edgeparse"What Gets Extracted:
- ✅ Text (with layout preservation)
- ✅ Tables (with structure detected)
- ✅ Metadata (pages, author, title)
- ✅ Multi-column layouts (academic papers)
Response:
{
"id": "doc-uuid",
"title": "AI Research Paper",
"status": "completed",
"chunk_count": 45,
"metadata": {
"pages": 12,
"tables_detected": 3
}
}EdgeQuake supports three extraction modes:
Text Mode (default, fastest):
# Automatic text extraction from digital PDFs
curl -X POST http://localhost:8080/api/v1/documents/upload \
-F "file=@doc.pdf"- Use for: Good quality digital PDFs
- Processing: 2-5 seconds
- Cost: Free
Vision Backend (scanned documents):
# LLM-based OCR for scanned/image PDFs
curl -X POST http://localhost:8080/api/v1/documents/upload \
-F "file=@scanned_book.pdf" \
-F "pdf_parser_backend=vision"- Use for: Scanned documents, poor quality PDFs
- Processing: 20-50 seconds
- Cost: ~$0.001-0.01 per page
Workspace Default Override (scan-heavy corpus):
# Prefer vision for all uploads in this workspace
curl -X PUT http://localhost:8080/api/v1/workspaces/$WORKSPACE_ID \
-H "Content-Type: application/json" \
-d '{"pdf_parser_backend":"vision"}'- Use for: Teams that mostly ingest scans or image-heavy PDFs
- Processing: Consistent Vision behavior across uploads
- Cost: All uploads use the Vision backend unless overridden per upload
EdgeParse is intentionally deterministic and does not auto-fallback to Vision. If a PDF is scanned or image-only, EdgeParse may produce thin markdown and the document lineage will include a warning telling you to retry with Vision.
For complex tables (merged cells, nested structures):
curl -X POST http://localhost:8080/api/v1/documents/upload \
-F "file=@financial_report.pdf" \
-F 'config={"enhance_tables": true}'Before (raw extraction):
Column1 Header Column2 Header
Data1a Data1b Data2a
Data2b Data3a Data3b
After (enhanced):
| Column 1 Header | Column 2 Header |
| --------------- | --------------- |
| Data 1a | Data 1b |
| Data 2a | Data 2b |
| Data 3a | Data 3b |Trade-off: 2x slower, ~$0.0001 per table, but significantly better accuracy.
When EdgeQuake processes PDFs, chunks are created based on document structure:
Text Content:
- Paragraphs → Individual chunks
- Sections → Detected via headings
- Reading order → Preserved with layout analysis
Tables:
- Entire table → Single chunk
- Preserves cell relationships
- Includes caption if present
Figures:
- Caption → Separate chunk
- Image description (if vision mode enabled)
Example (12-page research paper):
Page 1: Abstract → 1 chunk
Page 2-3: Introduction (4 paras) → 4 chunks
Page 4: Table 1 → 1 chunk
Page 5-7: Methods (6 paras) → 6 chunks
Page 8: Figure 2 caption → 1 chunk
Page 9-11: Results (8 paras + table) → 9 chunks
Page 12: Conclusion → 2 chunks
Total: 24 chunks from 12 pages
Tip: PDF chunks tend to be more structured than plain text chunks due to layout analysis.
Entities extracted from PDFs include document-specific elements:
From Content:
- Authors, researchers, organizations
- Methods, concepts, metrics
- Locations, datasets
From Metadata:
- PDF title → Document entity
- Author field → Person entities
- Creation date → Temporal entity
Example (from PDF metadata):
Dr. Jane Smith (PERSON) → AuthorOf → "AI Safety Paper" (DOCUMENT)
"AI Safety Paper" (DOCUMENT) → PublishedBy → MIT (ORGANIZATION)
MIT (ORGANIZATION) → LocatedIn → Boston (LOCATION)
Relationship Graph:
Jane Smith ───AuthorOf──▶ Paper ───Cites──▶ Related Work
│ │
│ │
WorksAt AboutTopic
│ │
▼ ▼
MIT "Reinforcement Learning"
After PDF upload, check extraction metrics:
curl http://localhost:8080/api/v1/documents/doc-uuidResponse:
{
"id": "doc-uuid",
"metadata": {
"pages": 12,
"tables_detected": 3,
"pdf_extraction_method": "edgeparse"
},
"chunk_count": 24,
"entity_count": 18
}Quality Indicators:
- ✅
chunk_countmatches expected (roughly 2-3 chunks per page) - ✅
tables_detected > 0if PDF has tables - ✅
entity_count > 0indicates successful extraction
If chunk_count = 0:
- Retry with the Vision backend:
{"pdf_parser_backend":"vision"} - Check if PDF is encrypted/protected
- See PDF Troubleshooting
Common configuration options:
{
"pdf_parser_backend": "edgeparse", // edgeparse | vision
"enhance_tables": false, // Enable LLM table refinement
"layout": {
"detect_columns": true, // Multi-column detection
"detect_tables": true, // Table detection
"column_gap_threshold": 20.0 // Column separation (points)
},
"max_pages": null, // Limit pages (null = all)
"normalize_spacing": true, // Fix concatenated words
"extract_figure_captions": true // Extract figure captions
}Read this section if:
- First time with EdgeQuake
- Quick reference for PDF upload
Read PDF Ingestion Tutorial if:
- Complex PDFs (tables, scans, multi-column)
- Need detailed configuration guidance
- Troubleshooting extraction issues
- Understanding quality metrics
Read PDF Processing Deep Dive if:
- Understanding internal algorithms
- XY-Cut layout analysis details
- Table detection clustering logic
- Contributing to PDF crate
No text extracted:
- ✅ Try
{"pdf_parser_backend":"vision"}for scanned PDFs - ✅ Check PDF is not encrypted
Tables not detected:
- ✅ Enable
{"enhance_tables": true} - ✅ Verify tables have clear borders
Wrong text order:
- ✅ Enable
{"layout": {"detect_columns": true}} - ✅ Academic papers benefit from column detection
More details: See PDF Troubleshooting
Chunks are the atomic units of retrieval. Too small = missing context. Too large = noise in results.
EdgeQuake uses sliding window chunking by default:
- Chunk size: 1200 tokens (default)
- Overlap: 100 tokens (~8%)
- Strategy: Semantic boundaries (sentences, paragraphs)
After uploading a document, view its chunks:
curl "http://localhost:8080/api/v1/documents/doc_xyz789/chunks"Response:
{
"chunks": [
{
"id": "chunk_001",
"content": "TechCorp Innovation Labs was founded in 2020 by Sarah Chen and Marcus Williams. The company is headquartered in San Francisco, with research offices in Boston and Seattle.",
"position": 0,
"token_count": 42,
"embedding_id": "emb_abc123"
},
{
"id": "chunk_002",
"content": "Sarah Chen serves as CEO and leads the company's AI research initiatives. She previously worked at Google DeepMind where she led the language model team.",
"position": 1,
"token_count": 38,
"embedding_id": "emb_def456"
}
],
"total_chunks": 8
}Different document types benefit from different chunking approaches:
| Strategy | Best For | Chunk Size |
|---|---|---|
| Fixed | General text | 1200 tokens (default) |
| Semantic | Well-structured docs | Variable |
| Paragraph | Articles, blogs | 1 paragraph |
| Sentence | Q&A, definitions | 1-3 sentences |
curl -X POST "http://localhost:8080/api/v1/documents?workspace_id=$WORKSPACE_ID" \
-F "file=@large_document.pdf" \
-F "title=Technical Manual" \
-F "chunk_size=1024" \
-F "chunk_overlap=100"| Scenario | Recommendation |
|---|---|
| Long technical docs | Increase to 1024 tokens |
| Short FAQs | Decrease to 256 tokens |
| Legal contracts | Use paragraph chunking |
| Code documentation | Use semantic with code awareness |
The LLM extracts entities and relationships from each chunk.
EdgeQuake extracts these entity types by default:
- PERSON - Named individuals
- ORGANIZATION - Companies, institutions, teams
- LOCATION - Places, cities, countries
- EVENT - Meetings, launches, milestones
- CONCEPT - Abstract ideas, theories
- TECHNOLOGY - Technical tools, frameworks, protocols
- PRODUCT - Products, services, commercial offerings
curl "http://localhost:8080/api/v1/documents/doc_xyz789/entities"Response:
{
"entities": [
{
"name": "SARAH_CHEN",
"type": "PERSON",
"description": "CEO of TechCorp Innovation Labs",
"mentions": [
{ "chunk_id": "chunk_001", "context": "...founded by Sarah Chen..." },
{ "chunk_id": "chunk_002", "context": "...Sarah Chen serves as CEO..." }
]
}
],
"relationships": [
{
"source": "SARAH_CHEN",
"target": "TECHCORP_INNOVATION_LABS",
"type": "FOUNDED",
"description": "Co-founded the company in 2020",
"source_chunk": "chunk_001"
}
]
}Configure workspace-specific entity types:
curl -X PATCH "http://localhost:8080/api/v1/workspaces/$WORKSPACE_ID" \
-H "Content-Type: application/json" \
-d '{
"entity_types": [
"PERSON",
"COMPANY",
"DRUG",
"DISEASE",
"GENE",
"PROTEIN"
]
}'This is useful for domain-specific applications (medical, legal, financial).
EdgeQuake automatically normalizes entity names to prevent duplicates.
Input → Normalized
─────────────────────────────────────
"Sarah Chen" → SARAH_CHEN
"Dr. Sarah Chen" → SARAH_CHEN
"Chen, Sarah" → SARAH_CHEN
"Ms. Sarah Chen, PhD" → SARAH_CHEN
"Sarah Chen's work" → SARAH_CHEN
When the same entity appears with different descriptions, EdgeQuake merges them:
Chunk 1: "Sarah Chen is the CEO of TechCorp"
Chunk 2: "Dr. Chen previously worked at Google DeepMind"
Result:
{
"name": "SARAH_CHEN",
"description": "CEO of TechCorp Innovation Labs. Previously led the language model team at Google DeepMind."
}
For complex documents, single-pass extraction may miss entities. Enable gleaning for thorough extraction:
curl -X POST "http://localhost:8080/api/v1/documents?workspace_id=$WORKSPACE_ID" \
-F "file=@complex_document.pdf" \
-F "title=Research Paper" \
-F "gleaning_iterations=2"┌─────────────────────────────────────────────────────────────────┐
│ GLEANING PROCESS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Pass 1: Initial Extraction │
│ ───────────────────────── │
│ LLM extracts: [SARAH_CHEN, TECHCORP, NEURALSEARCH] │
│ │
│ Pass 2: Glean (review for missed entities) │
│ ─────────────────────────────────────────── │
│ Prompt: "Review text for entities you may have missed" │
│ LLM extracts: [GOOGLE_DEEPMIND, VENTURE_PARTNERS_CAPITAL] │
│ │
│ Combined: 5 entities (vs 3 from single pass) │
│ Improvement: +67% recall │
│ │
└─────────────────────────────────────────────────────────────────┘
| Gleaning | LLM Calls | Entity Recall | Cost |
|---|---|---|---|
| 0 passes | 1 per chunk | Baseline | $ |
| 1 pass | 2 per chunk | +15-25% | $$ |
| 2 passes | 3 per chunk | +25-35% | $$$ |
Default: 1 gleaning iteration (good balance).
# Get processing status
curl "http://localhost:8080/api/v1/documents/doc_xyz789"Response:
{
"id": "doc_xyz789",
"title": "Research Paper",
"status": "processing",
"progress": {
"phase": "extracting",
"chunks_total": 45,
"chunks_processed": 23,
"percent": 51
},
"metrics": {
"parse_time_ms": 234,
"chunk_time_ms": 156,
"extract_time_ms": 12400,
"tokens_used": 15600
}
}| Phase | Description | Duration |
|---|---|---|
parsing |
Extract text from file | ~100ms |
chunking |
Split into chunks | ~50ms |
extracting |
LLM entity extraction | ~2-10s per chunk |
normalizing |
Deduplicate entities | ~100ms |
embedding |
Generate vectors | ~500ms |
storing |
Save to database | ~100ms |
For large document sets, use batch upload:
# Create a batch
curl -X POST "http://localhost:8080/api/v1/batches?workspace_id=$WORKSPACE_ID" \
-H "Content-Type: application/json" \
-d '{
"name": "Q1 Reports Batch",
"documents": [
{"file": "report_jan.pdf", "title": "January Report"},
{"file": "report_feb.pdf", "title": "February Report"},
{"file": "report_mar.pdf", "title": "March Report"}
]
}'Or upload a ZIP file:
curl -X POST "http://localhost:8080/api/v1/documents/bulk?workspace_id=$WORKSPACE_ID" \
-F "file=@all_reports.zip"If you change settings, reprocess existing documents:
# Reprocess with new entity types
curl -X POST "http://localhost:8080/api/v1/documents/doc_xyz789/reprocess" \
-H "Content-Type: application/json" \
-d '{
"chunk_size": 1024,
"gleaning_iterations": 2,
"entity_types": ["PERSON", "DRUG", "DISEASE"]
}'| Setting Change | Recalculated |
|---|---|
| chunk_size | Chunks, entities, embeddings |
| entity_types | Entities, relationships |
| gleaning | Entities, relationships |
| LLM model | Entities, embeddings |
Analyze pipeline performance:
curl "http://localhost:8080/api/v1/workspaces/$WORKSPACE_ID/metrics"Response:
{
"workspace_id": "ws_abc123",
"documents": {
"total": 150,
"completed": 148,
"processing": 2,
"failed": 0
},
"chunks": {
"total": 4500,
"avg_size_tokens": 487
},
"entities": {
"total": 1250,
"by_type": {
"PERSON": 320,
"ORGANIZATION": 180,
"CONCEPT": 450,
"LOCATION": 150,
"EVENT": 100,
"PRODUCT": 50
}
},
"relationships": {
"total": 2100
},
"costs": {
"llm_tokens_used": 4500000,
"embedding_tokens_used": 2250000,
"estimated_cost_usd": 12.5
}
}- Clean text - Remove headers, footers, page numbers if possible
- Consistent format - Use consistent naming for entities
- Quality over quantity - Better documents = better extraction
| Document Type | Recommended Size |
|---|---|
| General articles | 1200 tokens (default) |
| Technical docs | 1200 tokens |
| Short Q&A | 512 tokens |
| Legal contracts | Paragraph-based |
- Domain-specific types - Add custom types for your domain
- Enable gleaning - For research papers and complex docs
- Review extractions - Spot-check for quality
Problem: Few entities extracted from detailed document.
Solutions:
- Enable gleaning:
gleaning_iterations=2 - Decrease chunk size for finer extraction
- Check LLM model supports extraction task
Problem: Same entity appears multiple times.
Solutions:
- Check entity normalization is working
- Review entity descriptions for merge eligibility
- Consider manual merge via API
Problem: Documents taking too long.
Solutions:
- Increase worker threads:
WORKER_THREADS=8 - Use faster LLM model (gpt-5-nano)
- Reduce gleaning iterations
- Batch documents instead of sequential
✅ How the 6-stage pipeline works
✅ Chunking strategies and customization
✅ Entity extraction and normalization
✅ Gleaning for thorough extraction
✅ Monitoring processing status
✅ Batch and bulk upload
✅ Reprocessing documents
✅ Pipeline performance metrics
| Tutorial | Description |
|---|---|
| Query Optimization | Choosing and tuning query modes |
| Multi-Tenant Setup | Building a SaaS application |
| Custom Entity Types | Domain-specific extraction |
- LightRAG Algorithm - Algorithm deep-dive
- Entity Normalization - Deduplication details
- REST API - API reference