Examples
- API Request
- TypeScript
- Python (Sync)
Supported file formats
Complete Reference: For a comprehensive list of all supported file formats with detailed information, see our Supported File Formats documentation.
Document Processing Pipeline
When you upload a document, it goes through a comprehensive processing pipeline designed to make your content searchable and retrievable:1. Immediate Upload & Queue
- Your document is immediately accepted and stored securely
- It’s added to our processing queue for background processing
- You receive a confirmation response with a
file_idfor tracking
2. Processing Phase
Our system automatically handles:- Content Extraction: Extracting text from various formats (PDF, DOCX, TXT, etc.)
- Document Parsing: Understanding document structure, headers, and formatting
- Text Cleaning: Removing formatting artifacts and normalizing content
3. Intelligent Chunking
- Documents are split into semantically meaningful chunks
- Chunk size is optimized for both context preservation and search accuracy
- Overlapping boundaries ensure no information is lost between chunks
- Metadata is preserved and associated with each chunk
4. Embedding Generation
- Each chunk is converted into high-dimensional vector embeddings
- Embeddings capture semantic meaning and context
- Vectors are optimized for similarity search and retrieval
5. Indexing & Database Updates
- Embeddings are stored in our vector database for fast similarity search
- Full-text search indexes are created for keyword-based queries
- Metadata is indexed for filtering and faceted search
- Cross-references are established for related documents
6. Quality Assurance
- Automated quality checks ensure processing accuracy
- Content validation verifies extracted text completeness
- Embedding quality is assessed for optimal retrieval performance
Processing Time: Most documents are fully processed and searchable within 1-5 minutes. Larger documents (100+ pages) may take up to 15 minutes. You can check processing status using the document ID returned in the response.
Default Sub-Tenant Behavior: If you don’t specify a
sub_tenant_id, the document will be uploaded to the default sub-tenant created when your tenant was set up. This is perfect for organization-wide documents that should be accessible across all departments.File ID Management: The system uses a priority-based approach for file ID assignment:
- First Priority: If you provide a
file_idas a direct body parameter, that specific ID will be used- Second Priority: If no direct
file_idis provided, the system checks for afile_idin thedocument_metadataobject- Auto-Generation: If neither source provides a
file_id, the system will automatically generate a unique identifier
Duplicate File ID Behavior
When you upload a document with afile_id that already exists in your tenant:
- Overwrite Behavior: The existing document with the same
file_idwill be completely replaced with the new document - Processing: The new document will go through the full processing pipeline (content extraction, chunking, embedding generation, indexing)
- Search Results: Previous search results and embeddings from the old document will be replaced with the new document’s content
- Idempotency: Uploading the same document with the same
file_idmultiple times is safe and will result in the same final state
Processing Status & Monitoring
After uploading, you can monitor your document’s processing status:Immediate Response
Upon successful upload, you’ll receive:Processing States
Your document will progress through these states:queued: Document is in the processing queue, waiting to be processedin_progress: Document is actively being processed (includes content extraction, chunking, embedding generation, and indexing)success: Document is fully processed and searchableerrored: Processing encountered an error (rare occurrence)
In-Progress Details: While the status shows
in_progress, the system is actually performing multiple steps: content extraction, document parsing, intelligent chunking, embedding generation, and database indexing. These happen sequentially but are all part of the single in_progress state.When Your Document is Ready
Once processing is complete, your document will be:- ✅ Searchable via semantic search and Q&A endpoints
- ✅ Retrievable through our retrieval APIs
- ✅ Available for AI-powered applications
- ✅ Indexed for fast query performance
Best Practices
Document Preparation
- File Size: Documents up to 50MB are processed efficiently
- Content Quality: Clear, well-structured documents produce better embeddings
- Metadata: Include rich metadata for better filtering and organization
Processing Optimization
- Batch Uploads: For multiple documents, consider using our batch upload endpoint
- Metadata Consistency: Use consistent metadata schemas across your organization
- File Naming: Descriptive filenames help with document identification
Troubleshooting
Document Not Appearing in Search?- Wait 5-10 minutes for processing to complete
- Check if the document status is
errored(rare occurrence) - Verify your search query and filters
- Large documents (100+ pages) take longer to process
- Complex formatting may require additional processing time
- High system load may temporarily slow processing
- If status shows
errored, ensure your document isn’t corrupted or password-protected - Check that the file format is supported (see Supported File Formats section above)
- Verify your API key has sufficient permissions
- For unsupported formats, you’ll receive a
400error with the message:"Unsupported file format: [filename]. Please check our supported file formats documentation."
Need Help? If a document fails to process or you’re experiencing issues, contact our support team with the
file_id for assistance.Error Responses
All endpoints return consistent error responses following the standard format. For detailed error information, see our Error Responses documentation.Authorizations
Body
multipart/form-data