Examples
- API Request
- TypeScript
- Python (Sync)
Batch Processing Pipeline
When you upload multiple documents, each document goes through our comprehensive processing pipeline in parallel:1. Immediate Upload & Queue
- All documents are immediately accepted and stored securely
- Each document is added to our processing queue for background processing
- You receive a confirmation response with individual
file_ids for tracking each file
2. Parallel Processing Phase
Each document is processed independently with:- Content Extraction: Extracting text from various supported formats (see Supported File Formats section below)
- Document Parsing: Understanding document structure, headers, and formatting
- Text Cleaning: Removing formatting artifacts and normalizing content
3. Intelligent Chunking
- Each document is split into semantically meaningful chunks
- Chunk size is optimized for both context preservation and search accuracy
- Overlapping boundaries ensure no information is lost between chunks
- Metadata is preserved and associated with each chunk
4. Embedding Generation
- Each chunk is converted into high-dimensional vector embeddings
- Embeddings capture semantic meaning and context
- Vectors are optimized for similarity search and retrieval
5. Indexing & Database Updates
- Embeddings are stored in our vector database for fast similarity search
- Full-text search indexes are created for keyword-based queries
- Metadata is indexed for filtering and faceted search
- Cross-references are established between related documents
6. Quality Assurance
- Automated quality checks ensure processing accuracy for each document
- Content validation verifies extracted text completeness
- Embedding quality is assessed for optimal retrieval performance
Processing Time: Batch uploads are processed in parallel. Most documents are fully processed and searchable within 2-5 minutes. Larger documents (100+ pages) may take up to 15 minutes. You can check processing status using the individual document IDs returned in the response.
Default Sub-Tenant Behavior: If you don’t specify a
sub_tenant_id, all documents will be uploaded to the default sub-tenant created when your tenant was set up. This is perfect for organization-wide document batches that should be accessible across all departments.Recommended: For optimal performance, limit each batch to a maximum of 20 sources per request. Send multiple batch requests with an interval of 1 second between each request.
File ID Management: The system uses a priority-based approach for file ID assignment:
- First Priority: If you provide a
file_idas a direct body parameter, that specific ID will be used- Second Priority: If no direct
file_idis provided, the system checks for afile_idin thedocument_metadataobject- Auto-Generation: If neither source provides a
file_id, the system will automatically generate a unique identifier
Duplicate File ID Behavior
When you upload documents withfile_ids that already exist in your tenant:
- Overwrite Behavior: Each existing document with a matching
file_idwill be completely replaced with the new document - Processing: Each new document will go through the full processing pipeline independently
- Search Results: Previous search results and embeddings from old documents will be replaced with the new documents’ content
- Idempotency: Uploading the same documents with the same
file_ids multiple times is safe and will result in the same final state
Supported File Formats
Cortex supports a comprehensive range of file formats for document processing. Files are automatically parsed and their content extracted for indexing and search.Complete Reference: For a comprehensive list of all supported file formats with detailed information, see our Supported File Formats documentation.
Best Practices
Document Preparation
- File Size: Documents up to 50MB are processed efficiently
- Content Quality: Clear, well-structured documents produce better embeddings
- Metadata: Include rich metadata for better filtering and organization
Processing Optimization
- Batch Size: Limit each batch to a maximum of 20 sources per request
- Request Intervals: Send multiple batch requests with an interval of 1 second between each request
- Metadata Consistency: Use consistent metadata schemas across your organization
- File Naming: Descriptive filenames help with document identification
Troubleshooting
Documents Not Appearing in Search?- Wait 5-10 minutes for processing to complete
- Check if any document status is
errored(rare occurrence) - Verify your search query and filters
- Large documents (100+ pages) take longer to process
- Complex formatting may require additional processing time
- High system load may temporarily slow processing
- If status shows
errored, ensure your documents aren’t corrupted or password-protected - Check that the file format is supported (see Supported File Formats section above)
- Verify your API key has sufficient permissions
- For unsupported formats, you’ll receive a
400error with the message:"Unsupported file format: [filename]. Please check our supported file formats documentation."
Need Help? If documents fail to process or you’re experiencing issues, contact our support team with the
file_ids for assistance.Error Responses
All endpoints return consistent error responses following the standard format. For detailed error information, see our Error Responses documentation.Authorizations
Body
multipart/form-data