Overview
Document extraction allows you to automatically extract structured data from documents using AI-powered processing. This guide walks you through the complete workflow from document creation to retrieving extracted results.Extraction Workflow
The extraction process follows these steps:- Create a document - Initialize a document with extraction configuration
- Upload the file - Upload your document file using the pre-signed URL
- Monitor processing - Poll the document status until extraction completes
- Retrieve results - Access the extracted data from the document response
Step 1: Create a Document
You can create a document in two ways: using a template or with ad-hoc configuration.1
Option A: Using a Template
If you have a document template set up, reference it when creating the document:The template provides the extraction schema and fraud detection settings automatically.
2
Option B: Ad-Hoc Configuration
Provide extraction configuration directly in the request:See the Schema reference for schema requirements.
Response
The API returns a document object with a pre-signed upload URL:expires_in (typically 1 hour). Upload your file before it expires.
Step 2: Upload the File
Upload your document file to the pre-signed URL using a PUT request:File Requirements
- Maximum file size: 50MB
- Supported formats: PDF, images (PNG, JPEG)
- Content-Type: Should match the file type (e.g.,
application/pdffor PDFs)
Alternative: Using file_url
Instead of uploading, you can provide afile_url when creating the document to have Beltic download the file:
Step 3: Monitor Processing Status
After uploading, the document status changes toprocessing. Poll the Get Document endpoint to check status:
Document Statuses
- pending_upload: Document created, waiting for file upload
- processing: File uploaded, extraction in progress
- completed: Extraction finished successfully
- failed: Processing failed (check
error_codeanderror_message)
Polling Strategy
Recommended Polling
Recommended Polling
- Interval: Poll every 5-10 seconds for most documents
- Timeout: Set a maximum wait time (e.g., 5 minutes) based on your use case
Step 4: Retrieve Extracted Data
Once the document status iscompleted, the extracted data is available in the document response:
Response Structure
Extracted Data Format
Theextracted_data object matches your extraction schema. Fields that couldn’t be extracted will be null.
Fraud Analysis
If fraud detection is enabled, the response includesfraud_result with:
- score: Risk assessment level - one of
"NORMAL","TRUSTED","WARNING", or"HIGH_RISK" - file_metadata: Extracted file metadata including:
producer: Software that created the PDFcreator: Application used to create the documentcreation_date: When the file was createdmod_date: When the file was last modifiedauthor: Document authortitle,keywords,subject: Additional metadata fields
- indicators: Array of fraud detection indicators, each containing:
id: Unique indicator identifiertype: Indicator type -"RISK"(negative),"TRUST"(positive), or"INFO"(neutral)category: Indicator categorytitle: Short indicator titledescription: Detailed indicator descriptionorigin: Source of indicator -"FRAUD"or"QUALITY"
- document_classification: Document type classification with:
id: Classification identifiertype: Document typedocument_class_type: Document class categorydetailed_type: Specific document subtype
Error Handling
When document processing fails, the document status is set tofailed and the processing_errors array contains one or more error objects. Each error follows the JSON:API error format with the following structure:
Error Codes
All error codes are sanitized and safe to expose to API consumers. Vendor names and internal implementation details are removed from error messages.PROCESSING_FAILED
PROCESSING_FAILED
General document processing failure. This error occurs when both extraction and fraud detection fail, or when an unexpected error occurs during processing.Possible causes:
- Multiple processing steps failed simultaneously
- Internal service error
- Unhandled exception during processing
EXTRACTION_FAILED
EXTRACTION_FAILED
The AI-powered data extraction process failed. This can occur due to document quality issues, unsupported formats, or extraction service errors.Possible causes:
- Unsupported document format or structure
- Extraction service unavailable or timeout
- Document content doesn’t match the extraction schema
- Verify document format is supported (PDF, PNG, JPEG)
- Check that the extraction schema matches the document structure
- Retry with a different file if the issue persists
FRAUD_CHECK_FAILED
FRAUD_CHECK_FAILED
The fraud detection analysis failed. This occurs when the fraud detection service encounters an error while analyzing document authenticity.Possible causes:
- Fraud detection service unavailable
- Document format incompatible with fraud analysis
- Internal fraud detection processing error
TIMEOUT
TIMEOUT
Document processing exceeded the maximum allowed time. Processing was terminated to prevent indefinite waiting.Possible causes:
- Very large or complex documents
- Slow processing services
- Network latency issues
- Try processing the document again
- For large documents, consider splitting into smaller files
- If the issue persists, contact support.
FILE_TOO_LARGE
FILE_TOO_LARGE
The uploaded file exceeds the maximum allowed size of 50MB.Resolution:
- Split large documents into smaller files
- Use a file compression tool to reduce file size
- Consider using a different file format that results in a smaller file size
INVALID_FILE
INVALID_FILE
The file could not be processed due to format issues or corruption.Possible causes:
- Unsupported file format
- Corrupted or damaged file
- File is not a valid document (e.g., empty file, wrong MIME type)
- File structure is incompatible with processing
- Verify the file format is supported (PDF, PNG, JPEG)
- Ensure the file is not corrupted
- Try re-saving or re-exporting the document
- Check that the file is a valid document and not empty
DOCUMENT_NOT_FOUND
DOCUMENT_NOT_FOUND
The file at the provided
file_url could not be accessed or downloaded when creating a document.Possible causes:- Invalid or malformed URL in
file_url - File does not exist at the provided URL
- URL requires authentication that wasn’t provided
- Network error preventing file download
- URL is inaccessible (blocked, expired, or requires special permissions)
- Server hosting the file returned an error (404, 403, 500, etc.)
- Verify the
file_urlis correct and accessible - Test the URL in a browser or with
curlto ensure it’s reachable - Ensure the URL doesn’t require authentication or special headers
- Check that the file server is not blocking requests from Beltic’s IP addresses
- For private files, use the pre-signed upload URL method instead of
file_url - Verify the URL hasn’t expired (for time-limited URLs)
Handling Errors in Code
Always check the documentstatus field and handle failed status appropriately:
Error Response Format
When a document fails, the response includes aprocessing_errors array in the document attributes:
Best Practices
Schema Design
Schema Design
- Design schemas with realistic expectations - not all fields may be extractable
- Provide clear
descriptionfields to guide extraction - Test schemas with sample documents before production use
File Quality
File Quality
- Use high-quality scans or PDFs for best extraction results
- Ensure text is readable and not obscured
- Avoid heavily compressed or low-resolution images
- For multi-page documents, ensure all pages are included
Performance
Performance
- Use templates for repeated document types to reduce configuration overhead
- Cache template IDs to avoid repeated lookups
- Handle rate limits appropriately (429 status code)
Security
Security
- Never expose API keys in client-side code
- Use secure file URLs when providing
file_url(HTTPS only) - Validate extracted data before using in business logic