Skip to main content

Overview

Document extraction allows you to automatically extract structured data from documents using AI-powered processing. This guide walks you through the complete workflow from document creation to retrieving extracted results.

Extraction Workflow

The extraction process follows these steps:
  1. Create a document - Initialize a document with extraction configuration
  2. Upload the file - Upload your document file using the pre-signed URL
  3. Monitor processing - Poll the document status until extraction completes
  4. Retrieve results - Access the extracted data from the document response

Step 1: Create a Document

You can create a document in two ways: using a template or with ad-hoc configuration.
1

Option A: Using a Template

If you have a document template set up, reference it when creating the document:
curl -X POST https://api.beltic.com/v1/documents \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "type": "document",
      "meta": {
        "document_template_id": "123e4567-e89b-12d3-a456-426614174000"
      }
    }
  }'
The template provides the extraction schema and fraud detection settings automatically.
2

Option B: Ad-Hoc Configuration

Provide extraction configuration directly in the request:
curl -X POST https://api.beltic.com/v1/documents \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "type": "document",
      "meta": {
        "extraction_config": {
          "enabled": true,
          "schema": {
            "type": "object",
            "properties": {
              "invoice_number": {
                "type": ["string", "null"],
                "description": "Invoice identifier"
              },
              "amount": {
                "type": ["number", "null"],
                "description": "Total invoice amount"
              },
              "date": {
                "type": ["string", "null"],
                "description": "Invoice date"
              },
              "vendor": {
                "type": ["string", "null"],
                "description": "Vendor name"
              }
            },
            "required": ["invoice_number", "amount", "date", "vendor"]
          },
          "extraction_rules": "Extract all invoice details accurately."
        },
        "fraud_config": {
          "enabled": true
        }
      }
    }
  }'
See the Schema reference for schema requirements.

Response

The API returns a document object with a pre-signed upload URL:
{
  "data": {
    "type": "document",
    "id": "9941d601-0dd4-4a33-8043-4f158b480f0e",
    "attributes": {
      "status": "pending_upload",
      "created_at": "2024-01-15T10:30:00Z"
    }
  },
  "meta": {
    "presigned_upload_url": "https://files.beltic.com/9941d601-0dd4-4a33-8043-4f158b480f0e/file?X-Amz-Signature=...",
    "expires_in": 3600
  }
}
Important: The pre-signed URL expires after the time specified in expires_in (typically 1 hour). Upload your file before it expires.

Step 2: Upload the File

Upload your document file to the pre-signed URL using a PUT request:
curl -X PUT "{presigned_upload_url}" \
  -H "Content-Type: application/pdf" \
  --data-binary @invoice.pdf

File Requirements

  • Maximum file size: 50MB
  • Supported formats: PDF, images (PNG, JPEG)
  • Content-Type: Should match the file type (e.g., application/pdf for PDFs)

Alternative: Using file_url

Instead of uploading, you can provide a file_url when creating the document to have Beltic download the file:
curl -X POST https://api.beltic.com/v1/documents \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "type": "document",
      "meta": {
        "document_template_id": "123e4567-e89b-12d3-a456-426614174000",
        "file_url": "https://example.com/documents/invoice.pdf"
      }
    }
  }'
This skips the upload step and processing begins automatically once the file is downloaded.

Step 3: Monitor Processing Status

After uploading, the document status changes to processing. Poll the Get Document endpoint to check status:
curl -X GET https://api.beltic.com/v1/documents/{document_id} \
  -H "X-Api-Key: YOUR_API_KEY"

Document Statuses

  • pending_upload: Document created, waiting for file upload
  • processing: File uploaded, extraction in progress
  • completed: Extraction finished successfully
  • failed: Processing failed (check error_code and error_message)

Polling Strategy

Step 4: Retrieve Extracted Data

Once the document status is completed, the extracted data is available in the document response:
curl -X GET https://api.beltic.com/v1/documents/{document_id} \
  -H "X-Api-Key: YOUR_API_KEY"

Response Structure

{
  "data": {
    "type": "document",
    "id": "9941d601-0dd4-4a33-8043-4f158b480f0e",
    "attributes": {
      "status": "completed",
      "extracted_data": {
        "invoice_number": "INV-2024-001",
        "amount": 1250.50,
        "date": "2024-01-15",
        "vendor": "Acme Corporation"
      },
      "fraud_result": {
        "score": "NORMAL",
        "file_metadata": {
          "producer": "Adobe PDF Library",
          "creator": "Microsoft Word",
          "creation_date": "2024-01-15T08:00:00Z",
          "mod_date": "2024-01-15T09:30:00Z",
          "author": "John Doe",
          "title": "Invoice",
          "keywords": null,
          "subject": null
        },
        "indicators": [
          {
            "id": "indicator-001",
            "type": "TRUST",
            "category": "Document Quality",
            "title": "High Quality Scan",
            "description": "Document shows high quality scanning with clear text",
            "origin": "QUALITY"
          }
        ],
        "document_classification": {
          "id": "class-001",
          "type": "invoice",
          "document_class_type": "financial",
          "detailed_type": "commercial_invoice"
        }
      },
      "created_at": "2024-01-15T10:30:00Z",
      "updated_at": "2024-01-15T10:32:15Z"
    }
  }
}

Extracted Data Format

The extracted_data object matches your extraction schema. Fields that couldn’t be extracted will be null.

Fraud Analysis

If fraud detection is enabled, the response includes fraud_result with:
  • score: Risk assessment level - one of "NORMAL", "TRUSTED", "WARNING", or "HIGH_RISK"
  • file_metadata: Extracted file metadata including:
    • producer: Software that created the PDF
    • creator: Application used to create the document
    • creation_date: When the file was created
    • mod_date: When the file was last modified
    • author: Document author
    • title, keywords, subject: Additional metadata fields
  • indicators: Array of fraud detection indicators, each containing:
    • id: Unique indicator identifier
    • type: Indicator type - "RISK" (negative), "TRUST" (positive), or "INFO" (neutral)
    • category: Indicator category
    • title: Short indicator title
    • description: Detailed indicator description
    • origin: Source of indicator - "FRAUD" or "QUALITY"
  • document_classification: Document type classification with:
    • id: Classification identifier
    • type: Document type
    • document_class_type: Document class category
    • detailed_type: Specific document subtype

Error Handling

When document processing fails, the document status is set to failed and the processing_errors array contains one or more error objects. Each error follows the JSON:API error format with the following structure:
{
  "status": "500",
  "code": "EXTRACTION_FAILED",
  "title": "Data extraction failed",
  "detail": "Additional sanitized error details (may be null)",
  "meta": {
    "occurred_at": "2024-01-15T10:32:15Z"
  }
}

Error Codes

All error codes are sanitized and safe to expose to API consumers. Vendor names and internal implementation details are removed from error messages.
General document processing failure. This error occurs when both extraction and fraud detection fail, or when an unexpected error occurs during processing.Possible causes:
  • Multiple processing steps failed simultaneously
  • Internal service error
  • Unhandled exception during processing
Resolution: Retry the request. If the issue persists, try with a different file or contact support.
The AI-powered data extraction process failed. This can occur due to document quality issues, unsupported formats, or extraction service errors.Possible causes:
  • Unsupported document format or structure
  • Extraction service unavailable or timeout
  • Document content doesn’t match the extraction schema
Resolution:
  • Verify document format is supported (PDF, PNG, JPEG)
  • Check that the extraction schema matches the document structure
  • Retry with a different file if the issue persists
The fraud detection analysis failed. This occurs when the fraud detection service encounters an error while analyzing document authenticity.Possible causes:
  • Fraud detection service unavailable
  • Document format incompatible with fraud analysis
  • Internal fraud detection processing error
Resolution: Retry the request. If the issue persists, try with a different file or contact support.
Document processing exceeded the maximum allowed time. Processing was terminated to prevent indefinite waiting.Possible causes:
  • Very large or complex documents
  • Slow processing services
  • Network latency issues
Resolution:
  • Try processing the document again
  • For large documents, consider splitting into smaller files
  • If the issue persists, contact support.
The uploaded file exceeds the maximum allowed size of 50MB.Resolution:
  • Split large documents into smaller files
  • Use a file compression tool to reduce file size
  • Consider using a different file format that results in a smaller file size
The file could not be processed due to format issues or corruption.Possible causes:
  • Unsupported file format
  • Corrupted or damaged file
  • File is not a valid document (e.g., empty file, wrong MIME type)
  • File structure is incompatible with processing
Resolution:
  • Verify the file format is supported (PDF, PNG, JPEG)
  • Ensure the file is not corrupted
  • Try re-saving or re-exporting the document
  • Check that the file is a valid document and not empty
The file at the provided file_url could not be accessed or downloaded when creating a document.Possible causes:
  • Invalid or malformed URL in file_url
  • File does not exist at the provided URL
  • URL requires authentication that wasn’t provided
  • Network error preventing file download
  • URL is inaccessible (blocked, expired, or requires special permissions)
  • Server hosting the file returned an error (404, 403, 500, etc.)
Resolution:
  • Verify the file_url is correct and accessible
  • Test the URL in a browser or with curl to ensure it’s reachable
  • Ensure the URL doesn’t require authentication or special headers
  • Check that the file server is not blocking requests from Beltic’s IP addresses
  • For private files, use the pre-signed upload URL method instead of file_url
  • Verify the URL hasn’t expired (for time-limited URLs)

Handling Errors in Code

Always check the document status field and handle failed status appropriately:
async function checkDocumentStatus(documentId, apiKey) {
  const response = await fetch(`https://api.beltic.com/v1/documents/${documentId}`, {
    headers: { 'X-Api-Key': apiKey }
  });
  
  const data = await response.json();
  const status = data.data.attributes.status;
  
  if (status === 'failed') {
    const error = data.data.attributes.processing_errors[0];
    console.error(`Error: ${error.code} (${error.status})`);
    console.error(`Title: ${error.title}`);
    console.error(`Detail: ${error.detail || 'No additional details'}`);
    
    // Handle specific error codes
    switch (error.code) {
      case 'FILE_TOO_LARGE':
        console.error('File is too large. Please compress or split the document.');
        break;
      case 'INVALID_FILE':
        console.error('File format is invalid or corrupted. Please check the file.');
        break;
      case 'TIMEOUT':
        console.error('Processing timed out. Please try again.');
        break;
      default:
        console.error('Processing failed. Please retry or contact support.');
    }
  }
  
  return data;
}

Error Response Format

When a document fails, the response includes a processing_errors array in the document attributes:
{
  "data": {
    "type": "document",
    "id": "9941d601-0dd4-4a33-8043-4f158b480f0e",
    "attributes": {
      "status": "failed",
      "processing_errors": [
        {
          "status": "500",
          "code": "EXTRACTION_FAILED",
          "title": "Data extraction failed",
          "detail": "Unable to extract data from document",
          "meta": {
            "occurred_at": "2024-01-15T10:32:15Z"
          }
        }
      ]
    }
  }
}

Best Practices

  • Design schemas with realistic expectations - not all fields may be extractable
  • Provide clear description fields to guide extraction
  • Test schemas with sample documents before production use
  • Use high-quality scans or PDFs for best extraction results
  • Ensure text is readable and not obscured
  • Avoid heavily compressed or low-resolution images
  • For multi-page documents, ensure all pages are included
  • Use templates for repeated document types to reduce configuration overhead
  • Cache template IDs to avoid repeated lookups
  • Handle rate limits appropriately (429 status code)
  • Never expose API keys in client-side code
  • Use secure file URLs when providing file_url (HTTPS only)
  • Validate extracted data before using in business logic

Complete Example

Here’s a complete example using a template:
async function runExtraction(apiKey, templateId, file) {
  // Step 1: Create document
  const createResponse = await fetch('https://api.beltic.com/v1/documents', {
    method: 'POST',
    headers: {
      'X-Api-Key': apiKey,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      data: {
        type: 'document',
        meta: {
          document_template_id: templateId
        }
      }
    })
  });
  
  const createData = await createResponse.json();
  const documentId = createData.data.id;
  const uploadUrl = createData.meta.presigned_upload_url;
  
  console.log(`Created document: ${documentId}`);
  
  // Step 2: Upload file
  await fetch(uploadUrl, {
    method: 'PUT',
    headers: {
      'Content-Type': 'application/pdf'
    },
    body: file
  });
  
  console.log('File uploaded, processing...');
  
  // Step 3: Poll for completion
  while (true) {
    const docResponse = await fetch(`https://api.beltic.com/v1/documents/${documentId}`, {
      headers: { 'X-Api-Key': apiKey }
    });
    
    const docData = await docResponse.json();
    const status = docData.data.attributes.status;
    
    if (status === 'completed') {
      console.log('Extraction completed!');
      console.log(docData.data.attributes.extracted_data);
      return docData;
    } else if (status === 'failed') {
      const error = docData.data.attributes.processing_errors?.[0];
      console.error('Extraction failed!', error?.title || 'Unknown error');
      throw new Error(error?.detail || 'Document processing failed');
    }
    
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}

// Usage
const file = await fetch('invoice.pdf').then(r => r.blob());
await runExtraction('YOUR_API_KEY', '123e4567-e89b-12d3-a456-426614174000', file);