Running an Extraction

Overview

Document extraction allows you to automatically extract structured data from documents using AI-powered processing. This guide walks you through the complete workflow from document creation to retrieving extracted results.

Extraction Workflow

The extraction process follows these steps:

Create a document - Initialize a document with extraction configuration
Upload the file - Upload your document file using the pre-signed URL
Monitor processing - Poll the document status until extraction completes
Retrieve results - Access the extracted data from the document response

Step 1: Create a Document

You can create a document in two ways: using a template or with ad-hoc configuration.

Option A: Using a Template

If you have a document template set up, reference it when creating the document:

curl -X POST https://api.beltic.com/v1/documents \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "type": "document",
      "meta": {
        "document_template_id": "123e4567-e89b-12d3-a456-426614174000"
      }
    }
  }'

The template provides the extraction schema and fraud detection settings automatically.

Option B: Ad-Hoc Configuration

Provide extraction configuration directly in the request:

curl -X POST https://api.beltic.com/v1/documents \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "type": "document",
      "meta": {
        "extraction_config": {
          "enabled": true,
          "schema": {
            "type": "object",
            "properties": {
              "invoice_number": {
                "type": ["string", "null"],
                "description": "Invoice identifier"
              },
              "amount": {
                "type": ["number", "null"],
                "description": "Total invoice amount"
              },
              "date": {
                "type": ["string", "null"],
                "description": "Invoice date"
              },
              "vendor": {
                "type": ["string", "null"],
                "description": "Vendor name"
              }
            },
            "required": ["invoice_number", "amount", "date", "vendor"]
          },
          "extraction_rules": "Extract all invoice details accurately."
        },
        "fraud_config": {
          "enabled": true
        }
      }
    }
  }'

See the Schema reference for schema requirements.

Response

The API returns a document object with a pre-signed upload URL:

{
  "data": {
    "type": "document",
    "id": "9941d601-0dd4-4a33-8043-4f158b480f0e",
    "attributes": {
      "status": "pending_upload",
      "created_at": "2024-01-15T10:30:00Z"
    }
  },
  "meta": {
    "presigned_upload_url": "https://files.beltic.com/9941d601-0dd4-4a33-8043-4f158b480f0e/file?X-Amz-Signature=...",
    "expires_in": 3600
  }
}

Important: The pre-signed URL expires after the time specified in expires_in (typically 1 hour). Upload your file before it expires.

Step 2: Upload the File

Upload your document file to the pre-signed URL using a PUT request:

curl -X PUT "{presigned_upload_url}" \
  -H "Content-Type: application/pdf" \
  --data-binary @invoice.pdf

File Requirements

Maximum file size: 50MB
Supported formats: PDF, images (PNG, JPEG)
Content-Type: Should match the file type (e.g., application/pdf for PDFs)

Alternative: Using file_url

Instead of uploading, you can provide a file_url when creating the document to have Beltic download the file:

curl -X POST https://api.beltic.com/v1/documents \
  -H "X-Api-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": {
      "type": "document",
      "meta": {
        "document_template_id": "123e4567-e89b-12d3-a456-426614174000",
        "file_url": "https://example.com/documents/invoice.pdf"
      }
    }
  }'

This skips the upload step and processing begins automatically once the file is downloaded.

Step 3: Monitor Processing Status

After uploading, the document status changes to processing. Poll the Get Document endpoint to check status:

curl -X GET https://api.beltic.com/v1/documents/{document_id} \
  -H "X-Api-Key: YOUR_API_KEY"

Document Statuses

pending_upload: Document created, waiting for file upload
processing: File uploaded, extraction in progress
completed: Extraction finished successfully
failed: Processing failed (check error_code and error_message)

Polling Strategy

Recommended Polling

Interval: Poll every 5-10 seconds for most documents
Timeout: Set a maximum wait time (e.g., 5 minutes) based on your use case

Step 4: Retrieve Extracted Data

Once the document status is completed, the extracted data is available in the document response:

curl -X GET https://api.beltic.com/v1/documents/{document_id} \
  -H "X-Api-Key: YOUR_API_KEY"

Response Structure

{
  "data": {
    "type": "document",
    "id": "9941d601-0dd4-4a33-8043-4f158b480f0e",
    "attributes": {
      "status": "completed",
      "extracted_data": {
        "invoice_number": "INV-2024-001",
        "amount": 1250.50,
        "date": "2024-01-15",
        "vendor": "Acme Corporation"
      },
      "fraud_result": {
        "score": "NORMAL",
        "file_metadata": {
          "producer": "Adobe PDF Library",
          "creator": "Microsoft Word",
          "creation_date": "2024-01-15T08:00:00Z",
          "mod_date": "2024-01-15T09:30:00Z",
          "author": "John Doe",
          "title": "Invoice",
          "keywords": null,
          "subject": null
        },
        "indicators": [
          {
            "id": "indicator-001",
            "type": "TRUST",
            "category": "Document Quality",
            "title": "High Quality Scan",
            "description": "Document shows high quality scanning with clear text",
            "origin": "QUALITY"
          }
        ],
        "document_classification": {
          "id": "class-001",
          "type": "invoice",
          "document_class_type": "financial",
          "detailed_type": "commercial_invoice"
        }
      },
      "created_at": "2024-01-15T10:30:00Z",
      "updated_at": "2024-01-15T10:32:15Z"
    }
  }
}

Extracted Data Format

The extracted_data object matches your extraction schema. Fields that couldn’t be extracted will be null.

Fraud Analysis

If fraud detection is enabled, the response includes fraud_result with:

score: Risk assessment level - one of "NORMAL", "TRUSTED", "WARNING", or "HIGH_RISK"
file_metadata: Extracted file metadata including:
- producer: Software that created the PDF
- creator: Application used to create the document
- creation_date: When the file was created
- mod_date: When the file was last modified
- author: Document author
- title, keywords, subject: Additional metadata fields
indicators: Array of fraud detection indicators, each containing:
- id: Unique indicator identifier
- type: Indicator type - "RISK" (negative), "TRUST" (positive), or "INFO" (neutral)
- category: Indicator category
- title: Short indicator title
- description: Detailed indicator description
- origin: Source of indicator - "FRAUD" or "QUALITY"
document_classification: Document type classification with:
- id: Classification identifier
- type: Document type
- document_class_type: Document class category
- detailed_type: Specific document subtype

Error Handling

When document processing fails, the document status is set to failed and the processing_errors array contains one or more error objects. Each error follows the JSON:API error format with the following structure:

{
  "status": "500",
  "code": "EXTRACTION_FAILED",
  "title": "Data extraction failed",
  "detail": "Additional sanitized error details (may be null)",
  "meta": {
    "occurred_at": "2024-01-15T10:32:15Z"
  }
}

Error Codes

All error codes are sanitized and safe to expose to API consumers. Vendor names and internal implementation details are removed from error messages.

PROCESSING_FAILED

General document processing failure. This error occurs when both extraction and fraud detection fail, or when an unexpected error occurs during processing.Possible causes:

Multiple processing steps failed simultaneously
Internal service error
Unhandled exception during processing

Resolution: Retry the request. If the issue persists, try with a different file or contact support.

EXTRACTION_FAILED

The AI-powered data extraction process failed. This can occur due to document quality issues, unsupported formats, or extraction service errors.Possible causes:

Unsupported document format or structure
Extraction service unavailable or timeout
Document content doesn’t match the extraction schema

Resolution:

Verify document format is supported (PDF, PNG, JPEG)
Check that the extraction schema matches the document structure
Retry with a different file if the issue persists

FRAUD_CHECK_FAILED

The fraud detection analysis failed. This occurs when the fraud detection service encounters an error while analyzing document authenticity.Possible causes:

Fraud detection service unavailable
Document format incompatible with fraud analysis
Internal fraud detection processing error

Resolution: Retry the request. If the issue persists, try with a different file or contact support.

TIMEOUT

Document processing exceeded the maximum allowed time. Processing was terminated to prevent indefinite waiting.Possible causes:

Very large or complex documents
Slow processing services
Network latency issues

Resolution:

Try processing the document again
For large documents, consider splitting into smaller files
If the issue persists, contact support.

FILE_TOO_LARGE

The uploaded file exceeds the maximum allowed size of 50MB.Resolution:

Split large documents into smaller files
Use a file compression tool to reduce file size
Consider using a different file format that results in a smaller file size

INVALID_FILE

The file could not be processed due to format issues or corruption.Possible causes:

Unsupported file format
Corrupted or damaged file
File is not a valid document (e.g., empty file, wrong MIME type)
File structure is incompatible with processing

Resolution:

Verify the file format is supported (PDF, PNG, JPEG)
Ensure the file is not corrupted
Try re-saving or re-exporting the document
Check that the file is a valid document and not empty

DOCUMENT_NOT_FOUND

The file at the provided file_url could not be accessed or downloaded when creating a document.Possible causes:

Invalid or malformed URL in file_url
File does not exist at the provided URL
URL requires authentication that wasn’t provided
Network error preventing file download
URL is inaccessible (blocked, expired, or requires special permissions)
Server hosting the file returned an error (404, 403, 500, etc.)

Resolution:

Verify the file_url is correct and accessible
Test the URL in a browser or with curl to ensure it’s reachable
Ensure the URL doesn’t require authentication or special headers
Check that the file server is not blocking requests from Beltic’s IP addresses
For private files, use the pre-signed upload URL method instead of file_url
Verify the URL hasn’t expired (for time-limited URLs)

Handling Errors in Code

Always check the document status field and handle failed status appropriately:

async function checkDocumentStatus(documentId, apiKey) {
  const response = await fetch(`https://api.beltic.com/v1/documents/${documentId}`, {
    headers: { 'X-Api-Key': apiKey }
  });
  
  const data = await response.json();
  const status = data.data.attributes.status;
  
  if (status === 'failed') {
    const error = data.data.attributes.processing_errors[0];
    console.error(`Error: ${error.code} (${error.status})`);
    console.error(`Title: ${error.title}`);
    console.error(`Detail: ${error.detail || 'No additional details'}`);
    
    // Handle specific error codes
    switch (error.code) {
      case 'FILE_TOO_LARGE':
        console.error('File is too large. Please compress or split the document.');
        break;
      case 'INVALID_FILE':
        console.error('File format is invalid or corrupted. Please check the file.');
        break;
      case 'TIMEOUT':
        console.error('Processing timed out. Please try again.');
        break;
      default:
        console.error('Processing failed. Please retry or contact support.');
    }
  }
  
  return data;
}

Error Response Format

When a document fails, the response includes a processing_errors array in the document attributes:

{
  "data": {
    "type": "document",
    "id": "9941d601-0dd4-4a33-8043-4f158b480f0e",
    "attributes": {
      "status": "failed",
      "processing_errors": [
        {
          "status": "500",
          "code": "EXTRACTION_FAILED",
          "title": "Data extraction failed",
          "detail": "Unable to extract data from document",
          "meta": {
            "occurred_at": "2024-01-15T10:32:15Z"
          }
        }
      ]
    }
  }
}

Best Practices

Schema Design

Design schemas with realistic expectations - not all fields may be extractable
Provide clear description fields to guide extraction
Test schemas with sample documents before production use

File Quality

Use high-quality scans or PDFs for best extraction results
Ensure text is readable and not obscured
Avoid heavily compressed or low-resolution images
For multi-page documents, ensure all pages are included

Performance

Use templates for repeated document types to reduce configuration overhead
Cache template IDs to avoid repeated lookups
Handle rate limits appropriately (429 status code)

Security

Never expose API keys in client-side code
Use secure file URLs when providing file_url (HTTPS only)
Validate extracted data before using in business logic

Complete Example

Here’s a complete example using a template:

async function runExtraction(apiKey, templateId, file) {
  // Step 1: Create document
  const createResponse = await fetch('https://api.beltic.com/v1/documents', {
    method: 'POST',
    headers: {
      'X-Api-Key': apiKey,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      data: {
        type: 'document',
        meta: {
          document_template_id: templateId
        }
      }
    })
  });
  
  const createData = await createResponse.json();
  const documentId = createData.data.id;
  const uploadUrl = createData.meta.presigned_upload_url;
  
  console.log(`Created document: ${documentId}`);
  
  // Step 2: Upload file
  await fetch(uploadUrl, {
    method: 'PUT',
    headers: {
      'Content-Type': 'application/pdf'
    },
    body: file
  });
  
  console.log('File uploaded, processing...');
  
  // Step 3: Poll for completion
  while (true) {
    const docResponse = await fetch(`https://api.beltic.com/v1/documents/${documentId}`, {
      headers: { 'X-Api-Key': apiKey }
    });
    
    const docData = await docResponse.json();
    const status = docData.data.attributes.status;
    
    if (status === 'completed') {
      console.log('Extraction completed!');
      console.log(docData.data.attributes.extracted_data);
      return docData;
    } else if (status === 'failed') {
      const error = docData.data.attributes.processing_errors?.[0];
      console.error('Extraction failed!', error?.title || 'Unknown error');
      throw new Error(error?.detail || 'Document processing failed');
    }
    
    await new Promise(resolve => setTimeout(resolve, 5000));
  }
}

// Usage
const file = await fetch('invoice.pdf').then(r => r.blob());
await runExtraction('YOUR_API_KEY', '123e4567-e89b-12d3-a456-426614174000', file);

Overview

Document API

Identity API

Running an Extraction

Overview

Extraction Workflow

Step 1: Create a Document

Response

Step 2: Upload the File

File Requirements

Alternative: Using file_url

Step 3: Monitor Processing Status

Document Statuses

Polling Strategy

Step 4: Retrieve Extracted Data

Response Structure

Extracted Data Format

Fraud Analysis

Error Handling

Error Codes

Handling Errors in Code

Error Response Format

Best Practices

Complete Example

Overview

Document API

Identity API

​Overview

​Extraction Workflow

​Step 1: Create a Document

​Response

​Step 2: Upload the File

​File Requirements

​Alternative: Using file_url

​Step 3: Monitor Processing Status

​Document Statuses

​Polling Strategy

​Step 4: Retrieve Extracted Data

​Response Structure

​Extracted Data Format

​Fraud Analysis

​Error Handling

​Error Codes

​Handling Errors in Code

​Error Response Format

​Best Practices

​Complete Example

Overview

Extraction Workflow

Step 1: Create a Document

Response

Step 2: Upload the File

File Requirements

Alternative: Using file_url

Step 3: Monitor Processing Status

Document Statuses

Polling Strategy

Step 4: Retrieve Extracted Data

Response Structure

Extracted Data Format

Fraud Analysis

Error Handling

Error Codes

Handling Errors in Code

Error Response Format

Best Practices

Complete Example