New Open-Source Tool 'Lift' Extracts Structured JSON from PDFs with 90% Accuracy, Outperforming Azure and NuExtract3
Summary
A powerful new open-source tool called 'Lift' launches on GitHub, achieving 90.2% accuracy in extracting structured JSON data from PDFs and images, outperforming Azure Content Understanding and NuExtract3 across 225 benchmark documents, with a managed API version pushing accuracy even further to 95.9%.
Key Points
- A new open-source tool called 'lift' is now available on GitHub, enabling fast and accurate extraction of structured JSON data from PDFs and images using a 9B vision model with schema-constrained decoding.
- Benchmark tests across 225 documents show lift achieving 90.2% field accuracy, outperforming competitors like Azure Content Understanding and NuExtract3, while a managed Datalab API version reaches 95.9% accuracy with added features like citations and confidence scores.
- The tool supports easy installation via pip, offers CLI and Python API usage, includes a Schema Studio app for building and testing schemas, and provides a vLLM server option for production and batch processing deployments.