IBM, NVIDIA, and Red Hat Launch Open AI Document Format to Replace PDF with Up to 30x Lower Token Costs
Summary
IBM, NVIDIA, and Red Hat are spearheading DocLang, a groundbreaking open AI-native document format built to replace PDFs in enterprise AI pipelines, delivering up to 30x lower token costs while preserving semantic structure and slashing hallucination risks.
Key Points
- A new working group under the LF AI & Data Foundation, founded by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, is developing DocLang, an open AI-native document format designed to replace formats like PDF, HTML, and Markdown for enterprise AI pipelines.
- DocLang uses a limited XML vocabulary optimized for LLM tokenizers on a 1-to-1 basis, preserving semantic structure, layout, and governance metadata that current formats lose during AI processing, reducing hallucination risk and improving output accuracy.
- Early benchmarks show DocLang delivers 4x to over 30x lower token costs compared to PDFs, with a real-world test of IBM's 2025 annual report showing fewer input tokens, lower latency, and better AI output quality when using the new format.