+
+
+
+
+
+
+
+[](https://github.com/opendatalab/MinerU)
+[](https://github.com/opendatalab/MinerU)
+[](https://github.com/opendatalab/MinerU/issues)
+[](https://github.com/opendatalab/MinerU/issues)
+[](https://badge.fury.io/py/magic-pdf)
+[](https://pepy.tech/project/magic-pdf)
+[](https://pepy.tech/project/magic-pdf)
+
+[](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
+[](https://huggingface.co/spaces/opendatalab/MinerU)
+[](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
+[](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
+[](https://arxiv.org/abs/2409.18839)
+
+

+
+
+
+[English](README.md) | [简体中文](README_zh-CN.md)
+
+
+
+
+PDF-Extract-Kit: High-Quality PDF Extraction Toolkit🔥🔥🔥
+
+
+
+
+
+ 👋 join us on Discord and WeChat
+
+
+
+
+
+ read more docs on Read The Docs
+
+
+
+
+# Changelog
+- 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
+- 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
+ - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
+ - Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.
+ - Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.
+ - Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to near zero.
+ - Added multi-language support for OCR, supporting detection and recognition of 84 languages.For the list of supported languages, see [OCR Language Support List](https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations).
+ - Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.
+ - Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.
+ - Integrated [PDF-Extract-Kit 1.0](https://github.com/opendatalab/PDF-Extract-Kit):
+ - Added the self-developed `doclayout_yolo` model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with `layoutlmv3` via the configuration file.
+ - Upgraded formula parsing to `unimernet 0.2.1`, improving formula parsing accuracy while significantly reducing memory usage.
+ - Due to the repository change for `PDF-Extract-Kit 1.0`, you need to re-download the model. Please refer to [How to Download Models](docs/how_to_download_models_en.md) for detailed steps.
+- 2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a [localized deployment version](projects/web_demo/README.md) of the [online demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) and the [front-end interface](projects/web/README.md).
+- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
+- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
+- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
+- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
+- 2024/07/05: Initial open-source release
+
+
+
+
+
+
+
+
+
+
+
+[](https://github.com/opendatalab/MinerU)
+[](https://github.com/opendatalab/MinerU)
+[](https://github.com/opendatalab/MinerU/issues)
+[](https://github.com/opendatalab/MinerU/issues)
+[](https://badge.fury.io/py/magic-pdf)
+[](https://pepy.tech/project/magic-pdf)
+[](https://pepy.tech/project/magic-pdf)
+
+[](https://opendatalab.com/OpenSourceTools/Extractor/PDF)
+[](https://huggingface.co/spaces/opendatalab/MinerU)
+[](https://www.modelscope.cn/studios/OpenDataLab/MinerU)
+[](https://colab.research.google.com/gist/myhloli/3b3a00a4a0a61577b6c30f989092d20d/mineru_demo.ipynb)
+[](https://arxiv.org/abs/2409.18839)
+
+

+
+
+
+[English](README.md) | [简体中文](README_zh-CN.md)
+
+
+
+
+PDF-Extract-Kit: 高质量PDF解析工具箱🔥🔥🔥
+
+
+
+
+
+ 👋 join us on Discord and WeChat
+
+
+
+
+ read more docs on Read The Docs
+
+
+
+
+
+# 更新记录
+
+- 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
+- 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
+ - 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
+ - 重构段落拼接模块,在跨栏、跨页、跨图、跨表情况下均能实现良好的段落拼接效果
+ - 重构列表和目录识别功能,极大提升列表块和目录块识别的准确率及对应文本段落的解析效果
+ - 重构图、表与描述性文本的匹配逻辑,大幅提升 caption 和 footnote 与图表的匹配准确率,并将描述性文本的丢失率降至接近0
+ - 增加 OCR 的多语言支持,支持 84 种语言的检测与识别,语言支持列表详见 [OCR 语言支持列表](https://paddlepaddle.github.io/PaddleOCR/latest/ppocr/blog/multi_languages.html#5)
+ - 增加显存回收逻辑及其他显存优化措施,大幅降低显存使用需求。开启除表格加速外的全部加速功能(layout/公式/OCR)的显存需求从16GB降至8GB,开启全部加速功能的显存需求从24GB降至10GB
+ - 优化配置文件的功能开关,增加独立的公式检测开关,无需公式检测时可大幅提升速度和解析效果
+ - 集成 [PDF-Extract-Kit 1.0](https://github.com/opendatalab/PDF-Extract-Kit)
+ - 加入自研的 `doclayout_yolo` 模型,在相近解析效果情况下比原方案提速10倍以上,可通过配置文件与 `layoutlmv3` 自由切换
+ - 公式解析升级至 `unimernet 0.2.1`,在提升公式解析准确率的同时,大幅降低显存需求
+ - 因 `PDF-Extract-Kit 1.0` 更换仓库,需要重新下载模型,步骤详见 [如何下载模型](docs/how_to_download_models_zh_cn.md)
+- 2024/09/27 0.8.1发布,修复了一些bug,同时提供了[在线demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)的[本地化部署版本](projects/web_demo/README_zh-CN.md)和[前端界面](projects/web/README_zh-CN.md)
+- 2024/09/09 0.8.0发布,支持Dockerfile快速部署,同时上线了huggingface、modelscope demo
+- 2024/08/30 0.7.1发布,集成了paddle tablemaster表格识别功能
+- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
+- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
+- 2024/07/05 首次开源
+
+
+
+