Compare commits

...

76 Commits

Author SHA1 Message Date
Xiaomeng Zhao
d0ed731b9e Merge pull request #2199 from opendatalab/release-1.3.2
Release 1.3.2
2025-04-12 18:58:15 +08:00
Xiaomeng Zhao
d6c5700af2 Merge pull request #2198 from opendatalab/dev
Dev
2025-04-12 18:52:52 +08:00
Xiaomeng Zhao
196da7ea0e Merge pull request #2197 from myhloli/dev
docs(readme): update release notes for English and Chinese README files
2025-04-12 18:52:30 +08:00
myhloli
a69b97c9dd docs(readme): update release notes for English and Chinese README files
- Update version history in both English and Chinese README files
- Add note about model update required for fixing word concatenation issue- Ensure consistency between English and Chinese versions
2025-04-12 18:51:39 +08:00
Xiaomeng Zhao
b951a6ccd5 Merge pull request #2196 from opendatalab/dev
Dev
2025-04-12 18:48:11 +08:00
Xiaomeng Zhao
15b9146e8d Merge pull request #2195 from myhloli/dev
docs(README): update version history and installation instructions
2025-04-12 18:47:28 +08:00
myhloli
437311f5bd docs(README): update version history and installation instructions
- Update version history in README.md and README_zh-CN.md
- Add details for 1.3.2 release and previous versions
- Update Windows CUDA acceleration installation instructions
- Refactor changelog entries for better readability and organization
2025-04-12 18:44:55 +08:00
Xiaomeng Zhao
983e8e6824 Merge pull request #2194 from myhloli/dev
feat(magic_pdf): add logging for batch image processing
2025-04-12 17:05:14 +08:00
myhloli
afe1b02c3d feat(magic_pdf): add logging for batch image processing
- Add batch processing logs to track the progress of image analysis
- Display the current batch number, total batches, and the number of processed pages
2025-04-12 16:59:06 +08:00
Xiaomeng Zhao
af030021d2 Merge pull request #2193 from myhloli/dev
build(setup): update package versions and constraints
2025-04-12 16:39:59 +08:00
myhloli
15467730cf Merge remote-tracking branch 'origin/dev' into dev 2025-04-12 16:38:49 +08:00
myhloli
1b611a2e55 build(setup): update package versions and constraints
- Update matplotlib version range to >=3.10, <4
- Add version上限 for ultralytics: <9
- Remove redundant version ranges for full_old_linux
2025-04-12 16:38:33 +08:00
Xiaomeng Zhao
852b841ab1 Merge pull request #2189 from myhloli/dev
refactor(model): optimize batch processing and inference
2025-04-11 19:25:37 +08:00
myhloli
54ce594bf6 refactor(tools): improve code readability and maintainability
- Remove unnecessary line breaks and adjust indentation
- Update function call to use named arguments for better readability
- Modify _do_parse function call to use MakeMode.MM_MD instead of
2025-04-11 11:12:30 +08:00
myhloli
d2fc9dabf4 refactor(model): optimize batch processing and inference
- Update batch processing logic for improved efficiency
- Refactor image analysis and inference methods
- Optimize dataset handling and image retrieval
- Improve error handling and logging in batch processes
2025-04-11 10:59:38 +08:00
myhloli
930bc47fe4 build(dependencies): update torch version requirements
- Remove upper version limit for torch
- This change allows for greater flexibility in installing compatible torch versions
2025-04-11 10:29:37 +08:00
Xiaomeng Zhao
1c7f41dd7c Merge pull request #2178 from myhloli/dev
feat(gui): update language options and default settings
2025-04-10 17:53:11 +08:00
Xiaomeng Zhao
e32704f102 Merge branch 'opendatalab:dev' into dev 2025-04-10 17:52:07 +08:00
Xiaomeng Zhao
a881ee89f6 Merge pull request #2177 from icecraft/feat/iterator_inference
feat: inference with iter style
2025-04-10 17:51:46 +08:00
icecraft
43164533fa feat: inference with iter style 2025-04-10 17:45:19 +08:00
myhloli
786da939e5 feat(gui): update language options and default settings
- Remove unused 'layoutlmv3' model option
- Update language options to include new 'add_lang' list
- Set default language to 'ch' (Chinese)
- Comment out old 'all_lang' definition for future reference
2025-04-10 15:39:51 +08:00
Xiaomeng Zhao
ce212da14b Merge pull request #2174 from myhloli/dev
refactor(ocr): comment out det_count update and update OCR models
2025-04-09 23:59:35 +08:00
myhloli
f8323ae07c refactor(ocr): comment out det_count update and update OCR models
- Comment out the line that updates det_count in batch_analyze.py
- Add a new OCR model configuration for Chinese (ch_lite) in models_config.yml- Update the Chinese OCR model configuration to use a different recognition model
2025-04-09 23:56:47 +08:00
Xiaomeng Zhao
e5b74ae724 Merge pull request #2173 from myhloli/dev
fix(dataset): correct variable for language detection
2025-04-09 22:34:12 +08:00
myhloli
814bd4ea50 fix(dataset): correct variable for language detection
- Change `bits` to `self._data_bits` for language detection
- This fixes the TypeError when opening PDF files
2025-04-09 22:32:31 +08:00
Xiaomeng Zhao
1db6f89dcd Merge pull request #2172 from myhloli/dev
fix(ocr): handle NaN values in recognition scores, feat(table): add orientation detection and rotation for portrait tables
2025-04-09 19:06:20 +08:00
myhloli
4afdba3626 perf(table): optimize aspect ratio calculation for text boxes
- Simplify aspect ratio calculation using direct coordinate subtraction
- Remove unnecessary list append operation
- Improve code readability and performance in table rotation detection
2025-04-09 19:05:05 +08:00
myhloli
ac893f325a feat(table): add orientation detection and rotation for portrait tables
- Implement table orientation detection to identify if a table is in portrait mode
- Add rotation logic to turn portrait tables 90 degrees clockwise before OCR
- Update OCR processing to work with potentially rotated images
- Improve text box analysis to determine if a table is rotated
2025-04-09 18:47:53 +08:00
myhloli
c97959e4f5 fix(ocr): handle NaN values in recognition scores
- Update predict_rec.py to check for NaN values in recognition results
- Replace NaN scores with 0.0 to ensure stability and consistency
2025-04-09 18:00:30 +08:00
Xiaomeng Zhao
8aa61b0e9f Merge pull request #2166 from icecraft/fix/doc_analyze
fix: support page range
2025-04-09 17:18:42 +08:00
Xiaomeng Zhao
8e8103a8ce Merge pull request #2170 from myhloli/dev
feat(model): improve table recognition by merging and filtering tables
2025-04-09 17:17:21 +08:00
myhloli
df7ae4042d feat(model): improve table recognition by merging and filtering tables
- Add functions to calculate IoU, check if tables are inside each other, and merge tables
- Implement table merging for high IoU tables
- Add filtering to remove nested tables that don't overlap but cover a large area
- Update table_res_list and layout_res to reflect these changes
2025-04-09 17:14:33 +08:00
icecraft
29c42a1add fix: support page range 2025-04-09 15:04:07 +08:00
Xiaomeng Zhao
b60166a541 Merge pull request #2157 from opendatalab/release-1.3.1
Release 1.3.1
2025-04-08 18:16:33 +08:00
Xiaomeng Zhao
ccf2ea04cb Merge pull request #2156 from opendatalab/dev
Dev
2025-04-08 18:16:07 +08:00
Xiaomeng Zhao
564991512c Merge branch 'release-1.3.1' into dev 2025-04-08 18:16:01 +08:00
Xiaomeng Zhao
a1595f1912 Merge pull request #2155 from myhloli/dev
docs: update version number in README files
2025-04-08 18:15:17 +08:00
myhloli
bc0ff1acb0 docs: update version number in README files
- Correct version number from 1.3.2 to 1.3.1 in both README.md and README_zh-CN.md
- Update changelog entries for the latest release
2025-04-08 18:14:29 +08:00
Xiaomeng Zhao
cb9c2e7616 Merge pull request #2154 from opendatalab/release-1.3.2
Release 1.3.2
2025-04-08 18:11:26 +08:00
Xiaomeng Zhao
b3ac3ac148 Merge branch 'master' into release-1.3.2 2025-04-08 18:11:16 +08:00
Xiaomeng Zhao
2c7094ff3d Merge pull request #2153 from opendatalab/dev
Dev
2025-04-08 18:10:16 +08:00
Xiaomeng Zhao
0ed231cb8b Merge pull request #2152 from myhloli/dev
docs(README): update version number and changelog in README files
2025-04-08 18:09:53 +08:00
myhloli
bd4728aaeb docs(README): update version number and changelog in README files
- Update version number from 1.3.1 to 1.3.2
2025-04-08 18:09:05 +08:00
Xiaomeng Zhao
2813e59905 Merge pull request #2151 from myhloli/dev
refactor(ocr): improve OCR score precision to three decimal places
2025-04-08 18:06:31 +08:00
myhloli
ea730ae2e9 refactor(ocr): improve OCR score precision to three decimal places
- Update OCR score formatting in batch_analyze.py and pdf_parse_union_core_v2.py
- Change score rounding method to preserve three decimal places
- Enhance accuracy representation without significantly altering the score value
2025-04-08 18:02:03 +08:00
myhloli
0ab29cdbee docs(README): update version number in release notes
- Update version from1.3.1 to 1.3.2 in both English and Chinese README files
- Keep other content unchanged
2025-04-08 17:37:39 +08:00
Xiaomeng Zhao
44665d3966 Update python-package.yml 2025-04-08 17:35:39 +08:00
myhloli
79feb926b7 Update version.py with new version 2025-04-08 09:23:09 +00:00
Xiaomeng Zhao
a2cde43b57 Merge pull request #2146 from opendatalab/release-1.3.1
Release 1.3.1
2025-04-08 17:20:21 +08:00
Xiaomeng Zhao
b8856ca96a Merge pull request #2148 from opendatalab/dev
Dev
2025-04-08 17:03:27 +08:00
Xiaomeng Zhao
098cf1df60 Merge pull request #2147 from myhloli/dev
docs: update badges and project URLs- Update PyPI version badge to us…
2025-04-08 17:02:43 +08:00
myhloli
90f0e7370a docs: update badges and project URLs- Update PyPI version badge to use shields.io
- Add project URLs in setup.py for better discoverability
- Make consistent changes across README.md and README_zh-CN.md
2025-04-08 17:01:41 +08:00
Xiaomeng Zhao
714504864e Update python-package.yml 2025-04-08 16:49:56 +08:00
Xiaomeng Zhao
87fd4c2806 Update bug_report.yml 2025-04-08 16:49:02 +08:00
Xiaomeng Zhao
3251c73250 Merge pull request #2145 from opendatalab/dev
fix(table): add model path for slanet-plus to resolve RapidTableError
2025-04-08 16:47:45 +08:00
Xiaomeng Zhao
697da27cf7 Merge pull request #2144 from myhloli/dev
fix(table): add model path for slanet-plus to resolve RapidTableError
2025-04-08 16:47:09 +08:00
myhloli
e327e9bad5 fix(table): add model path for slanet-plus to resolve RapidTableError
- Import os and pathlib modules to handle file paths
- Define the path to the slanet-plus model
- Update RapidTableInput initialization to include the model path
2025-04-08 16:46:01 +08:00
Xiaomeng Zhao
99d5c022c4 Merge pull request #2142 from myhloli/dev
update 1.3.1
2025-04-08 16:13:28 +08:00
myhloli
7b61b418a3 ci: update Python version support and installation process
- Add support for Python3.11, 3.12, and 3.13
- Replace requirements.txt based installation with editable install
2025-04-08 16:10:07 +08:00
myhloli
4fd8d626c4 docs(install): update Python version requirements and simplify torch installation
- Update Python version requirements to >=3.10
- Simplify torch installation command- Remove numpy version restriction
- Update CUDA compatibility information
- Adjust environment creation commands across multiple documentation files
2025-04-08 16:06:02 +08:00
myhloli
cf6fa12767 build(setup): remove rapid_table dependency
- Remove rapid_table from install_requires in setup.py
2025-04-08 14:24:15 +08:00
myhloli
de4bc5a32d ci: update issue template options for Python version and dependency version
- Add "3.13" option for Python version
- Remove "3.9" option for Python version
- Update dependency version options:
  - Remove "0.8.x", "0.9.x", "0.10.x"
  - Add "1.1.x", "1.2.x", "1.3.x"
2025-04-08 14:22:06 +08:00
myhloli
9b5d2796f8 build(deps): update dependencies and add support for old Linux systems
- Update transformers to exclude version 4.51.0 due to compatibility issues- Rapid table version range expanded to >=1.0.5,<2.0.0
- Add separate 'full_old_linux' extras_require for better support of older Linux systems
- Update matplotlib version requirements for different platforms
- Remove platform-specific paddlepaddle versions,
2025-04-08 14:18:49 +08:00
myhloli
0f0591cf8f build(old_linux): add rapid_table dependency for PDF conversion
- Add rapid_table==1.0.3 to old_linux specific dependencies
- This version is compatible with Linux systems from 2019 and earlier
- Newer versions of rapid_table depend on onnxruntime, which is not supported on older Linux systems
2025-04-08 11:58:38 +08:00
Xiaomeng Zhao
cf6ffc6b1e Merge pull request #2128 from myhloli/dev
fix(model): improve VRAM detection and handling
2025-04-07 18:18:09 +08:00
myhloli
d32a63cada fix(model): improve VRAM detection and handling
- Refactor VRAM detection logic for better readability and efficiency
- Add fallback mechanism for unknown VRAM sizes
- Improve device checking in get_vram function
2025-04-07 18:15:37 +08:00
Xiaomeng Zhao
dfb3cbfb17 Merge pull request #2126 from icecraft/fix/image_ds_add_lang
fix: image dataset add lang field
2025-04-07 16:57:49 +08:00
icecraft
e36a083dc3 fix: image dataset add lang field 2025-04-07 15:40:06 +08:00
Xiaomeng Zhao
980f5c8cd7 Merge pull request #2125 from opendatalab/dev
docs: update torchvision version in CUDA installation guide
2025-04-07 15:26:13 +08:00
Xiaomeng Zhao
f442adfc95 Merge pull request #2124 from myhloli/dev
docs: update torchvision version in CUDA installation guide
2025-04-07 14:54:30 +08:00
myhloli
d4cda0a8c2 docs: update torchvision version in CUDA installation guide
- Update torchvision version from0.21.1 to0.21.0 in Windows CUDA acceleration guides
- Update both English and Chinese versions of the documentation
2025-04-07 14:53:25 +08:00
Xiaomeng Zhao
60fdf851a4 Merge pull request #2115 from myhloli/dev
build: remove accelerate dependency
2025-04-06 22:25:01 +08:00
myhloli
a10b9aec74 build: remove accelerate dependency
- Remove accelerate package from requirements.txt
- This change ensures only necessary external dependencies are introduced
2025-04-06 22:24:23 +08:00
Xiaomeng Zhao
e3261b0eea Merge pull request #2114 from myhloli/dev
build(deps): add accelerate package and update requirements https://github.com/opendatalab/MinerU/issues/2112
2025-04-06 22:17:20 +08:00
myhloli
09632dddc1 build(deps): add accelerate package and update requirements
- Add accelerate package to support model training acceleration
- Update requirements.txt to include new dependency
2025-04-06 22:16:01 +08:00
Xiaomeng Zhao
c5329a0722 Merge pull request #2093 from opendatalab/master
master -> dev
2025-04-03 23:33:35 +08:00
30 changed files with 817 additions and 349 deletions

View File

@@ -64,10 +64,10 @@ body:
# Need quotes around `3.10` otherwise it is treated as a number and shows as `3.1`.
options:
-
- "3.13"
- "3.12"
- "3.11"
- "3.10"
- "3.9"
validations:
required: true
@@ -78,10 +78,10 @@ body:
#multiple: false
options:
-
- "0.8.x"
- "0.9.x"
- "0.10.x"
- "1.0.x"
- "1.1.x"
- "1.2.x"
- "1.3.x"
validations:
required: true

View File

@@ -54,13 +54,13 @@ jobs:
run: |
git push origin HEAD:master
build:
check-install:
needs: [ update-version ]
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.10"]
python-version: ["3.10", "3.11", "3.12", "3.13"]
steps:
- name: Checkout code
@@ -79,10 +79,26 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
- name: Install magic-pdf
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install -e .[full]
build:
needs: [ check-install ]
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [ "3.10"]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: master
fetch-depth: 0
- name: Install wheel
run: |

225
README.md
View File

@@ -10,7 +10,8 @@
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![PyPI version](https://img.shields.io/pypi/v/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
@@ -47,11 +48,20 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div>
# Changelog
- 2025/04/03 Release of 1.3.0, in this version we made many optimizations and improvements:
- 2025/04/12 1.3.2 released
- Fixed the issue of incompatible dependency package versions when installing in Python 3.13 environment on Windows systems.
- Optimized memory usage during batch inference.
- Improved the parsing effect of tables rotated by 90 degrees.
- Enhanced the parsing accuracy for large tables in financial report samples.
- Fixed the occasional word concatenation issue in English text areas when OCR language is not specified.(The model needs to be updated)
- 2025/04/08 1.3.1 released, fixed some compatibility issues
- Supported Python 3.13
- Made the final adaptation for some outdated Linux systems (e.g., CentOS 7), and no further support will be guaranteed for subsequent versions. [Installation Instructions](https://github.com/opendatalab/MinerU/issues/1004)
- 2025/04/03 1.3.0 released, in this version we made many optimizations and improvements:
- Installation and compatibility optimization
- By removing the use of `layoutlmv3` in layout, resolved compatibility issues caused by `detectron2`.
- Torch version compatibility extended to 2.2~2.6 (excluding 2.5).
- CUDA compatibility supports 11.8/12.4/12.6 (CUDA version determined by torch), resolving compatibility issues for some users with 50-series and H-series GPUs.
- CUDA compatibility supports 11.8/12.4/12.6/12.8 (CUDA version determined by torch), resolving compatibility issues for some users with 50-series and H-series GPUs.
- Python compatible versions expanded to 3.10~3.12, solving the problem of automatic downgrade to 0.6.1 during installation in non-3.10 environments.
- Offline deployment process optimized; no internet connection required after successful deployment to download any model files.
- Performance optimization
@@ -64,59 +74,154 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
- Usability Optimization
- By using `paddleocr2torch`, completely replaced the use of the `paddle` framework and `paddleocr` in the project, resolving conflicts between `paddle` and `torch`, as well as thread safety issues caused by the `paddle` framework.
- Added a real-time progress bar during the parsing process to accurately track progress, making the wait less painful.
- 2025/03/03 1.2.1 released, fixed several bugs:
- Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers
- Fixed caption matching inaccuracies in certain scenarios
- Fixed formula span loss issues in certain scenarios
- 2025/02/24 1.2.0 released. This version includes several fixes and improvements to enhance parsing efficiency and accuracy:
- Performance Optimization
- Increased classification speed for PDF documents in auto mode.
- Parsing Optimization
- Improved parsing logic for documents containing watermarks, significantly enhancing the parsing results for such documents.
- Enhanced the matching logic for multiple images/tables and captions within a single page, improving the accuracy of image-text matching in complex layouts.
- Bug Fixes
- Fixed an issue where image/table spans were incorrectly filled into text blocks under certain conditions.
- Resolved an issue where title blocks were empty in some cases.
- 2025/01/22 1.1.0 released. In this version we have focused on improving parsing accuracy and efficiency:
- Model capability upgrade (requires re-executing the [model download process](docs/how_to_download_models_en.md) to obtain incremental updates of model files)
- The layout recognition model has been upgraded to the latest `doclayout_yolo(2501)` model, improving layout recognition accuracy.
- The formula parsing model has been upgraded to the latest `unimernet(2501)` model, improving formula recognition accuracy.
- Performance optimization
- On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.
- Parsing effect optimization
- Added a new heading classification feature (testing version, enabled by default) to the online demo([mineru.net](https://mineru.net/OpenSourceTools/Extractor)/[huggingface](https://huggingface.co/spaces/opendatalab/MinerU)/[modelscope](https://www.modelscope.cn/studios/OpenDataLab/MinerU)), which supports hierarchical classification of headings, thereby enhancing document structuring.
- 2025/01/10 1.0.1 released. This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:
- New API Interface
- For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.
- For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.
- Enhanced Compatibility
- By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.
- We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. [Ascend NPU Acceleration](docs/README_Ascend_NPU_Acceleration_zh_CN.md)
- Automatic Language Identification
- By introducing a new language recognition model, setting the `lang` configuration to `auto` during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.
- 2024/11/22 0.10.0 released. Introducing hybrid OCR text extraction capabilities,
- Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.
- Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.
- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
- 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
- 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
- Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
- Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.
- Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.
- Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to near zero.
- Added multi-language support for OCR, supporting detection and recognition of 84 languages.For the list of supported languages, see [OCR Language Support List](https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations).
- Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.
- Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.
- Integrated [PDF-Extract-Kit 1.0](https://github.com/opendatalab/PDF-Extract-Kit):
- Added the self-developed `doclayout_yolo` model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with `layoutlmv3` via the configuration file.
- Upgraded formula parsing to `unimernet 0.2.1`, improving formula parsing accuracy while significantly reducing memory usage.
- Due to the repository change for `PDF-Extract-Kit 1.0`, you need to re-download the model. Please refer to [How to Download Models](docs/how_to_download_models_en.md) for detailed steps.
- 2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a [localized deployment version](projects/web_demo/README.md) of the [online demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) and the [front-end interface](projects/web/README.md).
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
- 2024/07/05: Initial open-source release
<details>
<summary>2025/03/03 1.2.1 released</summary>
<ul>
<li>Fixed the impact on punctuation marks during full-width to half-width conversion of letters and numbers</li>
<li>Fixed caption matching inaccuracies in certain scenarios</li>
<li>Fixed formula span loss issues in certain scenarios</li>
</ul>
</details>
<details>
<summary>2025/02/24 1.2.0 released</summary>
<p>This version includes several fixes and improvements to enhance parsing efficiency and accuracy:</p>
<ul>
<li><strong>Performance Optimization</strong>
<ul>
<li>Increased classification speed for PDF documents in auto mode.</li>
</ul>
</li>
<li><strong>Parsing Optimization</strong>
<ul>
<li>Improved parsing logic for documents containing watermarks, significantly enhancing the parsing results for such documents.</li>
<li>Enhanced the matching logic for multiple images/tables and captions within a single page, improving the accuracy of image-text matching in complex layouts.</li>
</ul>
</li>
<li><strong>Bug Fixes</strong>
<ul>
<li>Fixed an issue where image/table spans were incorrectly filled into text blocks under certain conditions.</li>
<li>Resolved an issue where title blocks were empty in some cases.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/22 1.1.0 released</summary>
<p>In this version we have focused on improving parsing accuracy and efficiency:</p>
<ul>
<li><strong>Model capability upgrade</strong> (requires re-executing the <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">model download process</a> to obtain incremental updates of model files)
<ul>
<li>The layout recognition model has been upgraded to the latest <code>doclayout_yolo(2501)</code> model, improving layout recognition accuracy.</li>
<li>The formula parsing model has been upgraded to the latest <code>unimernet(2501)</code> model, improving formula recognition accuracy.</li>
</ul>
</li>
<li><strong>Performance optimization</strong>
<ul>
<li>On devices that meet certain configuration requirements (16GB+ VRAM), by optimizing resource usage and restructuring the processing pipeline, overall parsing speed has been increased by more than 50%.</li>
</ul>
</li>
<li><strong>Parsing effect optimization</strong>
<ul>
<li>Added a new heading classification feature (testing version, enabled by default) to the online demo (<a href="https://mineru.net/OpenSourceTools/Extractor">mineru.net</a>/<a href="https://huggingface.co/spaces/opendatalab/MinerU">huggingface</a>/<a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">modelscope</a>), which supports hierarchical classification of headings, thereby enhancing document structuring.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/10 1.0.1 released</summary>
<p>This is our first official release, where we have introduced a completely new API interface and enhanced compatibility through extensive refactoring, as well as a brand new automatic language identification feature:</p>
<ul>
<li><strong>New API Interface</strong>
<ul>
<li>For the data-side API, we have introduced the Dataset class, designed to provide a robust and flexible data processing framework. This framework currently supports a variety of document formats, including images (.jpg and .png), PDFs, Word documents (.doc and .docx), and PowerPoint presentations (.ppt and .pptx). It ensures effective support for data processing tasks ranging from simple to complex.</li>
<li>For the user-side API, we have meticulously designed the MinerU processing workflow as a series of composable Stages. Each Stage represents a specific processing step, allowing users to define new Stages according to their needs and creatively combine these stages to customize their data processing workflows.</li>
</ul>
</li>
<li><strong>Enhanced Compatibility</strong>
<ul>
<li>By optimizing the dependency environment and configuration items, we ensure stable and efficient operation on ARM architecture Linux systems.</li>
<li>We have deeply integrated with Huawei Ascend NPU acceleration, providing autonomous and controllable high-performance computing capabilities. This supports the localization and development of AI application platforms in China. <a href="https://github.com/opendatalab/MinerU/blob/master/docs/README_Ascend_NPU_Acceleration_zh_CN.md">Ascend NPU Acceleration</a></li>
</ul>
</li>
<li><strong>Automatic Language Identification</strong>
<ul>
<li>By introducing a new language recognition model, setting the <code>lang</code> configuration to <code>auto</code> during document parsing will automatically select the appropriate OCR language model, improving the accuracy of scanned document parsing.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/11/22 0.10.0 released</summary>
<p>Introducing hybrid OCR text extraction capabilities:</p>
<ul>
<li>Significantly improved parsing performance in complex text distribution scenarios such as dense formulas, irregular span regions, and text represented by images.</li>
<li>Combines the dual advantages of accurate content extraction and faster speed in text mode, and more precise span/line region recognition in OCR mode.</li>
</ul>
</details>
<details>
<summary>2024/11/15 0.9.3 released</summary>
<p>Integrated <a href="https://github.com/RapidAI/RapidTable">RapidTable</a> for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.</p>
</details>
<details>
<summary>2024/11/06 0.9.2 released</summary>
<p>Integrated the <a href="https://huggingface.co/U4R/StructTable-InternVL2-1B">StructTable-InternVL2-1B</a> model for table recognition functionality.</p>
</details>
<details>
<summary>2024/10/31 0.9.0 released</summary>
<p>This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:</p>
<ul>
<li>Refactored the sorting module code to use <a href="https://github.com/ppaanngggg/layoutreader">layoutreader</a> for reading order sorting, ensuring high accuracy in various layouts.</li>
<li>Refactored the paragraph concatenation module to achieve good results in cross-column, cross-page, cross-figure, and cross-table scenarios.</li>
<li>Refactored the list and table of contents recognition functions, significantly improving the accuracy of list blocks and table of contents blocks, as well as the parsing of corresponding text paragraphs.</li>
<li>Refactored the matching logic for figures, tables, and descriptive text, greatly enhancing the accuracy of matching captions and footnotes to figures and tables, and reducing the loss rate of descriptive text to near zero.</li>
<li>Added multi-language support for OCR, supporting detection and recognition of 84 languages. For the list of supported languages, see <a href="https://paddlepaddle.github.io/PaddleOCR/latest/en/ppocr/blog/multi_languages.html#5-support-languages-and-abbreviations">OCR Language Support List</a>.</li>
<li>Added memory recycling logic and other memory optimization measures, significantly reducing memory usage. The memory requirement for enabling all acceleration features except table acceleration (layout/formula/OCR) has been reduced from 16GB to 8GB, and the memory requirement for enabling all acceleration features has been reduced from 24GB to 10GB.</li>
<li>Optimized configuration file feature switches, adding an independent formula detection switch to significantly improve speed and parsing results when formula detection is not needed.</li>
<li>Integrated <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit 1.0</a>:
<ul>
<li>Added the self-developed <code>doclayout_yolo</code> model, which speeds up processing by more than 10 times compared to the original solution while maintaining similar parsing effects, and can be freely switched with <code>layoutlmv3</code> via the configuration file.</li>
<li>Upgraded formula parsing to <code>unimernet 0.2.1</code>, improving formula parsing accuracy while significantly reducing memory usage.</li>
<li>Due to the repository change for <code>PDF-Extract-Kit 1.0</code>, you need to re-download the model. Please refer to <a href="https://github.com/opendatalab/MinerU/blob/master/docs/how_to_download_models_en.md">How to Download Models</a> for detailed steps.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/09/27 Version 0.8.1 released</summary>
<p>Fixed some bugs, and providing a <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web_demo/README.md">localized deployment version</a> of the <a href="https://opendatalab.com/OpenSourceTools/Extractor/PDF/">online demo</a> and the <a href="https://github.com/opendatalab/MinerU/blob/master/projects/web/README.md">front-end interface</a>.</p>
</details>
<details>
<summary>2024/09/09 Version 0.8.0 released</summary>
<p>Supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.</p>
</details>
<details>
<summary>2024/08/30 Version 0.7.1 released</summary>
<p>Add paddle tablemaster table recognition option</p>
</details>
<details>
<summary>2024/08/09 Version 0.7.0b1 released</summary>
<p>Simplified installation process, added table recognition functionality</p>
</details>
<details>
<summary>2024/08/01 Version 0.6.2b1 released</summary>
<p>Optimized dependency conflict issues and installation documentation</p>
</details>
<details>
<summary>2024/07/05 Initial open-source release</summary>
</details>
<!-- TABLE OF CONTENT -->
@@ -232,7 +337,7 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">Python Version</td>
<td colspan="3">3.10~3.12</td>
<td colspan="3">>=3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver Version</td>
@@ -242,8 +347,8 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -274,7 +379,7 @@ Synced with dev branch updates:
#### 1. Install magic-pdf
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]"
```

View File

@@ -10,7 +10,8 @@
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![PyPI version](https://img.shields.io/pypi/v/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
@@ -46,11 +47,20 @@
</div>
# 更新记录
- 2025/04/12 1.3.2 发布
- 修复了windows系统下在python3.13环境安装时一些依赖包版本不兼容的问题
- 优化批量推理时的内存占用
- 优化旋转90度表格的解析效果
- 优化财报样本中超大表格的解析效果
- 修复了在未指定OCR语言时英文文本区域偶尔出现的单词黏连问题需要更新模型
- 2025/04/08 1.3.1 发布,修复了一些兼容问题
- 支持python 3.13
- 为部分过时的linux系统如centos7做出最后适配并不再保证后续版本的继续支持[安装说明](https://github.com/opendatalab/MinerU/issues/1004)
- 2025/04/03 1.3.0 发布,在这个版本我们做出了许多优化和改进:
- 安装与兼容性优化
- 通过移除layout中`layoutlmv3`的使用,解决了由`detectron2`导致的兼容问题
- torch版本兼容扩展到2.2~2.6(2.5除外)
- cuda兼容支持11.8/12.4/12.6cuda版本由torch决定解决部分用户50系显卡与H系显卡的兼容问题
- cuda兼容支持11.8/12.4/12.6/12.8cuda版本由torch决定解决部分用户50系显卡与H系显卡的兼容问题
- python兼容版本扩展到3.10~3.12解决了在非3.10环境下安装时自动降级到0.6.1的问题
- 优化离线部署流程,部署成功后不需要联网下载任何模型文件
- 性能优化
@@ -63,60 +73,143 @@
- 易用性优化
- 通过使用`paddleocr2torch`,完全替代`paddle`框架以及`paddleocr`在项目中的使用,解决了`paddle``torch`的冲突问题,和由于`paddle`框架导致的线程不安全问题
- 解析过程增加实时进度条显示,精准把握解析进度,让等待不再痛苦
- 2025/03/03 1.2.1 发布,修复了一些问题:
- 修复在字母与数字的全角转半角操作时对标点符号的影响
- 修复在某些情况下caption的匹配不准确问题
- 修复在某些情况下的公式span丢失问题
- 2025/02/24 1.2.0 发布,这个版本我们修复了一些问题,提升了解析的效率与精度:
- 性能优化
- auto模式下pdf文档的分类速度提升
- 在华为昇腾 NPU 加速模式下,添加高性能插件支持,常见场景下端到端加速可达 300% [申请链接](https://aicarrier.feishu.cn/share/base/form/shrcnb10VaoNQB8kQPA8DEfZC6d)
- 解析优化
- 优化对包含水印文档的解析逻辑,显著提升包含水印文档的解析效果
- 改进了单页内多个图像/表格与caption的匹配逻辑提升了复杂布局下图文匹配的准确性
- 问题修复
- 修复在某些情况下图片/表格span被填充进textblock导致的异常
- 修复在某些情况下标题block为空的问题
- 2025/01/22 1.1.0 发布,在这个版本我们重点提升了解析的精度与效率:
- 模型能力升级(需重新执行[模型下载流程](docs/how_to_download_models_zh_cn.md)以获得模型文件的增量更新)
- 布局识别模型升级到最新的`doclayout_yolo(2501)`模型提升了layout识别精度
- 公式解析模型升级到最新的`unimernet(2501)`模型,提升了公式识别精度
- 性能优化
- 在配置满足一定条件显存16GB+的设备上通过优化资源占用和重构处理流水线整体解析速度提升50%以上
- 解析效果优化
- 在线demo[mineru.net](https://mineru.net/OpenSourceTools/Extractor)/[huggingface](https://huggingface.co/spaces/opendatalab/MinerU)/[modelscope](https://www.modelscope.cn/studios/OpenDataLab/MinerU))上新增标题分级功能(测试版本,默认开启),支持对标题进行分级,提升文档结构化程度
- 2025/01/10 1.0.1 发布这是我们的第一个正式版本在这个版本中我们通过大量重构带来了全新的API接口和更广泛的兼容性以及全新的自动语言识别功能
- 全新API接口
- 对于数据侧API我们引入了Dataset类旨在提供一个强大而灵活的数据处理框架。该框架当前支持包括图像.jpg及.png、PDF、Word.doc及.docx、以及PowerPoint.ppt及.pptx在内的多种文档格式确保了从简单到复杂的数据处理任务都能得到有效的支持。
- 针对用户侧API我们将MinerU的处理流程精心设计为一系列可组合的Stage阶段。每个Stage代表了一个特定的处理步骤用户可以根据自身需求自由地定义新的Stage并通过创造性地组合这些阶段来定制专属的数据处理流程。
- 更广泛的兼容性适配
- 通过优化依赖环境和配置项确保在ARM架构的Linux系统上能够稳定高效运行。
- 深度适配华为昇腾NPU加速积极响应信创要求提供自主可控的高性能计算能力助力人工智能应用平台的国产化应用与发展。[NPU加速教程](docs/README_Ascend_NPU_Acceleration_zh_CN.md)
- 自动语言识别
- 通过引入全新的语言识别模型, 在文档解析中将`lang`配置为`auto`即可自动选择合适的OCR语言模型提升扫描类文档解析的准确性。
- 2024/11/22 0.10.0发布通过引入混合OCR文本提取能力
- 在公式密集、span区域不规范、部分文本使用图像表现等复杂文本分布场景下获得解析效果的显著提升
- 同时具备文本模式内容提取准确、速度更快与OCR模式span/line区域识别更准的双重优势
- 2024/11/15 0.9.3发布,为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上准确率更高显存占用更低
- 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
- 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
- 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
- 重构段落拼接模块,在跨栏、跨页、跨图、跨表情况下均能实现良好的段落拼接效果
- 重构列表和目录识别功能,极大提升列表块和目录块识别的准确率及对应文本段落的解析效果
- 重构图、表与描述性文本的匹配逻辑,大幅提升 caption 和 footnote 与图表的匹配准确率并将描述性文本的丢失率降至接近0
- 增加 OCR 的多语言支持,支持 84 种语言的检测与识别,语言支持列表详见 [OCR 语言支持列表](https://paddlepaddle.github.io/PaddleOCR/latest/ppocr/blog/multi_languages.html#5)
- 增加显存回收逻辑及其他显存优化措施,大幅降低显存使用需求。开启除表格加速外的全部加速功能(layout/公式/OCR)的显存需求从16GB降至8GB开启全部加速功能的显存需求从24GB降至10GB
- 优化配置文件的功能开关,增加独立的公式检测开关,无需公式检测时可大幅提升速度和解析效果
- 集成 [PDF-Extract-Kit 1.0](https://github.com/opendatalab/PDF-Extract-Kit)
- 加入自研的 `doclayout_yolo` 模型在相近解析效果情况下比原方案提速10倍以上可通过配置文件与 `layoutlmv3` 自由切换
- 公式解析升级至 `unimernet 0.2.1`,在提升公式解析准确率的同时,大幅降低显存需求
-`PDF-Extract-Kit 1.0` 更换仓库,需要重新下载模型,步骤详见 [如何下载模型](docs/how_to_download_models_zh_cn.md)
- 2024/09/27 0.8.1发布修复了一些bug同时提供了[在线demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)的[本地化部署版本](projects/web_demo/README_zh-CN.md)和[前端界面](projects/web/README_zh-CN.md)
- 2024/09/09 0.8.0发布支持Dockerfile快速部署同时上线了huggingface、modelscope demo
- 2024/08/30 0.7.1发布集成了paddle tablemaster表格识别功能
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
- 2024/07/05 首次开源
<details>
<summary>2025/03/03 1.2.1 发布,修复了一些问题</summary>
<ul>
<li>修复在字母与数字的全角转半角操作时对标点符号的影响</li>
<li>修复在某些情况下caption的匹配不准确问题</li>
<li>修复在某些情况下的公式span丢失问题</li>
</ul>
</details>
<details>
<summary>2025/02/24 1.2.0 发布,这个版本我们修复了一些问题,提升了解析的效率与精度:</summary>
<ul>
<li>性能优化
<ul>
<li>auto模式下pdf文档的分类速度提升</li>
</ul>
</li>
<li>解析优化
<ul>
<li>优化对包含水印文档的解析逻辑,显著提升包含水印文档的解析效果</li>
<li>改进了单页内多个图像/表格与caption的匹配逻辑提升了复杂布局下图文匹配的准确性</li>
</ul>
</li>
<li>问题修复
<ul>
<li>修复在某些情况下图片/表格span被填充进textblock导致的异常</li>
<li>修复在某些情况下标题block为空的问题</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/22 1.1.0 发布,在这个版本我们重点提升了解析的精度与效率:</summary>
<ul>
<li>模型能力升级(需重新执行 <a href="https://github.com/opendatalab/MinerU/docs/how_to_download_models_zh_cn.md">模型下载流程</a> 以获得模型文件的增量更新)
<ul>
<li>布局识别模型升级到最新的 `doclayout_yolo(2501)` 模型提升了layout识别精度</li>
<li>公式解析模型升级到最新的 `unimernet(2501)` 模型,提升了公式识别精度</li>
</ul>
</li>
<li>性能优化
<ul>
<li>在配置满足一定条件显存16GB+的设备上通过优化资源占用和重构处理流水线整体解析速度提升50%以上</li>
</ul>
</li>
<li>解析效果优化
<ul>
<li>在线demo<a href="https://mineru.net/OpenSourceTools/Extractor">mineru.net</a> / <a href="https://huggingface.co/spaces/opendatalab/MinerU">huggingface</a> / <a href="https://www.modelscope.cn/studios/OpenDataLab/MinerU">modelscope</a>)上新增标题分级功能(测试版本,默认开启),支持对标题进行分级,提升文档结构化程度</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/01/10 1.0.1 发布这是我们的第一个正式版本在这个版本中我们通过大量重构带来了全新的API接口和更广泛的兼容性以及全新的自动语言识别功能</summary>
<ul>
<li>全新API接口
<ul>
<li>对于数据侧API我们引入了Dataset类旨在提供一个强大而灵活的数据处理框架。该框架当前支持包括图像.jpg及.png、PDF、Word.doc及.docx、以及PowerPoint.ppt及.pptx在内的多种文档格式确保了从简单到复杂的数据处理任务都能得到有效的支持。</li>
<li>针对用户侧API我们将MinerU的处理流程精心设计为一系列可组合的Stage阶段。每个Stage代表了一个特定的处理步骤用户可以根据自身需求自由地定义新的Stage并通过创造性地组合这些阶段来定制专属的数据处理流程。</li>
</ul>
</li>
<li>更广泛的兼容性适配
<ul>
<li>通过优化依赖环境和配置项确保在ARM架构的Linux系统上能够稳定高效运行。</li>
<li>深度适配华为昇腾NPU加速积极响应信创要求提供自主可控的高性能计算能力助力人工智能应用平台的国产化应用与发展。 <a href="https://github.com/opendatalab/MinerU/docs/README_Ascend_NPU_Acceleration_zh_CN.md">NPU加速教程</a></li>
</ul>
</li>
<li>自动语言识别
<ul>
<li>通过引入全新的语言识别模型, 在文档解析中将 `lang` 配置为 `auto`即可自动选择合适的OCR语言模型提升扫描类文档解析的准确性。</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/11/22 0.10.0发布通过引入混合OCR文本提取能力</summary>
<ul>
<li>在公式密集、span区域不规范、部分文本使用图像表现等复杂文本分布场景下获得解析效果的显著提升</li>
<li>同时具备文本模式内容提取准确、速度更快与OCR模式span/line区域识别更准的双重优势</li>
</ul>
</details>
<details>
<summary>2024/11/15 0.9.3发布,为表格识别功能接入了<a href="https://github.com/RapidAI/RapidTable">RapidTable</a>,单表解析速度提升10倍以上准确率更高显存占用更低</summary>
</details>
<details>
<summary>2024/11/06 0.9.2发布,为表格识别功能接入了<a href="https://huggingface.co/U4R/StructTable-InternVL2-1B">StructTable-InternVL2-1B</a>模型</summary>
</details>
<details>
<summary>2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:</summary>
<ul>
<li>重构排序模块代码,使用 <a href="https://github.com/ppaanngggg/layoutreader">layoutreader</a> 进行阅读顺序排序,确保在各种排版下都能实现极高准确率</li>
<li>重构段落拼接模块,在跨栏、跨页、跨图、跨表情况下均能实现良好的段落拼接效果</li>
<li>重构列表和目录识别功能,极大提升列表块和目录块识别的准确率及对应文本段落的解析效果</li>
<li>重构图、表与描述性文本的匹配逻辑,大幅提升 caption 和 footnote 与图表的匹配准确率并将描述性文本的丢失率降至接近0</li>
<li>增加 OCR 的多语言支持,支持 84 种语言的检测与识别,语言支持列表详见 <a href="https://paddlepaddle.github.io/PaddleOCR/latest/ppocr/blog/multi_languages.html#5">OCR 语言支持列表</a></li>
<li>增加显存回收逻辑及其他显存优化措施,大幅降低显存使用需求。开启除表格加速外的全部加速功能(layout/公式/OCR)的显存需求从16GB降至8GB开启全部加速功能的显存需求从24GB降至10GB</li>
<li>优化配置文件的功能开关,增加独立的公式检测开关,无需公式检测时可大幅提升速度和解析效果</li>
<li>集成 <a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit 1.0</a>
<ul>
<li>加入自研的 `doclayout_yolo` 模型在相近解析效果情况下比原方案提速10倍以上可通过配置文件与 `layoutlmv3` 自由切换</li>
<li>公式解析升级至 `unimernet 0.2.1`,在提升公式解析准确率的同时,大幅降低显存需求</li>
<li>因 `PDF-Extract-Kit 1.0` 更换仓库,需要重新下载模型,步骤详见 <a href="https://github.com/opendatalab/MinerU/docs/how_to_download_models_zh_cn.md">如何下载模型</a></li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2024/09/27 0.8.1发布修复了一些bug同时提供了<a href="https://opendatalab.com/OpenSourceTools/Extractor/PDF/">在线demo</a>的<a href="https://github.com/opendatalab/MinerU/projects/web_demo/README_zh-CN.md">本地化部署版本</a>和<a href="https://github.com/opendatalab/MinerU/projects/web/README_zh-CN.md">前端界面</a></summary>
</details>
<details>
<summary>2024/09/09 0.8.0发布支持Dockerfile快速部署同时上线了huggingface、modelscope demo</summary>
</details>
<details>
<summary>2024/08/30 0.7.1发布集成了paddle tablemaster表格识别功能</summary>
</details>
<details>
<summary>2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能</summary>
</details>
<details>
<summary>2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档</summary>
</details>
<details>
<summary>2024/07/05 首次开源</summary>
</details>
<!-- TABLE OF CONTENT -->
@@ -233,7 +326,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">python版本</td>
<td colspan="3">>=3.9,<=3.12</td>
<td colspan="3">>=3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver 版本</td>
@@ -243,8 +336,8 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">CUDA环境</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -279,7 +372,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
> 最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
```

View File

@@ -49,25 +49,3 @@ docker run -it -u root --name mineru-npu --privileged=true \
magic-pdf --help
```
## 已知问题
- paddleocr使用内嵌onnx模型仅在默认语言配置下能以较快速度对中英文进行识别
- 自定义lang参数时paddleocr速度会存在明显下降情况
- layout模型使用layoutlmv3时会发生间歇性崩溃建议使用默认配置的doclayout_yolo模型
- 表格解析仅适配了rapid_table模型其他模型可能会无法使用
## 高性能模式
- 在特定硬件环境可以通过插件开启高性能模式整体速度相比默认模式提升300%以上
| 系统要求 | 版本/型号 |
|----------------|--------------|
| 芯片类型 | 昇腾910B |
| CANN版本 | CANN 8.0.RC2 |
| 驱动版本 | 24.1.rc2.1 |
| magic-pdf 软件版本 | \> = 1.2.0 |
- 高性能插件需满足一定的硬件条件和资质要求,如需申请使用请填写以下表单[MinerU高性能版本合作申请表](https://aicarrier.feishu.cn/share/base/form/shrcnb10VaoNQB8kQPA8DEfZC6d)

View File

@@ -54,7 +54,7 @@ In the final step, enter `yes`, close the terminal, and reopen it.
### 4. Create an Environment Using Conda
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -63,14 +63,13 @@ conda activate mineru
```sh
pip install -U magic-pdf[full]
```
> [!IMPORTANT]
> After installation, make sure to check the version of `magic-pdf` using the following command:
> [!TIP]
> After installation, you can check the version of `magic-pdf` using the following command:
>
> ```sh
> magic-pdf --version
> ```
>
> If the version number is less than 1.3.0, please report the issue.
### 6. Download Models

View File

@@ -54,7 +54,7 @@ bash Anaconda3-2024.06-1-Linux-x86_64.sh
## 4. 使用conda 创建环境
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -64,14 +64,13 @@ conda activate mineru
pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
```
> [!IMPORTANT]
> 下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
> [!TIP]
> 下载完成后,您可以通过以下命令检查`magic-pdf`的版本
>
> ```bash
> magic-pdf --version
> ```
>
> 如果版本号小于1.3.0请到issue中向我们反馈
## 6. 下载模型

View File

@@ -17,7 +17,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
### 3. Create an Environment Using Conda
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -28,13 +28,12 @@ pip install -U magic-pdf[full]
```
> [!IMPORTANT]
> After installation, verify the version of `magic-pdf`:
> After installation, you can check the version of `magic-pdf` using the following command:
>
> ```bash
> magic-pdf --version
> ```
>
> If the version number is less than 1.3.0, please report it in the issues section.
### 5. Download Models
@@ -64,7 +63,7 @@ If your graphics card has at least 6GB of VRAM, follow these steps to test CUDA-
1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
```
pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
```
2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory.

View File

@@ -18,7 +18,7 @@ https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Window
## 3. 使用conda 创建环境
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -29,13 +29,12 @@ pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
```
> [!IMPORTANT]
> 下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
> 下载完成后,您可以通过以下命令检查magic-pdf的版本
>
> ```bash
> magic-pdf --version
> ```
>
> 如果版本号小于 1.3.0 请到issue中向我们反馈
## 5. 下载模型
@@ -65,7 +64,7 @@ pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url具体可参考[torch官网](https://pytorch.org/get-started/locally/))
```bash
pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
```
**2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**

View File

@@ -18,15 +18,6 @@ The configuration file can be found in the user directory, with the filename `ma
# How to update models previously downloaded
## 1. Models downloaded via Git LFS
> [!IMPORTANT]
> Due to feedback from some users that downloading model files using git lfs was incomplete or resulted in corrupted model files, this method is no longer recommended.
>
> For versions 0.9.x and later, due to the repository change and the addition of the layout sorting model in PDF-Extract-Kit 1.0, the models cannot be updated using the `git pull` command. Instead, a Python script must be used for one-click updates.
When magic-pdf <= 0.8.1, if you have previously downloaded the model files via git lfs, you can navigate to the previous download directory and update the models using the `git pull` command.
## 2. Models downloaded via Hugging Face or Model Scope
## 1. Models downloaded via Hugging Face or Model Scope
If you previously downloaded models via Hugging Face or Model Scope, you can rerun the Python script used for the initial download. This will automatically update the model directory to the latest version.

View File

@@ -32,16 +32,6 @@ python脚本会自动下载模型文件并配置好配置文件中的模型目
# 此前下载过模型,如何更新
## 1. 通过git lfs下载过模型
> [!IMPORTANT]
> 由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况现已不推荐使用该方式下载。
>
> 0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型不能通过`git pull`命令更新需要使用python脚本一键更新。
当magic-pdf <= 0.8.1时,如此前通过 git lfs 下载过模型文件,可以进入到之前的下载目录中,通过`git pull`命令更新模型。
## 2. 通过 Hugging Face 或 Model Scope 下载过模型
## 1. 通过 Hugging Face 或 Model Scope 下载过模型
如此前通过 HuggingFace 或 Model Scope 下载过模型可以重复执行此前的模型下载python脚本将会自动将模型目录更新到最新版本。

View File

@@ -103,54 +103,65 @@ def batch_build_dataset(pdf_paths, k, lang=None):
all_images : list
List of all processed images
"""
# Get page counts for each PDF
pdf_info = []
total_pages = 0
results = []
for pdf_path in pdf_paths:
try:
doc = fitz.open(pdf_path)
num_pages = len(doc)
pdf_info.append((pdf_path, num_pages))
total_pages += num_pages
doc.close()
except Exception as e:
print(f'Error opening {pdf_path}: {e}')
# Partition the jobs based on page countEach job has 1 page
partitions = partition_array_greedy(pdf_info, k)
# Process each partition in parallel
all_images_h = {}
with concurrent.futures.ProcessPoolExecutor(max_workers=k) as executor:
# Submit one task per partition
futures = []
for sn, partition in enumerate(partitions):
# Get the jobs for this partition
partition_jobs = [pdf_info[idx] for idx in partition]
# Submit the task
future = executor.submit(
process_pdf_batch,
partition_jobs,
sn
)
futures.append(future)
# Process results as they complete
for i, future in enumerate(concurrent.futures.as_completed(futures)):
try:
idx, images = future.result()
all_images_h[idx] = images
except Exception as e:
print(f'Error processing partition: {e}')
results = [None] * len(pdf_paths)
for i in range(len(partitions)):
partition = partitions[i]
for j in range(len(partition)):
with open(pdf_info[partition[j]][0], 'rb') as f:
pdf_bytes = f.read()
dataset = PymuDocDataset(pdf_bytes, lang=lang)
dataset.set_images(all_images_h[i][j])
results[partition[j]] = dataset
with open(pdf_path, 'rb') as f:
pdf_bytes = f.read()
dataset = PymuDocDataset(pdf_bytes, lang=lang)
results.append(dataset)
return results
#
# # Get page counts for each PDF
# pdf_info = []
# total_pages = 0
#
# for pdf_path in pdf_paths:
# try:
# doc = fitz.open(pdf_path)
# num_pages = len(doc)
# pdf_info.append((pdf_path, num_pages))
# total_pages += num_pages
# doc.close()
# except Exception as e:
# print(f'Error opening {pdf_path}: {e}')
#
# # Partition the jobs based on page countEach job has 1 page
# partitions = partition_array_greedy(pdf_info, k)
#
# # Process each partition in parallel
# all_images_h = {}
#
# with concurrent.futures.ProcessPoolExecutor(max_workers=k) as executor:
# # Submit one task per partition
# futures = []
# for sn, partition in enumerate(partitions):
# # Get the jobs for this partition
# partition_jobs = [pdf_info[idx] for idx in partition]
#
# # Submit the task
# future = executor.submit(
# process_pdf_batch,
# partition_jobs,
# sn
# )
# futures.append(future)
# # Process results as they complete
# for i, future in enumerate(concurrent.futures.as_completed(futures)):
# try:
# idx, images = future.result()
# all_images_h[idx] = images
# except Exception as e:
# print(f'Error processing partition: {e}')
# results = [None] * len(pdf_paths)
# for i in range(len(partitions)):
# partition = partitions[i]
# for j in range(len(partition)):
# with open(pdf_info[partition[j]][0], 'rb') as f:
# pdf_bytes = f.read()
# dataset = PymuDocDataset(pdf_bytes, lang=lang)
# dataset.set_images(all_images_h[i][j])
# results[partition[j]] = dataset
# return results

View File

@@ -150,7 +150,7 @@ class PymuDocDataset(Dataset):
elif lang == 'auto':
from magic_pdf.model.sub_modules.language_detection.utils import \
auto_detect_lang
self._lang = auto_detect_lang(bits)
self._lang = auto_detect_lang(self._data_bits)
logger.info(f'lang: {lang}, detect_lang: {self._lang}')
else:
self._lang = lang
@@ -232,7 +232,7 @@ class PymuDocDataset(Dataset):
self._records[i].set_image(images[i])
class ImageDataset(Dataset):
def __init__(self, bits: bytes):
def __init__(self, bits: bytes, lang=None):
"""Initialize the dataset, which wraps the pymudoc documents.
Args:
@@ -244,6 +244,17 @@ class ImageDataset(Dataset):
self._raw_data = bits
self._data_bits = pdf_bytes
if lang == '':
self._lang = None
elif lang == 'auto':
from magic_pdf.model.sub_modules.language_detection.utils import \
auto_detect_lang
self._lang = auto_detect_lang(self._data_bits)
logger.info(f'lang: {lang}, detect_lang: {self._lang}')
else:
self._lang = lang
logger.info(f'lang: {lang}')
def __len__(self) -> int:
"""The length of the dataset."""
return len(self._records)
@@ -394,4 +405,4 @@ class Doc(PageableData):
fontsize (int): font size of the text
color (list[float] | None): three element tuple which describe the RGB of the board line, None will use the default font color!
"""
self._doc.insert_text(coord, content, fontsize=fontsize, color=color)
self._doc.insert_text(coord, content, fontsize=fontsize, color=color)

View File

@@ -1 +1 @@
__version__ = "1.3.0"
__version__ = "1.3.1"

View File

@@ -30,8 +30,14 @@ class BatchAnalyze:
images_layout_res = []
layout_start_time = time.time()
_, fst_ocr, fst_lang = images_with_extra_info[0]
self.model = self.model_manager.get_model(fst_ocr, self.show_log, fst_lang, self.layout_model, self.formula_enable, self.table_enable)
self.model = self.model_manager.get_model(
ocr=True,
show_log=self.show_log,
lang = None,
layout_model = self.layout_model,
formula_enable = self.formula_enable,
table_enable = self.table_enable,
)
images = [image for image, _, _ in images_with_extra_info]
@@ -143,14 +149,14 @@ class BatchAnalyze:
if ocr_res:
ocr_result_list = get_ocr_result_list(ocr_res, useful_list, ocr_res_list_dict['ocr_enable'], new_image, _lang)
ocr_res_list_dict['layout_res'].extend(ocr_result_list)
det_count += len(ocr_res_list_dict['ocr_res_list'])
# det_count += len(ocr_res_list_dict['ocr_res_list'])
# logger.info(f'ocr-det time: {round(time.time()-det_start, 2)}, image num: {det_count}')
# 表格识别 table recognition
if self.model.apply_table:
table_start = time.time()
table_count = 0
# for table_res_list_dict in table_res_list_all_page:
for table_res_dict in tqdm(table_res_list_all_page, desc="Table Predict"):
_lang = table_res_dict['lang']
@@ -241,7 +247,7 @@ class BatchAnalyze:
for index, layout_res_item in enumerate(need_ocr_lists_by_lang[lang]):
ocr_text, ocr_score = ocr_res_list[index]
layout_res_item['text'] = ocr_text
layout_res_item['score'] = float(round(ocr_score, 2))
layout_res_item['score'] = float(f"{ocr_score:.3f}")
total_processed += len(img_crop_list)

View File

@@ -146,10 +146,8 @@ def doc_analyze(
img_dict = page_data.get_image()
images.append(img_dict['img'])
page_wh_list.append((img_dict['width'], img_dict['height']))
if lang is None or lang == 'auto':
images_with_extra_info = [(images[index], ocr, dataset._lang) for index in range(len(dataset))]
else:
images_with_extra_info = [(images[index], ocr, lang) for index in range(len(dataset))]
images_with_extra_info = [(images[index], ocr, dataset._lang) for index in range(len(dataset))]
if len(images) >= MIN_BATCH_INFERENCE_SIZE:
batch_size = MIN_BATCH_INFERENCE_SIZE
@@ -158,8 +156,8 @@ def doc_analyze(
batch_images = [images_with_extra_info]
results = []
for sn, batch_image in enumerate(batch_images):
_, result = may_batch_image_analyze(batch_image, sn, ocr, show_log,layout_model, formula_enable, table_enable)
for batch_image in batch_images:
result = may_batch_image_analyze(batch_image, ocr, show_log,layout_model, formula_enable, table_enable)
results.extend(result)
model_json = []
@@ -181,7 +179,7 @@ def doc_analyze(
def batch_doc_analyze(
datasets: list[Dataset],
parse_method: str,
parse_method: str = 'auto',
show_log: bool = False,
lang=None,
layout_model=None,
@@ -190,30 +188,37 @@ def batch_doc_analyze(
):
MIN_BATCH_INFERENCE_SIZE = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 200))
batch_size = MIN_BATCH_INFERENCE_SIZE
images = []
page_wh_list = []
images_with_extra_info = []
for dataset in datasets:
for index in range(len(dataset)):
if lang is None or lang == 'auto':
_lang = dataset._lang
else:
_lang = lang
ocr = False
if parse_method == 'auto':
if dataset.classify() == SupportedPdfParseMethod.TXT:
ocr = False
elif dataset.classify() == SupportedPdfParseMethod.OCR:
ocr = True
elif parse_method == 'ocr':
ocr = True
elif parse_method == 'txt':
ocr = False
_lang = dataset._lang
for index in range(len(dataset)):
page_data = dataset.get_page(index)
img_dict = page_data.get_image()
images.append(img_dict['img'])
page_wh_list.append((img_dict['width'], img_dict['height']))
if parse_method == 'auto':
images_with_extra_info.append((images[-1], dataset.classify() == SupportedPdfParseMethod.OCR, _lang))
else:
images_with_extra_info.append((images[-1], parse_method == 'ocr', _lang))
images_with_extra_info.append((img_dict['img'], ocr, _lang))
batch_images = [images_with_extra_info[i:i+batch_size] for i in range(0, len(images_with_extra_info), batch_size)]
results = []
for sn, batch_image in enumerate(batch_images):
_, result = may_batch_image_analyze(batch_image, sn, True, show_log, layout_model, formula_enable, table_enable)
processed_images_count = 0
for index, batch_image in enumerate(batch_images):
processed_images_count += len(batch_image)
logger.info(f'Batch {index + 1}/{len(batch_images)}: {processed_images_count} pages/{len(images_with_extra_info)} pages')
result = may_batch_image_analyze(batch_image, True, show_log, layout_model, formula_enable, table_enable)
results.extend(result)
infer_results = []
@@ -233,7 +238,6 @@ def batch_doc_analyze(
def may_batch_image_analyze(
images_with_extra_info: list[(np.ndarray, bool, str)],
idx: int,
ocr: bool,
show_log: bool = False,
layout_model=None,
@@ -255,8 +259,9 @@ def may_batch_image_analyze(
torch.npu.set_compile_mode(jit_compile=False)
if str(device).startswith('npu') or str(device).startswith('cuda'):
gpu_memory = int(os.getenv('VIRTUAL_VRAM_SIZE', round(get_vram(device))))
if gpu_memory is not None:
vram = get_vram(device)
if vram is not None:
gpu_memory = int(os.getenv('VIRTUAL_VRAM_SIZE', round(vram)))
if gpu_memory >= 16:
batch_ratio = 16
elif gpu_memory >= 12:
@@ -268,6 +273,10 @@ def may_batch_image_analyze(
else:
batch_ratio = 1
logger.info(f'gpu_memory: {gpu_memory} GB, batch_ratio: {batch_ratio}')
else:
# Default batch_ratio when VRAM can't be determined
batch_ratio = 1
logger.info(f'Could not determine GPU memory, using default batch_ratio: {batch_ratio}')
# doc_analyze_start = time.time()
@@ -286,4 +295,4 @@ def may_batch_image_analyze(
# f'doc analyze time: {round(time.time() - doc_analyze_start, 2)},'
# f' speed: {doc_analyze_speed} pages/second'
# )
return idx, results
return results

View File

@@ -29,22 +29,204 @@ def crop_img(input_res, input_np_img, crop_paste_x=0, crop_paste_y=0):
return return_image, return_list
# Select regions for OCR / formula regions / table regions
def get_res_list_from_layout_res(layout_res):
def get_coords_and_area(table):
"""Extract coordinates and area from a table."""
xmin, ymin = int(table['poly'][0]), int(table['poly'][1])
xmax, ymax = int(table['poly'][4]), int(table['poly'][5])
area = (xmax - xmin) * (ymax - ymin)
return xmin, ymin, xmax, ymax, area
def calculate_intersection(box1, box2):
"""Calculate intersection coordinates between two boxes."""
intersection_xmin = max(box1[0], box2[0])
intersection_ymin = max(box1[1], box2[1])
intersection_xmax = min(box1[2], box2[2])
intersection_ymax = min(box1[3], box2[3])
# Check if intersection is valid
if intersection_xmax <= intersection_xmin or intersection_ymax <= intersection_ymin:
return None
return intersection_xmin, intersection_ymin, intersection_xmax, intersection_ymax
def calculate_iou(box1, box2):
"""Calculate IoU between two boxes."""
intersection = calculate_intersection(box1[:4], box2[:4])
if not intersection:
return 0
intersection_xmin, intersection_ymin, intersection_xmax, intersection_ymax = intersection
intersection_area = (intersection_xmax - intersection_xmin) * (intersection_ymax - intersection_ymin)
area1, area2 = box1[4], box2[4]
union_area = area1 + area2 - intersection_area
return intersection_area / union_area if union_area > 0 else 0
def is_inside(small_box, big_box, overlap_threshold=0.8):
"""Check if small_box is inside big_box by at least overlap_threshold."""
intersection = calculate_intersection(small_box[:4], big_box[:4])
if not intersection:
return False
intersection_xmin, intersection_ymin, intersection_xmax, intersection_ymax = intersection
intersection_area = (intersection_xmax - intersection_xmin) * (intersection_ymax - intersection_ymin)
# Check if overlap exceeds threshold
return intersection_area >= overlap_threshold * small_box[4]
def do_overlap(box1, box2):
"""Check if two boxes overlap."""
return calculate_intersection(box1[:4], box2[:4]) is not None
def merge_high_iou_tables(table_res_list, layout_res, table_indices, iou_threshold=0.7):
"""Merge tables with IoU > threshold."""
if len(table_res_list) < 2:
return table_res_list, table_indices
table_info = [get_coords_and_area(table) for table in table_res_list]
merged = True
while merged:
merged = False
i = 0
while i < len(table_res_list) - 1:
j = i + 1
while j < len(table_res_list):
iou = calculate_iou(table_info[i], table_info[j])
if iou > iou_threshold:
# Merge tables by taking their union
x1_min, y1_min, x1_max, y1_max, _ = table_info[i]
x2_min, y2_min, x2_max, y2_max, _ = table_info[j]
union_xmin = min(x1_min, x2_min)
union_ymin = min(y1_min, y2_min)
union_xmax = max(x1_max, x2_max)
union_ymax = max(y1_max, y2_max)
# Create merged table
merged_table = table_res_list[i].copy()
merged_table['poly'][0] = union_xmin
merged_table['poly'][1] = union_ymin
merged_table['poly'][2] = union_xmax
merged_table['poly'][3] = union_ymin
merged_table['poly'][4] = union_xmax
merged_table['poly'][5] = union_ymax
merged_table['poly'][6] = union_xmin
merged_table['poly'][7] = union_ymax
# Update layout_res
to_remove = [table_indices[j], table_indices[i]]
for idx in sorted(to_remove, reverse=True):
del layout_res[idx]
layout_res.append(merged_table)
# Update tracking lists
table_indices = [k if k < min(to_remove) else
k - 1 if k < max(to_remove) else
k - 2 if k > max(to_remove) else
len(layout_res) - 1
for k in table_indices
if k not in to_remove]
table_indices.append(len(layout_res) - 1)
# Update table lists
table_res_list.pop(j)
table_res_list.pop(i)
table_res_list.append(merged_table)
# Update table_info
table_info = [get_coords_and_area(table) for table in table_res_list]
merged = True
break
j += 1
if merged:
break
i += 1
return table_res_list, table_indices
def filter_nested_tables(table_res_list, overlap_threshold=0.8, area_threshold=0.8):
"""Remove big tables containing multiple smaller tables within them."""
if len(table_res_list) < 3:
return table_res_list
table_info = [get_coords_and_area(table) for table in table_res_list]
big_tables_idx = []
for i in range(len(table_res_list)):
# Find tables inside this one
tables_inside = [j for j in range(len(table_res_list))
if i != j and is_inside(table_info[j], table_info[i], overlap_threshold)]
# Continue if there are at least 2 tables inside
if len(tables_inside) >= 2:
# Check if inside tables overlap with each other
tables_overlap = any(do_overlap(table_info[tables_inside[idx1]], table_info[tables_inside[idx2]])
for idx1 in range(len(tables_inside))
for idx2 in range(idx1 + 1, len(tables_inside)))
# If no overlaps, check area condition
if not tables_overlap:
total_inside_area = sum(table_info[j][4] for j in tables_inside)
big_table_area = table_info[i][4]
if total_inside_area > area_threshold * big_table_area:
big_tables_idx.append(i)
return [table for i, table in enumerate(table_res_list) if i not in big_tables_idx]
def get_res_list_from_layout_res(layout_res, iou_threshold=0.7, overlap_threshold=0.8, area_threshold=0.8):
"""Extract OCR, table and other regions from layout results."""
ocr_res_list = []
table_res_list = []
table_indices = []
single_page_mfdetrec_res = []
for res in layout_res:
if int(res['category_id']) in [13, 14]:
# Categorize regions
for i, res in enumerate(layout_res):
category_id = int(res['category_id'])
if category_id in [13, 14]: # Formula regions
single_page_mfdetrec_res.append({
"bbox": [int(res['poly'][0]), int(res['poly'][1]),
int(res['poly'][4]), int(res['poly'][5])],
})
elif int(res['category_id']) in [0, 1, 2, 4, 6, 7]:
elif category_id in [0, 1, 2, 4, 6, 7]: # OCR regions
ocr_res_list.append(res)
elif int(res['category_id']) in [5]:
elif category_id == 5: # Table regions
table_res_list.append(res)
return ocr_res_list, table_res_list, single_page_mfdetrec_res
table_indices.append(i)
# Process tables: merge high IoU tables first, then filter nested tables
table_res_list, table_indices = merge_high_iou_tables(
table_res_list, layout_res, table_indices, iou_threshold)
filtered_table_res_list = filter_nested_tables(
table_res_list, overlap_threshold, area_threshold)
# Remove filtered out tables from layout_res
if len(filtered_table_res_list) < len(table_res_list):
kept_tables = set(id(table) for table in filtered_table_res_list)
to_remove = [table_indices[i] for i, table in enumerate(table_res_list)
if id(table) not in kept_tables]
for idx in sorted(to_remove, reverse=True):
del layout_res[idx]
return ocr_res_list, filtered_table_res_list, single_page_mfdetrec_res
def clean_vram(device, vram_threshold=8):
@@ -57,7 +239,7 @@ def clean_vram(device, vram_threshold=8):
def get_vram(device):
if torch.cuda.is_available() and device != 'cpu':
if torch.cuda.is_available() and str(device).startswith("cuda"):
total_memory = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3) # 将字节转换为 GB
return total_memory
elif str(device).startswith("npu"):

View File

@@ -1,8 +1,12 @@
lang:
ch:
ch_lite:
det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv4_rec_infer.pth
dict: ppocr_keys_v1.txt
ch:
det: ch_PP-OCRv3_det_infer.pth
rec: ch_PP-OCRv4_rec_server_infer.pth
dict: ppocr_keys_v1.txt
en:
det: en_PP-OCRv3_det_infer.pth
rec: en_PP-OCRv4_rec_infer.pth

View File

@@ -437,4 +437,10 @@ class TextRecognizer(BaseOCRV20):
index += 1
pbar.update(current_batch_size)
# Fix NaN values in recognition results
for i in range(len(rec_res)):
text, score = rec_res[i]
if isinstance(score, float) and math.isnan(score):
rec_res[i] = (text, 0.0)
return rec_res, elapse

View File

@@ -1,3 +1,5 @@
import os
from pathlib import Path
import cv2
import numpy as np
import torch
@@ -17,7 +19,9 @@ class RapidTableModel(object):
if torch.cuda.is_available() and table_sub_model_name == "unitable":
input_args = RapidTableInput(model_type=table_sub_model_name, use_cuda=True, device=get_device())
else:
input_args = RapidTableInput(model_type=table_sub_model_name)
root_dir = Path(__file__).absolute().parent.parent.parent.parent.parent
slanet_plus_model_path = os.path.join(root_dir, 'resources', 'slanet_plus', 'slanet-plus.onnx')
input_args = RapidTableInput(model_type=table_sub_model_name, model_path=slanet_plus_model_path)
else:
raise ValueError(f"Invalid table_sub_model_name: {table_sub_model_name}. It must be one of {sub_model_list}")
@@ -31,26 +35,63 @@ class RapidTableModel(object):
# from rapidocr_onnxruntime import RapidOCR
# self.ocr_engine = RapidOCR()
self.ocr_model_name = "PaddleOCR"
# self.ocr_model_name = "PaddleOCR"
self.ocr_engine = ocr_engine
def predict(self, image):
bgr_image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR)
if self.ocr_model_name == "RapidOCR":
ocr_result, _ = self.ocr_engine(np.asarray(image))
elif self.ocr_model_name == "PaddleOCR":
bgr_image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR)
ocr_result = self.ocr_engine.ocr(bgr_image)[0]
if ocr_result:
ocr_result = [[item[0], item[1][0], item[1][1]] for item in ocr_result if
len(item) == 2 and isinstance(item[1], tuple)]
else:
ocr_result = None
# First check the overall image aspect ratio (height/width)
img_height, img_width = bgr_image.shape[:2]
img_aspect_ratio = img_height / img_width if img_width > 0 else 1.0
img_is_portrait = img_aspect_ratio > 1.2
if img_is_portrait:
det_res = self.ocr_engine.ocr(bgr_image, rec=False)[0]
# Check if table is rotated by analyzing text box aspect ratios
is_rotated = False
if det_res:
vertical_count = 0
for box_ocr_res in det_res:
p1, p2, p3, p4 = box_ocr_res
# Calculate width and height
width = p3[0] - p1[0]
height = p3[1] - p1[1]
aspect_ratio = width / height if height > 0 else 1.0
# Count vertical vs horizontal text boxes
if aspect_ratio < 0.8: # Taller than wide - vertical text
vertical_count += 1
# elif aspect_ratio > 1.2: # Wider than tall - horizontal text
# horizontal_count += 1
# If we have more vertical text boxes than horizontal ones,
# and vertical ones are significant, table might be rotated
if vertical_count >= len(det_res) * 0.3:
is_rotated = True
# logger.debug(f"Text orientation analysis: vertical={vertical_count}, det_res={len(det_res)}, rotated={is_rotated}")
# Rotate image if necessary
if is_rotated:
# logger.debug("Table appears to be in portrait orientation, rotating 90 degrees clockwise")
image = cv2.rotate(np.asarray(image), cv2.ROTATE_90_CLOCKWISE)
bgr_image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
# Continue with OCR on potentially rotated image
ocr_result = self.ocr_engine.ocr(bgr_image)[0]
if ocr_result:
ocr_result = [[item[0], item[1][0], item[1][1]] for item in ocr_result if
len(item) == 2 and isinstance(item[1], tuple)]
else:
logger.error("OCR model not supported")
ocr_result = None
if ocr_result:
table_results = self.table_model(np.asarray(image), ocr_result)
html_code = table_results.pred_html

View File

@@ -997,7 +997,7 @@ def pdf_parse_union(
for index, span in enumerate(need_ocr_list):
ocr_text, ocr_score = ocr_res_list[index]
span['content'] = ocr_text
span['score'] = float(round(ocr_score, 2))
span['score'] = float(f"{ocr_score:.3f}")
# rec_time = time.time() - rec_start
# logger.info(f'ocr-dynamic-rec time: {round(rec_time, 2)}, total images processed: {len(img_crop_list)}')

Binary file not shown.

View File

@@ -109,9 +109,7 @@ def _do_parse(
pdf_bytes = ds._raw_data
local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
local_md_dir
)
image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(local_md_dir)
image_dir = str(os.path.basename(local_image_dir))
if len(model_list) == 0:
@@ -317,7 +315,26 @@ def batch_do_parse(
infer_results = batch_doc_analyze(dss, parse_method, lang=lang, layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
for idx, infer_result in enumerate(infer_results):
_do_parse(output_dir, pdf_file_names[idx], dss[idx], infer_result.get_infer_res(), parse_method, debug_able, f_draw_span_bbox=f_draw_span_bbox, f_draw_layout_bbox=f_draw_layout_bbox, f_dump_md=f_dump_md, f_dump_middle_json=f_dump_middle_json, f_dump_model_json=f_dump_model_json, f_dump_orig_pdf=f_dump_orig_pdf, f_dump_content_list=f_dump_content_list, f_make_md_mode=f_make_md_mode, f_draw_model_bbox=f_draw_model_bbox, f_draw_line_sort_bbox=f_draw_line_sort_bbox, f_draw_char_bbox=f_draw_char_bbox, lang=lang)
_do_parse(
output_dir = output_dir,
pdf_file_name = pdf_file_names[idx],
pdf_bytes_or_dataset = dss[idx],
model_list = infer_result.get_infer_res(),
parse_method = parse_method,
debug_able = debug_able,
f_draw_span_bbox = f_draw_span_bbox,
f_draw_layout_bbox = f_draw_layout_bbox,
f_dump_md=f_dump_md,
f_dump_middle_json=f_dump_middle_json,
f_dump_model_json=f_dump_model_json,
f_dump_orig_pdf=f_dump_orig_pdf,
f_dump_content_list=f_dump_content_list,
f_make_md_mode=MakeMode.MM_MD,
f_draw_model_bbox=f_draw_model_bbox,
f_draw_line_sort_bbox=f_draw_line_sort_bbox,
f_draw_char_bbox=f_draw_char_bbox,
lang=lang,
)
parse_pdf_methods = click.Choice(['ocr', 'txt', 'auto'])

View File

@@ -80,7 +80,7 @@ Specify Python version 3.10.
.. code:: sh
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
5. Install Applications
@@ -90,16 +90,15 @@ Specify Python version 3.10.
pip install -U magic-pdf[full]
.. admonition:: Important
.. admonition:: TIP
:class: tip
After installation, make sure to check the version of ``magic-pdf`` using the following command:
After installation, you can check the version of ``magic-pdf`` using the following command:
.. code:: sh
magic-pdf --version
If the version number is less than 1.3.0, please report the issue.
6. Download Models
~~~~~~~~~~~~~~~~~~
@@ -178,7 +177,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
::
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
4. Install Applications
@@ -188,16 +187,15 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
pip install -U magic-pdf[full]
.. admonition:: Important
.. admonition:: Tip
:class: tip
❗️After installation, verify the version of ``magic-pdf``:
After installation, you can check the version of ``magic-pdf``:
.. code:: bash
magic-pdf --version
If the version number is less than 1.3.0, please report it in the issues section.
5. Download Models
~~~~~~~~~~~~~~~~~~
@@ -237,7 +235,7 @@ test CUDA-accelerated parsing performance.
.. code:: sh
pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``

View File

@@ -28,7 +28,7 @@ magic-pdf.json
"layoutreader-model-dir":"/tmp/layoutreader",
"device-mode":"cpu",
"layout-config": {
"model": "layoutlmv3"
"model": "doclayout_yolo"
},
"formula-config": {
"mfd_model": "yolo_v8_mfd",
@@ -37,7 +37,7 @@ magic-pdf.json
},
"table-config": {
"model": "rapid_table",
"enable": false,
"enable": true,
"max_time": 400
},
"config_version": "1.0.0"
@@ -88,10 +88,10 @@ layout-config
.. code:: json
{
"model": "layoutlmv3"
"model": "doclayout_yolo"
}
layout model can not be disabled now, And we have only kind of layout model currently.
layout model can not be disabled now.
formula-config
@@ -132,14 +132,14 @@ table-config
{
"model": "rapid_table",
"enable": false,
"enable": true,
"max_time": 400
}
model
""""""""
Specify the table inference model, options are ['rapid_table', 'tablemaster', 'struct_eqtable']
Specify the table inference model, options are ['rapid_table']
max_time

View File

@@ -29,18 +29,7 @@ filename ``magic-pdf.json``.
How to update models previously downloaded
-----------------------------------------
1. Models downloaded via Git LFS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Due to feedback from some users that downloading model files using
git lfs was incomplete or resulted in corrupted model files, this
method is no longer recommended.
If you previously downloaded model files via git lfs, you can navigate
to the previous download directory and use the ``git pull`` command to
update the model.
2. Models downloaded via Hugging Face or Model Scope
1. Models downloaded via Hugging Face or Model Scope
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you previously downloaded models via Hugging Face or Model Scope, you

View File

@@ -71,8 +71,8 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -97,7 +97,7 @@ Create an environment
.. code-block:: shell
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]"

View File

@@ -159,9 +159,12 @@ devanagari_lang = [
'sa', 'bgc'
]
other_lang = ['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
add_lang = ['latin', 'arabic', 'cyrillic', 'devanagari']
all_lang = ['', 'auto']
all_lang.extend([*other_lang, *latin_lang, *arabic_lang, *cyrillic_lang, *devanagari_lang])
# all_lang = ['', 'auto']
all_lang = []
# all_lang.extend([*other_lang, *latin_lang, *arabic_lang, *cyrillic_lang, *devanagari_lang])
all_lang.extend([*other_lang, *add_lang])
def to_pdf(file_path):
@@ -192,8 +195,8 @@ if __name__ == '__main__':
file = gr.File(label='Please upload a PDF or image', file_types=['.pdf', '.png', '.jpeg', '.jpg'])
max_pages = gr.Slider(1, 20, 10, step=1, label='Max convert pages')
with gr.Row():
layout_mode = gr.Dropdown(['layoutlmv3', 'doclayout_yolo'], label='Layout model', value='doclayout_yolo')
language = gr.Dropdown(all_lang, label='Language', value='auto')
layout_mode = gr.Dropdown(['doclayout_yolo'], label='Layout model', value='doclayout_yolo')
language = gr.Dropdown(all_lang, label='Language', value='ch')
with gr.Row():
formula_enable = gr.Checkbox(label='Enable formula recognition', value=True)
is_ocr = gr.Checkbox(label='Force enable OCR', value=False)

View File

@@ -7,9 +7,9 @@ numpy>=1.21.6
pydantic>=2.7.2,<2.11
PyMuPDF>=1.24.9,<1.25.0
scikit-learn>=1.0.2
torch>=2.2.2,!=2.5.0,!=2.5.1,<=2.6.0
torch>=2.2.2,!=2.5.0,!=2.5.1
torchvision
transformers>=4.49.0,<5.0.0
transformers>=4.49.0,!=4.51.0,<5.0.0
pdfminer.six==20231228
tqdm>=4.67.1
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.

View File

@@ -26,6 +26,7 @@ if __name__ == '__main__':
setup(
name="magic_pdf", # 项目名
version=__version__, # 自动从tag中获取版本号
license="AGPL-3.0",
packages=find_packages() + ["magic_pdf.resources"] + ["magic_pdf.model.sub_modules.ocr.paddleocr2pytorch.pytorchocr.utils.resources"], # 包含所有的包
package_data={
"magic_pdf.resources": ["**"], # 包含magic_pdf.resources目录下的所有文件
@@ -33,33 +34,54 @@ if __name__ == '__main__':
},
install_requires=parse_requirements('requirements.txt'), # 项目依赖的第三方库
extras_require={
"lite": ["paddleocr==2.7.3",
"paddlepaddle==3.0.0b1;platform_system=='Linux'",
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
],
"lite": [
"paddleocr==2.7.3",
"paddlepaddle==3.0.0b1;platform_system=='Linux'",
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
],
"full": [
"matplotlib<=3.9.0;platform_system=='Windows'", # 3.9.1及之后不提供windows的预编译包避免一些没有编译环境的windows设备安装失败
"matplotlib>=3.10;platform_system=='Linux' or platform_system=='Darwin'", # linux 和 macos 不应限制matplotlib的最高版本以避免无法更新导致的一些bug
"ultralytics>=8.3.48", # yolov8,公式检测
"matplotlib>=3.10,<4",
"ultralytics>=8.3.48,<9", # yolov8,公式检测
"doclayout_yolo==0.0.2b1", # doclayout_yolo
"dill>=0.3.9,<1", # doclayout_yolo
"rapid_table>=1.0.3,<2.0.0", # rapid_table
"rapid_table>=1.0.5,<2.0.0", # rapid_table
"PyYAML>=6.0.2,<7", # yaml
"ftfy>=6.3.1,<7", # unimernet_hf
"ftfy>=6.3.1,<7", # unimernet_hf
"openai>=1.70.0,<2", # openai SDK
"shapely>=2.0.7,<3", # imgaug-paddleocr2pytorch
"pyclipper>=1.3.0,<2", # paddleocr2pytorch
"omegaconf>=2.3.0,<3", # paddleocr2pytorch
],
"old_linux":[
"albumentations<=1.4.20", # 1.4.21引入的simsimd不支持2019年及更早的linux系统
],
"full_old_linux": [
"matplotlib>=3.10,<=3.10.1",
"ultralytics>=8.3.48,<=8.3.104", # yolov8,公式检测
"doclayout_yolo==0.0.2b1", # doclayout_yolo
"dill==0.3.9", # doclayout_yolo
"PyYAML==6.0.2", # yaml
"ftfy==6.3.1", # unimernet_hf
"openai==1.71.0", # openai SDK
"shapely==2.1.0", # imgaug-paddleocr2pytorch
"pyclipper==1.3.0.post6", # paddleocr2pytorch
"omegaconf==2.3.0", # paddleocr2pytorch
"albumentations==1.4.20", # 1.4.21引入的simsimd不支持2019年及更早的linux系统
"rapid_table==1.0.3", # rapid_table新版本依赖的onnxruntime不支持2019年及更早的linux系统
],
},
description="A practical tool for converting PDF to Markdown", # 简短描述
long_description=long_description, # 详细描述
long_description_content_type="text/markdown", # 如果README是Markdown格式
url="https://github.com/opendatalab/MinerU",
python_requires=">=3.9", # 项目依赖的 Python 版本
project_urls={
"Home": "https://mineru.net/",
"Repository": "https://github.com/opendatalab/MinerU",
},
keywords=["magic-pdf, mineru, MinerU, convert, pdf, markdown"],
classifiers=[
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
],
python_requires=">=3.10,<4", # 项目依赖的 Python 版本
entry_points={
"console_scripts": [
"magic-pdf = magic_pdf.tools.cli:cli",