Compare commits

...

41 Commits

Author SHA1 Message Date
Xiaomeng Zhao
ccf2ea04cb Merge pull request #2156 from opendatalab/dev
Dev
2025-04-08 18:16:07 +08:00
Xiaomeng Zhao
564991512c Merge branch 'release-1.3.1' into dev 2025-04-08 18:16:01 +08:00
Xiaomeng Zhao
a1595f1912 Merge pull request #2155 from myhloli/dev
docs: update version number in README files
2025-04-08 18:15:17 +08:00
myhloli
bc0ff1acb0 docs: update version number in README files
- Correct version number from 1.3.2 to 1.3.1 in both README.md and README_zh-CN.md
- Update changelog entries for the latest release
2025-04-08 18:14:29 +08:00
Xiaomeng Zhao
b3ac3ac148 Merge branch 'master' into release-1.3.2 2025-04-08 18:11:16 +08:00
Xiaomeng Zhao
2c7094ff3d Merge pull request #2153 from opendatalab/dev
Dev
2025-04-08 18:10:16 +08:00
Xiaomeng Zhao
0ed231cb8b Merge pull request #2152 from myhloli/dev
docs(README): update version number and changelog in README files
2025-04-08 18:09:53 +08:00
myhloli
bd4728aaeb docs(README): update version number and changelog in README files
- Update version number from 1.3.1 to 1.3.2
2025-04-08 18:09:05 +08:00
Xiaomeng Zhao
2813e59905 Merge pull request #2151 from myhloli/dev
refactor(ocr): improve OCR score precision to three decimal places
2025-04-08 18:06:31 +08:00
myhloli
ea730ae2e9 refactor(ocr): improve OCR score precision to three decimal places
- Update OCR score formatting in batch_analyze.py and pdf_parse_union_core_v2.py
- Change score rounding method to preserve three decimal places
- Enhance accuracy representation without significantly altering the score value
2025-04-08 18:02:03 +08:00
myhloli
0ab29cdbee docs(README): update version number in release notes
- Update version from1.3.1 to 1.3.2 in both English and Chinese README files
- Keep other content unchanged
2025-04-08 17:37:39 +08:00
Xiaomeng Zhao
44665d3966 Update python-package.yml 2025-04-08 17:35:39 +08:00
myhloli
79feb926b7 Update version.py with new version 2025-04-08 09:23:09 +00:00
Xiaomeng Zhao
a2cde43b57 Merge pull request #2146 from opendatalab/release-1.3.1
Release 1.3.1
2025-04-08 17:20:21 +08:00
Xiaomeng Zhao
b8856ca96a Merge pull request #2148 from opendatalab/dev
Dev
2025-04-08 17:03:27 +08:00
Xiaomeng Zhao
098cf1df60 Merge pull request #2147 from myhloli/dev
docs: update badges and project URLs- Update PyPI version badge to us…
2025-04-08 17:02:43 +08:00
myhloli
90f0e7370a docs: update badges and project URLs- Update PyPI version badge to use shields.io
- Add project URLs in setup.py for better discoverability
- Make consistent changes across README.md and README_zh-CN.md
2025-04-08 17:01:41 +08:00
Xiaomeng Zhao
714504864e Update python-package.yml 2025-04-08 16:49:56 +08:00
Xiaomeng Zhao
87fd4c2806 Update bug_report.yml 2025-04-08 16:49:02 +08:00
Xiaomeng Zhao
3251c73250 Merge pull request #2145 from opendatalab/dev
fix(table): add model path for slanet-plus to resolve RapidTableError
2025-04-08 16:47:45 +08:00
Xiaomeng Zhao
697da27cf7 Merge pull request #2144 from myhloli/dev
fix(table): add model path for slanet-plus to resolve RapidTableError
2025-04-08 16:47:09 +08:00
myhloli
e327e9bad5 fix(table): add model path for slanet-plus to resolve RapidTableError
- Import os and pathlib modules to handle file paths
- Define the path to the slanet-plus model
- Update RapidTableInput initialization to include the model path
2025-04-08 16:46:01 +08:00
Xiaomeng Zhao
99d5c022c4 Merge pull request #2142 from myhloli/dev
update 1.3.1
2025-04-08 16:13:28 +08:00
myhloli
7b61b418a3 ci: update Python version support and installation process
- Add support for Python3.11, 3.12, and 3.13
- Replace requirements.txt based installation with editable install
2025-04-08 16:10:07 +08:00
myhloli
4fd8d626c4 docs(install): update Python version requirements and simplify torch installation
- Update Python version requirements to >=3.10
- Simplify torch installation command- Remove numpy version restriction
- Update CUDA compatibility information
- Adjust environment creation commands across multiple documentation files
2025-04-08 16:06:02 +08:00
myhloli
cf6fa12767 build(setup): remove rapid_table dependency
- Remove rapid_table from install_requires in setup.py
2025-04-08 14:24:15 +08:00
myhloli
de4bc5a32d ci: update issue template options for Python version and dependency version
- Add "3.13" option for Python version
- Remove "3.9" option for Python version
- Update dependency version options:
  - Remove "0.8.x", "0.9.x", "0.10.x"
  - Add "1.1.x", "1.2.x", "1.3.x"
2025-04-08 14:22:06 +08:00
myhloli
9b5d2796f8 build(deps): update dependencies and add support for old Linux systems
- Update transformers to exclude version 4.51.0 due to compatibility issues- Rapid table version range expanded to >=1.0.5,<2.0.0
- Add separate 'full_old_linux' extras_require for better support of older Linux systems
- Update matplotlib version requirements for different platforms
- Remove platform-specific paddlepaddle versions,
2025-04-08 14:18:49 +08:00
myhloli
0f0591cf8f build(old_linux): add rapid_table dependency for PDF conversion
- Add rapid_table==1.0.3 to old_linux specific dependencies
- This version is compatible with Linux systems from 2019 and earlier
- Newer versions of rapid_table depend on onnxruntime, which is not supported on older Linux systems
2025-04-08 11:58:38 +08:00
Xiaomeng Zhao
cf6ffc6b1e Merge pull request #2128 from myhloli/dev
fix(model): improve VRAM detection and handling
2025-04-07 18:18:09 +08:00
myhloli
d32a63cada fix(model): improve VRAM detection and handling
- Refactor VRAM detection logic for better readability and efficiency
- Add fallback mechanism for unknown VRAM sizes
- Improve device checking in get_vram function
2025-04-07 18:15:37 +08:00
Xiaomeng Zhao
dfb3cbfb17 Merge pull request #2126 from icecraft/fix/image_ds_add_lang
fix: image dataset add lang field
2025-04-07 16:57:49 +08:00
icecraft
e36a083dc3 fix: image dataset add lang field 2025-04-07 15:40:06 +08:00
Xiaomeng Zhao
980f5c8cd7 Merge pull request #2125 from opendatalab/dev
docs: update torchvision version in CUDA installation guide
2025-04-07 15:26:13 +08:00
Xiaomeng Zhao
f442adfc95 Merge pull request #2124 from myhloli/dev
docs: update torchvision version in CUDA installation guide
2025-04-07 14:54:30 +08:00
myhloli
d4cda0a8c2 docs: update torchvision version in CUDA installation guide
- Update torchvision version from0.21.1 to0.21.0 in Windows CUDA acceleration guides
- Update both English and Chinese versions of the documentation
2025-04-07 14:53:25 +08:00
Xiaomeng Zhao
60fdf851a4 Merge pull request #2115 from myhloli/dev
build: remove accelerate dependency
2025-04-06 22:25:01 +08:00
myhloli
a10b9aec74 build: remove accelerate dependency
- Remove accelerate package from requirements.txt
- This change ensures only necessary external dependencies are introduced
2025-04-06 22:24:23 +08:00
Xiaomeng Zhao
e3261b0eea Merge pull request #2114 from myhloli/dev
build(deps): add accelerate package and update requirements https://github.com/opendatalab/MinerU/issues/2112
2025-04-06 22:17:20 +08:00
myhloli
09632dddc1 build(deps): add accelerate package and update requirements
- Add accelerate package to support model training acceleration
- Update requirements.txt to include new dependency
2025-04-06 22:16:01 +08:00
Xiaomeng Zhao
c5329a0722 Merge pull request #2093 from opendatalab/master
master -> dev
2025-04-03 23:33:35 +08:00
25 changed files with 145 additions and 134 deletions

View File

@@ -64,10 +64,10 @@ body:
# Need quotes around `3.10` otherwise it is treated as a number and shows as `3.1`.
options:
-
- "3.13"
- "3.12"
- "3.11"
- "3.10"
- "3.9"
validations:
required: true
@@ -78,10 +78,10 @@ body:
#multiple: false
options:
-
- "0.8.x"
- "0.9.x"
- "0.10.x"
- "1.0.x"
- "1.1.x"
- "1.2.x"
- "1.3.x"
validations:
required: true

View File

@@ -54,13 +54,13 @@ jobs:
run: |
git push origin HEAD:master
build:
check-install:
needs: [ update-version ]
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.10"]
python-version: ["3.10", "3.11", "3.12", "3.13"]
steps:
- name: Checkout code
@@ -79,10 +79,26 @@ jobs:
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
- name: Install magic-pdf
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install -e .[full]
build:
needs: [ check-install ]
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [ "3.10"]
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: master
fetch-depth: 0
- name: Install wheel
run: |

View File

@@ -10,7 +10,8 @@
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![PyPI version](https://img.shields.io/pypi/v/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
@@ -47,11 +48,15 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div>
# Changelog
- 2025/04/03 Release of 1.3.0, in this version we made many optimizations and improvements:
- 2025/04/08 1.3.1 released, fixed some compatibility issues
- Supported Python 3.13
- Resolved errors caused by `transformers 4.51.0`
- Made the final adaptation for some outdated Linux systems (e.g., CentOS 7), and no further support will be guaranteed for subsequent versions. [Installation Instructions](https://github.com/opendatalab/MinerU/issues/1004)
- 2025/04/03 1.3.0 released, in this version we made many optimizations and improvements:
- Installation and compatibility optimization
- By removing the use of `layoutlmv3` in layout, resolved compatibility issues caused by `detectron2`.
- Torch version compatibility extended to 2.2~2.6 (excluding 2.5).
- CUDA compatibility supports 11.8/12.4/12.6 (CUDA version determined by torch), resolving compatibility issues for some users with 50-series and H-series GPUs.
- CUDA compatibility supports 11.8/12.4/12.6/12.8 (CUDA version determined by torch), resolving compatibility issues for some users with 50-series and H-series GPUs.
- Python compatible versions expanded to 3.10~3.12, solving the problem of automatic downgrade to 0.6.1 during installation in non-3.10 environments.
- Offline deployment process optimized; no internet connection required after successful deployment to download any model files.
- Performance optimization
@@ -232,7 +237,7 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">Python Version</td>
<td colspan="3">3.10~3.12</td>
<td colspan="3">>=3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver Version</td>
@@ -242,8 +247,8 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -274,7 +279,7 @@ Synced with dev branch updates:
#### 1. Install magic-pdf
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]"
```

View File

@@ -10,7 +10,8 @@
[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
[![PyPI version](https://img.shields.io/pypi/v/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/magic-pdf)](https://pypi.org/project/magic-pdf/)
[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
@@ -46,11 +47,16 @@
</div>
# 更新记录
- 2025/04/08 1.3.1 发布,修复了一些兼容问题
- 支持python 3.13
- 解决因`transformers 4.51.0` 导致的报错
- 为部分过时的linux系统如centos7做出最后适配并不再保证后续版本的继续支持[安装说明](https://github.com/opendatalab/MinerU/issues/1004)
- 2025/04/03 1.3.0 发布,在这个版本我们做出了许多优化和改进:
- 安装与兼容性优化
- 通过移除layout中`layoutlmv3`的使用,解决了由`detectron2`导致的兼容问题
- torch版本兼容扩展到2.2~2.6(2.5除外)
- cuda兼容支持11.8/12.4/12.6cuda版本由torch决定解决部分用户50系显卡与H系显卡的兼容问题
- cuda兼容支持11.8/12.4/12.6/12.8cuda版本由torch决定解决部分用户50系显卡与H系显卡的兼容问题
- python兼容版本扩展到3.10~3.12解决了在非3.10环境下安装时自动降级到0.6.1的问题
- 优化离线部署流程,部署成功后不需要联网下载任何模型文件
- 性能优化
@@ -70,7 +76,6 @@
- 2025/02/24 1.2.0 发布,这个版本我们修复了一些问题,提升了解析的效率与精度:
- 性能优化
- auto模式下pdf文档的分类速度提升
- 在华为昇腾 NPU 加速模式下,添加高性能插件支持,常见场景下端到端加速可达 300% [申请链接](https://aicarrier.feishu.cn/share/base/form/shrcnb10VaoNQB8kQPA8DEfZC6d)
- 解析优化
- 优化对包含水印文档的解析逻辑,显著提升包含水印文档的解析效果
- 改进了单页内多个图像/表格与caption的匹配逻辑提升了复杂布局下图文匹配的准确性
@@ -233,7 +238,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">python版本</td>
<td colspan="3">>=3.9,<=3.12</td>
<td colspan="3">>=3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver 版本</td>
@@ -243,8 +248,8 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">CUDA环境</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -279,7 +284,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
> 最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
```

View File

@@ -49,25 +49,3 @@ docker run -it -u root --name mineru-npu --privileged=true \
magic-pdf --help
```
## 已知问题
- paddleocr使用内嵌onnx模型仅在默认语言配置下能以较快速度对中英文进行识别
- 自定义lang参数时paddleocr速度会存在明显下降情况
- layout模型使用layoutlmv3时会发生间歇性崩溃建议使用默认配置的doclayout_yolo模型
- 表格解析仅适配了rapid_table模型其他模型可能会无法使用
## 高性能模式
- 在特定硬件环境可以通过插件开启高性能模式整体速度相比默认模式提升300%以上
| 系统要求 | 版本/型号 |
|----------------|--------------|
| 芯片类型 | 昇腾910B |
| CANN版本 | CANN 8.0.RC2 |
| 驱动版本 | 24.1.rc2.1 |
| magic-pdf 软件版本 | \> = 1.2.0 |
- 高性能插件需满足一定的硬件条件和资质要求,如需申请使用请填写以下表单[MinerU高性能版本合作申请表](https://aicarrier.feishu.cn/share/base/form/shrcnb10VaoNQB8kQPA8DEfZC6d)

View File

@@ -54,7 +54,7 @@ In the final step, enter `yes`, close the terminal, and reopen it.
### 4. Create an Environment Using Conda
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -63,14 +63,13 @@ conda activate mineru
```sh
pip install -U magic-pdf[full]
```
> [!IMPORTANT]
> After installation, make sure to check the version of `magic-pdf` using the following command:
> [!TIP]
> After installation, you can check the version of `magic-pdf` using the following command:
>
> ```sh
> magic-pdf --version
> ```
>
> If the version number is less than 1.3.0, please report the issue.
### 6. Download Models

View File

@@ -54,7 +54,7 @@ bash Anaconda3-2024.06-1-Linux-x86_64.sh
## 4. 使用conda 创建环境
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -64,14 +64,13 @@ conda activate mineru
pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
```
> [!IMPORTANT]
> 下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
> [!TIP]
> 下载完成后,您可以通过以下命令检查`magic-pdf`的版本
>
> ```bash
> magic-pdf --version
> ```
>
> 如果版本号小于1.3.0请到issue中向我们反馈
## 6. 下载模型

View File

@@ -17,7 +17,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
### 3. Create an Environment Using Conda
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -28,13 +28,12 @@ pip install -U magic-pdf[full]
```
> [!IMPORTANT]
> After installation, verify the version of `magic-pdf`:
> After installation, you can check the version of `magic-pdf` using the following command:
>
> ```bash
> magic-pdf --version
> ```
>
> If the version number is less than 1.3.0, please report it in the issues section.
### 5. Download Models
@@ -64,7 +63,7 @@ If your graphics card has at least 6GB of VRAM, follow these steps to test CUDA-
1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
```
pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
```
2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory.

View File

@@ -18,7 +18,7 @@ https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Window
## 3. 使用conda 创建环境
```bash
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -29,13 +29,12 @@ pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
```
> [!IMPORTANT]
> 下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
> 下载完成后,您可以通过以下命令检查magic-pdf的版本
>
> ```bash
> magic-pdf --version
> ```
>
> 如果版本号小于 1.3.0 请到issue中向我们反馈
## 5. 下载模型
@@ -65,7 +64,7 @@ pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url具体可参考[torch官网](https://pytorch.org/get-started/locally/))
```bash
pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
```
**2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**

View File

@@ -18,15 +18,6 @@ The configuration file can be found in the user directory, with the filename `ma
# How to update models previously downloaded
## 1. Models downloaded via Git LFS
> [!IMPORTANT]
> Due to feedback from some users that downloading model files using git lfs was incomplete or resulted in corrupted model files, this method is no longer recommended.
>
> For versions 0.9.x and later, due to the repository change and the addition of the layout sorting model in PDF-Extract-Kit 1.0, the models cannot be updated using the `git pull` command. Instead, a Python script must be used for one-click updates.
When magic-pdf <= 0.8.1, if you have previously downloaded the model files via git lfs, you can navigate to the previous download directory and update the models using the `git pull` command.
## 2. Models downloaded via Hugging Face or Model Scope
## 1. Models downloaded via Hugging Face or Model Scope
If you previously downloaded models via Hugging Face or Model Scope, you can rerun the Python script used for the initial download. This will automatically update the model directory to the latest version.

View File

@@ -32,16 +32,6 @@ python脚本会自动下载模型文件并配置好配置文件中的模型目
# 此前下载过模型,如何更新
## 1. 通过git lfs下载过模型
> [!IMPORTANT]
> 由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况现已不推荐使用该方式下载。
>
> 0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型不能通过`git pull`命令更新需要使用python脚本一键更新。
当magic-pdf <= 0.8.1时,如此前通过 git lfs 下载过模型文件,可以进入到之前的下载目录中,通过`git pull`命令更新模型。
## 2. 通过 Hugging Face 或 Model Scope 下载过模型
## 1. 通过 Hugging Face 或 Model Scope 下载过模型
如此前通过 HuggingFace 或 Model Scope 下载过模型可以重复执行此前的模型下载python脚本将会自动将模型目录更新到最新版本。

View File

@@ -232,7 +232,7 @@ class PymuDocDataset(Dataset):
self._records[i].set_image(images[i])
class ImageDataset(Dataset):
def __init__(self, bits: bytes):
def __init__(self, bits: bytes, lang=None):
"""Initialize the dataset, which wraps the pymudoc documents.
Args:
@@ -244,6 +244,17 @@ class ImageDataset(Dataset):
self._raw_data = bits
self._data_bits = pdf_bytes
if lang == '':
self._lang = None
elif lang == 'auto':
from magic_pdf.model.sub_modules.language_detection.utils import \
auto_detect_lang
self._lang = auto_detect_lang(bits)
logger.info(f'lang: {lang}, detect_lang: {self._lang}')
else:
self._lang = lang
logger.info(f'lang: {lang}')
def __len__(self) -> int:
"""The length of the dataset."""
return len(self._records)

View File

@@ -1 +1 @@
__version__ = "1.3.0"
__version__ = "1.3.1"

View File

@@ -241,7 +241,7 @@ class BatchAnalyze:
for index, layout_res_item in enumerate(need_ocr_lists_by_lang[lang]):
ocr_text, ocr_score = ocr_res_list[index]
layout_res_item['text'] = ocr_text
layout_res_item['score'] = float(round(ocr_score, 2))
layout_res_item['score'] = float(f"{ocr_score:.3f}")
total_processed += len(img_crop_list)

View File

@@ -255,8 +255,9 @@ def may_batch_image_analyze(
torch.npu.set_compile_mode(jit_compile=False)
if str(device).startswith('npu') or str(device).startswith('cuda'):
gpu_memory = int(os.getenv('VIRTUAL_VRAM_SIZE', round(get_vram(device))))
if gpu_memory is not None:
vram = get_vram(device)
if vram is not None:
gpu_memory = int(os.getenv('VIRTUAL_VRAM_SIZE', round(vram)))
if gpu_memory >= 16:
batch_ratio = 16
elif gpu_memory >= 12:
@@ -268,6 +269,10 @@ def may_batch_image_analyze(
else:
batch_ratio = 1
logger.info(f'gpu_memory: {gpu_memory} GB, batch_ratio: {batch_ratio}')
else:
# Default batch_ratio when VRAM can't be determined
batch_ratio = 1
logger.info(f'Could not determine GPU memory, using default batch_ratio: {batch_ratio}')
# doc_analyze_start = time.time()

View File

@@ -57,7 +57,7 @@ def clean_vram(device, vram_threshold=8):
def get_vram(device):
if torch.cuda.is_available() and device != 'cpu':
if torch.cuda.is_available() and str(device).startswith("cuda"):
total_memory = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3) # 将字节转换为 GB
return total_memory
elif str(device).startswith("npu"):

View File

@@ -1,3 +1,5 @@
import os
from pathlib import Path
import cv2
import numpy as np
import torch
@@ -17,7 +19,9 @@ class RapidTableModel(object):
if torch.cuda.is_available() and table_sub_model_name == "unitable":
input_args = RapidTableInput(model_type=table_sub_model_name, use_cuda=True, device=get_device())
else:
input_args = RapidTableInput(model_type=table_sub_model_name)
root_dir = Path(__file__).absolute().parent.parent.parent.parent.parent
slanet_plus_model_path = os.path.join(root_dir, 'resources', 'slanet_plus', 'slanet-plus.onnx')
input_args = RapidTableInput(model_type=table_sub_model_name, model_path=slanet_plus_model_path)
else:
raise ValueError(f"Invalid table_sub_model_name: {table_sub_model_name}. It must be one of {sub_model_list}")

View File

@@ -997,7 +997,7 @@ def pdf_parse_union(
for index, span in enumerate(need_ocr_list):
ocr_text, ocr_score = ocr_res_list[index]
span['content'] = ocr_text
span['score'] = float(round(ocr_score, 2))
span['score'] = float(f"{ocr_score:.3f}")
# rec_time = time.time() - rec_start
# logger.info(f'ocr-dynamic-rec time: {round(rec_time, 2)}, total images processed: {len(img_crop_list)}')

Binary file not shown.

View File

@@ -80,7 +80,7 @@ Specify Python version 3.10.
.. code:: sh
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
5. Install Applications
@@ -90,16 +90,15 @@ Specify Python version 3.10.
pip install -U magic-pdf[full]
.. admonition:: Important
.. admonition:: TIP
:class: tip
After installation, make sure to check the version of ``magic-pdf`` using the following command:
After installation, you can check the version of ``magic-pdf`` using the following command:
.. code:: sh
magic-pdf --version
If the version number is less than 1.3.0, please report the issue.
6. Download Models
~~~~~~~~~~~~~~~~~~
@@ -178,7 +177,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
::
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
4. Install Applications
@@ -188,16 +187,15 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
pip install -U magic-pdf[full]
.. admonition:: Important
.. admonition:: Tip
:class: tip
❗️After installation, verify the version of ``magic-pdf``:
After installation, you can check the version of ``magic-pdf``:
.. code:: bash
magic-pdf --version
If the version number is less than 1.3.0, please report it in the issues section.
5. Download Models
~~~~~~~~~~~~~~~~~~
@@ -237,7 +235,7 @@ test CUDA-accelerated parsing performance.
.. code:: sh
pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``

View File

@@ -28,7 +28,7 @@ magic-pdf.json
"layoutreader-model-dir":"/tmp/layoutreader",
"device-mode":"cpu",
"layout-config": {
"model": "layoutlmv3"
"model": "doclayout_yolo"
},
"formula-config": {
"mfd_model": "yolo_v8_mfd",
@@ -37,7 +37,7 @@ magic-pdf.json
},
"table-config": {
"model": "rapid_table",
"enable": false,
"enable": true,
"max_time": 400
},
"config_version": "1.0.0"
@@ -88,10 +88,10 @@ layout-config
.. code:: json
{
"model": "layoutlmv3"
"model": "doclayout_yolo"
}
layout model can not be disabled now, And we have only kind of layout model currently.
layout model can not be disabled now.
formula-config
@@ -132,14 +132,14 @@ table-config
{
"model": "rapid_table",
"enable": false,
"enable": true,
"max_time": 400
}
model
""""""""
Specify the table inference model, options are ['rapid_table', 'tablemaster', 'struct_eqtable']
Specify the table inference model, options are ['rapid_table']
max_time

View File

@@ -29,18 +29,7 @@ filename ``magic-pdf.json``.
How to update models previously downloaded
-----------------------------------------
1. Models downloaded via Git LFS
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Due to feedback from some users that downloading model files using
git lfs was incomplete or resulted in corrupted model files, this
method is no longer recommended.
If you previously downloaded model files via git lfs, you can navigate
to the previous download directory and use the ``git pull`` command to
update the model.
2. Models downloaded via Hugging Face or Model Scope
1. Models downloaded via Hugging Face or Model Scope
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you previously downloaded models via Hugging Face or Model Scope, you

View File

@@ -71,8 +71,8 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6</td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -97,7 +97,7 @@ Create an environment
.. code-block:: shell
conda create -n mineru 'python<3.13' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]"

View File

@@ -9,7 +9,7 @@ PyMuPDF>=1.24.9,<1.25.0
scikit-learn>=1.0.2
torch>=2.2.2,!=2.5.0,!=2.5.1,<=2.6.0
torchvision
transformers>=4.49.0,<5.0.0
transformers>=4.49.0,!=4.51.0,<5.0.0
pdfminer.six==20231228
tqdm>=4.67.1
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.

View File

@@ -26,6 +26,7 @@ if __name__ == '__main__':
setup(
name="magic_pdf", # 项目名
version=__version__, # 自动从tag中获取版本号
license="AGPL-3.0",
packages=find_packages() + ["magic_pdf.resources"] + ["magic_pdf.model.sub_modules.ocr.paddleocr2pytorch.pytorchocr.utils.resources"], # 包含所有的包
package_data={
"magic_pdf.resources": ["**"], # 包含magic_pdf.resources目录下的所有文件
@@ -33,33 +34,55 @@ if __name__ == '__main__':
},
install_requires=parse_requirements('requirements.txt'), # 项目依赖的第三方库
extras_require={
"lite": ["paddleocr==2.7.3",
"paddlepaddle==3.0.0b1;platform_system=='Linux'",
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
],
"lite": [
"paddleocr==2.7.3",
"paddlepaddle==3.0.0b1;platform_system=='Linux'",
"paddlepaddle==2.6.1;platform_system=='Windows' or platform_system=='Darwin'",
],
"full": [
"matplotlib<=3.9.0;platform_system=='Windows'", # 3.9.1及之后不提供windows的预编译包避免一些没有编译环境的windows设备安装失败
"matplotlib>=3.10;platform_system=='Linux' or platform_system=='Darwin'", # linux 和 macos 不应限制matplotlib的最高版本以避免无法更新导致的一些bug
"ultralytics>=8.3.48", # yolov8,公式检测
"doclayout_yolo==0.0.2b1", # doclayout_yolo
"dill>=0.3.9,<1", # doclayout_yolo
"rapid_table>=1.0.3,<2.0.0", # rapid_table
"rapid_table>=1.0.5,<2.0.0", # rapid_table
"PyYAML>=6.0.2,<7", # yaml
"ftfy>=6.3.1,<7", # unimernet_hf
"openai>=1.70.0,<2", # openai SDK
"shapely>=2.0.7,<3", # imgaug-paddleocr2pytorch
"pyclipper>=1.3.0,<2", # paddleocr2pytorch
"omegaconf>=2.3.0,<3", # paddleocr2pytorch
],
"old_linux":[
"albumentations<=1.4.20", # 1.4.21引入的simsimd不支持2019年及更早的linux系统
],
"full_old_linux":[
"matplotlib>=3.10,<=3.10.1",
"ultralytics>=8.3.48,<=8.3.104", # yolov8,公式检测
"doclayout_yolo==0.0.2b1", # doclayout_yolo
"dill==0.3.9", # doclayout_yolo
"PyYAML==6.0.2", # yaml
"ftfy==6.3.1", # unimernet_hf
"openai==1.71.0", # openai SDK
"shapely==2.1.0", # imgaug-paddleocr2pytorch
"pyclipper==1.3.0.post6", # paddleocr2pytorch
"omegaconf==2.3.0", # paddleocr2pytorch
"albumentations==1.4.20", # 1.4.21引入的simsimd不支持2019年及更早的linux系统
"rapid_table==1.0.3", # rapid_table新版本依赖的onnxruntime不支持2019年及更早的linux系统
],
},
description="A practical tool for converting PDF to Markdown", # 简短描述
long_description=long_description, # 详细描述
long_description_content_type="text/markdown", # 如果README是Markdown格式
url="https://github.com/opendatalab/MinerU",
python_requires=">=3.9", # 项目依赖的 Python 版本
project_urls={
"Home": "https://mineru.net/",
"Repository": "https://github.com/opendatalab/MinerU",
},
keywords=["magic-pdf, mineru, MinerU, convert, pdf, markdown"],
classifiers=[
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
],
python_requires=">=3.10,<4", # 项目依赖的 Python 版本
entry_points={
"console_scripts": [
"magic-pdf = magic_pdf.tools.cli:cli",