mirror of
https://github.com/opendatalab/MinerU.git
synced 2026-03-27 11:08:32 +07:00
Compare commits
26 Commits
release-2.
...
release-2.
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b69191ba2b | ||
|
|
0028514ced | ||
|
|
8d8daf6851 | ||
|
|
815280dd23 | ||
|
|
7b52f92aea | ||
|
|
33543b76c9 | ||
|
|
ea5f8e98dd | ||
|
|
8996e06448 | ||
|
|
bfb304ef1f | ||
|
|
17e6016b58 | ||
|
|
ba06cd14ef | ||
|
|
0209ada8d0 | ||
|
|
e2140222bc | ||
|
|
d679d99192 | ||
|
|
4bfcc0b808 | ||
|
|
ead29489ff | ||
|
|
c01e35b4c6 | ||
|
|
a89249069c | ||
|
|
2fc395bcff | ||
|
|
0ca244ad62 | ||
|
|
8acc7dd326 | ||
|
|
1cde3fe5ad | ||
|
|
0a4c87fc22 | ||
|
|
12d803079f | ||
|
|
8c4b3ef3a2 | ||
|
|
ed6894c178 |
@@ -45,6 +45,11 @@
|
||||
|
||||
# Changelog
|
||||
|
||||
- 2026/01/06 2.7.1 Release
|
||||
- fix bug: #4300
|
||||
- Updated pdfminer.six dependency version to resolve [CVE-2025-64512](https://github.com/advisories/GHSA-wf5f-4jwr-ppcp)
|
||||
- Support automatic correction of input image exif orientation to improve OCR recognition accuracy #4283
|
||||
|
||||
- 2025/12/30 2.7.0 Release
|
||||
- Simplified installation process. No need to separately install `vlm` acceleration engine dependencies. Using `uv pip install mineru[all]` during installation will install all optional backend dependencies.
|
||||
- Added new `hybrid` backend, which combines the advantages of `pipeline` and `vlm` backends. Built on vlm, it integrates some capabilities of pipeline, adding extra extensibility on top of high accuracy:
|
||||
|
||||
@@ -45,6 +45,11 @@
|
||||
|
||||
# 更新记录
|
||||
|
||||
- 2026/01/06 2.7.1 发布
|
||||
- fix bug: #4300
|
||||
- 更新pdfminer.six的依赖版本以解决 [CVE-2025-64512](https://github.com/advisories/GHSA-wf5f-4jwr-ppcp)
|
||||
- 支持输入图像的exif方向自动校正,提升OCR识别效果 #4283
|
||||
|
||||
- 2025/12/30 2.7.0 发布
|
||||
- 简化安装流程,现在不再需要单独安装`vlm`加速引擎依赖包,安装时使用`uv pip install mineru[all]`即可安装所有可选后端的依赖包。
|
||||
- 增加全新后端`hybrid`,该后端结合了`pipeline`和`vlm`后端的优势,在vlm的基础上,融入了pipeline的部分能力,在高精度的基础上增加了额外的扩展性:
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
# 快速开始
|
||||
# 快速入门
|
||||
|
||||
如果遇到任何安装问题,请先查询 [FAQ](../faq/index.md)
|
||||
|
||||
|
||||
@@ -105,65 +105,50 @@ docker run -u root --name mineru_docker --privileged=true \
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td rowspan="4">命令行工具(mineru)</td>
|
||||
<td rowspan="3">命令行工具(mineru)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="4">fastapi服务(mineru-api)</td>
|
||||
<td rowspan="3">fastapi服务(mineru-api)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="4">gradio界面(mineru-gradio)</td>
|
||||
<td rowspan="3">gradio界面(mineru-gradio)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
|
||||
@@ -82,65 +82,50 @@ docker run --ipc host \
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td rowspan="4">命令行工具(mineru)</td>
|
||||
<td rowspan="3">命令行工具(mineru)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td>🟡</td>
|
||||
<td>🟡</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="4">fastapi服务(mineru-api)</td>
|
||||
<td rowspan="3">fastapi服务(mineru-api)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td>🟡</td>
|
||||
<td>🟡</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="4">gradio界面(mineru-gradio)</td>
|
||||
<td rowspan="3">gradio界面(mineru-gradio)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td>🟡</td>
|
||||
<td>🟡</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
|
||||
@@ -73,65 +73,50 @@ docker run --privileged=true \
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td rowspan="4">命令行工具(mineru)</td>
|
||||
<td rowspan="3">命令行工具(mineru)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="4">fastapi服务(mineru-api)</td>
|
||||
<td rowspan="3">fastapi服务(mineru-api)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="4">gradio界面(mineru-gradio)</td>
|
||||
<td rowspan="3">gradio界面(mineru-gradio)</td>
|
||||
<td>pipeline</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td><vlm/hybrid>-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-<engine_name>-engine</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td><vlm/hybrid>-http-client</td>
|
||||
<td>🟢</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
|
||||
@@ -14,21 +14,21 @@
|
||||
cpu: Hygon C86-4G
|
||||
gpu: VA16 / VA1L / VA10L
|
||||
torch: 2.8.0+cpu
|
||||
torch-vacc: 1.3.3.626
|
||||
torch-vacc: 1.3.3.777
|
||||
vllm: 0.11.1.dev0+gb8b302cde.d20251030.cpu
|
||||
vllm-vacc: 0.11.0.626
|
||||
driver: 00.25.12.02 d3_3_v2_9_a3_1 3ef7cf3 20251202
|
||||
vllm-vacc: 0.11.0.777
|
||||
driver: 00.25.12.30 d3_3_v2_9_a3_1 a76bf37 20251230
|
||||
docker: 28.1.1
|
||||
```
|
||||
|
||||
## 3. 环境准备
|
||||
|
||||
- 获取Docker镜像
|
||||
- 获取vllm_vacc基础镜像
|
||||
```bash
|
||||
sudo docker pull harbor.vastaitech.com/ai_deliver/vllm_vacc:VVI-25.12.SP1
|
||||
sudo docker pull harbor.vastaitech.com/ai_deliver/vllm_vacc:VVI-25.12.SP2
|
||||
```
|
||||
|
||||
- 启动Docker容器
|
||||
- 启动容器
|
||||
```bash
|
||||
sudo docker run -it \
|
||||
--privileged=true \
|
||||
@@ -36,23 +36,7 @@
|
||||
--name vllm_service \
|
||||
--ipc=host \
|
||||
--network=host \
|
||||
harbor.vastaitech.com/ai_deliver/vllm_vacc:VVI-25.12.SP1 bash
|
||||
```
|
||||
|
||||
|
||||
>[!TIP]
|
||||
> - 镜像内已包含`torch/vllm`等相关依赖
|
||||
> - 和`NVIDIA`硬件下`CUDA_VISIBLE_DEVICES`类似;在`VastAI`硬件中可以使用`VACC_VISIBLE_DEVICES`指定`可见计算卡ID`,如`-e VACC_VISIBLE_DEVICES=0,1,2,3`
|
||||
> - 需指定适当的`--shm-size`虚拟内存
|
||||
|
||||
## 4. MinerU功能
|
||||
|
||||
>[!NOTE]
|
||||
> - `VastAI`加速卡仅支持使用`vlm-vllm-engine`和`vlm-http-client`形式进行`VLM`模型推理加速
|
||||
|
||||
- 进入容器
|
||||
```bash
|
||||
sudo docker exec -it vllm_service bash
|
||||
harbor.vastaitech.com/ai_deliver/vllm_vacc:VVI-25.12.SP2 bash
|
||||
```
|
||||
|
||||
- 安装MinerU
|
||||
@@ -60,6 +44,10 @@
|
||||
- 参考官方文档安装:[README_zh-CN.md#安装-mineru](https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md#安装-mineru)
|
||||
|
||||
```bash
|
||||
# 启动容器
|
||||
# sudo docker exec -it vllm_service bash
|
||||
|
||||
# 可选pypi源
|
||||
# https://mirrors.163.com/pypi/simple/
|
||||
# https://mirrors.aliyun.com/pypi/simple/
|
||||
# https://pypi.mirrors.ustc.edu.cn/simple/
|
||||
@@ -68,26 +56,42 @@
|
||||
|
||||
# 通过源码安装MinerU
|
||||
git clone https://github.com/opendatalab/MinerU.git
|
||||
git checkout eed479eb56bba93ee99c1a8c255d509bd2f837e5
|
||||
git checkout 8c4b3ef3a20b11ddac9903f25124d24ea82639b5
|
||||
pip install -e .[core] -i https://mirrors.aliyun.com/pypi/simple
|
||||
|
||||
# 使用pip安装MinerU
|
||||
pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
|
||||
# 或使用pip安装MinerU
|
||||
pip install -U "mineru[core]==2.7.0" -i https://mirrors.aliyun.com/pypi/simple
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> - `vllm_vacc`基础镜像内已包含`torch/vllm`等相关依赖
|
||||
> - 截至`2025/12/31`,`VastAI`已支持`MinerU`至最新版本`2.7.0`,`master分支8c4b3ef3`
|
||||
> - 和`NVIDIA`硬件下`CUDA_VISIBLE_DEVICES`类似;在`VastAI`硬件中可以使用`VACC_VISIBLE_DEVICES`指定`可见计算卡ID`,如`-e VACC_VISIBLE_DEVICES=0,1,2,3`
|
||||
> - 需指定适当的`--shm-size`虚拟内存
|
||||
|
||||
## 4. MinerU功能
|
||||
|
||||
> [!NOTE]
|
||||
> - `VastAI`加速卡仅支持使用`vlm-auto-engine`和`vlm-http-client`形式进行`VLM`模型推理加速
|
||||
|
||||
- 进入容器
|
||||
```bash
|
||||
sudo docker exec -it vllm_service bash
|
||||
```
|
||||
|
||||
- 使用MinerU
|
||||
|
||||
- 模型准备,参考官方介绍:[model_source.md](https://github.com/opendatalab/MinerU/blob/master/docs/zh/usage/model_source.md)
|
||||
|
||||
- 方式一:`vlm-vllm-engine`
|
||||
- 方式一:`vlm-auto-engine`
|
||||
|
||||
```bash
|
||||
export MINERU_MODEL_SOURCE=modelscope
|
||||
|
||||
# step1, 以`vlm-vllm-engine`方式启动MinerU解析任务
|
||||
mineru -p /path/to/demo/pdfs/demo1.pdf \
|
||||
# step1, 以`vlm-auto-engine`方式启动MinerU解析任务
|
||||
mineru -p image.png \
|
||||
-o ./output \
|
||||
-b vlm-vllm-engine \
|
||||
-b vlm-auto-engine \
|
||||
--http-timeout 1200 \
|
||||
--tensor-parallel-size 2 \
|
||||
--enforce_eager \
|
||||
@@ -108,7 +112,7 @@
|
||||
--served-model-name MinerU2.5-2509-1.2B
|
||||
|
||||
# step2,以`vlm-http-client`方式启动MinerU解析任务
|
||||
mineru -p /path/to/demo/pdfs/demo1.pdf \
|
||||
mineru -p demo/pdfs/demo1.pdf \
|
||||
-o ./output \
|
||||
-b vlm-http-client \
|
||||
-u http://127.0.0.1:8090 \
|
||||
@@ -116,8 +120,7 @@
|
||||
```
|
||||
|
||||
|
||||
>[!NOTE]
|
||||
> - 截至`2025/12/23`,`VastAI`已支持`MinerU`至最新版本`2.6.8`,`master分支eed479eb`
|
||||
> [!NOTE]
|
||||
> - 注意在执行任意与`vllm`相关命令需追加`--enforce_eager`参数
|
||||
|
||||
|
||||
@@ -140,17 +143,17 @@
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td>hybrid-http-client</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-vllm-engine</td>
|
||||
<td>hybrid-auto-engine</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-lmdeploy-client</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td>🟢</td>
|
||||
@@ -161,17 +164,17 @@
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td>hybrid-http-client</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-vllm-engine</td>
|
||||
<td>hybrid-auto-engine</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-lmdeploy-engine</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td>🟢</td>
|
||||
@@ -182,17 +185,17 @@
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-transformers</td>
|
||||
<td>hybrid-http-client</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-vllm-engine</td>
|
||||
<td>hybrid-auto-engine</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-auto-engine</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-lmdeploy-engine</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>vlm-http-client</td>
|
||||
<td>🟢</td>
|
||||
@@ -202,18 +205,19 @@
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="2">Tensor并行 (--tensor-parallel-size/--tp)</td>
|
||||
<td>🟢 仅支持tp1/tp2</td>
|
||||
<td colspan="2">Tensor并行 (--tensor-parallel-size)</td>
|
||||
<td>🟢</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="2">数据并行 (--data-parallel-size/--dp)</td>
|
||||
<td colspan="2">数据并行 (--data-parallel-size)</td>
|
||||
<td>🔴</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
|
||||
注:
|
||||
🟢: 支持,运行较稳定,精度与Nvidia GPU基本一致
|
||||
🟡: 支持但较不稳定,在某些场景下可能出现异常,或精度存在一定差异
|
||||
🔴: 不支持,无法运行,或精度存在较大差异
|
||||
> [!NOTE]
|
||||
> - 🟢: 支持,运行较稳定,精度与NVIDIA GPU基本一致
|
||||
> - 🟡: 支持但较不稳定,在某些场景下可能出现异常,或精度存在一定差异
|
||||
> - 🔴: 不支持,无法运行,或精度存在较大差异
|
||||
> - `vlm-auto-engine`:VastAI仅支持vLLM后端
|
||||
@@ -4,10 +4,10 @@
|
||||
|
||||
## 目录
|
||||
- 本地部署
|
||||
* [快速使用](./quick_usage.md) - 快速上手和基本使用
|
||||
* [基础使用](./quick_usage.md) - 快速上手和基本使用
|
||||
* [模型源配置](./model_source.md) - 模型源的详细配置说明
|
||||
* [命令行工具](./cli_tools.md) - 命令行工具的详细参数说明
|
||||
* [进阶优化参数](./advanced_cli_parameters.md) - 一些适配命令行工具的进阶参数说明
|
||||
* [命令行进阶参数](./advanced_cli_parameters.md) - 一些适配命令行工具的进阶参数说明
|
||||
- 其他加速卡适配(🚀官方支持/❤️社区贡献)
|
||||
* [昇腾 Ascend](acceleration_cards/Ascend.md) 🚀
|
||||
* [平头哥 T-Head](acceleration_cards/THead.md) 🚀
|
||||
|
||||
@@ -17,8 +17,6 @@ from mineru.utils.pdf_image_tools import images_bytes_to_pdf_bytes
|
||||
from mineru.backend.vlm.vlm_middle_json_mkcontent import union_make as vlm_union_make
|
||||
from mineru.backend.vlm.vlm_analyze import doc_analyze as vlm_doc_analyze
|
||||
from mineru.backend.vlm.vlm_analyze import aio_doc_analyze as aio_vlm_doc_analyze
|
||||
from mineru.backend.hybrid.hybrid_analyze import doc_analyze as hybrid_doc_analyze
|
||||
from mineru.backend.hybrid.hybrid_analyze import aio_doc_analyze as aio_hybrid_doc_analyze
|
||||
from mineru.utils.pdf_page_id import get_end_page_id
|
||||
|
||||
if os.getenv("MINERU_LMDEPLOY_DEVICE", "") == "maca":
|
||||
@@ -326,6 +324,7 @@ def _process_hybrid(
|
||||
server_url=None,
|
||||
**kwargs,
|
||||
):
|
||||
from mineru.backend.hybrid.hybrid_analyze import doc_analyze as hybrid_doc_analyze
|
||||
"""同步处理hybrid后端逻辑"""
|
||||
if not backend.endswith("client"):
|
||||
server_url = None
|
||||
@@ -378,8 +377,8 @@ async def _async_process_hybrid(
|
||||
server_url=None,
|
||||
**kwargs,
|
||||
):
|
||||
from mineru.backend.hybrid.hybrid_analyze import aio_doc_analyze as aio_hybrid_doc_analyze
|
||||
"""异步处理hybrid后端逻辑"""
|
||||
|
||||
if not backend.endswith("client"):
|
||||
server_url = None
|
||||
|
||||
|
||||
@@ -5,7 +5,7 @@ from io import BytesIO
|
||||
import numpy as np
|
||||
import pypdfium2 as pdfium
|
||||
from loguru import logger
|
||||
from PIL import Image
|
||||
from PIL import Image, ImageOps
|
||||
|
||||
from mineru.data.data_reader_writer import FileBasedDataWriter
|
||||
from mineru.utils.check_sys_env import is_windows_environment
|
||||
@@ -41,19 +41,23 @@ def pdf_page_to_image(page: pdfium.PdfPage, dpi=200, image_type=ImageType.PIL) -
|
||||
return image_dict
|
||||
|
||||
|
||||
def _load_images_from_pdf_worker(pdf_bytes, dpi, start_page_id, end_page_id, image_type):
|
||||
def _load_images_from_pdf_worker(
|
||||
pdf_bytes, dpi, start_page_id, end_page_id, image_type
|
||||
):
|
||||
"""用于进程池的包装函数"""
|
||||
return load_images_from_pdf_core(pdf_bytes, dpi, start_page_id, end_page_id, image_type)
|
||||
return load_images_from_pdf_core(
|
||||
pdf_bytes, dpi, start_page_id, end_page_id, image_type
|
||||
)
|
||||
|
||||
|
||||
def load_images_from_pdf(
|
||||
pdf_bytes: bytes,
|
||||
dpi=200,
|
||||
start_page_id=0,
|
||||
end_page_id=None,
|
||||
image_type=ImageType.PIL,
|
||||
timeout=None,
|
||||
threads=4,
|
||||
pdf_bytes: bytes,
|
||||
dpi=200,
|
||||
start_page_id=0,
|
||||
end_page_id=None,
|
||||
image_type=ImageType.PIL,
|
||||
timeout=None,
|
||||
threads=4,
|
||||
):
|
||||
"""带超时控制的 PDF 转图片函数,支持多进程加速
|
||||
|
||||
@@ -77,7 +81,7 @@ def load_images_from_pdf(
|
||||
dpi,
|
||||
start_page_id,
|
||||
get_end_page_id(end_page_id, len(pdf_doc)),
|
||||
image_type
|
||||
image_type,
|
||||
), pdf_doc
|
||||
else:
|
||||
if timeout is None:
|
||||
@@ -116,7 +120,7 @@ def load_images_from_pdf(
|
||||
dpi,
|
||||
range_start,
|
||||
range_end,
|
||||
image_type
|
||||
image_type,
|
||||
)
|
||||
futures.append((range_start, future))
|
||||
|
||||
@@ -163,7 +167,14 @@ def load_images_from_pdf_core(
|
||||
return images_list
|
||||
|
||||
|
||||
def cut_image(bbox: tuple, page_num: int, page_pil_img, return_path, image_writer: FileBasedDataWriter, scale=2):
|
||||
def cut_image(
|
||||
bbox: tuple,
|
||||
page_num: int,
|
||||
page_pil_img,
|
||||
return_path,
|
||||
image_writer: FileBasedDataWriter,
|
||||
scale=2,
|
||||
):
|
||||
"""从第page_num页的page中,根据bbox进行裁剪出一张jpg图片,返回图片路径 save_path:需要同时支持s3和本地,
|
||||
图片存放在save_path下,文件名是:
|
||||
{page_num}_{bbox[0]}_{bbox[1]}_{bbox[2]}_{bbox[3]}.jpg , bbox内数字取整。"""
|
||||
@@ -197,7 +208,6 @@ def get_crop_img(bbox: tuple, pil_img, scale=2):
|
||||
|
||||
|
||||
def get_crop_np_img(bbox: tuple, input_img, scale=2):
|
||||
|
||||
if isinstance(input_img, Image.Image):
|
||||
np_img = np.asarray(input_img)
|
||||
elif isinstance(input_img, np.ndarray):
|
||||
@@ -212,17 +222,27 @@ def get_crop_np_img(bbox: tuple, input_img, scale=2):
|
||||
int(bbox[3] * scale),
|
||||
)
|
||||
|
||||
return np_img[scale_bbox[1]:scale_bbox[3], scale_bbox[0]:scale_bbox[2]]
|
||||
return np_img[scale_bbox[1] : scale_bbox[3], scale_bbox[0] : scale_bbox[2]]
|
||||
|
||||
|
||||
def images_bytes_to_pdf_bytes(image_bytes):
|
||||
# 内存缓冲区
|
||||
pdf_buffer = BytesIO()
|
||||
|
||||
# 载入并转换所有图像为 RGB 模式
|
||||
image = Image.open(BytesIO(image_bytes)).convert("RGB")
|
||||
image = Image.open(BytesIO(image_bytes))
|
||||
# 根据 EXIF 信息自动转正(处理手机拍摄的带 Orientation 标记的图片)
|
||||
image = ImageOps.exif_transpose(image) or image
|
||||
# 只在必要时转换
|
||||
if image.mode != "RGB":
|
||||
image = image.convert("RGB")
|
||||
|
||||
# 第一张图保存为 PDF,其余追加
|
||||
image.save(pdf_buffer, format="PDF", save_all=True)
|
||||
image.save(
|
||||
pdf_buffer,
|
||||
format="PDF",
|
||||
# save_all=True
|
||||
)
|
||||
|
||||
# 获取 PDF bytes 并重置指针(可选)
|
||||
pdf_bytes = pdf_buffer.getvalue()
|
||||
|
||||
@@ -9,7 +9,14 @@ from mineru.utils.char_utils import full_to_half
|
||||
from mineru.utils.enum_class import BlockType, SplitFlag
|
||||
|
||||
|
||||
CONTINUATION_MARKERS = ["(续)", "(续表)", "(continued)", "(cont.)"]
|
||||
CONTINUATION_MARKERS = [
|
||||
"(续)",
|
||||
"(续表)",
|
||||
"(续上表)",
|
||||
"(continued)",
|
||||
"(cont.)",
|
||||
"(cont’d)",
|
||||
]
|
||||
|
||||
|
||||
def calculate_table_total_columns(soup):
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = "2.6.8"
|
||||
__version__ = "2.7.0"
|
||||
|
||||
19
mkdocs.yml
19
mkdocs.yml
@@ -93,7 +93,18 @@ nav:
|
||||
- FAQ: faq/index.md
|
||||
- Demo:
|
||||
- Demo: demo/index.md
|
||||
- Quick Start:
|
||||
- Quick Start: quick_start/index.md
|
||||
- Extension Modules: quick_start/extension_modules.md
|
||||
- Docker Deployment: quick_start/docker_deployment.md
|
||||
- Usage:
|
||||
- Usage: usage/index.md
|
||||
- Quick Usage: usage/quick_usage.md
|
||||
- Model Source: usage/model_source.md
|
||||
- CLI Tools: usage/cli_tools.md
|
||||
- Advanced CLI Parameters: usage/advanced_cli_parameters.md
|
||||
- Reference:
|
||||
- Reference: reference/index.md
|
||||
- Output File Format: reference/output_files.md
|
||||
- Changelog: reference/changelog.md
|
||||
- FAQ:
|
||||
@@ -119,13 +130,13 @@ plugins:
|
||||
build: true
|
||||
nav_translations:
|
||||
Home: 主页
|
||||
Quick Start: 快速开始
|
||||
Quick Start: 快速入门
|
||||
Extension Modules: 扩展模块安装
|
||||
Docker Deployment: Docker部署
|
||||
Usage: 使用方法
|
||||
Quick Usage: 快速使用
|
||||
Usage: 使用指南
|
||||
Quick Usage: 基础使用
|
||||
CLI Tools: 命令行工具
|
||||
Model Source: 模型源
|
||||
Model Source: 模型源配置
|
||||
Advanced CLI Parameters: 命令行进阶参数
|
||||
FAQ: 常见问题解答
|
||||
Reference: 参考资料
|
||||
|
||||
@@ -21,7 +21,7 @@ dependencies = [
|
||||
"click>=8.1.7",
|
||||
"loguru>=0.7.2",
|
||||
"numpy>=1.21.6",
|
||||
"pdfminer.six==20250506",
|
||||
"pdfminer.six>=20251230",
|
||||
"tqdm>=4.67.1",
|
||||
"requests",
|
||||
"httpx",
|
||||
@@ -94,10 +94,10 @@ core = [
|
||||
"mineru[pipeline]",
|
||||
"mineru[api]",
|
||||
"mineru[gradio]",
|
||||
"mineru[mlx] ; sys_platform == 'darwin'",
|
||||
]
|
||||
all = [
|
||||
"mineru[core]",
|
||||
"mineru[mlx] ; sys_platform == 'darwin'",
|
||||
"mineru[vllm] ; sys_platform == 'linux'",
|
||||
"mineru[lmdeploy] ; sys_platform == 'windows'",
|
||||
]
|
||||
|
||||
Reference in New Issue
Block a user