Merge pull request #3424 from myhloli/dev

Dev
This commit is contained in:
Xiaomeng Zhao
2025-09-05 19:01:48 +08:00
committed by GitHub
11 changed files with 324 additions and 137 deletions

153
README.md
View File

@@ -43,48 +43,122 @@
</div>
# Changelog
- 2025/08/01 2.1.10 Released
- Fixed an issue in the `pipeline` backend where block overlap caused the parsing results to deviate from expectations #3232
- 2025/07/30 2.1.9 Released
- `transformers` 4.54.1 version adaptation
- 2025/07/28 2.1.8 Released
- `sglang` 0.4.9.post5 version adaptation
- 2025/07/27 2.1.7 Released
- `transformers` 4.54.0 version adaptation
- 2025/07/26 2.1.6 Released
- Fixed table parsing issues in handwritten documents when using `vlm` backend
- Fixed visualization box position drift issue when document is rotated #3175
- 2025/07/24 2.1.5 Released
- `sglang` 0.4.9 version adaptation, synchronously upgrading the dockerfile base image to sglang 0.4.9.post3
- 2025/07/23 2.1.4 Released
- Bug Fixes
- Fixed the issue of excessive memory consumption during the `MFR` step in the `pipeline` backend under certain scenarios #2771
- Fixed the inaccurate matching between `image`/`table` and `caption`/`footnote` under certain conditions #3129
- 2025/07/16 2.1.1 Released
- Bug fixes
- Fixed text block content loss issue that could occur in certain `pipeline` scenarios #3005
- Fixed issue where `sglang-client` required unnecessary packages like `torch` #2968
- Updated `dockerfile` to fix incomplete text content parsing due to missing fonts in Linux #2915
- Usability improvements
- Updated `compose.yaml` to facilitate direct startup of `sglang-server`, `mineru-api`, and `mineru-gradio` services
- Launched brand new [online documentation site](https://opendatalab.github.io/MinerU/), simplified readme, providing better documentation experience
- 2025/07/05 Version 2.1.0 Released
- This is the first major update of MinerU 2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows:
- **Performance Optimizations:**
- Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side).
- Greatly enhanced post-processing speed when the `pipeline` backend handles batch processing of documents with fewer pages (<10 pages).
- Layout analysis speed of the `pipeline` backend has been increased by approximately 20%.
- **Experience Enhancements:**
- Built-in ready-to-use `fastapi service` and `gradio webui`. For detailed usage instructions, please refer to [Documentation](https://opendatalab.github.io/MinerU/usage/quick_usage/#advanced-usage-via-api-webui-sglang-clientserver).
- Adapted to `sglang` version `0.4.8`, significantly reducing the GPU memory requirements for the `vlm-sglang` backend. It can now run on graphics cards with as little as `8GB GPU memory` (Turing architecture or newer).
- Added transparent parameter passing for all commands related to `sglang`, allowing the `sglang-engine` backend to receive all `sglang` parameters consistently with the `sglang-server`.
- Supports feature extensions based on configuration files, including `custom formula delimiters`, `enabling heading classification`, and `customizing local model directories`. For detailed usage instructions, please refer to [Documentation](https://opendatalab.github.io/MinerU/usage/quick_usage/#extending-mineru-functionality-with-configuration-files).
- **New Features:**
- Updated the `pipeline` backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. [Details](https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
- Introduced limited support for vertical text layout in the `pipeline` backend.
- 2025/09/05 2.2.0 Released
- Major Updates
- In this version, we focused on improving table parsing accuracy by introducing a new [wired table recognition model](https://github.com/RapidAI/TableStructureRec) and a brand-new hybrid table structure parsing algorithm, significantly enhancing the table recognition capabilities of the `pipeline` backend.
- We also added support for cross-page table merging, which is supported by both `pipeline` and `vlm` backends, further improving the completeness and accuracy of table parsing.
- Other Updates
- The `pipeline` backend now supports 270-degree rotated table parsing, bringing support for table parsing in 0/90/270-degree orientations
- `pipeline` added OCR capability support for Thai and Greek, and updated the English OCR model to the latest version. English recognition accuracy improved by 11%, Thai recognition model accuracy is 82.68%, and Greek recognition model accuracy is 89.28% (by PPOCRv5)
- Added `bbox` field (mapped to 0-1000 range) in the output `content_list.json`, making it convenient for users to directly obtain position information for each content block
<details>
<summary>History Log</summary>
<details>
<summary>2025/08/01 2.1.10 Released</summary>
<ul>
<li>Fixed an issue in the <code>pipeline</code> backend where block overlap caused the parsing results to deviate from expectations #3232</li>
</ul>
</details>
<details>
<summary>2025/07/30 2.1.9 Released</summary>
<ul>
<li><code>transformers</code> 4.54.1 version adaptation</li>
</ul>
</details>
<details>
<summary>2025/07/28 2.1.8 Released</summary>
<ul>
<li><code>sglang</code> 0.4.9.post5 version adaptation</li>
</ul>
</details>
<details>
<summary>2025/07/27 2.1.7 Released</summary>
<ul>
<li><code>transformers</code> 4.54.0 version adaptation</li>
</ul>
</details>
<details>
<summary>2025/07/26 2.1.6 Released</summary>
<ul>
<li>Fixed table parsing issues in handwritten documents when using <code>vlm</code> backend</li>
<li>Fixed visualization box position drift issue when document is rotated #3175</li>
</ul>
</details>
<details>
<summary>2025/07/24 2.1.5 Released</summary>
<ul>
<li><code>sglang</code> 0.4.9 version adaptation, synchronously upgrading the dockerfile base image to sglang 0.4.9.post3</li>
</ul>
</details>
<details>
<summary>2025/07/23 2.1.4 Released</summary>
<ul>
<li><strong>Bug Fixes</strong>
<ul>
<li>Fixed the issue of excessive memory consumption during the <code>MFR</code> step in the <code>pipeline</code> backend under certain scenarios #2771</li>
<li>Fixed the inaccurate matching between <code>image</code>/<code>table</code> and <code>caption</code>/<code>footnote</code> under certain conditions #3129</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/07/16 2.1.1 Released</summary>
<ul>
<li><strong>Bug fixes</strong>
<ul>
<li>Fixed text block content loss issue that could occur in certain <code>pipeline</code> scenarios #3005</li>
<li>Fixed issue where <code>sglang-client</code> required unnecessary packages like <code>torch</code> #2968</li>
<li>Updated <code>dockerfile</code> to fix incomplete text content parsing due to missing fonts in Linux #2915</li>
</ul>
</li>
<li><strong>Usability improvements</strong>
<ul>
<li>Updated <code>compose.yaml</code> to facilitate direct startup of <code>sglang-server</code>, <code>mineru-api</code>, and <code>mineru-gradio</code> services</li>
<li>Launched brand new <a href="https://opendatalab.github.io/MinerU/">online documentation site</a>, simplified readme, providing better documentation experience</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/07/05 2.1.0 Released</summary>
<ul>
<li>This is the first major update of MinerU 2, which includes a large number of new features and improvements, covering significant performance optimizations, user experience enhancements, and bug fixes. The detailed update contents are as follows:</li>
<li><strong>Performance Optimizations:</strong>
<ul>
<li>Significantly improved preprocessing speed for documents with specific resolutions (around 2000 pixels on the long side).</li>
<li>Greatly enhanced post-processing speed when the <code>pipeline</code> backend handles batch processing of documents with fewer pages (&lt;10 pages).</li>
<li>Layout analysis speed of the <code>pipeline</code> backend has been increased by approximately 20%.</li>
</ul>
</li>
<li><strong>Experience Enhancements:</strong>
<ul>
<li>Built-in ready-to-use <code>fastapi service</code> and <code>gradio webui</code>. For detailed usage instructions, please refer to <a href="https://opendatalab.github.io/MinerU/usage/quick_usage/#advanced-usage-via-api-webui-sglang-clientserver">Documentation</a>.</li>
<li>Adapted to <code>sglang</code> version <code>0.4.8</code>, significantly reducing the GPU memory requirements for the <code>vlm-sglang</code> backend. It can now run on graphics cards with as little as <code>8GB GPU memory</code> (Turing architecture or newer).</li>
<li>Added transparent parameter passing for all commands related to <code>sglang</code>, allowing the <code>sglang-engine</code> backend to receive all <code>sglang</code> parameters consistently with the <code>sglang-server</code>.</li>
<li>Supports feature extensions based on configuration files, including <code>custom formula delimiters</code>, <code>enabling heading classification</code>, and <code>customizing local model directories</code>. For detailed usage instructions, please refer to <a href="https://opendatalab.github.io/MinerU/usage/quick_usage/#extending-mineru-functionality-with-configuration-files">Documentation</a>.</li>
</ul>
</li>
<li><strong>New Features:</strong>
<ul>
<li>Updated the <code>pipeline</code> backend with the PP-OCRv5 multilingual text recognition model, supporting text recognition in 37 languages such as French, Spanish, Portuguese, Russian, and Korean, with an average accuracy improvement of over 30%. <a href="https://paddlepaddle.github.io/PaddleOCR/latest/en/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html">Details</a></li>
<li>Introduced limited support for vertical text layout in the <code>pipeline</code> backend.</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/06/20 2.0.6 Released</summary>
<ul>
@@ -596,6 +670,7 @@ Currently, some models in this project are trained based on YOLO. However, since
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [UniMERNet](https://github.com/opendatalab/UniMERNet)
- [RapidTable](https://github.com/RapidAI/RapidTable)
- [TableStructureRec](https://github.com/RapidAI/TableStructureRec)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
- [layoutreader](https://github.com/ppaanngggg/layoutreader)

View File

@@ -43,48 +43,122 @@
</div>
# 更新记录
- 2025/08/01 2.1.10 发布
- 修复`pipeline`后端因block覆盖导致的解析结果与预期不符 #3232
- 2025/07/30 2.1.9 发布
- `transformers` 4.54.1 版本适配
- 2025/07/28 2.1.8 发布
- `sglang` 0.4.9.post5 版本适配
- 2025/07/27 2.1.7 发布
- `transformers` 4.54.0 版本适配
- 2025/07/26 2.1.6 发布
- 修复`vlm`后端解析部分手写文档时的表格异常问题
- 修复文档旋转时可视化框位置漂移问题 #3175
- 2025/07/24 2.1.5 发布
- `sglang` 0.4.9 版本适配同步升级dockerfile基础镜像为sglang 0.4.9.post3
- 2025/07/23 2.1.4 发布
- bug修复
- 修复`pipeline`后端中`MFR`步骤在某些情况下显存消耗过大的问题 #2771
- 修复某些情况下`image`/`table``caption`/`footnote`匹配不准确的问题 #3129
- 2025/07/16 2.1.1 发布
- bug修复
- 修复`pipeline`在某些情况可能发生的文本块内容丢失问题 #3005
- 修复`sglang-client`需要安装`torch`等不必要的包的问题 #2968
- 更新`dockerfile`以修复linux字体缺失导致的解析文本内容不完整问题 #2915
- 易用性更新
- 更新`compose.yaml`,便于用户直接启动`sglang-server``mineru-api``mineru-gradio`服务
- 启用全新的[在线文档站点](https://opendatalab.github.io/MinerU/zh/)简化readme提供更好的文档体验
- 2025/07/05 2.1.0 发布
- 这是 MinerU 2 的第一个大版本更新包含了大量新功能和改进包含众多性能优化、体验优化和bug修复具体更新内容如下
- 性能优化:
- 大幅提升某些特定分辨率长边2000像素左右文档的预处理速度
- 大幅提升`pipeline`后端批量处理大量页数较少(<10文档时的后处理速度
- `pipeline`后端的layout分析速度提升约20%
- 体验优化:
- 内置开箱即用的`fastapi服务``gradio webui`,详细使用方法请参考[文档](https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#apiwebuisglang-clientserver)
- `sglang`适配`0.4.8`版本,大幅降低`vlm-sglang`后端的显存要求,最低可在`8G显存`(Turing及以后架构)的显卡上运行
- 对所有命令增加`sglang`的参数透传,使得`sglang-engine`后端可以与`sglang-server`一致,接收`sglang`的所有参数
- 支持基于配置文件的功能扩展,包含`自定义公式标识符``开启标题分级功能``自定义本地模型目录`,详细使用方法请参考[文档](https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#mineru_1)
- 新特性:
- `pipeline`后端更新 PP-OCRv5 多语种文本识别模型,支持法语、西班牙语、葡萄牙语、俄语、韩语等 37 种语言的文字识别平均精度涨幅超30%。[详情](https://paddlepaddle.github.io/PaddleOCR/latest/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html)
- `pipeline`后端增加对竖排文本的有限支持
- 2025/09/05 2.2.0 发布
- 主要更新
- 在这个版本我们重点提升了表格的解析精度,通过引入新的[有线表识别模型](https://github.com/RapidAI/TableStructureRec)和全新的混合表格结构解析算法,显著提升了`pipeline`后端的表格识别能力。
- 另外我们增加了对跨页表格合并的支持,这一功能同时支持`pipeline``vlm`后端,进一步提升了表格解析的完整性和准确性。
- 其他更新
- `pipeline`后端增加270度旋转的表格解析能力现已支持0/90/270度三个方向的表格解析
- `pipeline`增加对泰文、希腊文的ocr能力支持并更新了英文ocr模型至最新英文识别精度提升11%,泰文识别模型精度 82.68%,希腊文识别模型精度 89.28%by PPOCRv5
- 在输出的`content_list.json`中增加了`bbox`字段(映射至0-1000范围内),方便用户直接获取每个内容块的位置信息
<details>
<summary>历史日志</summary>
<details>
<summary>2025/08/01 2.1.10 发布</summary>
<ul>
<li>修复<code>pipeline</code>后端因block覆盖导致的解析结果与预期不符 #3232</li>
</ul>
</details>
<details>
<summary>2025/07/30 2.1.9 发布</summary>
<ul>
<li><code>transformers</code> 4.54.1 版本适配</li>
</ul>
</details>
<details>
<summary>2025/07/28 2.1.8 发布</summary>
<ul>
<li><code>sglang</code> 0.4.9.post5 版本适配</li>
</ul>
</details>
<details>
<summary>2025/07/27 2.1.7 发布</summary>
<ul>
<li><code>transformers</code> 4.54.0 版本适配</li>
</ul>
</details>
<details>
<summary>2025/07/26 2.1.6 发布</summary>
<ul>
<li>修复<code>vlm</code>后端解析部分手写文档时的表格异常问题</li>
<li>修复文档旋转时可视化框位置漂移问题 #3175</li>
</ul>
</details>
<details>
<summary>2025/07/24 2.1.5 发布</summary>
<ul>
<li><code>sglang</code> 0.4.9 版本适配同步升级dockerfile基础镜像为sglang 0.4.9.post3</li>
</ul>
</details>
<details>
<summary>2025/07/23 2.1.4 发布</summary>
<ul>
<li><strong>bug修复</strong>
<ul>
<li>修复<code>pipeline</code>后端中<code>MFR</code>步骤在某些情况下显存消耗过大的问题 #2771</li>
<li>修复某些情况下<code>image</code>/<code>table</code>与<code>caption</code>/<code>footnote</code>匹配不准确的问题 #3129</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/07/16 2.1.1 发布</summary>
<ul>
<li><strong>bug修复</strong>
<ul>
<li>修复<code>pipeline</code>在某些情况可能发生的文本块内容丢失问题 #3005</li>
<li>修复<code>sglang-client</code>需要安装<code>torch</code>等不必要的包的问题 #2968</li>
<li>更新<code>dockerfile</code>以修复linux字体缺失导致的解析文本内容不完整问题 #2915</li>
</ul>
</li>
<li><strong>易用性更新</strong>
<ul>
<li>更新<code>compose.yaml</code>,便于用户直接启动<code>sglang-server</code>、<code>mineru-api</code>、<code>mineru-gradio</code>服务</li>
<li>启用全新的<a href="https://opendatalab.github.io/MinerU/zh/">在线文档站点</a>简化readme提供更好的文档体验</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/07/05 2.1.0 发布</summary>
<p>这是 MinerU 2 的第一个大版本更新包含了大量新功能和改进包含众多性能优化、体验优化和bug修复具体更新内容如下</p>
<ul>
<li><strong>性能优化:</strong>
<ul>
<li>大幅提升某些特定分辨率长边2000像素左右文档的预处理速度</li>
<li>大幅提升<code>pipeline</code>后端批量处理大量页数较少(&lt;10文档时的后处理速度</li>
<li><code>pipeline</code>后端的layout分析速度提升约20%</li>
</ul>
</li>
<li><strong>体验优化:</strong>
<ul>
<li>内置开箱即用的<code>fastapi服务</code>和<code>gradio webui</code>,详细使用方法请参考<a href="https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#apiwebuisglang-clientserver">文档</a></li>
<li><code>sglang</code>适配<code>0.4.8</code>版本,大幅降低<code>vlm-sglang</code>后端的显存要求,最低可在<code>8G显存</code>(Turing及以后架构)的显卡上运行</li>
<li>对所有命令增加<code>sglang</code>的参数透传,使得<code>sglang-engine</code>后端可以与<code>sglang-server</code>一致,接收<code>sglang</code>的所有参数</li>
<li>支持基于配置文件的功能扩展,包含<code>自定义公式标识符</code>、<code>开启标题分级功能</code>、<code>自定义本地模型目录</code>,详细使用方法请参考<a href="https://opendatalab.github.io/MinerU/zh/usage/quick_usage/#mineru_1">文档</a></li>
</ul>
</li>
<li><strong>新特性:</strong>
<ul>
<li><code>pipeline</code>后端更新 PP-OCRv5 多语种文本识别模型,支持法语、西班牙语、葡萄牙语、俄语、韩语等 37 种语言的文字识别平均精度涨幅超30%。<a href="https://paddlepaddle.github.io/PaddleOCR/latest/version3.x/algorithm/PP-OCRv5/PP-OCRv5_multi_languages.html">详情</a></li>
<li><code>pipeline</code>后端增加对竖排文本的有限支持</li>
</ul>
</li>
</ul>
</details>
<details>
<summary>2025/06/20 2.0.6发布</summary>
<ul>
@@ -584,6 +658,7 @@ mineru -p <input_path> -o <output_path>
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [UniMERNet](https://github.com/opendatalab/UniMERNet)
- [RapidTable](https://github.com/RapidAI/RapidTable)
- [TableStructureRec](https://github.com/RapidAI/TableStructureRec)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [PaddleOCR2Pytorch](https://github.com/frotms/PaddleOCR2Pytorch)
- [layoutreader](https://github.com/ppaanngggg/layoutreader)

View File

@@ -416,7 +416,8 @@ Text levels are distinguished through the `text_level` field:
#### Common Fields
All content blocks include a `page_idx` field indicating the page number (starting from 0).
- All content blocks include a `page_idx` field indicating the page number (starting from 0).
- All content blocks include a `bbox` field representing the bounding box coordinates of the content block `[x0, y0, x1, y1]`, mapped to a range of 0-1000.
#### Sample Data
@@ -425,31 +426,15 @@ All content blocks include a `page_idx` field indicating the page number (starti
{
"type": "text",
"text": "The response of flow duration curves to afforestation ",
"text_level": 1,
"text_level": 1,
"bbox": [
62,
480,
946,
904
],
"page_idx": 0
},
{
"type": "text",
"text": "Received 1 October 2003; revised 22 December 2004; accepted 3 January 2005 ",
"page_idx": 0
},
{
"type": "text",
"text": "Abstract ",
"text_level": 2,
"page_idx": 0
},
{
"type": "text",
"text": "The hydrologic effect of replacing pasture or other short crops with trees is reasonably well understood on a mean annual basis. The impact on flow regime, as described by the annual flow duration curve (FDC) is less certain. A method to assess the impact of plantation establishment on FDCs was developed. The starting point for the analyses was the assumption that rainfall and vegetation age are the principal drivers of evapotranspiration. A key objective was to remove the variability in the rainfall signal, leaving changes in streamflow solely attributable to the evapotranspiration of the plantation. A method was developed to (1) fit a model to the observed annual time series of FDC percentiles; i.e. 10th percentile for each year of record with annual rainfall and plantation age as parameters, (2) replace the annual rainfall variation with the long term mean to obtain climate adjusted FDCs, and (3) quantify changes in FDC percentiles as plantations age. Data from 10 catchments from Australia, South Africa and New Zealand were used. The model was able to represent flow variation for the majority of percentiles at eight of the 10 catchments, particularly for the 1050th percentiles. The adjusted FDCs revealed variable patterns in flow reductions with two types of responses (groups) being identified. Group 1 catchments show a substantial increase in the number of zero flow days, with low flows being more affected than high flows. Group 2 catchments show a more uniform reduction in flows across all percentiles. The differences may be partly explained by storage characteristics. The modelled flow reductions were in accord with published results of paired catchment experiments. An additional analysis was performed to characterise the impact of afforestation on the number of zero flow days $( N _ { \\mathrm { z e r o } } )$ for the catchments in group 1. This model performed particularly well, and when adjusted for climate, indicated a significant increase in $N _ { \\mathrm { z e r o } }$ . The zero flow day method could be used to determine change in the occurrence of any given flow in response to afforestation. The methods used in this study proved satisfactory in removing the rainfall variability, and have added useful insight into the hydrologic impacts of plantation establishment. This approach provides a methodology for understanding catchment response to afforestation, where paired catchment data is not available. ",
"page_idx": 0
},
{
"type": "text",
"text": "1. Introduction ",
"text_level": 2,
"page_idx": 1
},
{
"type": "image",
"img_path": "images/a8ecda1c69b27e4f79fce1589175a9d721cbdc1cf78b4cc06a015f3746f6b9d8.jpg",
@@ -457,6 +442,12 @@ All content blocks include a `page_idx` field indicating the page number (starti
"Fig. 1. Annual flow duration curves of daily flows from Pine Creek, Australia, 19892000. "
],
"img_footnote": [],
"bbox": [
62,
480,
946,
904
],
"page_idx": 1
},
{
@@ -464,6 +455,12 @@ All content blocks include a `page_idx` field indicating the page number (starti
"img_path": "images/181ea56ef185060d04bf4e274685f3e072e922e7b839f093d482c29bf89b71e8.jpg",
"text": "$$\nQ _ { \\% } = f ( P ) + g ( T )\n$$",
"text_format": "latex",
"bbox": [
62,
480,
946,
904
],
"page_idx": 2
},
{
@@ -476,6 +473,12 @@ All content blocks include a `page_idx` field indicating the page number (starti
"indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "
],
"table_body": "<html><body><table><tr><td rowspan=\"2\">Site</td><td colspan=\"10\">Percentile</td></tr><tr><td>10</td><td>20</td><td>30</td><td>40</td><td>50</td><td>60</td><td>70</td><td>80</td><td>90</td><td>100</td></tr><tr><td>Traralgon Ck</td><td>P</td><td>P,*</td><td>P</td><td>P</td><td>P,</td><td>P,</td><td>P,</td><td>P,</td><td>P</td><td>P</td></tr><tr><td>Redhill</td><td>P,T</td><td>P,T</td><td>*</td><td>**</td><td>P.T</td><td>P,*</td><td>P*</td><td>P*</td><td>*</td><td>*</td></tr><tr><td>Pine Ck</td><td></td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td><td>T</td><td>na</td><td>na</td></tr><tr><td>Stewarts Ck 5</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P.T</td><td>P.T</td><td>P,T</td><td>na</td><td>na</td><td>na</td></tr><tr><td>Glendhu 2</td><td>P</td><td>P,T</td><td>P,*</td><td>P,T</td><td>P.T</td><td>P,ns</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td></tr><tr><td>Cathedral Peak 2</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Cathedral Peak 3</td><td>P.T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Lambrechtsbos A</td><td>P,T</td><td>P</td><td>P</td><td>P,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>T</td></tr><tr><td>Lambrechtsbos B</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td></tr><tr><td>Biesievlei</td><td>P,T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>*,T</td><td>T</td><td>T</td><td>P,T</td><td>P,T</td></tr></table></body></html>",
"bbox": [
62,
480,
946,
904
],
"page_idx": 5
}
]

View File

@@ -13,7 +13,7 @@ Options:
-m, --method [auto|txt|ocr] Parsing method: auto (default), txt, ocr (pipeline backend only)
-b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
Parsing backend (default: pipeline)
-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|th|el|latin|arabic|east_slavic|cyrillic|devanagari]
Specify document language (improves OCR accuracy, pipeline backend only)
-u, --url TEXT Service address when using sglang-client
-s, --start INTEGER Starting page number for parsing (0-based)

View File

@@ -416,7 +416,8 @@ inference_result: list[PageInferenceResults] = []
#### 通用字段
所有内容块都包含 `page_idx` 字段,表示所在页码(从 0 开始)。
- 所有内容块都包含 `page_idx` 字段,表示所在页码(从 0 开始)。
- 所有内容块都包含 `bbox` 字段,表示内容块的边界框坐标 `[x0, y0, x1, y1]` 映射在0-1000范围内的结果。
#### 示例数据
@@ -425,31 +426,15 @@ inference_result: list[PageInferenceResults] = []
{
"type": "text",
"text": "The response of flow duration curves to afforestation ",
"text_level": 1,
"text_level": 1,
"bbox": [
62,
480,
946,
904
],
"page_idx": 0
},
{
"type": "text",
"text": "Received 1 October 2003; revised 22 December 2004; accepted 3 January 2005 ",
"page_idx": 0
},
{
"type": "text",
"text": "Abstract ",
"text_level": 2,
"page_idx": 0
},
{
"type": "text",
"text": "The hydrologic effect of replacing pasture or other short crops with trees is reasonably well understood on a mean annual basis. The impact on flow regime, as described by the annual flow duration curve (FDC) is less certain. A method to assess the impact of plantation establishment on FDCs was developed. The starting point for the analyses was the assumption that rainfall and vegetation age are the principal drivers of evapotranspiration. A key objective was to remove the variability in the rainfall signal, leaving changes in streamflow solely attributable to the evapotranspiration of the plantation. A method was developed to (1) fit a model to the observed annual time series of FDC percentiles; i.e. 10th percentile for each year of record with annual rainfall and plantation age as parameters, (2) replace the annual rainfall variation with the long term mean to obtain climate adjusted FDCs, and (3) quantify changes in FDC percentiles as plantations age. Data from 10 catchments from Australia, South Africa and New Zealand were used. The model was able to represent flow variation for the majority of percentiles at eight of the 10 catchments, particularly for the 1050th percentiles. The adjusted FDCs revealed variable patterns in flow reductions with two types of responses (groups) being identified. Group 1 catchments show a substantial increase in the number of zero flow days, with low flows being more affected than high flows. Group 2 catchments show a more uniform reduction in flows across all percentiles. The differences may be partly explained by storage characteristics. The modelled flow reductions were in accord with published results of paired catchment experiments. An additional analysis was performed to characterise the impact of afforestation on the number of zero flow days $( N _ { \\mathrm { z e r o } } )$ for the catchments in group 1. This model performed particularly well, and when adjusted for climate, indicated a significant increase in $N _ { \\mathrm { z e r o } }$ . The zero flow day method could be used to determine change in the occurrence of any given flow in response to afforestation. The methods used in this study proved satisfactory in removing the rainfall variability, and have added useful insight into the hydrologic impacts of plantation establishment. This approach provides a methodology for understanding catchment response to afforestation, where paired catchment data is not available. ",
"page_idx": 0
},
{
"type": "text",
"text": "1. Introduction ",
"text_level": 2,
"page_idx": 1
},
{
"type": "image",
"img_path": "images/a8ecda1c69b27e4f79fce1589175a9d721cbdc1cf78b4cc06a015f3746f6b9d8.jpg",
@@ -457,6 +442,12 @@ inference_result: list[PageInferenceResults] = []
"Fig. 1. Annual flow duration curves of daily flows from Pine Creek, Australia, 19892000. "
],
"img_footnote": [],
"bbox": [
62,
480,
946,
904
],
"page_idx": 1
},
{
@@ -464,6 +455,12 @@ inference_result: list[PageInferenceResults] = []
"img_path": "images/181ea56ef185060d04bf4e274685f3e072e922e7b839f093d482c29bf89b71e8.jpg",
"text": "$$\nQ _ { \\% } = f ( P ) + g ( T )\n$$",
"text_format": "latex",
"bbox": [
62,
480,
946,
904
],
"page_idx": 2
},
{
@@ -476,6 +473,12 @@ inference_result: list[PageInferenceResults] = []
"indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "
],
"table_body": "<html><body><table><tr><td rowspan=\"2\">Site</td><td colspan=\"10\">Percentile</td></tr><tr><td>10</td><td>20</td><td>30</td><td>40</td><td>50</td><td>60</td><td>70</td><td>80</td><td>90</td><td>100</td></tr><tr><td>Traralgon Ck</td><td>P</td><td>P,*</td><td>P</td><td>P</td><td>P,</td><td>P,</td><td>P,</td><td>P,</td><td>P</td><td>P</td></tr><tr><td>Redhill</td><td>P,T</td><td>P,T</td><td>*</td><td>**</td><td>P.T</td><td>P,*</td><td>P*</td><td>P*</td><td>*</td><td>*</td></tr><tr><td>Pine Ck</td><td></td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td><td>T</td><td>na</td><td>na</td></tr><tr><td>Stewarts Ck 5</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P.T</td><td>P.T</td><td>P,T</td><td>na</td><td>na</td><td>na</td></tr><tr><td>Glendhu 2</td><td>P</td><td>P,T</td><td>P,*</td><td>P,T</td><td>P.T</td><td>P,ns</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td></tr><tr><td>Cathedral Peak 2</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Cathedral Peak 3</td><td>P.T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Lambrechtsbos A</td><td>P,T</td><td>P</td><td>P</td><td>P,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>T</td></tr><tr><td>Lambrechtsbos B</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td></tr><tr><td>Biesievlei</td><td>P,T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>*,T</td><td>T</td><td>T</td><td>P,T</td><td>P,T</td></tr></table></body></html>",
"bbox": [
62,
480,
946,
904
],
"page_idx": 5
}
]

View File

@@ -13,7 +13,7 @@ Options:
-m, --method [auto|txt|ocr] 解析方法auto默认、txt、ocr仅用于 pipeline 后端)
-b, --backend [pipeline|vlm-transformers|vlm-sglang-engine|vlm-sglang-client]
解析后端(默认为 pipeline
-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|latin|arabic|east_slavic|cyrillic|devanagari]
-l, --lang [ch|ch_server|ch_lite|en|korean|japan|chinese_cht|ta|te|ka|th|el|latin|arabic|east_slavic|cyrillic|devanagari]
指定文档语言(可提升 OCR 准确率,仅用于 pipeline 后端)
-u, --url TEXT 当使用 sglang-client 时,需指定服务地址
-s, --start INTEGER 开始解析的页码(从 0 开始)

View File

@@ -188,7 +188,7 @@ def merge_para_with_text(para_block):
return para_text
def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
def make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size):
para_type = para_block['type']
para_content = {}
if para_type in [BlockType.TEXT, BlockType.LIST, BlockType.INDEX]:
@@ -245,6 +245,17 @@ def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
if block['type'] == BlockType.TABLE_FOOTNOTE:
para_content[BlockType.TABLE_FOOTNOTE].append(merge_para_with_text(block))
page_weight, page_height = page_size
para_bbox = para_block.get('bbox')
if para_bbox:
x0, y0, x1, y1 = para_bbox
para_content['bbox'] = [
int(x0 * 1000 / page_weight),
int(y0 * 1000 / page_height),
int(x1 * 1000 / page_weight),
int(y1 * 1000 / page_height),
]
para_content['page_idx'] = page_idx
return para_content
@@ -258,6 +269,7 @@ def union_make(pdf_info_dict: list,
for page_info in pdf_info_dict:
paras_of_layout = page_info.get('para_blocks')
page_idx = page_info.get('page_idx')
page_size = page_info.get('page_size')
if not paras_of_layout:
continue
if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
@@ -265,7 +277,7 @@ def union_make(pdf_info_dict: list,
output_content.extend(page_markdown)
elif make_mode == MakeMode.CONTENT_LIST:
for para_block in paras_of_layout:
para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx)
para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size)
if para_content:
output_content.append(para_content)

View File

@@ -4,7 +4,7 @@ from typing import Iterable, List, Optional, Union
import torch
from PIL import Image
from tqdm import tqdm
from transformers import AutoTokenizer, BitsAndBytesConfig
from transformers import AutoTokenizer, BitsAndBytesConfig, __version__
from ...model.vlm_hf_model import Mineru2QwenForCausalLM
from ...model.vlm_hf_model.image_processing_mineru2 import process_images
@@ -66,7 +66,11 @@ class HuggingfacePredictor(BasePredictor):
bnb_4bit_quant_type="nf4",
)
else:
kwargs["torch_dtype"] = torch_dtype
from packaging import version
if version.parse(__version__) >= version.parse("4.56.0"):
kwargs["dtype"] = torch_dtype
else:
kwargs["torch_dtype"] = torch_dtype
if use_flash_attn:
kwargs["attn_implementation"] = "flash_attention_2"

View File

@@ -1,8 +1,9 @@
import os
import time
from loguru import logger
import numpy as np
import cv2
from mineru.utils.config_reader import get_llm_aided_config
from mineru.utils.config_reader import get_llm_aided_config, get_table_enable
from mineru.utils.cut_image import cut_image_and_table
from mineru.utils.enum_class import ContentType
from mineru.utils.hash_utils import str_md5
@@ -94,7 +95,9 @@ def result_to_middle_json(token_list, images_list, pdf_doc, image_writer):
middle_json["pdf_info"].append(page_info)
"""表格跨页合并"""
merge_table(middle_json["pdf_info"])
table_enable = get_table_enable(os.getenv('MINERU_VLM_TABLE_ENABLE', 'True').lower() == 'true')
if table_enable:
merge_table(middle_json["pdf_info"])
"""llm优化标题分级"""
if heading_level_import_success:

View File

@@ -125,7 +125,7 @@ def mk_blocks_to_markdown(para_blocks, make_mode, formula_enable, table_enable,
def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
def make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size):
para_type = para_block['type']
para_content = {}
if para_type in [BlockType.TEXT, BlockType.LIST, BlockType.INDEX]:
@@ -179,6 +179,17 @@ def make_blocks_to_content_list(para_block, img_buket_path, page_idx):
if block['type'] == BlockType.TABLE_FOOTNOTE:
para_content[BlockType.TABLE_FOOTNOTE].append(merge_para_with_text(block))
page_weight, page_height = page_size
para_bbox = para_block.get('bbox')
if para_bbox:
x0, y0, x1, y1 = para_bbox
para_content['bbox'] = [
int(x0 * 1000 / page_weight),
int(y0 * 1000 / page_height),
int(x1 * 1000 / page_weight),
int(y1 * 1000 / page_height),
]
para_content['page_idx'] = page_idx
return para_content
@@ -195,6 +206,7 @@ def union_make(pdf_info_dict: list,
for page_info in pdf_info_dict:
paras_of_layout = page_info.get('para_blocks')
page_idx = page_info.get('page_idx')
page_size = page_info.get('page_size')
if not paras_of_layout:
continue
if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:
@@ -202,7 +214,7 @@ def union_make(pdf_info_dict: list,
output_content.extend(page_markdown)
elif make_mode == MakeMode.CONTENT_LIST:
for para_block in paras_of_layout:
para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx)
para_content = make_blocks_to_content_list(para_block, img_buket_path, page_idx, page_size)
output_content.append(para_content)
if make_mode in [MakeMode.MM_MD, MakeMode.NLP_MD]:

View File

@@ -62,7 +62,7 @@ from .common import do_parse, read_fn, pdf_suffixes, image_suffixes
'-l',
'--lang',
'lang',
type=click.Choice(['ch', 'ch_server', 'ch_lite', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka',
type=click.Choice(['ch', 'ch_server', 'ch_lite', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'th', 'el',
'latin', 'arabic', 'east_slavic', 'cyrillic', 'devanagari']),
help="""
Input the languages in the pdf (if known) to improve OCR accuracy. Optional.