Merge pull request #4724 from opendatalab/dev

Dev
Merge pull request #4723 from myhloli/dev
2026-04-02 22:18:36 +07:00 · 2026-04-02 16:36:39 +08:00 · 2026-04-02 16:35:55 +08:00 · 2026-04-02 16:29:53 +08:00 · 2026-04-01 21:29:00 +08:00 · 2026-04-01 13:20:54 +00:00
6 changed files with 79 additions and 14 deletions
--- a/README.md
+++ b/README.md
@@ -43,6 +43,39 @@

 </div>

+
+<details>
+<summary>MinerU — High-accuracy document parsing engine for LLM · RAG · Agent workflows</summary>
+Converts PDF · Word · PPT · Images · Web pages into structured Markdown / JSON · VLM+OCR dual engine · 109 languages <br>
+MCP Server · LangChain / Dify / FastGPT native integration · 10+ domestic AI chip support
+
+**🔍 Core Parsing Capabilities**
+
+- Formulas → LaTeX · Tables → HTML, accurate layout reconstruction
+- Supports scanned docs, handwriting, multi-column layouts, cross-page table merging
+- Output follows human reading order with automatic header/footer removal
+- VLM + OCR dual engine, 109-language OCR recognition
+
+**🔌 Integration**
+
+| Use Case | Solution |
+|----------|----------|
+| AI Coding Tools | MCP Server — Cursor · Claude Desktop · Windsurf |
+| RAG Frameworks | LangChain · LlamaIndex · RAGFlow · RAG-Anything · Flowise · Dify · FastGPT |
+| Development | Python / Go / TypeScript SDK · CLI · REST API · Docker |
+| No-Code | mineru.net online · Gradio WebUI · Desktop client |
+
+**🖥️ Deployment (Private · Fully Offline)**
+
+| Inference Backend | Best For |
+|------------------|---------|
+| pipeline         | Fast & stable, no hallucination, runs on CPU or GPU |
+| vlm-engine       | High accuracy, supports vLLM / LMDeploy / mlx ecosystem |
+| hybrid-engine    | High accuracy, native text extraction, low hallucination |
+
+Domestic AI chips: Ascend · Cambricon · Enflame · MetaX · Moore Threads · Kunlunxin · Iluvatar · Hygon · Biren · T-Head
+</details>
+
 # Changelog

 - 2026/03/29 3.0.0 Released
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -43,6 +43,38 @@

 </div>

+<details>
+<summary>MinerU — 专为 LLM · RAG · Agent 场景构建的高精度文档解析引擎 </summary>
+将 PDF · Word · PPT · 图片 · 网页转为结构化 Markdown / JSON · VLM+OCR 双引擎 · 109 种语言 <br>
+MCP Server · LangChain / Dify / FastGPT 原生集成 · 10+ 国产算力适配 <br>
+
+**🔍 核心解析能力**
+- 公式 → LaTeX · 表格 → HTML，精准还原复杂版面
+- 支持扫描件、手写体、多栏布局、跨页表格合并
+- 输出符合人类阅读顺序，自动去除页眉页脚
+- VLM + OCR 双引擎，支持 109 种语言识别
+
+**🔌 接入方式**
+
+| 场景 | 方案 |
+|------|------|
+| AI 编程工具 | MCP Server — Cursor · Claude Desktop · Windsurf |
+| RAG 框架 | LangChain · LlamaIndex · RAGFlow · RAG-Anything · Flowise · Dify · FastGPT |
+| 开发集成 | Python / Go / TypeScript SDK · CLI · REST API · Docker |
+| 零代码 | mineru.net 在线版 · Gradio WebUI · 桌面客户端 |
+
+**🖥️ 部署生态（支持私有化 · 完全离线）**
+
+| 推理后端         | 适用场景                        |
+|--------------|-----------------------------|
+| pipeline     | 快速稳定，无幻觉，CPU / GPU 均可运行     |
+| vlm-engine   | 高精度，支持 vLLM / LMdeploy / mlx 生态 |
+| hybrid-engine| 高精度，原生文本提取，低幻觉              |
+
+国产算力：昇腾 · 寒武纪 · 燧原 · 沐曦 · 摩尔线程 · 昆仑芯 · 天数智芯 · 瀚博 · 太初元碁 · 海光 · 平头哥
+
+</details>
+
 # 更新记录

 - 2026/03/29 3.0.0 发布
--- a/docs/en/usage/quick_usage.md
+++ b/docs/en/usage/quick_usage.md
@@ -43,12 +43,12 @@ If you need to adjust parsing options through custom parameters, you can also ch
  >- API outputs are controlled by the server and written to `./output` by default
  >- Uploads currently support `PDF`, image, and `DOCX` files
  >
-  >`POST /tasks` returns immediately with a `task_id`. `POST /file_parse` uses the same task manager internally, waits for the task to finish, and then returns the final result synchronously.
-  >When a task is waiting in the queue, both the submission response and task-status response may include `queued_ahead` to indicate how many tasks are ahead of it.
-  >Tasks are tracked only in-process for a single `mineru-api` instance. Task status is not preserved across service restarts, `--reload`, or multi-process deployments.
-  >Completed or failed tasks are retained for 24 hours by default, then their task state and output directory are cleaned automatically. After cleanup, task status and result endpoints return `404`.
-  >Use `MINERU_API_TASK_RETENTION_SECONDS` and `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` to adjust retention and cleanup polling intervals.
-  >Use `--enable-vlm-preload true` to warm up the local VLM model during service startup instead of waiting for the first VLM or hybrid request.
+  >- `POST /tasks` returns immediately with a `task_id`. `POST /file_parse` uses the same task manager internally, waits for the task to finish, and then returns the final result synchronously.
+  >- When a task is waiting in the queue, both the submission response and task-status response may include `queued_ahead` to indicate how many tasks are ahead of it.
+  >- Tasks are tracked only in-process for a single `mineru-api` instance. Task status is not preserved across service restarts, `--reload`, or multi-process deployments.
+  >- Completed or failed tasks are retained for 24 hours by default, then their task state and output directory are cleaned automatically. After cleanup, task status and result endpoints return `404`.
+  >- Use `MINERU_API_TASK_RETENTION_SECONDS` and `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` to adjust retention and cleanup polling intervals.
+  >- Use `--enable-vlm-preload true` to warm up the local VLM model during service startup instead of waiting for the first VLM or hybrid request.
  >
  >Asynchronous task submission example:
  >```bash
--- a/docs/zh/usage/quick_usage.md
+++ b/docs/zh/usage/quick_usage.md
@@ -43,12 +43,12 @@ mineru -p <input_path> -o <output_path>
  >- API 输出目录由服务端固定控制，默认写入 `./output`
  >- 上传文件当前支持 `PDF`、图片与 `DOCX`
  >
-  >`POST /tasks` 会立即返回 `task_id`；`POST /file_parse` 会在内部提交到同一个任务管理器，等待任务完成后同步返回最终结果。
-  >当任务处于排队状态时，任务提交结果和状态查询结果中可能会返回 `queued_ahead` 字段，用于表示前方排队任务数。
-  >任务为单进程、进程内状态实现，服务重启、`--reload` 热重载或多进程部署后不保证仍可查询历史任务状态。
-  >默认任务完成或失败后保留 24 小时，随后自动清理任务状态和输出目录；清理后访问任务状态或结果会返回 `404`。
-  >可通过环境变量 `MINERU_API_TASK_RETENTION_SECONDS` 和 `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` 调整保留时长与清理轮询间隔。
-  >可通过 `--enable-vlm-preload true` 在服务启动阶段预热本地 VLM 模型，避免首次 VLM 或 hybrid 请求时再初始化。
+  >- `POST /tasks` 会立即返回 `task_id`；`POST /file_parse` 会在内部提交到同一个任务管理器，等待任务完成后同步返回最终结果。
+  >- 当任务处于排队状态时，任务提交结果和状态查询结果中可能会返回 `queued_ahead` 字段，用于表示前方排队任务数。
+  >- 任务为单进程、进程内状态实现，服务重启、`--reload` 热重载或多进程部署后不保证仍可查询历史任务状态。
+  >- 默认任务完成或失败后保留 24 小时，随后自动清理任务状态和输出目录；清理后访问任务状态或结果会返回 `404`。
+  >- 可通过环境变量 `MINERU_API_TASK_RETENTION_SECONDS` 和 `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` 调整保留时长与清理轮询间隔。
+  >- 可通过 `--enable-vlm-preload true` 在服务启动阶段预热本地 VLM 模型，避免首次 VLM 或 hybrid 请求时再初始化。
  >
  >异步任务提交示例：
  >```bash
--- a/mineru/backend/office/office_middle_json_mkcontent.py
+++ b/mineru/backend/office/office_middle_json_mkcontent.py
@@ -701,7 +701,7 @@ def mk_blocks_to_markdown(para_blocks, make_mode, img_buket_path='', page_idx=No
            continue
        else:
            # page_markdown.append(para_text.strip())
-            page_markdown.append(para_text)
+            page_markdown.append(para_text.strip('\r\n'))

    return page_markdown

--- a/mineru/version.py
+++ b/mineru/version.py
@@ -1 +1 @@
-__version__ = "3.0.5"
+__version__ = "3.0.7"
Author	SHA1	Message	Date
Xiaomeng Zhao	d59a69236c	Merge pull request #4724 from opendatalab/dev Dev	2026-04-02 16:36:39 +08:00
Xiaomeng Zhao	904394497e	Merge pull request #4723 from myhloli/dev docs: add detailed description of MinerU capabilities and integration…	2026-04-02 16:35:55 +08:00
myhloli	a25798b43a	docs: add detailed description of MinerU capabilities and integration in README files	2026-04-02 16:29:53 +08:00
Xiaomeng Zhao	d7011f42e2	Merge pull request #4719 from opendatalab/master master->Dev	2026-04-01 21:29:00 +08:00
myhloli	ede8d95bf1	Update version.py with new version	2026-04-01 13:20:54 +00:00
Xiaomeng Zhao	54b68d4bf1	Merge pull request #4718 from opendatalab/dev 3.0.7	2026-04-01 21:14:24 +08:00
Xiaomeng Zhao	1b478c24cf	Merge pull request #4717 from myhloli/dev fix: strip newline characters from paragraph text in office_middle_json_mkcontent	2026-04-01 21:13:47 +08:00
myhloli	39b62cc76a	fix: strip newline characters from paragraph text in office_middle_json_mkcontent	2026-04-01 21:12:14 +08:00
Xiaomeng Zhao	13465ff43f	Merge pull request #4716 from opendatalab/master master->dev	2026-04-01 20:53:24 +08:00
myhloli	d18b7df766	Update version.py with new version	2026-04-01 12:52:13 +00:00
Xiaomeng Zhao	a97753c86f	Merge pull request #4715 from myhloli/dev fix: correct formatting of usage instructions in quick_usage.md	2026-04-01 20:51:04 +08:00
myhloli	a3b65470cf	fix: correct formatting of usage instructions in quick_usage.md	2026-04-01 20:50:07 +08:00