Compare commits

..

12 Commits

Author SHA1 Message Date
Xiaomeng Zhao
d59a69236c Merge pull request #4724 from opendatalab/dev
Dev
2026-04-02 16:36:39 +08:00
Xiaomeng Zhao
904394497e Merge pull request #4723 from myhloli/dev
docs: add detailed description of MinerU capabilities and integration…
2026-04-02 16:35:55 +08:00
myhloli
a25798b43a docs: add detailed description of MinerU capabilities and integration in README files 2026-04-02 16:29:53 +08:00
Xiaomeng Zhao
d7011f42e2 Merge pull request #4719 from opendatalab/master
master->Dev
2026-04-01 21:29:00 +08:00
myhloli
ede8d95bf1 Update version.py with new version 2026-04-01 13:20:54 +00:00
Xiaomeng Zhao
54b68d4bf1 Merge pull request #4718 from opendatalab/dev
3.0.7
2026-04-01 21:14:24 +08:00
Xiaomeng Zhao
1b478c24cf Merge pull request #4717 from myhloli/dev
fix: strip newline characters from paragraph text in office_middle_json_mkcontent
2026-04-01 21:13:47 +08:00
myhloli
39b62cc76a fix: strip newline characters from paragraph text in office_middle_json_mkcontent 2026-04-01 21:12:14 +08:00
Xiaomeng Zhao
13465ff43f Merge pull request #4716 from opendatalab/master
master->dev
2026-04-01 20:53:24 +08:00
myhloli
d18b7df766 Update version.py with new version 2026-04-01 12:52:13 +00:00
Xiaomeng Zhao
a97753c86f Merge pull request #4715 from myhloli/dev
fix: correct formatting of usage instructions in quick_usage.md
2026-04-01 20:51:04 +08:00
myhloli
a3b65470cf fix: correct formatting of usage instructions in quick_usage.md 2026-04-01 20:50:07 +08:00
6 changed files with 79 additions and 14 deletions

View File

@@ -43,6 +43,39 @@
</div>
<details>
<summary>MinerU — High-accuracy document parsing engine for LLM · RAG · Agent workflows</summary>
Converts PDF · Word · PPT · Images · Web pages into structured Markdown / JSON · VLM+OCR dual engine · 109 languages <br>
MCP Server · LangChain / Dify / FastGPT native integration · 10+ domestic AI chip support
**🔍 Core Parsing Capabilities**
- Formulas → LaTeX · Tables → HTML, accurate layout reconstruction
- Supports scanned docs, handwriting, multi-column layouts, cross-page table merging
- Output follows human reading order with automatic header/footer removal
- VLM + OCR dual engine, 109-language OCR recognition
**🔌 Integration**
| Use Case | Solution |
|----------|----------|
| AI Coding Tools | MCP Server — Cursor · Claude Desktop · Windsurf |
| RAG Frameworks | LangChain · LlamaIndex · RAGFlow · RAG-Anything · Flowise · Dify · FastGPT |
| Development | Python / Go / TypeScript SDK · CLI · REST API · Docker |
| No-Code | mineru.net online · Gradio WebUI · Desktop client |
**🖥️ Deployment (Private · Fully Offline)**
| Inference Backend | Best For |
|------------------|---------|
| pipeline | Fast & stable, no hallucination, runs on CPU or GPU |
| vlm-engine | High accuracy, supports vLLM / LMDeploy / mlx ecosystem |
| hybrid-engine | High accuracy, native text extraction, low hallucination |
Domestic AI chips: Ascend · Cambricon · Enflame · MetaX · Moore Threads · Kunlunxin · Iluvatar · Hygon · Biren · T-Head
</details>
# Changelog
- 2026/03/29 3.0.0 Released

View File

@@ -43,6 +43,38 @@
</div>
<details>
<summary>MinerU — 专为 LLM · RAG · Agent 场景构建的高精度文档解析引擎 </summary>
将 PDF · Word · PPT · 图片 · 网页转为结构化 Markdown / JSON · VLM+OCR 双引擎 · 109 种语言 <br>
MCP Server · LangChain / Dify / FastGPT 原生集成 · 10+ 国产算力适配 <br>
**🔍 核心解析能力**
- 公式 → LaTeX · 表格 → HTML精准还原复杂版面
- 支持扫描件、手写体、多栏布局、跨页表格合并
- 输出符合人类阅读顺序,自动去除页眉页脚
- VLM + OCR 双引擎,支持 109 种语言识别
**🔌 接入方式**
| 场景 | 方案 |
|------|------|
| AI 编程工具 | MCP Server — Cursor · Claude Desktop · Windsurf |
| RAG 框架 | LangChain · LlamaIndex · RAGFlow · RAG-Anything · Flowise · Dify · FastGPT |
| 开发集成 | Python / Go / TypeScript SDK · CLI · REST API · Docker |
| 零代码 | mineru.net 在线版 · Gradio WebUI · 桌面客户端 |
**🖥️ 部署生态(支持私有化 · 完全离线)**
| 推理后端 | 适用场景 |
|--------------|-----------------------------|
| pipeline | 快速稳定无幻觉CPU / GPU 均可运行 |
| vlm-engine | 高精度,支持 vLLM / LMdeploy / mlx 生态 |
| hybrid-engine| 高精度,原生文本提取,低幻觉 |
国产算力:昇腾 · 寒武纪 · 燧原 · 沐曦 · 摩尔线程 · 昆仑芯 · 天数智芯 · 瀚博 · 太初元碁 · 海光 · 平头哥
</details>
# 更新记录
- 2026/03/29 3.0.0 发布

View File

@@ -43,12 +43,12 @@ If you need to adjust parsing options through custom parameters, you can also ch
>- API outputs are controlled by the server and written to `./output` by default
>- Uploads currently support `PDF`, image, and `DOCX` files
>
>`POST /tasks` returns immediately with a `task_id`. `POST /file_parse` uses the same task manager internally, waits for the task to finish, and then returns the final result synchronously.
>When a task is waiting in the queue, both the submission response and task-status response may include `queued_ahead` to indicate how many tasks are ahead of it.
>Tasks are tracked only in-process for a single `mineru-api` instance. Task status is not preserved across service restarts, `--reload`, or multi-process deployments.
>Completed or failed tasks are retained for 24 hours by default, then their task state and output directory are cleaned automatically. After cleanup, task status and result endpoints return `404`.
>Use `MINERU_API_TASK_RETENTION_SECONDS` and `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` to adjust retention and cleanup polling intervals.
>Use `--enable-vlm-preload true` to warm up the local VLM model during service startup instead of waiting for the first VLM or hybrid request.
>- `POST /tasks` returns immediately with a `task_id`. `POST /file_parse` uses the same task manager internally, waits for the task to finish, and then returns the final result synchronously.
>- When a task is waiting in the queue, both the submission response and task-status response may include `queued_ahead` to indicate how many tasks are ahead of it.
>- Tasks are tracked only in-process for a single `mineru-api` instance. Task status is not preserved across service restarts, `--reload`, or multi-process deployments.
>- Completed or failed tasks are retained for 24 hours by default, then their task state and output directory are cleaned automatically. After cleanup, task status and result endpoints return `404`.
>- Use `MINERU_API_TASK_RETENTION_SECONDS` and `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` to adjust retention and cleanup polling intervals.
>- Use `--enable-vlm-preload true` to warm up the local VLM model during service startup instead of waiting for the first VLM or hybrid request.
>
>Asynchronous task submission example:
>```bash

View File

@@ -43,12 +43,12 @@ mineru -p <input_path> -o <output_path>
>- API 输出目录由服务端固定控制,默认写入 `./output`
>- 上传文件当前支持 `PDF`、图片与 `DOCX`
>
>`POST /tasks` 会立即返回 `task_id``POST /file_parse` 会在内部提交到同一个任务管理器,等待任务完成后同步返回最终结果。
>当任务处于排队状态时,任务提交结果和状态查询结果中可能会返回 `queued_ahead` 字段,用于表示前方排队任务数。
>任务为单进程、进程内状态实现,服务重启、`--reload` 热重载或多进程部署后不保证仍可查询历史任务状态。
>默认任务完成或失败后保留 24 小时,随后自动清理任务状态和输出目录;清理后访问任务状态或结果会返回 `404`。
>可通过环境变量 `MINERU_API_TASK_RETENTION_SECONDS` 和 `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` 调整保留时长与清理轮询间隔。
>可通过 `--enable-vlm-preload true` 在服务启动阶段预热本地 VLM 模型,避免首次 VLM 或 hybrid 请求时再初始化。
>- `POST /tasks` 会立即返回 `task_id``POST /file_parse` 会在内部提交到同一个任务管理器,等待任务完成后同步返回最终结果。
>- 当任务处于排队状态时,任务提交结果和状态查询结果中可能会返回 `queued_ahead` 字段,用于表示前方排队任务数。
>- 任务为单进程、进程内状态实现,服务重启、`--reload` 热重载或多进程部署后不保证仍可查询历史任务状态。
>- 默认任务完成或失败后保留 24 小时,随后自动清理任务状态和输出目录;清理后访问任务状态或结果会返回 `404`。
>- 可通过环境变量 `MINERU_API_TASK_RETENTION_SECONDS` 和 `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` 调整保留时长与清理轮询间隔。
>- 可通过 `--enable-vlm-preload true` 在服务启动阶段预热本地 VLM 模型,避免首次 VLM 或 hybrid 请求时再初始化。
>
>异步任务提交示例:
>```bash

View File

@@ -701,7 +701,7 @@ def mk_blocks_to_markdown(para_blocks, make_mode, img_buket_path='', page_idx=No
continue
else:
# page_markdown.append(para_text.strip())
page_markdown.append(para_text)
page_markdown.append(para_text.strip('\r\n'))
return page_markdown

View File

@@ -1 +1 @@
__version__ = "3.0.5"
__version__ = "3.0.7"