mirror of
https://github.com/opendatalab/MinerU.git
synced 2026-04-02 22:18:36 +07:00
Compare commits
12 Commits
mineru-3.0
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
d59a69236c | ||
|
|
904394497e | ||
|
|
a25798b43a | ||
|
|
d7011f42e2 | ||
|
|
ede8d95bf1 | ||
|
|
54b68d4bf1 | ||
|
|
1b478c24cf | ||
|
|
39b62cc76a | ||
|
|
13465ff43f | ||
|
|
d18b7df766 | ||
|
|
a97753c86f | ||
|
|
a3b65470cf |
33
README.md
33
README.md
@@ -43,6 +43,39 @@
|
||||
|
||||
</div>
|
||||
|
||||
|
||||
<details>
|
||||
<summary>MinerU — High-accuracy document parsing engine for LLM · RAG · Agent workflows</summary>
|
||||
Converts PDF · Word · PPT · Images · Web pages into structured Markdown / JSON · VLM+OCR dual engine · 109 languages <br>
|
||||
MCP Server · LangChain / Dify / FastGPT native integration · 10+ domestic AI chip support
|
||||
|
||||
**🔍 Core Parsing Capabilities**
|
||||
|
||||
- Formulas → LaTeX · Tables → HTML, accurate layout reconstruction
|
||||
- Supports scanned docs, handwriting, multi-column layouts, cross-page table merging
|
||||
- Output follows human reading order with automatic header/footer removal
|
||||
- VLM + OCR dual engine, 109-language OCR recognition
|
||||
|
||||
**🔌 Integration**
|
||||
|
||||
| Use Case | Solution |
|
||||
|----------|----------|
|
||||
| AI Coding Tools | MCP Server — Cursor · Claude Desktop · Windsurf |
|
||||
| RAG Frameworks | LangChain · LlamaIndex · RAGFlow · RAG-Anything · Flowise · Dify · FastGPT |
|
||||
| Development | Python / Go / TypeScript SDK · CLI · REST API · Docker |
|
||||
| No-Code | mineru.net online · Gradio WebUI · Desktop client |
|
||||
|
||||
**🖥️ Deployment (Private · Fully Offline)**
|
||||
|
||||
| Inference Backend | Best For |
|
||||
|------------------|---------|
|
||||
| pipeline | Fast & stable, no hallucination, runs on CPU or GPU |
|
||||
| vlm-engine | High accuracy, supports vLLM / LMDeploy / mlx ecosystem |
|
||||
| hybrid-engine | High accuracy, native text extraction, low hallucination |
|
||||
|
||||
Domestic AI chips: Ascend · Cambricon · Enflame · MetaX · Moore Threads · Kunlunxin · Iluvatar · Hygon · Biren · T-Head
|
||||
</details>
|
||||
|
||||
# Changelog
|
||||
|
||||
- 2026/03/29 3.0.0 Released
|
||||
|
||||
@@ -43,6 +43,38 @@
|
||||
|
||||
</div>
|
||||
|
||||
<details>
|
||||
<summary>MinerU — 专为 LLM · RAG · Agent 场景构建的高精度文档解析引擎 </summary>
|
||||
将 PDF · Word · PPT · 图片 · 网页转为结构化 Markdown / JSON · VLM+OCR 双引擎 · 109 种语言 <br>
|
||||
MCP Server · LangChain / Dify / FastGPT 原生集成 · 10+ 国产算力适配 <br>
|
||||
|
||||
**🔍 核心解析能力**
|
||||
- 公式 → LaTeX · 表格 → HTML,精准还原复杂版面
|
||||
- 支持扫描件、手写体、多栏布局、跨页表格合并
|
||||
- 输出符合人类阅读顺序,自动去除页眉页脚
|
||||
- VLM + OCR 双引擎,支持 109 种语言识别
|
||||
|
||||
**🔌 接入方式**
|
||||
|
||||
| 场景 | 方案 |
|
||||
|------|------|
|
||||
| AI 编程工具 | MCP Server — Cursor · Claude Desktop · Windsurf |
|
||||
| RAG 框架 | LangChain · LlamaIndex · RAGFlow · RAG-Anything · Flowise · Dify · FastGPT |
|
||||
| 开发集成 | Python / Go / TypeScript SDK · CLI · REST API · Docker |
|
||||
| 零代码 | mineru.net 在线版 · Gradio WebUI · 桌面客户端 |
|
||||
|
||||
**🖥️ 部署生态(支持私有化 · 完全离线)**
|
||||
|
||||
| 推理后端 | 适用场景 |
|
||||
|--------------|-----------------------------|
|
||||
| pipeline | 快速稳定,无幻觉,CPU / GPU 均可运行 |
|
||||
| vlm-engine | 高精度,支持 vLLM / LMdeploy / mlx 生态 |
|
||||
| hybrid-engine| 高精度,原生文本提取,低幻觉 |
|
||||
|
||||
国产算力:昇腾 · 寒武纪 · 燧原 · 沐曦 · 摩尔线程 · 昆仑芯 · 天数智芯 · 瀚博 · 太初元碁 · 海光 · 平头哥
|
||||
|
||||
</details>
|
||||
|
||||
# 更新记录
|
||||
|
||||
- 2026/03/29 3.0.0 发布
|
||||
|
||||
@@ -43,12 +43,12 @@ If you need to adjust parsing options through custom parameters, you can also ch
|
||||
>- API outputs are controlled by the server and written to `./output` by default
|
||||
>- Uploads currently support `PDF`, image, and `DOCX` files
|
||||
>
|
||||
>`POST /tasks` returns immediately with a `task_id`. `POST /file_parse` uses the same task manager internally, waits for the task to finish, and then returns the final result synchronously.
|
||||
>When a task is waiting in the queue, both the submission response and task-status response may include `queued_ahead` to indicate how many tasks are ahead of it.
|
||||
>Tasks are tracked only in-process for a single `mineru-api` instance. Task status is not preserved across service restarts, `--reload`, or multi-process deployments.
|
||||
>Completed or failed tasks are retained for 24 hours by default, then their task state and output directory are cleaned automatically. After cleanup, task status and result endpoints return `404`.
|
||||
>Use `MINERU_API_TASK_RETENTION_SECONDS` and `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` to adjust retention and cleanup polling intervals.
|
||||
>Use `--enable-vlm-preload true` to warm up the local VLM model during service startup instead of waiting for the first VLM or hybrid request.
|
||||
>- `POST /tasks` returns immediately with a `task_id`. `POST /file_parse` uses the same task manager internally, waits for the task to finish, and then returns the final result synchronously.
|
||||
>- When a task is waiting in the queue, both the submission response and task-status response may include `queued_ahead` to indicate how many tasks are ahead of it.
|
||||
>- Tasks are tracked only in-process for a single `mineru-api` instance. Task status is not preserved across service restarts, `--reload`, or multi-process deployments.
|
||||
>- Completed or failed tasks are retained for 24 hours by default, then their task state and output directory are cleaned automatically. After cleanup, task status and result endpoints return `404`.
|
||||
>- Use `MINERU_API_TASK_RETENTION_SECONDS` and `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` to adjust retention and cleanup polling intervals.
|
||||
>- Use `--enable-vlm-preload true` to warm up the local VLM model during service startup instead of waiting for the first VLM or hybrid request.
|
||||
>
|
||||
>Asynchronous task submission example:
|
||||
>```bash
|
||||
|
||||
@@ -43,12 +43,12 @@ mineru -p <input_path> -o <output_path>
|
||||
>- API 输出目录由服务端固定控制,默认写入 `./output`
|
||||
>- 上传文件当前支持 `PDF`、图片与 `DOCX`
|
||||
>
|
||||
>`POST /tasks` 会立即返回 `task_id`;`POST /file_parse` 会在内部提交到同一个任务管理器,等待任务完成后同步返回最终结果。
|
||||
>当任务处于排队状态时,任务提交结果和状态查询结果中可能会返回 `queued_ahead` 字段,用于表示前方排队任务数。
|
||||
>任务为单进程、进程内状态实现,服务重启、`--reload` 热重载或多进程部署后不保证仍可查询历史任务状态。
|
||||
>默认任务完成或失败后保留 24 小时,随后自动清理任务状态和输出目录;清理后访问任务状态或结果会返回 `404`。
|
||||
>可通过环境变量 `MINERU_API_TASK_RETENTION_SECONDS` 和 `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` 调整保留时长与清理轮询间隔。
|
||||
>可通过 `--enable-vlm-preload true` 在服务启动阶段预热本地 VLM 模型,避免首次 VLM 或 hybrid 请求时再初始化。
|
||||
>- `POST /tasks` 会立即返回 `task_id`;`POST /file_parse` 会在内部提交到同一个任务管理器,等待任务完成后同步返回最终结果。
|
||||
>- 当任务处于排队状态时,任务提交结果和状态查询结果中可能会返回 `queued_ahead` 字段,用于表示前方排队任务数。
|
||||
>- 任务为单进程、进程内状态实现,服务重启、`--reload` 热重载或多进程部署后不保证仍可查询历史任务状态。
|
||||
>- 默认任务完成或失败后保留 24 小时,随后自动清理任务状态和输出目录;清理后访问任务状态或结果会返回 `404`。
|
||||
>- 可通过环境变量 `MINERU_API_TASK_RETENTION_SECONDS` 和 `MINERU_API_TASK_CLEANUP_INTERVAL_SECONDS` 调整保留时长与清理轮询间隔。
|
||||
>- 可通过 `--enable-vlm-preload true` 在服务启动阶段预热本地 VLM 模型,避免首次 VLM 或 hybrid 请求时再初始化。
|
||||
>
|
||||
>异步任务提交示例:
|
||||
>```bash
|
||||
|
||||
@@ -701,7 +701,7 @@ def mk_blocks_to_markdown(para_blocks, make_mode, img_buket_path='', page_idx=No
|
||||
continue
|
||||
else:
|
||||
# page_markdown.append(para_text.strip())
|
||||
page_markdown.append(para_text)
|
||||
page_markdown.append(para_text.strip('\r\n'))
|
||||
|
||||
return page_markdown
|
||||
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = "3.0.5"
|
||||
__version__ = "3.0.7"
|
||||
|
||||
Reference in New Issue
Block a user