mirror of
https://github.com/opendatalab/MinerU.git
synced 2026-03-27 11:08:32 +07:00
feat: update README to clarify project submission status and improve layout
This commit is contained in:
@@ -1,15 +1,7 @@
|
||||
# Welcome to the MinerU Project List (Archived)
|
||||
# Welcome to the MinerU Project (Archived)
|
||||
|
||||
>[!NOTE]
|
||||
> All projects in this repository are contributed by community developers. The official team does not provide maintenance or technical support. To consolidate resources, this repository has stopped accepting new project submissions and maintenance requests for existing projects.
|
||||
> To consolidate resources, this repository has stopped accepting new project submissions and maintenance requests for existing projects.
|
||||
> If you have an excellent project based on MinerU that you'd like to share, please submit the project link to the well-maintained community resource repository [awesome-mineru](https://github.com/opendatalab/awesome-mineru).
|
||||
> Thank you for your support and contribution to the MinerU ecosystem!
|
||||
|
||||
## Project List
|
||||
|
||||
- Projects compatible with version 2.0:
|
||||
- [multi_gpu_v2](./multi_gpu_v2/README.md): Multi-GPU parallel processing based on LitServe
|
||||
- [mineru_tianshu](./mineru_tianshu/README.md): Asynchronous multi-GPU document parsing service based on MinerU
|
||||
|
||||
- Projects not yet compatible with version 2.0:
|
||||
- [mcp](./mcp/README.md): MCP server based on the official API
|
||||
|
||||
@@ -1,16 +1,7 @@
|
||||
# 欢迎来到 MinerU 项目列表(Archived)
|
||||
# 欢迎来到 MinerU 项目(Archived)
|
||||
|
||||
>[!NOTE]
|
||||
> 本仓库中所有项目均由社区开发者贡献,官方团队不提供维护与技术支持。
|
||||
>[!NOTE]
|
||||
> 为整合资源,本仓库目前已停止接收新项目提交和已有项目的维护请求。
|
||||
> 如果您有基于 MinerU 的优秀项目希望分享,欢迎将项目链接提交至精心维护的社区资源库 [awesome-mineru](https://github.com/opendatalab/awesome-mineru)。
|
||||
> 感谢您对 MinerU 生态的支持与贡献!
|
||||
|
||||
## 项目列表
|
||||
|
||||
- 已兼容2.0版本的项目列表
|
||||
- [multi_gpu_v2](./multi_gpu_v2/README_zh.md): 基于 LitServe 的多 GPU 并行处理
|
||||
- [mineru_tianshu](./mineru_tianshu/README.md): 基于 MinerU 的异步多 GPU 文档解析服务
|
||||
|
||||
- 未兼容2.0版本的项目列表
|
||||
- [mcp](./mcp/README.md): 基于官方api的mcp server
|
||||
|
||||
@@ -1,5 +0,0 @@
|
||||
MINERU_API_BASE = "https://mineru.net"
|
||||
MINERU_API_KEY = "eyJ0eXB..."
|
||||
OUTPUT_DIR=./downloads
|
||||
USE_LOCAL_API=false
|
||||
LOCAL_MINERU_API_BASE="http://localhost:8888"
|
||||
12
projects/mcp/.gitignore
vendored
12
projects/mcp/.gitignore
vendored
@@ -1,12 +0,0 @@
|
||||
downloads
|
||||
.env
|
||||
uv.lock
|
||||
.venv
|
||||
src/mineru/__pycache__
|
||||
dist
|
||||
.DS_Store
|
||||
.cursor
|
||||
build
|
||||
*.lock
|
||||
src/mineru_mcp.egg-info
|
||||
test
|
||||
@@ -1,164 +0,0 @@
|
||||
# MinerU MCP-Server Docker 部署指南
|
||||
|
||||
## 1. 简介
|
||||
|
||||
本文档提供了使用 Docker 部署 MinerU MCP-Server 的详细指南。通过 Docker 部署,你可以在任何支持 Docker 的环境中快速启动 MinerU MCP 服务器,无需考虑复杂的环境配置和依赖管理。
|
||||
|
||||
Docker 部署的主要优势:
|
||||
|
||||
- **一致的运行环境**:确保在任何平台上都有相同的运行环境
|
||||
- **简化部署流程**:一键启动,无需手动安装依赖
|
||||
- **易于扩展和迁移**:便于在不同环境间迁移和扩展服务
|
||||
- **资源隔离**:避免与宿主机其他服务产生冲突
|
||||
|
||||
## 2. 先决条件
|
||||
|
||||
在开始之前,请确保你的系统已安装以下软件:
|
||||
|
||||
- [Docker](https://www.docker.com/get-started) (19.03 或更高版本)
|
||||
- [Docker Compose](https://docs.docker.com/compose/install/) (1.27.0 或更高版本)
|
||||
|
||||
你可以通过以下命令检查它们是否已正确安装:
|
||||
|
||||
```bash
|
||||
docker --version
|
||||
docker-compose --version
|
||||
```
|
||||
|
||||
同时,你需要:
|
||||
|
||||
- 从 [MinerU 官网](https://mineru.net) 获取的 API 密钥(如果需要使用远程 API)
|
||||
- 充足的硬盘空间,用于存储转换后的文件
|
||||
|
||||
## 3. 使用 Docker Compose 部署(推荐)
|
||||
|
||||
Docker Compose 提供了最简单的部署方式,特别适合快速开始使用或开发环境。
|
||||
|
||||
### 3.1 准备配置文件
|
||||
|
||||
1. 克隆仓库(如果尚未克隆):
|
||||
|
||||
```bash
|
||||
git clone <repository-url>
|
||||
cd mineru-mcp
|
||||
```
|
||||
|
||||
2. 创建环境变量文件:
|
||||
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
3. 编辑 `.env` 文件,设置必要的环境变量:
|
||||
|
||||
```
|
||||
MINERU_API_BASE=https://mineru.net
|
||||
MINERU_API_KEY=你的API密钥
|
||||
OUTPUT_DIR=./downloads
|
||||
USE_LOCAL_API=false
|
||||
LOCAL_MINERU_API_BASE=http://localhost:8080
|
||||
```
|
||||
|
||||
如果你计划使用本地 API,请将 `USE_LOCAL_API` 设置为 `true`,并确保 `LOCAL_MINERU_API_BASE` 指向你的本地 API 服务地址。
|
||||
|
||||
### 3.2 启动服务
|
||||
|
||||
在项目根目录下运行:
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
这将会:
|
||||
- 构建 Docker 镜像(如果尚未构建)
|
||||
- 创建并启动容器
|
||||
- 在后台运行服务 (`-d` 参数)
|
||||
|
||||
服务将在 `http://localhost:8001` 上启动。你可以通过 MCP 客户端连接此地址。
|
||||
|
||||
### 3.3 查看日志
|
||||
|
||||
要查看服务日志,运行:
|
||||
|
||||
```bash
|
||||
docker-compose logs -f
|
||||
```
|
||||
|
||||
按 `Ctrl+C` 退出日志查看。
|
||||
|
||||
### 3.4 停止服务
|
||||
|
||||
要停止服务,运行:
|
||||
|
||||
```bash
|
||||
docker-compose down
|
||||
```
|
||||
|
||||
如果你想同时删除构建的镜像,可以使用:
|
||||
|
||||
```bash
|
||||
docker-compose down --rmi local
|
||||
```
|
||||
|
||||
## 4. 手动构建和运行 Docker 镜像
|
||||
|
||||
如果你需要更多的控制或自定义,你可以手动构建和运行 Docker 镜像。
|
||||
|
||||
### 4.1 构建镜像
|
||||
|
||||
在项目根目录下运行:
|
||||
|
||||
```bash
|
||||
docker build -t mineru-mcp:latest .
|
||||
```
|
||||
|
||||
这将根据 Dockerfile 构建一个名为 `mineru-mcp` 的 Docker 镜像,标签为 `latest`。
|
||||
|
||||
### 4.2 运行容器
|
||||
|
||||
使用环境变量文件运行容器:
|
||||
|
||||
```bash
|
||||
docker run -p 8001:8001 --env-file .env mineru-mcp:latest
|
||||
```
|
||||
|
||||
或者直接指定环境变量:
|
||||
|
||||
```bash
|
||||
docker run -p 8001:8001 \
|
||||
-e MINERU_API_BASE=https://mineru.net \
|
||||
-e MINERU_API_KEY=你的API密钥 \
|
||||
-e OUTPUT_DIR=/app/downloads \
|
||||
-v $(pwd)/downloads:/app/downloads \
|
||||
mineru-mcp:latest
|
||||
```
|
||||
|
||||
### 4.3 挂载卷
|
||||
|
||||
为了持久化存储转换后的文件,你应该挂载宿主机目录到容器的输出目录:
|
||||
|
||||
```bash
|
||||
docker run -p 8001:8001 --env-file .env \
|
||||
-v $(pwd)/downloads:/app/downloads \
|
||||
mineru-mcp:latest
|
||||
```
|
||||
|
||||
这将挂载当前工作目录下的 `downloads` 文件夹到容器内的 `/app/downloads` 目录。
|
||||
|
||||
## 5. 环境变量配置
|
||||
|
||||
Docker 环境中支持的环境变量与标准环境相同:
|
||||
|
||||
| 环境变量 | 说明 | 默认值 |
|
||||
| ------------------------- | -------------------------------------------------------------- | ------------------------- |
|
||||
| `MINERU_API_BASE` | MinerU 远程 API 的基础 URL | `https://mineru.net` |
|
||||
| `MINERU_API_KEY` | MinerU API 密钥,需要从官网申请 | - |
|
||||
| `OUTPUT_DIR` | 转换后文件的保存路径 | `/app/downloads` |
|
||||
| `USE_LOCAL_API` | 是否使用本地 API 进行解析(仅适用于 `local_parse_pdf` 工具) | `false` |
|
||||
| `LOCAL_MINERU_API_BASE` | 本地 API 的基础 URL(当 `USE_LOCAL_API=true` 时有效) | `http://localhost:8080` |
|
||||
|
||||
在 Docker 环境中,你可以:
|
||||
|
||||
- 通过 `--env-file` 指定环境变量文件
|
||||
- 通过 `-e` 参数直接指定环境变量
|
||||
- 在 `docker-compose.yml` 文件中的 `environment` 部分配置环境变量
|
||||
@@ -1,35 +0,0 @@
|
||||
FROM python:3.12-slim
|
||||
|
||||
# Set working directory
|
||||
WORKDIR /app
|
||||
|
||||
# Configure pip to use Alibaba Cloud mirror
|
||||
RUN pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
|
||||
|
||||
# Install dependencies
|
||||
RUN pip install --no-cache-dir poetry
|
||||
|
||||
# Copy project files
|
||||
COPY pyproject.toml .
|
||||
COPY README.md .
|
||||
COPY src/ ./src/
|
||||
|
||||
# Install the package
|
||||
RUN poetry config virtualenvs.create false && \
|
||||
poetry install
|
||||
|
||||
# Create downloads directory
|
||||
RUN mkdir -p /app/downloads
|
||||
|
||||
# Set environment variables
|
||||
ENV OUTPUT_DIR=/app/downloads
|
||||
# MINERU_API_KEY should be provided at runtime
|
||||
ENV MINERU_API_BASE=https://mineru.net
|
||||
ENV USE_LOCAL_API=false
|
||||
ENV LOCAL_MINERU_API_BASE=""
|
||||
|
||||
# Expose the port that SSE will run on
|
||||
EXPOSE 8001
|
||||
|
||||
# Set command to start the service with SSE transport
|
||||
CMD ["mineru-mcp", "--transport", "sse", "--output-dir", "/app/downloads"]
|
||||
@@ -1,346 +0,0 @@
|
||||
# MinerU MCP-Server
|
||||
|
||||
## 1. 概述
|
||||
|
||||
这个项目提供了一个 **MinerU MCP 服务器** (`mineru-mcp`),它基于 **FastMCP** 框架构建。其主要功能是作为 **MinerU API** 的接口,用于将文档转换为 Markdown格式。
|
||||
|
||||
该服务器通过 MCP 协议公开了以下主要工具:
|
||||
|
||||
1. `parse_documents`:统一接口,支持处理本地文件和URL,自动根据配置选择最合适的处理方式,并自动读取转换后的内容
|
||||
2. `get_ocr_languages`:获取OCR支持的语言列表
|
||||
|
||||
这使得其他应用程序或 MCP 客户端能够轻松地集成 MinerU 的 文档 到 Markdown 转换功能。
|
||||
|
||||
## 2. 核心功能
|
||||
|
||||
* **文档提取**: 接收文档文件输入(单个或多个 URL、单个或多个本地路径,支持doc、ppt、pdf、图片多种格式),调用 MinerU API 进行内容提取和格式转换,最终生成 Markdown 文件。
|
||||
* **批量处理**: 支持同时处理多个文档文件(通过提供由空格、逗号或换行符分隔的 URL 列表或本地文件路径列表)。
|
||||
* **OCR 支持**: 可选启用 OCR 功能(默认不开启),以处理扫描版或图片型文档。
|
||||
* **多语言支持**: 支持多种语言的识别,可以自动检测文档语言或手动指定。
|
||||
* **自动化流程**: 自动处理与 MinerU API 的交互,包括任务提交、状态轮询、结果下载解压、结果文件读取。
|
||||
* **本地解析**: 支持调用本地部署的mineru模型直接解析文档,不依赖远程 API,适用于隐私敏感场景或离线环境。
|
||||
* **智能路径处理**: 自动识别URL和本地文件路径,根据USE_LOCAL_API配置选择最合适的处理方式。
|
||||
|
||||
## 3. 安装
|
||||
|
||||
在开始安装之前,请确保您的系统满足以下基本要求:
|
||||
* Python >= 3.10
|
||||
|
||||
### 3.1 使用 pip 安装 (推荐)
|
||||
|
||||
如果你的包已发布到 PyPI 或其他 Python 包索引,可以直接使用 pip 安装:
|
||||
|
||||
```bash
|
||||
pip install mineru-mcp==1.0.0
|
||||
```
|
||||
|
||||
目前版本:1.0.0
|
||||
|
||||
这种方式适用于不需要修改源代码的普通用户。
|
||||
|
||||
### 3.2 从源码安装
|
||||
|
||||
如果你需要修改源代码或进行开发,可以从源码安装。
|
||||
|
||||
克隆仓库并进入项目目录:
|
||||
|
||||
```bash
|
||||
git clone <repository-url> # 替换为你的仓库 URL
|
||||
cd mineru-mcp
|
||||
```
|
||||
|
||||
推荐使用 `uv` 或 `pip` 配合虚拟环境进行安装:
|
||||
|
||||
**使用 uv (推荐):**
|
||||
|
||||
```bash
|
||||
# 安装 uv (如果尚未安装)
|
||||
# pip install uv
|
||||
|
||||
# 创建并激活虚拟环境
|
||||
uv venv
|
||||
|
||||
# Linux/macOS
|
||||
source .venv/bin/activate
|
||||
# Windows
|
||||
# .venv\\Scripts\\activate
|
||||
|
||||
# 安装依赖和项目
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
**使用 pip:**
|
||||
|
||||
```bash
|
||||
# 创建并激活虚拟环境
|
||||
python -m venv .venv
|
||||
|
||||
# Linux/macOS
|
||||
source .venv/bin/activate
|
||||
# Windows
|
||||
# .venv\\Scripts\\activate
|
||||
|
||||
# 安装依赖和项目
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
## 4. 环境变量配置
|
||||
|
||||
本项目支持通过环境变量进行配置。你可以选择直接设置系统环境变量,或者在项目根目录创建 `.env` 文件(参考 `.env.example` 模板)。
|
||||
|
||||
### 4.1 支持的环境变量
|
||||
|
||||
| 环境变量 | 说明 | 默认值 |
|
||||
| ------------------------- | --------------------------------------------------------------- | ------------------------- |
|
||||
| `MINERU_API_BASE` | MinerU 远程 API 的基础 URL | `https://mineru.net` |
|
||||
| `MINERU_API_KEY` | MinerU API 密钥,需要从[官网](https://mineru.net)申请 | - |
|
||||
| `OUTPUT_DIR` | 转换后文件的保存路径 | `./downloads` |
|
||||
| `USE_LOCAL_API` | 是否使用本地 API 进行解析 | `false` |
|
||||
| `LOCAL_MINERU_API_BASE` | 本地 API 的基础 URL(当 `USE_LOCAL_API=true` 时有效) | `http://localhost:8080` |
|
||||
|
||||
### 4.2 远程 API 与本地 API
|
||||
|
||||
本项目支持两种 API 模式:
|
||||
|
||||
* **远程 API**:默认模式,通过 MinerU 官方提供的云服务进行文档解析。优点是无需本地部署复杂的模型和环境,但需要网络连接和 API 密钥。
|
||||
* **本地 API**:在本地部署 MinerU 引擎进行文档解析,适用于对数据隐私有高要求或需要离线使用的场景。设置 `USE_LOCAL_API=true` 时生效。
|
||||
|
||||
### 4.3 获取 API 密钥
|
||||
|
||||
要获取 `MINERU_API_KEY`,请访问 [MinerU 官网](https://mineru.net) 注册账号并申请 API 密钥。
|
||||
|
||||
## 5. 使用方法
|
||||
|
||||
### 5.1 工具概览
|
||||
|
||||
本项目通过 MCP 协议提供以下工具:
|
||||
|
||||
1. **parse_documents**:统一接口,支持处理本地文件和URL,根据 `USE_LOCAL_API` 配置自动选择合适的处理方式,并自动读取转换后的文件内容
|
||||
2. **get_ocr_languages**:获取 OCR 支持的语言列表
|
||||
|
||||
### 5.2 参数说明
|
||||
|
||||
#### 5.2.1 parse_documents
|
||||
|
||||
| 参数 | 类型 | 说明 | 默认值 | 适用模式 |
|
||||
| ------------------- | ------- | ------------------------------------------------------------------- | -------- | -------- |
|
||||
| `file_sources` | 字符串 | 文件路径或URL,多个可用逗号或换行符分隔 (支持pdf、ppt、pptx、doc、docx以及图片格式jpg、jpeg、png) | - | 全部 |
|
||||
| `enable_ocr` | 布尔值 | 是否启用 OCR 功能 | `false` | 全部 |
|
||||
| `language` | 字符串 | 文档语言,默认"ch"中文,可选"en"英文等 | `ch` | 全部 |
|
||||
| `page_ranges` | 字符串 (可选) | 指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6":表示选取第2页、第4页至第6页;"2--2":表示从第2页一直选取到倒数第二页。(远程API) | `None` | 远程API |
|
||||
|
||||
> **注意**:
|
||||
> - 当 `USE_LOCAL_API=true` 时,如果提供了URL,这些URL会被过滤掉,只处理本地文件路径
|
||||
> - 当 `USE_LOCAL_API=false` 时,会同时处理URL和本地文件路径
|
||||
|
||||
#### 5.2.2 get_ocr_languages
|
||||
|
||||
无需参数
|
||||
|
||||
## 6. MCP 客户端集成
|
||||
|
||||
你可以在任何支持 MCP 协议的客户端中使用 MinerU MCP 服务器。
|
||||
|
||||
### 6.1 在 Claude 中使用
|
||||
|
||||
将 MinerU MCP 服务器配置为 Claude 的工具,即可在 Claude 中直接使用文档转 Markdown 功能。配置工具时详情请参考 MCP 工具配置文档。根据不同的安装和使用场景,你可以选择以下两种配置方式:
|
||||
|
||||
#### 6.1.1 源码运行方式
|
||||
|
||||
如果你是从源码安装并运行 MinerU MCP,可以使用以下配置。这种方式适合你需要修改源码或者进行开发调试的场景:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"mineru-mcp": {
|
||||
"command": "uv",
|
||||
"args": ["--directory", "/Users/adrianwang/Documents/minerU-mcp", "run", "-m", "mineru.cli"],
|
||||
"env": {
|
||||
"MINERU_API_BASE": "https://mineru.net",
|
||||
"MINERU_API_KEY": "ey...",
|
||||
"OUTPUT_DIR": "./downloads",
|
||||
"USE_LOCAL_API": "true",
|
||||
"LOCAL_MINERU_API_BASE": "http://localhost:8080"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
这种配置的特点:
|
||||
|
||||
- 使用 `uv` 命令
|
||||
- 通过 `--directory` 参数指定源码所在目录
|
||||
- 使用 `-m mineru.cli` 运行模块
|
||||
- 适合开发调试和定制化需求
|
||||
|
||||
#### 6.1.2 安装包运行方式
|
||||
|
||||
如果你是通过 pip 或 uv 安装了 mineru-mcp 包,可以使用以下更简洁的配置。这种方式适合生产环境或日常使用:
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"mineru-mcp": {
|
||||
"command": "uvx",
|
||||
"args": ["mineru-mcp"],
|
||||
"env": {
|
||||
"MINERU_API_BASE": "https://mineru.net",
|
||||
"MINERU_API_KEY": "ey...",
|
||||
"OUTPUT_DIR": "./downloads",
|
||||
"USE_LOCAL_API": "true",
|
||||
"LOCAL_MINERU_API_BASE": "http://localhost:8080"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
这种配置的特点:
|
||||
|
||||
- 使用 `uvx` 命令直接运行已安装的包
|
||||
- 配置更加简洁
|
||||
- 不需要指定源码目录
|
||||
- 适合稳定的生产环境使用
|
||||
|
||||
### 6.2 在 FastMCP 客户端中使用
|
||||
|
||||
|
||||
```python
|
||||
from fastmcp import FastMCP
|
||||
|
||||
# 初始化 FastMCP 客户端
|
||||
client = FastMCP(server_url="http://localhost:8001")
|
||||
|
||||
# 使用 parse_documents 工具处理单个文档
|
||||
result = await client.tool_call(
|
||||
tool_name="parse_documents",
|
||||
params={"file_sources": "/path/to/document.pdf"}
|
||||
)
|
||||
|
||||
# 混合处理URLs和本地文件
|
||||
result = await client.tool_call(
|
||||
tool_name="parse_documents",
|
||||
params={"file_sources": "/path/to/file.pdf, https://example.com/document.pdf"}
|
||||
)
|
||||
|
||||
# 启用OCR
|
||||
result = await client.tool_call(
|
||||
tool_name="parse_documents",
|
||||
params={"file_sources": "/path/to/file.pdf", "enable_ocr": True}
|
||||
)
|
||||
```
|
||||
|
||||
### 6.3 直接运行服务
|
||||
|
||||
你可以通过设置环境变量并直接运行命令的方式启动 MinerU MCP 服务器,这种方式特别适合快速测试和开发环境。
|
||||
|
||||
#### 6.3.1 设置环境变量
|
||||
|
||||
首先,确保设置了必要的环境变量。你可以通过创建 `.env` 文件(参考 `.env.example`)或直接在命令行中设置:
|
||||
|
||||
```bash
|
||||
# Linux/macOS
|
||||
export MINERU_API_BASE="https://mineru.net"
|
||||
export MINERU_API_KEY="your-api-key"
|
||||
export OUTPUT_DIR="./downloads"
|
||||
export USE_LOCAL_API="true" # 可选,如果需要本地解析
|
||||
export LOCAL_MINERU_API_BASE="http://localhost:8080" # 可选,如果启用本地 API
|
||||
|
||||
# Windows
|
||||
set MINERU_API_BASE=https://mineru.net
|
||||
set MINERU_API_KEY=your-api-key
|
||||
set OUTPUT_DIR=./downloads
|
||||
set USE_LOCAL_API=true
|
||||
set LOCAL_MINERU_API_BASE=http://localhost:8080
|
||||
```
|
||||
|
||||
#### 6.3.2 启动服务
|
||||
|
||||
使用以下命令启动 MinerU MCP 服务器,支持多种传输模式:
|
||||
|
||||
**SSE 传输模式**:
|
||||
```bash
|
||||
uv run mineru-mcp --transport sse
|
||||
```
|
||||
|
||||
**Streamable HTTP 传输模式**:
|
||||
```bash
|
||||
uv run mineru-mcp --transport streamable-http
|
||||
```
|
||||
|
||||
或者,如果你使用全局安装:
|
||||
|
||||
```bash
|
||||
mineru-mcp --transport sse
|
||||
# 或
|
||||
mineru-mcp --transport streamable-http
|
||||
```
|
||||
|
||||
服务默认在 `http://localhost:8001` 启动,使用的传输协议取决于你指定的 `--transport` 参数。
|
||||
|
||||
> **注意**:不同传输模式使用不同的路由路径:
|
||||
> - SSE 模式:`/sse`(例如:`http://localhost:8001/sse`)
|
||||
> - Streamable HTTP 模式:`/mcp`(例如:`http://localhost:8001/mcp`)
|
||||
|
||||
|
||||
## 7. Docker 部署
|
||||
|
||||
本项目支持使用 Docker 进行部署,使你能在任何支持 Docker 的环境中快速启动 MinerU MCP 服务器。
|
||||
|
||||
### 7.1 使用 Docker Compose
|
||||
|
||||
1. 确保你已经安装了 Docker 和 Docker Compose
|
||||
2. 复制项目根目录中的 `.env.example` 文件为 `.env`,并根据你的需求修改环境变量
|
||||
3. 运行以下命令启动服务:
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
服务默认会在 `http://localhost:8001` 启动。
|
||||
|
||||
### 7.2 手动构建 Docker 镜像
|
||||
|
||||
如果需要手动构建 Docker 镜像,可以使用以下命令:
|
||||
|
||||
```bash
|
||||
docker build -t mineru-mcp:latest .
|
||||
```
|
||||
|
||||
然后启动容器:
|
||||
|
||||
```bash
|
||||
docker run -p 8001:8001 --env-file .env mineru-mcp:latest
|
||||
```
|
||||
|
||||
更多 Docker 相关信息,请参考 `DOCKER_README.md` 文件。
|
||||
|
||||
## 8. 常见问题
|
||||
|
||||
### 8.1 API 密钥问题
|
||||
|
||||
**问题**:无法连接 MinerU API 或返回 401 错误。
|
||||
**解决方案**:检查你的 API 密钥是否正确设置。在 `.env` 文件中确保 `MINERU_API_KEY` 环境变量包含有效的密钥。
|
||||
|
||||
### 8.2 如何优雅退出服务
|
||||
|
||||
**问题**:如何正确地停止 MinerU MCP 服务?
|
||||
**解决方案**:服务运行时,可以通过按 `Ctrl+C` 来优雅地退出。系统会自动处理正在进行的操作,并确保所有资源得到正确释放。如果一次 `Ctrl+C` 没有响应,可以再次按下 `Ctrl+C` 强制退出。
|
||||
|
||||
### 8.3 文件路径问题
|
||||
|
||||
**问题**:使用 `parse_documents` 工具处理本地文件时报找不到文件错误。
|
||||
**解决方案**:请确保使用绝对路径,或者相对于服务器运行目录的正确相对路径。
|
||||
|
||||
### 8.4 MCP 服务调用超时问题
|
||||
|
||||
**问题**:调用 `parse_documents` 工具时出现 `Error calling tool 'parse_documents': MCP error -32001: Request timed out` 错误。
|
||||
**解决方案**:这个问题常见于处理大型文档或网络不稳定的情况。在某些 MCP 客户端(如 Cursor)中,超时后可能导致无法再次调用 MCP 服务,需要重启客户端。最新版本的 Cursor 中可能会显示正在调用 MCP,但实际上没有真正调用成功。建议:
|
||||
1. **等待官方修复**:这是Cursor客户端的已知问题,建议等待Cursor官方修复
|
||||
2. **处理小文件**:尽量只处理少量小文件,避免处理大型文档导致超时
|
||||
3. **分批处理**:将多个文件分成多次请求处理,每次只处理一两个文件
|
||||
4. 增加超时时间设置(如果客户端支持)
|
||||
5. 对于超时后无法再次调用的问题,需要重启 MCP 客户端
|
||||
6. 如果反复出现超时,请检查网络连接或考虑使用本地 API 模式
|
||||
|
||||
@@ -1,14 +0,0 @@
|
||||
version: '3'
|
||||
|
||||
services:
|
||||
mineru-mcp:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: Dockerfile
|
||||
ports:
|
||||
- "8001:8001"
|
||||
environment:
|
||||
- MINERU_API_KEY=${MINERU_API_KEY}
|
||||
volumes:
|
||||
- ./downloads:/app/downloads
|
||||
restart: unless-stopped
|
||||
@@ -1,39 +0,0 @@
|
||||
[project]
|
||||
name = "mineru-mcp"
|
||||
version = "1.0.0"
|
||||
description = "MinerU MCP Server for PDF to Markdown conversion"
|
||||
authors = [
|
||||
{name = "minerU",email = "OpenDataLab@pjlab.org.cn"}
|
||||
]
|
||||
readme = "README.md"
|
||||
license = {text = "MIT"}
|
||||
requires-python = ">=3.10,<4.0"
|
||||
classifiers = [
|
||||
"Programming Language :: Python :: 3",
|
||||
"License :: OSI Approved :: MIT License",
|
||||
"Operating System :: OS Independent",
|
||||
]
|
||||
dependencies = [
|
||||
"fastmcp>=2.5.2",
|
||||
"python-dotenv>=1.0.0",
|
||||
"requests>=2.31.0",
|
||||
"aiohttp>=3.9.0",
|
||||
"httpx>=0.24.0",
|
||||
"uvicorn>=0.20.0",
|
||||
"starlette>=0.27.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
mineru-mcp = "mineru.cli:main"
|
||||
|
||||
[tool.poetry]
|
||||
packages = [{include = "mineru", from = "src"}]
|
||||
|
||||
[[tool.poetry.source]]
|
||||
name = "aliyun"
|
||||
url = "https://mirrors.aliyun.com/pypi/simple/"
|
||||
priority = "primary"
|
||||
|
||||
[build-system]
|
||||
requires = ["setuptools>=42.0", "wheel"]
|
||||
build-backend = "setuptools.build_meta"
|
||||
@@ -1,729 +0,0 @@
|
||||
"""MinerU File转Markdown转换的API客户端。"""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional, Union
|
||||
|
||||
import aiohttp
|
||||
import requests
|
||||
|
||||
from . import config
|
||||
|
||||
|
||||
def singleton_func(cls):
|
||||
instance = {}
|
||||
|
||||
def _singleton(*args, **kwargs):
|
||||
if cls not in instance:
|
||||
instance[cls] = cls(*args, **kwargs)
|
||||
return instance[cls]
|
||||
|
||||
return _singleton
|
||||
|
||||
|
||||
@singleton_func
|
||||
class MinerUClient:
|
||||
"""
|
||||
用于与 MinerU API 交互以将 File 转换为 Markdown 的客户端。
|
||||
"""
|
||||
|
||||
def __init__(self, api_base: Optional[str] = None, api_key: Optional[str] = None):
|
||||
"""
|
||||
初始化 MinerU API 客户端。
|
||||
|
||||
Args:
|
||||
api_base: MinerU API 的基础 URL (默认: 从环境变量获取)
|
||||
api_key: 用于向 MinerU 进行身份验证的 API 密钥 (默认: 从环境变量获取)
|
||||
"""
|
||||
self.api_base = api_base or config.MINERU_API_BASE
|
||||
self.api_key = api_key or config.MINERU_API_KEY
|
||||
|
||||
if not self.api_key:
|
||||
# 提供更友好的错误消息
|
||||
raise ValueError(
|
||||
"错误: MinerU API 密钥 (MINERU_API_KEY) 未设置或为空。\n"
|
||||
"请确保已设置 MINERU_API_KEY 环境变量,例如:\n"
|
||||
" export MINERU_API_KEY='your_actual_api_key'\n"
|
||||
"或者,在项目根目录的 `.env` 文件中定义该变量。"
|
||||
)
|
||||
|
||||
async def _request(self, method: str, endpoint: str, **kwargs) -> Dict[str, Any]:
|
||||
"""
|
||||
向 MinerU API 发出请求。
|
||||
|
||||
Args:
|
||||
method: HTTP 方法 (GET, POST 等)
|
||||
endpoint: API 端点路径 (不含基础 URL)
|
||||
**kwargs: 传递给 aiohttp 请求的其他参数
|
||||
|
||||
Returns:
|
||||
dict: API 响应 (JSON 格式)
|
||||
"""
|
||||
url = f"{self.api_base}{endpoint}"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {self.api_key}",
|
||||
"Accept": "application/json",
|
||||
}
|
||||
|
||||
if "headers" in kwargs:
|
||||
kwargs["headers"].update(headers)
|
||||
else:
|
||||
kwargs["headers"] = headers
|
||||
|
||||
# 创建一个不包含授权信息的参数副本,用于日志记录
|
||||
log_kwargs = kwargs.copy()
|
||||
if "headers" in log_kwargs and "Authorization" in log_kwargs["headers"]:
|
||||
log_kwargs["headers"] = log_kwargs["headers"].copy()
|
||||
log_kwargs["headers"]["Authorization"] = "Bearer ****" # 隐藏API密钥
|
||||
|
||||
config.logger.debug(f"API请求: {method} {url}")
|
||||
config.logger.debug(f"请求参数: {log_kwargs}")
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.request(method, url, **kwargs) as response:
|
||||
response.raise_for_status()
|
||||
response_json = await response.json()
|
||||
|
||||
config.logger.debug(f"API响应: {response_json}")
|
||||
|
||||
return response_json
|
||||
|
||||
async def submit_file_url_task(
|
||||
self,
|
||||
urls: Union[str, List[Union[str, Dict[str, Any]]], Dict[str, Any]],
|
||||
enable_ocr: bool = True,
|
||||
language: str = "ch",
|
||||
page_ranges: Optional[str] = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
提交 File URL 以转换为 Markdown。支持单个URL或多个URL批量处理。
|
||||
|
||||
Args:
|
||||
urls: 可以是以下形式之一:
|
||||
1. 单个URL字符串
|
||||
2. 多个URL的列表
|
||||
3. 包含URL配置的字典列表,每个字典包含:
|
||||
- url: File文件URL (必需)
|
||||
- is_ocr: 是否启用OCR (可选)
|
||||
- data_id: 文件数据ID (可选)
|
||||
- page_ranges: 页码范围 (可选)
|
||||
enable_ocr: 是否为转换启用 OCR(所有文件的默认值)
|
||||
language: 指定文档语言,默认 ch,中文
|
||||
page_ranges: 指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6"表示选取第2页、第4页至第6页;"2--2"表示从第2页到倒数第2页。
|
||||
|
||||
Returns:
|
||||
dict: 任务信息,包括batch_id
|
||||
"""
|
||||
# 统计URL数量
|
||||
url_count = 1
|
||||
if isinstance(urls, list):
|
||||
url_count = len(urls)
|
||||
config.logger.debug(
|
||||
f"调用submit_file_url_task: {url_count}个URL, "
|
||||
+ f"ocr={enable_ocr}, "
|
||||
+ f"language={language}"
|
||||
)
|
||||
|
||||
# 处理输入,确保我们有一个URL配置列表
|
||||
urls_config = []
|
||||
|
||||
# 转换输入为标准格式
|
||||
if isinstance(urls, str):
|
||||
urls_config.append(
|
||||
{"url": urls, "is_ocr": enable_ocr, "page_ranges": page_ranges}
|
||||
)
|
||||
|
||||
elif isinstance(urls, list):
|
||||
# 处理URL列表或URL配置列表
|
||||
for i, url_item in enumerate(urls):
|
||||
if isinstance(url_item, str):
|
||||
# 简单的URL字符串
|
||||
urls_config.append(
|
||||
{
|
||||
"url": url_item,
|
||||
"is_ocr": enable_ocr,
|
||||
"page_ranges": page_ranges,
|
||||
}
|
||||
)
|
||||
|
||||
elif isinstance(url_item, dict):
|
||||
# 含有详细配置的URL字典
|
||||
if "url" not in url_item:
|
||||
raise ValueError(f"URL配置必须包含 'url' 字段: {url_item}")
|
||||
|
||||
url_is_ocr = url_item.get("is_ocr", enable_ocr)
|
||||
url_page_ranges = url_item.get("page_ranges", page_ranges)
|
||||
|
||||
url_config = {"url": url_item["url"], "is_ocr": url_is_ocr}
|
||||
if url_page_ranges is not None:
|
||||
url_config["page_ranges"] = url_page_ranges
|
||||
|
||||
urls_config.append(url_config)
|
||||
else:
|
||||
raise TypeError(f"不支持的URL配置类型: {type(url_item)}")
|
||||
elif isinstance(urls, dict):
|
||||
# 单个URL配置字典
|
||||
if "url" not in urls:
|
||||
raise ValueError(f"URL配置必须包含 'url' 字段: {urls}")
|
||||
|
||||
url_is_ocr = urls.get("is_ocr", enable_ocr)
|
||||
url_page_ranges = urls.get("page_ranges", page_ranges)
|
||||
|
||||
url_config = {"url": urls["url"], "is_ocr": url_is_ocr}
|
||||
if url_page_ranges is not None:
|
||||
url_config["page_ranges"] = url_page_ranges
|
||||
|
||||
urls_config.append(url_config)
|
||||
else:
|
||||
raise TypeError(f"urls 必须是字符串、列表或字典,而不是 {type(urls)}")
|
||||
|
||||
# 构建API请求payload
|
||||
files_payload = urls_config # 与submit_file_task不同,这里直接使用URLs配置
|
||||
|
||||
payload = {
|
||||
"language": language,
|
||||
"files": files_payload,
|
||||
}
|
||||
|
||||
# 调用批量API
|
||||
response = await self._request(
|
||||
"POST", "/api/v4/extract/task/batch", json=payload
|
||||
)
|
||||
|
||||
# 检查响应
|
||||
if "data" not in response or "batch_id" not in response["data"]:
|
||||
raise ValueError(f"提交批量URL任务失败: {response}")
|
||||
|
||||
batch_id = response["data"]["batch_id"]
|
||||
|
||||
config.logger.info(f"开始处理 {len(urls_config)} 个文件URL")
|
||||
config.logger.debug(f"批量URL任务提交成功,批次ID: {batch_id}")
|
||||
|
||||
# 返回包含batch_id的响应和URLs信息
|
||||
result = {
|
||||
"data": {
|
||||
"batch_id": batch_id,
|
||||
"uploaded_files": [url_config.get("url") for url_config in urls_config],
|
||||
}
|
||||
}
|
||||
|
||||
# 对于单个URL的情况,设置file_name以保持与原来返回格式的兼容性
|
||||
if len(urls_config) == 1:
|
||||
url = urls_config[0]["url"]
|
||||
# 从URL中提取文件名
|
||||
file_name = url.split("/")[-1]
|
||||
result["data"]["file_name"] = file_name
|
||||
|
||||
return result
|
||||
|
||||
async def submit_file_task(
|
||||
self,
|
||||
files: Union[str, List[Union[str, Dict[str, Any]]], Dict[str, Any]],
|
||||
enable_ocr: bool = True,
|
||||
language: str = "ch",
|
||||
page_ranges: Optional[str] = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
提交本地 File 文件以转换为 Markdown。支持单个文件路径或多个文件配置。
|
||||
|
||||
Args:
|
||||
files: 可以是以下形式之一:
|
||||
1. 单个文件路径字符串
|
||||
2. 多个文件路径的列表
|
||||
3. 包含文件配置的字典列表,每个字典包含:
|
||||
- path/name: 文件路径或文件名
|
||||
- is_ocr: 是否启用OCR (可选)
|
||||
- data_id: 文件数据ID (可选)
|
||||
- page_ranges: 页码范围 (可选)
|
||||
enable_ocr: 是否为转换启用 OCR(所有文件的默认值)
|
||||
language: 指定文档语言,默认 ch,中文
|
||||
page_ranges: 指定页码范围,格式为逗号分隔的字符串。例如:"2,4-6"表示选取第2页、第4页至第6页;"2--2"表示从第2页到倒数第2页。
|
||||
|
||||
Returns:
|
||||
dict: 任务信息,包括batch_id
|
||||
"""
|
||||
# 统计文件数量
|
||||
file_count = 1
|
||||
if isinstance(files, list):
|
||||
file_count = len(files)
|
||||
config.logger.debug(
|
||||
f"调用submit_file_task: {file_count}个文件, "
|
||||
+ f"ocr={enable_ocr}, "
|
||||
+ f"language={language}"
|
||||
)
|
||||
|
||||
# 处理输入,确保我们有一个文件配置列表
|
||||
files_config = []
|
||||
|
||||
# 转换输入为标准格式
|
||||
if isinstance(files, str):
|
||||
# 单个文件路径
|
||||
file_path = Path(files)
|
||||
if not file_path.exists():
|
||||
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
|
||||
|
||||
files_config.append(
|
||||
{
|
||||
"path": file_path,
|
||||
"name": file_path.name,
|
||||
"is_ocr": enable_ocr,
|
||||
"page_ranges": page_ranges,
|
||||
}
|
||||
)
|
||||
|
||||
elif isinstance(files, list):
|
||||
# 处理文件路径列表或文件配置列表
|
||||
for i, file_item in enumerate(files):
|
||||
if isinstance(file_item, str):
|
||||
# 简单的文件路径
|
||||
file_path = Path(file_item)
|
||||
if not file_path.exists():
|
||||
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
|
||||
|
||||
files_config.append(
|
||||
{
|
||||
"path": file_path,
|
||||
"name": file_path.name,
|
||||
"is_ocr": enable_ocr,
|
||||
"page_ranges": page_ranges,
|
||||
}
|
||||
)
|
||||
|
||||
elif isinstance(file_item, dict):
|
||||
# 含有详细配置的文件字典
|
||||
if "path" not in file_item and "name" not in file_item:
|
||||
raise ValueError(
|
||||
f"文件配置必须包含 'path' 或 'name' 字段: {file_item}"
|
||||
)
|
||||
|
||||
if "path" in file_item:
|
||||
file_path = Path(file_item["path"])
|
||||
if not file_path.exists():
|
||||
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
|
||||
|
||||
file_name = file_path.name
|
||||
else:
|
||||
file_name = file_item["name"]
|
||||
file_path = None
|
||||
|
||||
file_is_ocr = file_item.get("is_ocr", enable_ocr)
|
||||
file_page_ranges = file_item.get("page_ranges", page_ranges)
|
||||
|
||||
file_config = {
|
||||
"path": file_path,
|
||||
"name": file_name,
|
||||
"is_ocr": file_is_ocr,
|
||||
}
|
||||
if file_page_ranges is not None:
|
||||
file_config["page_ranges"] = file_page_ranges
|
||||
|
||||
files_config.append(file_config)
|
||||
else:
|
||||
raise TypeError(f"不支持的文件配置类型: {type(file_item)}")
|
||||
elif isinstance(files, dict):
|
||||
# 单个文件配置字典
|
||||
if "path" not in files and "name" not in files:
|
||||
raise ValueError(f"文件配置必须包含 'path' 或 'name' 字段: {files}")
|
||||
|
||||
if "path" in files:
|
||||
file_path = Path(files["path"])
|
||||
if not file_path.exists():
|
||||
raise FileNotFoundError(f"未找到 File 文件: {file_path}")
|
||||
|
||||
file_name = file_path.name
|
||||
else:
|
||||
file_name = files["name"]
|
||||
file_path = None
|
||||
|
||||
file_is_ocr = files.get("is_ocr", enable_ocr)
|
||||
file_page_ranges = files.get("page_ranges", page_ranges)
|
||||
|
||||
file_config = {
|
||||
"path": file_path,
|
||||
"name": file_name,
|
||||
"is_ocr": file_is_ocr,
|
||||
}
|
||||
if file_page_ranges is not None:
|
||||
file_config["page_ranges"] = file_page_ranges
|
||||
|
||||
files_config.append(file_config)
|
||||
else:
|
||||
raise TypeError(f"files 必须是字符串、列表或字典,而不是 {type(files)}")
|
||||
|
||||
# 步骤1: 构建API请求payload
|
||||
files_payload = []
|
||||
for file_config in files_config:
|
||||
file_payload = {
|
||||
"name": file_config["name"],
|
||||
"is_ocr": file_config["is_ocr"],
|
||||
}
|
||||
if "page_ranges" in file_config and file_config["page_ranges"] is not None:
|
||||
file_payload["page_ranges"] = file_config["page_ranges"]
|
||||
files_payload.append(file_payload)
|
||||
|
||||
payload = {
|
||||
"language": language,
|
||||
"files": files_payload,
|
||||
}
|
||||
|
||||
# 步骤2: 获取文件上传URL
|
||||
response = await self._request("POST", "/api/v4/file-urls/batch", json=payload)
|
||||
|
||||
# 检查响应
|
||||
if (
|
||||
"data" not in response
|
||||
or "batch_id" not in response["data"]
|
||||
or "file_urls" not in response["data"]
|
||||
):
|
||||
raise ValueError(f"获取上传URL失败: {response}")
|
||||
|
||||
batch_id = response["data"]["batch_id"]
|
||||
file_urls = response["data"]["file_urls"]
|
||||
|
||||
if len(file_urls) != len(files_config):
|
||||
raise ValueError(
|
||||
f"上传URL数量 ({len(file_urls)}) 与文件数量 ({len(files_config)}) 不匹配"
|
||||
)
|
||||
|
||||
config.logger.info(f"开始上传 {len(file_urls)} 个本地文件")
|
||||
config.logger.debug(f"获取上传URL成功,批次ID: {batch_id}")
|
||||
|
||||
# 步骤3: 上传所有文件
|
||||
uploaded_files = []
|
||||
|
||||
for i, (file_config, upload_url) in enumerate(zip(files_config, file_urls)):
|
||||
file_path = file_config["path"]
|
||||
if file_path is None:
|
||||
raise ValueError(f"文件 {file_config['name']} 没有有效的路径")
|
||||
|
||||
try:
|
||||
with open(file_path, "rb") as f:
|
||||
# 重要:不设置Content-Type,让OSS自动处理
|
||||
response = requests.put(upload_url, data=f)
|
||||
|
||||
if response.status_code != 200:
|
||||
raise ValueError(
|
||||
f"文件上传失败,状态码: {response.status_code}, 响应: {response.text}"
|
||||
)
|
||||
|
||||
config.logger.debug(f"文件 {file_path.name} 上传成功")
|
||||
uploaded_files.append(file_path.name)
|
||||
except Exception as e:
|
||||
raise ValueError(f"文件 {file_path.name} 上传失败: {str(e)}")
|
||||
|
||||
config.logger.info(f"文件上传完成,共 {len(uploaded_files)} 个文件")
|
||||
|
||||
# 返回包含batch_id的响应和已上传的文件信息
|
||||
result = {"data": {"batch_id": batch_id, "uploaded_files": uploaded_files}}
|
||||
|
||||
# 对于单个文件的情况,保持与原来返回格式的兼容性
|
||||
if len(uploaded_files) == 1:
|
||||
result["data"]["file_name"] = uploaded_files[0]
|
||||
|
||||
return result
|
||||
|
||||
async def get_batch_task_status(self, batch_id: str) -> Dict[str, Any]:
|
||||
"""
|
||||
获取批量转换任务的状态。
|
||||
|
||||
Args:
|
||||
batch_id: 批量任务的ID
|
||||
|
||||
Returns:
|
||||
dict: 批量任务状态信息
|
||||
"""
|
||||
response = await self._request(
|
||||
"GET", f"/api/v4/extract-results/batch/{batch_id}"
|
||||
)
|
||||
|
||||
return response
|
||||
|
||||
async def process_file_to_markdown(
|
||||
self,
|
||||
task_fn,
|
||||
task_arg: Union[str, List[Dict[str, Any]], Dict[str, Any]],
|
||||
enable_ocr: bool = True,
|
||||
output_dir: Optional[str] = None,
|
||||
max_retries: int = 180,
|
||||
retry_interval: int = 10,
|
||||
) -> Union[str, Dict[str, Any]]:
|
||||
"""
|
||||
从开始到结束处理 File 到 Markdown 的转换。
|
||||
|
||||
Args:
|
||||
task_fn: 提交任务的函数 (submit_file_url_task 或 submit_file_task)
|
||||
task_arg: 任务函数的参数,可以是:
|
||||
- URL字符串
|
||||
- 文件路径字符串
|
||||
- 包含文件配置的字典
|
||||
- 包含多个文件配置的字典列表
|
||||
enable_ocr: 是否启用 OCR
|
||||
output_dir: 结果的输出目录
|
||||
max_retries: 最大状态检查重试次数
|
||||
retry_interval: 状态检查之间的时间间隔 (秒)
|
||||
|
||||
Returns:
|
||||
Union[str, Dict[str, Any]]:
|
||||
- 单文件: 包含提取的 Markdown 文件的目录路径
|
||||
- 多文件: {
|
||||
"results": [
|
||||
{
|
||||
"filename": str,
|
||||
"status": str,
|
||||
"content": str,
|
||||
"error_message": str,
|
||||
}
|
||||
],
|
||||
"extract_dir": str
|
||||
}
|
||||
"""
|
||||
try:
|
||||
# 提交任务 - 使用位置参数调用,而不是命名参数
|
||||
task_info = await task_fn(task_arg, enable_ocr)
|
||||
|
||||
# 批量任务处理
|
||||
batch_id = task_info["data"]["batch_id"]
|
||||
|
||||
# 获取所有上传文件的名称
|
||||
uploaded_files = task_info["data"].get("uploaded_files", [])
|
||||
if not uploaded_files and "file_name" in task_info["data"]:
|
||||
uploaded_files = [task_info["data"]["file_name"]]
|
||||
|
||||
if not uploaded_files:
|
||||
raise ValueError("无法获取上传文件的信息")
|
||||
|
||||
config.logger.debug(f"批量任务提交成功。Batch ID: {batch_id}")
|
||||
|
||||
# 跟踪所有文件的处理状态
|
||||
files_status = {} # 将使用file_name作为键
|
||||
files_download_urls = {}
|
||||
failed_files = {} # 记录失败的文件和错误信息
|
||||
|
||||
# 准备输出路径
|
||||
output_path = config.ensure_output_dir(output_dir)
|
||||
|
||||
# 轮询任务完成情况
|
||||
for i in range(max_retries):
|
||||
status_info = await self.get_batch_task_status(batch_id)
|
||||
|
||||
config.logger.debug(f"轮训结果:{status_info}")
|
||||
|
||||
if (
|
||||
"data" not in status_info
|
||||
or "extract_result" not in status_info["data"]
|
||||
):
|
||||
config.logger.error(f"获取批量任务状态失败: {status_info}")
|
||||
await asyncio.sleep(retry_interval)
|
||||
continue
|
||||
|
||||
# 检查所有文件的状态
|
||||
all_done = True
|
||||
has_progress = False
|
||||
|
||||
for result in status_info["data"]["extract_result"]:
|
||||
file_name = result.get("file_name")
|
||||
|
||||
if not file_name:
|
||||
continue
|
||||
|
||||
# 初始化状态,如果之前没有记录
|
||||
if file_name not in files_status:
|
||||
files_status[file_name] = "pending"
|
||||
|
||||
state = result.get("state")
|
||||
files_status[file_name] = state
|
||||
|
||||
if state == "done":
|
||||
# 保存下载链接
|
||||
full_zip_url = result.get("full_zip_url")
|
||||
if full_zip_url:
|
||||
files_download_urls[file_name] = full_zip_url
|
||||
config.logger.info(f"文件 {file_name} 处理完成")
|
||||
else:
|
||||
config.logger.debug(
|
||||
f"文件 {file_name} 标记为完成但没有下载链接"
|
||||
)
|
||||
all_done = False
|
||||
elif state in ["failed", "error"]:
|
||||
err_msg = result.get("err_msg", "未知错误")
|
||||
failed_files[file_name] = err_msg
|
||||
config.logger.warning(f"文件 {file_name} 处理失败: {err_msg}")
|
||||
# 不抛出异常,继续处理其他文件
|
||||
else:
|
||||
all_done = False
|
||||
# 显示进度信息
|
||||
if state == "running" and "extract_progress" in result:
|
||||
has_progress = True
|
||||
progress = result["extract_progress"]
|
||||
extracted = progress.get("extracted_pages", 0)
|
||||
total = progress.get("total_pages", 0)
|
||||
if total > 0:
|
||||
percent = (extracted / total) * 100
|
||||
config.logger.info(
|
||||
f"处理进度: {file_name} "
|
||||
+ f"{extracted}/{total} 页 "
|
||||
+ f"({percent:.1f}%)"
|
||||
)
|
||||
|
||||
# 检查是否所有文件都已经处理完成
|
||||
expected_file_count = len(uploaded_files)
|
||||
processed_file_count = len(files_status)
|
||||
completed_file_count = len(files_download_urls) + len(failed_files)
|
||||
|
||||
# 记录当前状态
|
||||
config.logger.debug(
|
||||
f"文件处理状态: all_done={all_done}, "
|
||||
+ f"files_status数量={processed_file_count}, "
|
||||
+ f"上传文件数量={expected_file_count}, "
|
||||
+ f"下载链接数量={len(files_download_urls)}, "
|
||||
+ f"失败文件数量={len(failed_files)}"
|
||||
)
|
||||
|
||||
# 判断是否所有文件都已完成(包括成功和失败的)
|
||||
if (
|
||||
processed_file_count > 0
|
||||
and processed_file_count >= expected_file_count
|
||||
and completed_file_count >= processed_file_count
|
||||
):
|
||||
if files_download_urls or failed_files:
|
||||
config.logger.info("文件处理完成")
|
||||
if failed_files:
|
||||
config.logger.warning(
|
||||
f"有 {len(failed_files)} 个文件处理失败"
|
||||
)
|
||||
break
|
||||
else:
|
||||
# 这种情况不应该发生,但保险起见
|
||||
all_done = False
|
||||
|
||||
# 如果没有进度信息,只显示简单的等待消息
|
||||
if not has_progress:
|
||||
config.logger.info(f"等待文件处理完成... ({i+1}/{max_retries})")
|
||||
|
||||
await asyncio.sleep(retry_interval)
|
||||
else:
|
||||
# 如果超过最大重试次数,检查是否有部分文件完成
|
||||
if not files_download_urls and not failed_files:
|
||||
raise TimeoutError(f"批量任务 {batch_id} 未在允许的时间内完成")
|
||||
else:
|
||||
config.logger.warning(
|
||||
"警告: 部分文件未在允许的时间内完成," + "继续处理已完成的文件"
|
||||
)
|
||||
|
||||
# 创建主提取目录
|
||||
extract_dir = output_path / batch_id
|
||||
extract_dir.mkdir(exist_ok=True)
|
||||
|
||||
# 准备结果列表
|
||||
results = []
|
||||
|
||||
# 下载并解压每个成功的文件的结果
|
||||
for file_name, download_url in files_download_urls.items():
|
||||
try:
|
||||
config.logger.debug
|
||||
(f"下载文件处理结果: {file_name}")
|
||||
|
||||
# 从下载URL中提取zip文件名作为子目录名
|
||||
zip_file_name = download_url.split("/")[-1]
|
||||
# 去掉.zip扩展名
|
||||
zip_dir_name = os.path.splitext(zip_file_name)[0]
|
||||
|
||||
file_extract_dir = extract_dir / zip_dir_name
|
||||
file_extract_dir.mkdir(exist_ok=True)
|
||||
|
||||
# 下载ZIP文件
|
||||
zip_path = output_path / f"{batch_id}_{zip_file_name}"
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(
|
||||
download_url,
|
||||
headers={"Authorization": f"Bearer {self.api_key}"},
|
||||
) as response:
|
||||
response.raise_for_status()
|
||||
with open(zip_path, "wb") as f:
|
||||
f.write(await response.read())
|
||||
|
||||
# 解压到子文件夹
|
||||
with zipfile.ZipFile(zip_path, "r") as zip_ref:
|
||||
zip_ref.extractall(file_extract_dir)
|
||||
|
||||
# 解压后删除ZIP文件
|
||||
zip_path.unlink()
|
||||
|
||||
# 尝试读取Markdown内容
|
||||
markdown_content = ""
|
||||
markdown_files = list(file_extract_dir.glob("*.md"))
|
||||
if markdown_files:
|
||||
with open(markdown_files[0], "r", encoding="utf-8") as f:
|
||||
markdown_content = f.read()
|
||||
|
||||
# 添加成功结果
|
||||
results.append(
|
||||
{
|
||||
"filename": file_name,
|
||||
"status": "success",
|
||||
"content": markdown_content,
|
||||
"extract_path": str(file_extract_dir),
|
||||
}
|
||||
)
|
||||
|
||||
config.logger.debug(
|
||||
f"文件 {file_name} 的结果已解压到: {file_extract_dir}"
|
||||
)
|
||||
|
||||
except Exception as e:
|
||||
# 下载失败,添加错误结果
|
||||
error_msg = f"下载结果失败: {str(e)}"
|
||||
config.logger.error(f"文件 {file_name} {error_msg}")
|
||||
results.append(
|
||||
{
|
||||
"filename": file_name,
|
||||
"status": "error",
|
||||
"error_message": error_msg,
|
||||
}
|
||||
)
|
||||
|
||||
# 添加处理失败的文件到结果
|
||||
for file_name, error_msg in failed_files.items():
|
||||
results.append(
|
||||
{
|
||||
"filename": file_name,
|
||||
"status": "error",
|
||||
"error_message": f"处理失败: {error_msg}",
|
||||
}
|
||||
)
|
||||
|
||||
# 输出处理结果统计
|
||||
success_count = len(files_download_urls)
|
||||
fail_count = len(failed_files)
|
||||
total_count = success_count + fail_count
|
||||
|
||||
config.logger.info("\n=== 文件处理结果统计 ===")
|
||||
config.logger.info(f"总文件数: {total_count}")
|
||||
config.logger.info(f"成功处理: {success_count}")
|
||||
config.logger.info(f"处理失败: {fail_count}")
|
||||
|
||||
if failed_files:
|
||||
config.logger.info("\n失败文件详情:")
|
||||
for file_name, error_msg in failed_files.items():
|
||||
config.logger.info(f" - {file_name}: {error_msg}")
|
||||
|
||||
if success_count > 0:
|
||||
config.logger.info(f"\n结果保存目录: {extract_dir}")
|
||||
else:
|
||||
config.logger.info(f"\n输出目录: {extract_dir}")
|
||||
|
||||
# 返回详细结果
|
||||
return {
|
||||
"results": results,
|
||||
"extract_dir": str(extract_dir),
|
||||
"success_count": success_count,
|
||||
"fail_count": fail_count,
|
||||
"total_count": total_count,
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
config.logger.error(f"处理 File 到 Markdown 失败: {str(e)}")
|
||||
raise
|
||||
@@ -1,73 +0,0 @@
|
||||
"""MinerU File转Markdown服务的命令行界面。"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
from . import config
|
||||
from . import server
|
||||
|
||||
|
||||
def main():
|
||||
"""命令行界面的入口点。"""
|
||||
parser = argparse.ArgumentParser(description="MinerU File转Markdown转换服务")
|
||||
|
||||
parser.add_argument(
|
||||
"--output-dir", "-o", type=str, help="保存转换后文件的目录 (默认: ./downloads)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--transport",
|
||||
"-t",
|
||||
type=str,
|
||||
default="stdio",
|
||||
help="协议类型 (默认: stdio,可选: sse,streamable-http)",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--port",
|
||||
"-p",
|
||||
type=int,
|
||||
default=8001,
|
||||
help="服务器端口 (默认: 8001, 仅在使用HTTP协议时有效)",
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"--host",
|
||||
type=str,
|
||||
default="127.0.0.1",
|
||||
help="服务器主机地址 (默认: 127.0.0.1, 仅在使用HTTP协议时有效)",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 检查参数有效性
|
||||
if args.transport == "stdio" and (args.host != "127.0.0.1" or args.port != 8001):
|
||||
print("警告: 在STDIO模式下,--host和--port参数将被忽略", file=sys.stderr)
|
||||
|
||||
# 验证API密钥 - 移动到这里,以便 --help 等参数可以无密钥运行
|
||||
if not config.MINERU_API_KEY:
|
||||
print(
|
||||
"错误: 启动服务需要 MINERU_API_KEY 环境变量。"
|
||||
"\\n请检查是否已设置该环境变量,例如:"
|
||||
"\\n export MINERU_API_KEY='your_actual_api_key'"
|
||||
"\\n或者,确保在项目根目录的 `.env` 文件中定义了该变量。"
|
||||
"\\n\\n您可以使用 --help 查看可用的命令行选项。",
|
||||
file=sys.stderr, # 将错误消息输出到 stderr
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# 如果提供了输出目录,则进行设置
|
||||
if args.output_dir:
|
||||
server.set_output_dir(args.output_dir)
|
||||
|
||||
# 打印配置信息
|
||||
print("MinerU File转Markdown转换服务启动...")
|
||||
if args.transport in ["sse", "streamable-http"]:
|
||||
print(f"服务器地址: {args.host}:{args.port}")
|
||||
print("按 Ctrl+C 可以退出服务")
|
||||
|
||||
server.run_server(mode=args.transport, port=args.port, host=args.host)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -1,91 +0,0 @@
|
||||
"""MinerU File转Markdown转换服务的配置工具。"""
|
||||
|
||||
import os
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from dotenv import load_dotenv
|
||||
|
||||
# 从 .env 文件加载环境变量
|
||||
load_dotenv()
|
||||
|
||||
# API 配置
|
||||
MINERU_API_BASE = os.getenv("MINERU_API_BASE", "https://mineru.net")
|
||||
MINERU_API_KEY = os.getenv("MINERU_API_KEY", "")
|
||||
|
||||
# 本地API配置
|
||||
USE_LOCAL_API = os.getenv("USE_LOCAL_API", "").lower() in ["true", "1", "yes"]
|
||||
LOCAL_MINERU_API_BASE = os.getenv("LOCAL_MINERU_API_BASE", "http://localhost:8080")
|
||||
|
||||
# 转换后文件的默认输出目录
|
||||
DEFAULT_OUTPUT_DIR = os.getenv("OUTPUT_DIR", "./downloads")
|
||||
|
||||
|
||||
# 设置日志系统
|
||||
def setup_logging():
|
||||
"""
|
||||
设置日志系统,根据环境变量配置日志级别。
|
||||
|
||||
Returns:
|
||||
logging.Logger: 配置好的日志记录器。
|
||||
"""
|
||||
# 获取环境变量中的日志级别设置
|
||||
log_level = os.getenv("MINERU_LOG_LEVEL", "INFO").upper()
|
||||
debug_mode = os.getenv("MINERU_DEBUG", "").lower() in ["true", "1", "yes"]
|
||||
|
||||
# 如果设置了debug_mode,则覆盖log_level
|
||||
if debug_mode:
|
||||
log_level = "DEBUG"
|
||||
|
||||
# 确保log_level是有效的
|
||||
valid_levels = ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]
|
||||
if log_level not in valid_levels:
|
||||
log_level = "INFO"
|
||||
|
||||
# 设置日志格式
|
||||
log_format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
|
||||
# 配置日志
|
||||
logging.basicConfig(level=getattr(logging, log_level), format=log_format)
|
||||
|
||||
logger = logging.getLogger("mineru")
|
||||
logger.setLevel(getattr(logging, log_level))
|
||||
|
||||
# 输出日志级别信息
|
||||
logger.info(f"日志级别设置为: {log_level}")
|
||||
|
||||
return logger
|
||||
|
||||
|
||||
# 创建默认的日志记录器
|
||||
logger = setup_logging()
|
||||
|
||||
|
||||
# 如果输出目录不存在,则创建它
|
||||
def ensure_output_dir(output_dir=None):
|
||||
"""
|
||||
确保输出目录存在。
|
||||
|
||||
Args:
|
||||
output_dir: 输出目录的可选路径。如果为 None,则使用 DEFAULT_OUTPUT_DIR。
|
||||
|
||||
Returns:
|
||||
表示输出目录的 Path 对象。
|
||||
"""
|
||||
output_path = Path(output_dir or DEFAULT_OUTPUT_DIR)
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
return output_path
|
||||
|
||||
|
||||
# 验证 API 配置
|
||||
def validate_api_config():
|
||||
"""
|
||||
验证是否已设置所需的 API 配置。
|
||||
|
||||
Returns:
|
||||
dict: 配置状态。
|
||||
"""
|
||||
return {
|
||||
"api_base": MINERU_API_BASE,
|
||||
"api_key_set": bool(MINERU_API_KEY),
|
||||
"output_dir": DEFAULT_OUTPUT_DIR,
|
||||
}
|
||||
@@ -1,76 +0,0 @@
|
||||
"""演示如何使用 MinerU File转Markdown客户端的示例。"""
|
||||
|
||||
import os
|
||||
import asyncio
|
||||
from mcp.client import MCPClient
|
||||
|
||||
|
||||
async def convert_file_url_example():
|
||||
"""从 URL 转换 File 的示例。"""
|
||||
client = MCPClient("http://localhost:8000")
|
||||
|
||||
# 转换单个 File URL
|
||||
result = await client.call(
|
||||
"convert_file_url", url="https://example.com/sample.pdf", enable_ocr=True
|
||||
)
|
||||
print(f"转换结果: {result}")
|
||||
|
||||
# 转换多个 File URL
|
||||
urls = """
|
||||
https://example.com/doc1.pdf
|
||||
https://example.com/doc2.pdf
|
||||
"""
|
||||
result = await client.call("convert_file_url", url=urls, enable_ocr=True)
|
||||
print(f"多个转换结果: {result}")
|
||||
|
||||
|
||||
async def convert_file_file_example():
|
||||
"""转换本地 File 文件的示例。"""
|
||||
client = MCPClient("http://localhost:8000")
|
||||
|
||||
# 获取测试 File 的绝对路径
|
||||
script_dir = os.path.dirname(os.path.abspath(__file__))
|
||||
project_root = os.path.dirname(os.path.dirname(os.path.dirname(script_dir)))
|
||||
test_file_path = os.path.join(project_root, "test_files", "test.pdf")
|
||||
|
||||
# 转换单个 File 文件
|
||||
result = await client.call(
|
||||
"convert_file_file", file_path=test_file_path, enable_ocr=True
|
||||
)
|
||||
print(f"文件转换结果: {result}")
|
||||
|
||||
|
||||
async def get_api_status_example():
|
||||
"""获取 API 状态的示例。"""
|
||||
client = MCPClient("http://localhost:8000")
|
||||
|
||||
# 获取 API 状态
|
||||
status = await client.get_resource("status://api")
|
||||
print(f"API 状态: {status}")
|
||||
|
||||
# 获取使用帮助
|
||||
help_text = await client.get_resource("help://usage")
|
||||
print(f"使用帮助: {help_text[:100]}...") # 显示前 100 个字符
|
||||
|
||||
|
||||
async def main():
|
||||
"""运行所有示例。"""
|
||||
print("运行 File 到 Markdown 转换示例...")
|
||||
|
||||
# 检查是否设置了 API_KEY
|
||||
if not os.environ.get("MINERU_API_KEY"):
|
||||
print("警告: MINERU_API_KEY 环境变量未设置。")
|
||||
print("使用以下命令设置: export MINERU_API_KEY=your_api_key")
|
||||
print("跳过需要 API 访问的示例...")
|
||||
|
||||
# 仅获取 API 状态
|
||||
await get_api_status_example()
|
||||
else:
|
||||
# 运行所有示例
|
||||
await convert_file_url_example()
|
||||
await convert_file_file_example()
|
||||
await get_api_status_example()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -1,106 +0,0 @@
|
||||
"""MinerU支持的语言列表。"""
|
||||
|
||||
from typing import Dict, List
|
||||
|
||||
# 支持的语言列表
|
||||
LANGUAGES: List[Dict[str, str]] = [
|
||||
{"name": "中文", "description": "Chinese & English", "code": "ch"},
|
||||
{"name": "英文", "description": "English", "code": "en"},
|
||||
{"name": "法文", "description": "French", "code": "fr"},
|
||||
{"name": "德文", "description": "German", "code": "german"},
|
||||
{"name": "日文", "description": "Japanese", "code": "japan"},
|
||||
{"name": "韩文", "description": "Korean", "code": "korean"},
|
||||
{"name": "中文繁体", "description": "Chinese Traditional", "code": "chinese_cht"},
|
||||
{"name": "意大利文", "description": "Italian", "code": "it"},
|
||||
{"name": "西班牙文", "description": "Spanish", "code": "es"},
|
||||
{"name": "葡萄牙文", "description": "Portuguese", "code": "pt"},
|
||||
{"name": "俄罗斯文", "description": "Russian", "code": "ru"},
|
||||
{"name": "阿拉伯文", "description": "Arabic", "code": "ar"},
|
||||
{"name": "印地文", "description": "Hindi", "code": "hi"},
|
||||
{"name": "维吾尔", "description": "Uyghur", "code": "ug"},
|
||||
{"name": "波斯文", "description": "Persian", "code": "fa"},
|
||||
{"name": "乌尔都文", "description": "Urdu", "code": "ur"},
|
||||
{"name": "塞尔维亚文(latin)", "description": "Serbian(latin)", "code": "rs_latin"},
|
||||
{"name": "欧西坦文", "description": "Occitan", "code": "oc"},
|
||||
{"name": "马拉地文", "description": "Marathi", "code": "mr"},
|
||||
{"name": "尼泊尔文", "description": "Nepali", "code": "ne"},
|
||||
{
|
||||
"name": "塞尔维亚文(cyrillic)",
|
||||
"description": "Serbian(cyrillic)",
|
||||
"code": "rs_cyrillic",
|
||||
},
|
||||
{"name": "毛利文", "description": "Maori", "code": "mi"},
|
||||
{"name": "马来文", "description": "Malay", "code": "ms"},
|
||||
{"name": "马耳他文", "description": "Maltese", "code": "mt"},
|
||||
{"name": "荷兰文", "description": "Dutch", "code": "nl"},
|
||||
{"name": "挪威文", "description": "Norwegian", "code": "no"},
|
||||
{"name": "波兰文", "description": "Polish", "code": "pl"},
|
||||
{"name": "罗马尼亚文", "description": "Romanian", "code": "ro"},
|
||||
{"name": "斯洛伐克文", "description": "Slovak", "code": "sk"},
|
||||
{"name": "斯洛文尼亚文", "description": "Slovenian", "code": "sl"},
|
||||
{"name": "阿尔巴尼亚文", "description": "Albanian", "code": "sq"},
|
||||
{"name": "瑞典文", "description": "Swedish", "code": "sv"},
|
||||
{"name": "西瓦希里文", "description": "Swahili", "code": "sw"},
|
||||
{"name": "塔加洛文", "description": "Tagalog", "code": "tl"},
|
||||
{"name": "土耳其文", "description": "Turkish", "code": "tr"},
|
||||
{"name": "乌兹别克文", "description": "Uzbek", "code": "uz"},
|
||||
{"name": "越南文", "description": "Vietnamese", "code": "vi"},
|
||||
{"name": "蒙古文", "description": "Mongolian", "code": "mn"},
|
||||
{"name": "车臣文", "description": "Chechen", "code": "che"},
|
||||
{"name": "哈里亚纳语", "description": "Haryanvi", "code": "bgc"},
|
||||
{"name": "保加利亚文", "description": "Bulgarian", "code": "bg"},
|
||||
{"name": "乌克兰文", "description": "Ukranian", "code": "uk"},
|
||||
{"name": "白俄罗斯文", "description": "Belarusian", "code": "be"},
|
||||
{"name": "泰卢固文", "description": "Telugu", "code": "te"},
|
||||
{"name": "阿巴扎文", "description": "Abaza", "code": "abq"},
|
||||
{"name": "泰米尔文", "description": "Tamil", "code": "ta"},
|
||||
{"name": "南非荷兰文", "description": "Afrikaans", "code": "af"},
|
||||
{"name": "阿塞拜疆文", "description": "Azerbaijani", "code": "az"},
|
||||
{"name": "波斯尼亚文", "description": "Bosnian", "code": "bs"},
|
||||
{"name": "捷克文", "description": "Czech", "code": "cs"},
|
||||
{"name": "威尔士文", "description": "Welsh", "code": "cy"},
|
||||
{"name": "丹麦文", "description": "Danish", "code": "da"},
|
||||
{"name": "爱沙尼亚文", "description": "Estonian", "code": "et"},
|
||||
{"name": "爱尔兰文", "description": "Irish", "code": "ga"},
|
||||
{"name": "克罗地亚文", "description": "Croatian", "code": "hr"},
|
||||
{"name": "匈牙利文", "description": "Hungarian", "code": "hu"},
|
||||
{"name": "印尼文", "description": "Indonesian", "code": "id"},
|
||||
{"name": "冰岛文", "description": "Icelandic", "code": "is"},
|
||||
{"name": "库尔德文", "description": "Kurdish", "code": "ku"},
|
||||
{"name": "立陶宛文", "description": "Lithuanian", "code": "lt"},
|
||||
{"name": "拉脱维亚文", "description": "Latvian", "code": "lv"},
|
||||
{"name": "达尔瓦文", "description": "Dargwa", "code": "dar"},
|
||||
{"name": "因古什文", "description": "Ingush", "code": "inh"},
|
||||
{"name": "拉克文", "description": "Lak", "code": "lbe"},
|
||||
{"name": "莱兹甘文", "description": "Lezghian", "code": "lez"},
|
||||
{"name": "塔巴萨兰文", "description": "Tabassaran", "code": "tab"},
|
||||
{"name": "比尔哈文", "description": "Bihari", "code": "bh"},
|
||||
{"name": "迈蒂利文", "description": "Maithili", "code": "mai"},
|
||||
{"name": "昂加文", "description": "Angika", "code": "ang"},
|
||||
{"name": "孟加拉文", "description": "Bhojpuri", "code": "bho"},
|
||||
{"name": "摩揭陀文", "description": "Magahi", "code": "mah"},
|
||||
{"name": "那格浦尔文", "description": "Nagpur", "code": "sck"},
|
||||
{"name": "尼瓦尔文", "description": "Newari", "code": "new"},
|
||||
{"name": "保加利亚文", "description": "Goan Konkani", "code": "gom"},
|
||||
{"name": "梵文", "description": "Sanskrit", "code": "sa"},
|
||||
{"name": "阿瓦尔文", "description": "Avar", "code": "ava"},
|
||||
{"name": "阿瓦尔文", "description": "Avar", "code": "ava"},
|
||||
{"name": "阿迪赫文", "description": "Adyghe", "code": "ady"},
|
||||
{"name": "巴利文", "description": "Pali", "code": "pi"},
|
||||
{"name": "拉丁文", "description": "Latin", "code": "la"},
|
||||
]
|
||||
|
||||
# 构建语言代码到语言信息的映射字典,便于快速查找
|
||||
LANGUAGES_DICT: Dict[str, Dict[str, str]] = {lang["code"]: lang for lang in LANGUAGES}
|
||||
|
||||
|
||||
def get_language_list() -> List[Dict[str, str]]:
|
||||
"""获取所有支持的语言列表。"""
|
||||
return LANGUAGES
|
||||
|
||||
|
||||
def get_language_by_code(code: str) -> Dict[str, str]:
|
||||
"""根据语言代码获取语言信息。"""
|
||||
return LANGUAGES_DICT.get(
|
||||
code, {"name": "未知", "description": "Unknown", "code": code}
|
||||
)
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,565 +0,0 @@
|
||||
# MinerU Tianshu (天枢)
|
||||
|
||||
> 天枢 - 企业级多GPU文档解析服务
|
||||
> 结合 SQLite 任务队列 + LitServe GPU负载均衡的最佳方案
|
||||
|
||||
## 🌟 核心特性
|
||||
|
||||
### 高性能架构
|
||||
- ✅ **Worker 主动拉取** - 0.5秒响应速度,无需调度器触发
|
||||
- ✅ **并发安全** - 原子操作防止任务重复,支持多Worker并发
|
||||
- ✅ **GPU 负载均衡** - LitServe 自动调度,避免显存冲突
|
||||
- ✅ **多GPU隔离** - 每个进程只使用分配的GPU,彻底解决多卡占用
|
||||
|
||||
### 企业级功能
|
||||
- ✅ **异步处理** - 客户端立即响应(~100ms),无需等待处理完成
|
||||
- ✅ **任务持久化** - SQLite 存储,服务重启任务不丢失
|
||||
- ✅ **优先级队列** - 重要任务优先处理
|
||||
- ✅ **自动清理** - 定期清理旧结果文件,保留数据库记录
|
||||
|
||||
### 智能解析
|
||||
- ✅ **双解析器** - PDF/图片用 MinerU(GPU加速), Office/HTML等用 MarkItDown(快速)
|
||||
- ✅ **内容获取** - API自动返回 Markdown 内容,支持图片上传到 MinIO
|
||||
- ✅ **RESTful API** - 支持任何编程语言接入
|
||||
- ✅ **实时查询** - 随时查看任务进度和状态
|
||||
|
||||
## 🏗️ 系统架构
|
||||
|
||||
```
|
||||
客户端请求 → FastAPI Server (立即返回 task_id)
|
||||
↓
|
||||
SQLite 任务队列 (并发安全)
|
||||
↓
|
||||
LitServe Worker Pool (主动拉取 + GPU自动负载均衡)
|
||||
↓
|
||||
MinerU / MarkItDown 解析
|
||||
↓
|
||||
Task Scheduler (可选监控组件)
|
||||
```
|
||||
|
||||
**架构特点**:
|
||||
- ✅ **Worker 主动模式**: Workers 持续循环拉取任务,无需调度器触发
|
||||
- ✅ **并发安全**: SQLite 使用原子操作防止任务重复处理
|
||||
- ✅ **自动负载均衡**: LitServe 自动分配任务到空闲 GPU
|
||||
- ✅ **智能解析**: PDF/图片用 MinerU,其他格式用 MarkItDown
|
||||
|
||||
## 🚀 快速开始
|
||||
|
||||
### 1. 安装依赖
|
||||
|
||||
```bash
|
||||
cd projects/mineru_tianshu
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
> **支持的文件格式**:
|
||||
> - 📄 **PDF 和图片** (.pdf, .png, .jpg, .jpeg, .bmp, .tiff, .webp) - 使用 MinerU 解析(GPU 加速)
|
||||
> - 📊 **其他所有格式** (Office、HTML、文本等) - 使用 MarkItDown 解析(快速处理)
|
||||
> - Office: .docx, .doc, .xlsx, .xls, .pptx, .ppt
|
||||
> - 网页: .html, .htm
|
||||
> - 文本: .txt, .md, .csv, .json, .xml 等
|
||||
|
||||
### 2. 启动服务
|
||||
|
||||
```bash
|
||||
# 一键启动所有服务(推荐)
|
||||
python start_all.py
|
||||
|
||||
# 或自定义配置
|
||||
python start_all.py --workers-per-device 2 --devices 0,1
|
||||
```
|
||||
|
||||
> **Windows 用户注意**: 项目已针对 Windows 的 multiprocessing 进行优化,可直接运行。
|
||||
|
||||
### 3. 使用 API
|
||||
|
||||
**方式A: 浏览器访问 API 文档**
|
||||
```
|
||||
http://localhost:8000/docs
|
||||
```
|
||||
|
||||
**方式B: Python 客户端**
|
||||
```python
|
||||
python client_example.py
|
||||
```
|
||||
|
||||
**方式C: cURL 命令**
|
||||
```bash
|
||||
# 提交任务
|
||||
curl -X POST http://localhost:8000/api/v1/tasks/submit \
|
||||
-F "file=@document.pdf" \
|
||||
-F "lang=ch"
|
||||
|
||||
# 查询状态(任务完成后自动返回解析内容)
|
||||
curl http://localhost:8000/api/v1/tasks/{task_id}
|
||||
|
||||
# 查询状态并上传图片到MinIO
|
||||
curl http://localhost:8000/api/v1/tasks/{task_id}?upload_images=true
|
||||
```
|
||||
|
||||
## 📁 项目结构
|
||||
|
||||
```
|
||||
mineru_tianshu/
|
||||
├── task_db.py # 数据库管理 (并发安全,支持清理)
|
||||
├── api_server.py # API 服务器 (自动返回内容)
|
||||
├── litserve_worker.py # Worker Pool (主动拉取 + 双解析器)
|
||||
├── task_scheduler.py # 任务调度器 (可选监控)
|
||||
├── start_all.py # 启动脚本
|
||||
├── client_example.py # 客户端示例
|
||||
└── requirements.txt # 依赖配置
|
||||
```
|
||||
|
||||
**核心组件说明**:
|
||||
- `task_db.py`: 使用原子操作保证并发安全,支持旧任务清理
|
||||
- `api_server.py`: 查询接口自动返回Markdown内容,支持MinIO图片上传
|
||||
- `litserve_worker.py`: Worker主动循环拉取任务,支持MinerU和MarkItDown双解析
|
||||
- `task_scheduler.py`: 可选组件,仅用于监控和健康检查(默认5分钟监控,15分钟健康检查)
|
||||
|
||||
## 📚 使用示例
|
||||
|
||||
### 示例 1: 提交任务并等待结果 (新版本 - 自动返回内容)
|
||||
|
||||
```python
|
||||
import requests
|
||||
import time
|
||||
|
||||
# 提交文档
|
||||
with open('document.pdf', 'rb') as f:
|
||||
response = requests.post(
|
||||
'http://localhost:8000/api/v1/tasks/submit',
|
||||
files={'file': f},
|
||||
data={'lang': 'ch', 'priority': 0}
|
||||
)
|
||||
task_id = response.json()['task_id']
|
||||
print(f"✅ 任务已提交: {task_id}")
|
||||
|
||||
# 轮询等待完成
|
||||
while True:
|
||||
response = requests.get(f'http://localhost:8000/api/v1/tasks/{task_id}')
|
||||
result = response.json()
|
||||
|
||||
if result['status'] == 'completed':
|
||||
# v2.0 新特性: 任务完成后自动返回解析内容
|
||||
if result.get('data'):
|
||||
content = result['data']['content']
|
||||
print(f"✅ 解析完成,内容长度: {len(content)} 字符")
|
||||
print(f" 解析方法: {result['data'].get('parser', 'Unknown')}")
|
||||
|
||||
# 保存结果
|
||||
with open('output.md', 'w', encoding='utf-8') as f:
|
||||
f.write(content)
|
||||
else:
|
||||
# 结果文件已被清理
|
||||
print(f"⚠️ 任务完成但结果文件已清理: {result.get('message', '')}")
|
||||
break
|
||||
elif result['status'] == 'failed':
|
||||
print(f"❌ 失败: {result['error_message']}")
|
||||
break
|
||||
|
||||
print(f"⏳ 处理中... 状态: {result['status']}")
|
||||
time.sleep(2)
|
||||
```
|
||||
|
||||
### 示例 2: 图片上传到 MinIO (可选功能)
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
task_id = "your-task-id"
|
||||
|
||||
# v2.0: 查询时自动返回内容,同时可选上传图片到 MinIO
|
||||
response = requests.get(
|
||||
f'http://localhost:8000/api/v1/tasks/{task_id}',
|
||||
params={'upload_images': True} # 启用图片上传
|
||||
)
|
||||
|
||||
result = response.json()
|
||||
if result['status'] == 'completed' and result.get('data'):
|
||||
# 图片已替换为 MinIO URL (HTML img 标签格式)
|
||||
content = result['data']['content']
|
||||
images_uploaded = result['data']['images_uploaded']
|
||||
|
||||
print(f"✅ 图片已上传到 MinIO: {images_uploaded}")
|
||||
print(f" 内容长度: {len(content)} 字符")
|
||||
|
||||
# 保存包含 MinIO 图片链接的 Markdown
|
||||
with open('output_with_cloud_images.md', 'w', encoding='utf-8') as f:
|
||||
f.write(content)
|
||||
```
|
||||
|
||||
### 示例 3: 批量处理
|
||||
|
||||
```python
|
||||
import requests
|
||||
import concurrent.futures
|
||||
|
||||
files = ['doc1.pdf', 'report.docx', 'data.xlsx']
|
||||
|
||||
def process_file(file_path):
|
||||
# 提交任务
|
||||
with open(file_path, 'rb') as f:
|
||||
response = requests.post(
|
||||
'http://localhost:8000/api/v1/tasks/submit',
|
||||
files={'file': f}
|
||||
)
|
||||
return response.json()['task_id']
|
||||
|
||||
# 并发提交
|
||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||
task_ids = list(executor.map(process_file, files))
|
||||
print(f"✅ 已提交 {len(task_ids)} 个任务")
|
||||
```
|
||||
|
||||
### 示例 4: 使用内置客户端
|
||||
|
||||
```bash
|
||||
# 运行完整示例
|
||||
python client_example.py
|
||||
|
||||
# 运行特定示例
|
||||
python client_example.py single # 单任务
|
||||
python client_example.py batch # 批量任务
|
||||
python client_example.py priority # 优先级队列
|
||||
```
|
||||
|
||||
## ⚙️ 配置说明
|
||||
|
||||
### 启动参数
|
||||
|
||||
```bash
|
||||
python start_all.py [选项]
|
||||
|
||||
选项:
|
||||
--output-dir PATH 输出目录 (默认: /tmp/mineru_tianshu_output)
|
||||
--api-port PORT API端口 (默认: 8000)
|
||||
--worker-port PORT Worker端口 (默认: 9000)
|
||||
--accelerator TYPE 加速器类型: auto/cuda/cpu/mps (默认: auto)
|
||||
--workers-per-device N 每个GPU的worker数 (默认: 1)
|
||||
--devices DEVICES 使用的GPU设备 (默认: auto,使用所有GPU)
|
||||
--poll-interval SECONDS Worker拉取任务间隔 (默认: 0.5秒)
|
||||
--enable-scheduler 启用可选的任务调度器 (默认: 不启动)
|
||||
--monitor-interval SECONDS 调度器监控间隔 (默认: 300秒=5分钟)
|
||||
--cleanup-old-files-days N 清理N天前的结果文件 (默认: 7天, 0=禁用)
|
||||
```
|
||||
|
||||
**新增功能说明**:
|
||||
- `--poll-interval`: Worker空闲时拉取任务的频率,默认0.5秒响应极快
|
||||
- `--enable-scheduler`: 是否启动调度器(可选),仅用于监控和健康检查
|
||||
- `--monitor-interval`: 调度器日志输出频率,建议5-10分钟避免刷屏
|
||||
- `--cleanup-old-files-days`: 自动清理旧结果文件但保留数据库记录
|
||||
|
||||
### 配置示例
|
||||
|
||||
```bash
|
||||
# 基础启动(推荐)
|
||||
python start_all.py
|
||||
|
||||
# CPU模式(无GPU或测试)
|
||||
python start_all.py --accelerator cpu
|
||||
|
||||
# GPU模式: 24GB显卡,每卡2个worker
|
||||
python start_all.py --accelerator cuda --workers-per-device 2
|
||||
|
||||
# 指定GPU: 只使用GPU 0和1
|
||||
python start_all.py --accelerator cuda --devices 0,1
|
||||
|
||||
# 启用监控调度器(可选)
|
||||
python start_all.py --enable-scheduler --monitor-interval 300
|
||||
|
||||
# 调整Worker拉取频率(高负载场景)
|
||||
python start_all.py --poll-interval 1.0
|
||||
|
||||
# 禁用旧文件清理(保留所有结果)
|
||||
python start_all.py --cleanup-old-files-days 0
|
||||
|
||||
# 完整配置示例
|
||||
python start_all.py \
|
||||
--accelerator cuda \
|
||||
--devices 0,1 \
|
||||
--workers-per-device 2 \
|
||||
--poll-interval 0.5 \
|
||||
--enable-scheduler \
|
||||
--monitor-interval 300 \
|
||||
--cleanup-old-files-days 7
|
||||
|
||||
# Mac M系列芯片
|
||||
python start_all.py --accelerator mps
|
||||
```
|
||||
|
||||
### MinIO 配置(可选)
|
||||
|
||||
如需使用图片上传到 MinIO 功能:
|
||||
|
||||
```bash
|
||||
export MINIO_ENDPOINT="your-endpoint.com"
|
||||
export MINIO_ACCESS_KEY="your-access-key"
|
||||
export MINIO_SECRET_KEY="your-secret-key"
|
||||
export MINIO_BUCKET="your-bucket"
|
||||
```
|
||||
|
||||
### 硬件要求
|
||||
|
||||
| 后端 | 显存要求 | 推荐配置 |
|
||||
|------|---------|---------|
|
||||
| pipeline | 6GB+ | RTX 2060 以上 |
|
||||
| vlm-transformers | 8GB+ | RTX 3060 以上 |
|
||||
| vlm-vllm-engine | 8GB+ | RTX 4070 以上 |
|
||||
|
||||
## 📡 API 接口
|
||||
|
||||
> 完整文档: http://localhost:8000/docs
|
||||
|
||||
### 1. 提交任务
|
||||
```http
|
||||
POST /api/v1/tasks/submit
|
||||
|
||||
参数:
|
||||
file: 文件 (必需)
|
||||
backend: pipeline | vlm-transformers | vlm-vllm-engine (默认: pipeline)
|
||||
lang: ch | en | korean | japan | ... (默认: ch)
|
||||
priority: 0-100 (数字越大越优先,默认: 0)
|
||||
```
|
||||
|
||||
### 2. 查询任务
|
||||
```http
|
||||
GET /api/v1/tasks/{task_id}?upload_images=false
|
||||
|
||||
参数:
|
||||
upload_images: 是否上传图片到 MinIO (默认: false)
|
||||
|
||||
返回:
|
||||
- status: pending | processing | completed | failed
|
||||
- data: 任务完成后**自动返回** Markdown 内容
|
||||
- markdown_file: 文件名
|
||||
- content: 完整的 Markdown 内容
|
||||
- images_uploaded: 是否已上传图片
|
||||
- has_images: 是否包含图片
|
||||
- message: 如果结果文件已清理会提示
|
||||
|
||||
注意:
|
||||
- v2.0 新特性: 完成的任务会自动返回内容,无需额外请求
|
||||
- 如果结果文件已被清理(超过保留期),data 为 null 但任务记录仍可查询
|
||||
```
|
||||
|
||||
### 3. 队列统计
|
||||
```http
|
||||
GET /api/v1/queue/stats
|
||||
|
||||
返回: 各状态任务数量统计
|
||||
```
|
||||
|
||||
### 4. 取消任务
|
||||
```http
|
||||
DELETE /api/v1/tasks/{task_id}
|
||||
|
||||
只能取消 pending 状态的任务
|
||||
```
|
||||
|
||||
### 5. 管理接口
|
||||
|
||||
**重置超时任务**
|
||||
```http
|
||||
POST /api/v1/admin/reset-stale?timeout_minutes=60
|
||||
|
||||
将超时的 processing 任务重置为 pending
|
||||
```
|
||||
|
||||
**清理旧任务**
|
||||
```http
|
||||
POST /api/v1/admin/cleanup?days=7
|
||||
|
||||
仅用于手动触发清理(自动清理会每24小时执行一次)
|
||||
```
|
||||
|
||||
## 🔧 故障排查
|
||||
|
||||
### 问题1: Worker 无法启动
|
||||
|
||||
**检查GPU**
|
||||
```bash
|
||||
nvidia-smi # 应显示GPU信息
|
||||
```
|
||||
|
||||
**检查依赖**
|
||||
```bash
|
||||
pip list | grep -E "(mineru|litserve|torch)"
|
||||
```
|
||||
|
||||
### 问题2: 任务一直 pending
|
||||
|
||||
> ⚠️ **重要**: Worker 现在是主动拉取模式,不需要调度器触发!
|
||||
|
||||
**检查 Worker 是否运行**
|
||||
```bash
|
||||
# Windows
|
||||
tasklist | findstr python
|
||||
|
||||
# Linux/Mac
|
||||
ps aux | grep litserve_worker
|
||||
```
|
||||
|
||||
**检查 Worker 健康状态**
|
||||
```bash
|
||||
curl -X POST http://localhost:9000/predict \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"action":"health"}'
|
||||
```
|
||||
|
||||
**查看数据库状态**
|
||||
```bash
|
||||
python -c "from task_db import TaskDB; db = TaskDB(); print(db.get_queue_stats())"
|
||||
```
|
||||
|
||||
### 问题3: 显存不足或多卡占用
|
||||
|
||||
**减少worker数量**
|
||||
```bash
|
||||
python start_all.py --workers-per-device 1
|
||||
```
|
||||
|
||||
**设置显存限制**
|
||||
```bash
|
||||
export MINERU_VIRTUAL_VRAM_SIZE=6
|
||||
python start_all.py
|
||||
```
|
||||
|
||||
**指定特定GPU**
|
||||
```bash
|
||||
# 只使用GPU 0
|
||||
python start_all.py --devices 0
|
||||
```
|
||||
|
||||
> 💡 **提示**: 新版本已修复多卡显存占用问题,通过设置 `CUDA_VISIBLE_DEVICES` 确保每个进程只使用分配的GPU
|
||||
|
||||
### 问题4: 端口被占用
|
||||
|
||||
**查看占用**
|
||||
```bash
|
||||
# Windows
|
||||
netstat -ano | findstr :8000
|
||||
|
||||
# Linux/Mac
|
||||
lsof -i :8000
|
||||
```
|
||||
|
||||
**使用其他端口**
|
||||
```bash
|
||||
python start_all.py --api-port 8080 --worker-port 9090
|
||||
```
|
||||
|
||||
### 问题5: 结果文件丢失
|
||||
|
||||
**查询任务状态**
|
||||
```bash
|
||||
curl http://localhost:8000/api/v1/tasks/{task_id}
|
||||
```
|
||||
|
||||
**说明**: 如果返回 `result files have been cleaned up`,说明结果文件已被清理(默认7天后)
|
||||
|
||||
**解决方案**:
|
||||
```bash
|
||||
# 延长保留时间为30天
|
||||
python start_all.py --cleanup-old-files-days 30
|
||||
|
||||
# 或禁用自动清理
|
||||
python start_all.py --cleanup-old-files-days 0
|
||||
```
|
||||
|
||||
### 问题6: 任务重复处理
|
||||
|
||||
**症状**: 同一个任务被多个 worker 处理
|
||||
|
||||
**原因**: 这不应该发生,数据库使用了原子操作防止重复
|
||||
|
||||
**排查**:
|
||||
```bash
|
||||
# 检查是否有多个 TaskDB 实例连接不同的数据库文件
|
||||
# 确保所有组件使用同一个 mineru_tianshu.db
|
||||
```
|
||||
|
||||
## 🛠️ 技术栈
|
||||
|
||||
- **Web**: FastAPI + Uvicorn
|
||||
- **解析器**: MinerU (PDF/图片) + MarkItDown (Office/文本/HTML等)
|
||||
- **GPU 调度**: LitServe (自动负载均衡)
|
||||
- **存储**: SQLite (并发安全) + MinIO (可选)
|
||||
- **日志**: Loguru
|
||||
- **并发模型**: Worker主动拉取 + 原子操作
|
||||
|
||||
## 🆕 版本更新说明
|
||||
|
||||
### v2.0 重大改进
|
||||
|
||||
**1. Worker 主动拉取模式**
|
||||
- ✅ Workers 持续循环拉取任务,无需调度器触发
|
||||
- ✅ 默认 0.5 秒拉取间隔,响应速度极快
|
||||
- ✅ 空闲时自动休眠,不占用CPU资源
|
||||
|
||||
**2. 数据库并发安全增强**
|
||||
- ✅ 使用 `BEGIN IMMEDIATE` 和原子操作
|
||||
- ✅ 防止任务重复处理
|
||||
- ✅ 支持多 Worker 并发拉取
|
||||
|
||||
**3. 调度器变为可选**
|
||||
- ✅ 不再是必需组件,Workers 可独立运行
|
||||
- ✅ 仅用于系统监控和健康检查
|
||||
- ✅ 默认不启动,减少系统开销
|
||||
|
||||
**4. 结果文件清理功能**
|
||||
- ✅ 自动清理旧结果文件(默认7天)
|
||||
- ✅ 保留数据库记录供查询
|
||||
- ✅ 可配置清理周期或禁用
|
||||
|
||||
**5. API 自动返回内容**
|
||||
- ✅ 查询接口自动返回 Markdown 内容
|
||||
- ✅ 无需额外请求获取结果
|
||||
- ✅ 支持图片上传到 MinIO
|
||||
|
||||
**6. 多GPU显存优化**
|
||||
- ✅ 修复多卡显存占用问题
|
||||
- ✅ 每个进程只使用分配的GPU
|
||||
- ✅ 通过 `CUDA_VISIBLE_DEVICES` 隔离
|
||||
|
||||
### 迁移指南 (v1.x → v2.0)
|
||||
|
||||
**无需修改代码**,只需注意:
|
||||
1. 调度器现在是可选的,不启动也能正常工作
|
||||
2. 结果文件默认7天后清理,如需保留请设置 `--cleanup-old-files-days 0`
|
||||
3. API 查询接口现在会返回 `data` 字段包含完整内容
|
||||
|
||||
### 性能提升
|
||||
|
||||
| 指标 | v1.x | v2.0 | 提升 |
|
||||
|-----|------|------|-----|
|
||||
| 任务响应延迟<sup>※</sup> | 5-10秒 (调度器轮询) | 0.5秒 (Worker主动拉取) | **10-20倍** |
|
||||
| 并发安全性 | 基础锁机制 | 原子操作 + 状态检查 | **可靠性提升** |
|
||||
| 多GPU效率 | 有时会出现显存冲突 | 完全隔离,无冲突 | **稳定性提升** |
|
||||
| 系统开销 | 调度器持续运行 | 可选监控(5分钟) | **资源节省** |
|
||||
|
||||
※ 任务响应延迟指任务添加到被 Worker 开始处理的时间间隔。v1.x 主要受调度器轮询间隔影响,非测量端到端处理时间。实际端到端响应时间还包括任务类型和系统负载所有因子。
|
||||
|
||||
## 📝 核心依赖
|
||||
|
||||
```txt
|
||||
mineru[core]>=2.5.0 # MinerU 核心
|
||||
fastapi>=0.115.0 # Web 框架
|
||||
litserve>=0.2.0 # GPU 负载均衡
|
||||
markitdown>=0.1.3 # Office 文档解析
|
||||
minio>=7.2.0 # MinIO 对象存储
|
||||
```
|
||||
|
||||
## 🤝 贡献
|
||||
|
||||
欢迎提交 Issue 和 Pull Request!
|
||||
|
||||
## 📄 许可证
|
||||
|
||||
遵循 MinerU 主项目许可证
|
||||
|
||||
---
|
||||
|
||||
**天枢 (Tianshu)** - 企业级多 GPU 文档解析服务 ⚡️
|
||||
|
||||
*北斗第一星,寓意核心调度能力*
|
||||
|
||||
@@ -1,751 +0,0 @@
|
||||
"""
|
||||
MinerU Tianshu - API Server
|
||||
天枢API服务器
|
||||
|
||||
提供RESTful API接口用于任务提交、查询和管理
|
||||
"""
|
||||
from fastapi import FastAPI, UploadFile, File, Form, HTTPException, Query
|
||||
from fastapi.responses import JSONResponse
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from loguru import logger
|
||||
import uvicorn
|
||||
from typing import Optional
|
||||
from datetime import datetime
|
||||
import os
|
||||
import re
|
||||
import uuid
|
||||
import json
|
||||
from minio import Minio
|
||||
|
||||
from task_db import TaskDB
|
||||
|
||||
# 初始化 FastAPI 应用
|
||||
app = FastAPI(
|
||||
title="MinerU Tianshu API",
|
||||
description="天枢 - 企业级多GPU文档解析服务",
|
||||
version="1.0.0"
|
||||
)
|
||||
|
||||
# 添加 CORS 中间件
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=["*"],
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
# 初始化数据库
|
||||
db = TaskDB()
|
||||
|
||||
# 配置输出目录
|
||||
OUTPUT_DIR = Path('/tmp/mineru_tianshu_output')
|
||||
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# MinIO 配置
|
||||
MINIO_CONFIG = {
|
||||
'endpoint': os.getenv('MINIO_ENDPOINT', ''),
|
||||
'access_key': os.getenv('MINIO_ACCESS_KEY', ''),
|
||||
'secret_key': os.getenv('MINIO_SECRET_KEY', ''),
|
||||
'secure': True,
|
||||
'bucket_name': os.getenv('MINIO_BUCKET', '')
|
||||
}
|
||||
|
||||
|
||||
def get_minio_client():
|
||||
"""获取MinIO客户端实例"""
|
||||
return Minio(
|
||||
endpoint=MINIO_CONFIG['endpoint'],
|
||||
access_key=MINIO_CONFIG['access_key'],
|
||||
secret_key=MINIO_CONFIG['secret_key'],
|
||||
secure=MINIO_CONFIG['secure']
|
||||
)
|
||||
|
||||
|
||||
def process_markdown_images(md_content: str, image_dir: Path, upload_images: bool = False):
|
||||
"""
|
||||
处理 Markdown 中的图片引用
|
||||
|
||||
Args:
|
||||
md_content: Markdown 内容
|
||||
image_dir: 图片所在目录
|
||||
upload_images: 是否上传图片到 MinIO 并替换链接
|
||||
|
||||
Returns:
|
||||
处理后的 Markdown 内容
|
||||
"""
|
||||
if not upload_images:
|
||||
return md_content
|
||||
|
||||
try:
|
||||
minio_client = get_minio_client()
|
||||
bucket_name = MINIO_CONFIG['bucket_name']
|
||||
minio_endpoint = MINIO_CONFIG['endpoint']
|
||||
|
||||
# 查找所有 markdown 格式的图片
|
||||
img_pattern = r'!\[([^\]]*)\]\(([^)]+)\)'
|
||||
|
||||
def replace_image(match):
|
||||
alt_text = match.group(1)
|
||||
image_path = match.group(2)
|
||||
|
||||
# 构建完整的本地图片路径
|
||||
full_image_path = image_dir / Path(image_path).name
|
||||
|
||||
if full_image_path.exists():
|
||||
# 获取文件后缀
|
||||
file_extension = full_image_path.suffix
|
||||
# 生成 UUID 作为新文件名
|
||||
new_filename = f"{uuid.uuid4()}{file_extension}"
|
||||
|
||||
try:
|
||||
# 上传到 MinIO
|
||||
object_name = f"images/{new_filename}"
|
||||
minio_client.fput_object(bucket_name=bucket_name, object_name=object_name, file_path=str(full_image_path))
|
||||
|
||||
# 生成 MinIO 访问 URL
|
||||
scheme = 'https' if MINIO_CONFIG['secure'] else 'http'
|
||||
minio_url = f"{scheme}://{minio_endpoint}/{bucket_name}/{object_name}"
|
||||
|
||||
# 返回 HTML 格式的 img 标签
|
||||
return f'<img src="{minio_url}" alt="{alt_text}">'
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to upload image to MinIO: {e}")
|
||||
return match.group(0) # 上传失败,保持原样
|
||||
|
||||
return match.group(0)
|
||||
|
||||
# 替换所有图片引用
|
||||
new_content = re.sub(img_pattern, replace_image, md_content)
|
||||
return new_content
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing markdown images: {e}")
|
||||
return md_content # 出错时返回原内容
|
||||
|
||||
|
||||
def read_json_file(file_path: Path):
|
||||
"""
|
||||
读取 JSON 文件
|
||||
|
||||
Args:
|
||||
file_path: JSON 文件路径
|
||||
|
||||
Returns:
|
||||
解析后的 JSON 数据,失败返回 None
|
||||
"""
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to read JSON file {file_path}: {e}")
|
||||
return None
|
||||
|
||||
|
||||
def get_file_metadata(file_path: Path):
|
||||
"""
|
||||
获取文件元数据
|
||||
|
||||
Args:
|
||||
file_path: 文件路径
|
||||
|
||||
Returns:
|
||||
包含文件元数据的字典
|
||||
"""
|
||||
if not file_path.exists():
|
||||
return None
|
||||
|
||||
stat = file_path.stat()
|
||||
return {
|
||||
'size': stat.st_size,
|
||||
'created_at': datetime.fromtimestamp(stat.st_ctime).isoformat(),
|
||||
'modified_at': datetime.fromtimestamp(stat.st_mtime).isoformat()
|
||||
}
|
||||
|
||||
|
||||
def get_images_info(image_dir: Path, upload_to_minio: bool = False):
|
||||
"""
|
||||
获取图片目录信息
|
||||
|
||||
Args:
|
||||
image_dir: 图片目录路径
|
||||
upload_to_minio: 是否上传到 MinIO
|
||||
|
||||
Returns:
|
||||
图片信息字典
|
||||
"""
|
||||
if not image_dir.exists() or not image_dir.is_dir():
|
||||
return {
|
||||
'count': 0,
|
||||
'list': [],
|
||||
'uploaded_to_minio': False
|
||||
}
|
||||
|
||||
# 支持的图片格式
|
||||
image_extensions = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.webp', '.svg'}
|
||||
image_files = [f for f in image_dir.iterdir() if f.is_file() and f.suffix.lower() in image_extensions]
|
||||
|
||||
images_list = []
|
||||
|
||||
for img_file in sorted(image_files):
|
||||
img_info = {
|
||||
'name': img_file.name,
|
||||
'size': img_file.stat().st_size,
|
||||
'path': str(img_file.relative_to(image_dir.parent))
|
||||
}
|
||||
|
||||
# 如果需要上传到 MinIO
|
||||
if upload_to_minio:
|
||||
try:
|
||||
minio_client = get_minio_client()
|
||||
bucket_name = MINIO_CONFIG['bucket_name']
|
||||
minio_endpoint = MINIO_CONFIG['endpoint']
|
||||
|
||||
# 生成 UUID 作为新文件名
|
||||
file_extension = img_file.suffix
|
||||
new_filename = f"{uuid.uuid4()}{file_extension}"
|
||||
object_name = f"images/{new_filename}"
|
||||
|
||||
# 上传到 MinIO
|
||||
minio_client.fput_object(bucket_name=bucket_name, object_name=object_name, file_path=str(img_file))
|
||||
|
||||
# 生成访问 URL
|
||||
scheme = 'https' if MINIO_CONFIG['secure'] else 'http'
|
||||
img_info['url'] = f"{scheme}://{minio_endpoint}/{bucket_name}/{object_name}"
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to upload image {img_file.name} to MinIO: {e}")
|
||||
img_info['url'] = None
|
||||
|
||||
images_list.append(img_info)
|
||||
|
||||
return {
|
||||
'count': len(images_list),
|
||||
'list': images_list,
|
||||
'uploaded_to_minio': upload_to_minio
|
||||
}
|
||||
|
||||
|
||||
@app.get("/")
|
||||
async def root():
|
||||
"""API根路径"""
|
||||
return {
|
||||
"service": "MinerU Tianshu",
|
||||
"version": "1.0.0",
|
||||
"description": "天枢 - 企业级多GPU文档解析服务",
|
||||
"docs": "/docs"
|
||||
}
|
||||
|
||||
|
||||
@app.post("/api/v1/tasks/submit")
|
||||
async def submit_task(
|
||||
file: UploadFile = File(..., description="文档文件: PDF/图片(MinerU解析) 或 Office/HTML/文本等(MarkItDown解析)"),
|
||||
backend: str = Form('pipeline', description="处理后端: pipeline/vlm-transformers/vlm-vllm-engine"),
|
||||
lang: str = Form('ch', description="语言: ch/en/korean/japan等"),
|
||||
method: str = Form('auto', description="解析方法: auto/txt/ocr"),
|
||||
formula_enable: bool = Form(True, description="是否启用公式识别"),
|
||||
table_enable: bool = Form(True, description="是否启用表格识别"),
|
||||
priority: int = Form(0, description="优先级,数字越大越优先"),
|
||||
):
|
||||
"""
|
||||
提交文档解析任务
|
||||
|
||||
立即返回 task_id,任务在后台异步处理
|
||||
"""
|
||||
try:
|
||||
# 保存上传的文件到临时目录
|
||||
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=Path(file.filename).suffix)
|
||||
|
||||
# 流式写入文件到磁盘,避免高内存使用
|
||||
while True:
|
||||
chunk = await file.read(1 << 23) # 8MB chunks
|
||||
if not chunk:
|
||||
break
|
||||
temp_file.write(chunk)
|
||||
|
||||
temp_file.close()
|
||||
|
||||
# 创建任务
|
||||
task_id = db.create_task(
|
||||
file_name=file.filename,
|
||||
file_path=temp_file.name,
|
||||
backend=backend,
|
||||
options={
|
||||
'lang': lang,
|
||||
'method': method,
|
||||
'formula_enable': formula_enable,
|
||||
'table_enable': table_enable,
|
||||
},
|
||||
priority=priority
|
||||
)
|
||||
|
||||
logger.info(f"✅ Task submitted: {task_id} - {file.filename} (priority: {priority})")
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'task_id': task_id,
|
||||
'status': 'pending',
|
||||
'message': 'Task submitted successfully',
|
||||
'file_name': file.filename,
|
||||
'created_at': datetime.now().isoformat()
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to submit task: {e}")
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
|
||||
|
||||
@app.get("/api/v1/tasks/{task_id}/data")
|
||||
async def get_task_data(
|
||||
task_id: str,
|
||||
include_fields: str = Query(
|
||||
"md,content_list,middle_json,model_output,images",
|
||||
description="需要返回的字段,逗号分隔:md,content_list,middle_json,model_output,images,layout_pdf,span_pdf,origin_pdf"
|
||||
),
|
||||
upload_images: bool = Query(False, description="是否上传图片到MinIO并返回URL"),
|
||||
include_metadata: bool = Query(True, description="是否包含文件元数据")
|
||||
):
|
||||
"""
|
||||
按需获取任务的解析数据
|
||||
|
||||
支持灵活获取 MinerU 解析后的数据,包括:
|
||||
- Markdown 内容
|
||||
- Content List JSON(结构化内容列表)
|
||||
- Middle JSON(中间处理结果)
|
||||
- Model Output JSON(模型原始输出)
|
||||
- 图片列表
|
||||
- 其他辅助文件(layout PDF、span PDF、origin PDF)
|
||||
|
||||
通过 include_fields 参数按需选择需要返回的字段
|
||||
"""
|
||||
# 获取任务信息
|
||||
task = db.get_task(task_id)
|
||||
|
||||
if not task:
|
||||
raise HTTPException(status_code=404, detail="Task not found")
|
||||
|
||||
# 构建基础响应
|
||||
response = {
|
||||
'success': True,
|
||||
'task_id': task_id,
|
||||
'status': task['status'],
|
||||
'file_name': task['file_name'],
|
||||
'backend': task['backend'],
|
||||
'created_at': task['created_at'],
|
||||
'completed_at': task['completed_at']
|
||||
}
|
||||
|
||||
# 如果任务未完成,直接返回状态
|
||||
if task['status'] != 'completed':
|
||||
response['message'] = f"Task is in {task['status']} status, data not available yet"
|
||||
return response
|
||||
|
||||
# 检查结果路径
|
||||
if not task['result_path']:
|
||||
response['message'] = 'Task completed but result files have been cleaned up (older than retention period)'
|
||||
return response
|
||||
|
||||
result_dir = Path(task['result_path'])
|
||||
if not result_dir.exists():
|
||||
response['message'] = 'Result directory does not exist'
|
||||
return response
|
||||
|
||||
# 解析需要返回的字段
|
||||
fields = [f.strip() for f in include_fields.split(',')]
|
||||
|
||||
# 初始化 data 字段
|
||||
response['data'] = {} # type: ignore
|
||||
|
||||
logger.info(f"📦 Getting complete data for task {task_id}, fields: {fields}")
|
||||
|
||||
# 查找文件(递归搜索,MinerU 输出结构:task_id/filename/auto/*.md)
|
||||
try:
|
||||
# 1. 处理 Markdown 文件
|
||||
if 'md' in fields:
|
||||
md_files = list(result_dir.rglob('*.md'))
|
||||
# 排除带特殊后缀的 md 文件
|
||||
md_files = [f for f in md_files if not any(f.stem.endswith(suffix) for suffix in ['_layout', '_span', '_origin'])]
|
||||
|
||||
if md_files:
|
||||
md_file = md_files[0]
|
||||
logger.info(f"📄 Reading markdown file: {md_file}")
|
||||
|
||||
with open(md_file, 'r', encoding='utf-8') as f:
|
||||
md_content = f.read()
|
||||
|
||||
# 处理图片(如果需要上传)
|
||||
image_dir = md_file.parent / 'images'
|
||||
if upload_images and image_dir.exists():
|
||||
md_content = process_markdown_images(md_content, image_dir, upload_images)
|
||||
|
||||
response['data']['markdown'] = {
|
||||
'content': md_content,
|
||||
'file_name': md_file.name
|
||||
}
|
||||
|
||||
if include_metadata:
|
||||
metadata = get_file_metadata(md_file)
|
||||
if metadata:
|
||||
response['data']['markdown']['metadata'] = metadata
|
||||
|
||||
# 2. 处理 Content List JSON
|
||||
if 'content_list' in fields:
|
||||
content_list_files = list(result_dir.rglob('*_content_list.json'))
|
||||
if content_list_files:
|
||||
content_list_file = content_list_files[0]
|
||||
logger.info(f"📄 Reading content list file: {content_list_file}")
|
||||
|
||||
content_data = read_json_file(content_list_file)
|
||||
if content_data is not None:
|
||||
response['data']['content_list'] = {
|
||||
'content': content_data,
|
||||
'file_name': content_list_file.name
|
||||
}
|
||||
|
||||
if include_metadata:
|
||||
metadata = get_file_metadata(content_list_file)
|
||||
if metadata:
|
||||
response['data']['content_list']['metadata'] = metadata
|
||||
|
||||
# 3. 处理 Middle JSON
|
||||
if 'middle_json' in fields:
|
||||
middle_json_files = list(result_dir.rglob('*_middle.json'))
|
||||
if middle_json_files:
|
||||
middle_json_file = middle_json_files[0]
|
||||
logger.info(f"📄 Reading middle json file: {middle_json_file}")
|
||||
|
||||
middle_data = read_json_file(middle_json_file)
|
||||
if middle_data is not None:
|
||||
response['data']['middle_json'] = {
|
||||
'content': middle_data,
|
||||
'file_name': middle_json_file.name
|
||||
}
|
||||
|
||||
if include_metadata:
|
||||
metadata = get_file_metadata(middle_json_file)
|
||||
if metadata:
|
||||
response['data']['middle_json']['metadata'] = metadata
|
||||
|
||||
# 4. 处理 Model Output JSON
|
||||
if 'model_output' in fields:
|
||||
model_output_files = list(result_dir.rglob('*_model.json'))
|
||||
if model_output_files:
|
||||
model_output_file = model_output_files[0]
|
||||
logger.info(f"📄 Reading model output file: {model_output_file}")
|
||||
|
||||
model_data = read_json_file(model_output_file)
|
||||
if model_data is not None:
|
||||
response['data']['model_output'] = {
|
||||
'content': model_data,
|
||||
'file_name': model_output_file.name
|
||||
}
|
||||
|
||||
if include_metadata:
|
||||
metadata = get_file_metadata(model_output_file)
|
||||
if metadata:
|
||||
response['data']['model_output']['metadata'] = metadata
|
||||
|
||||
# 5. 处理图片
|
||||
if 'images' in fields:
|
||||
image_dirs = list(result_dir.rglob('images'))
|
||||
if image_dirs:
|
||||
image_dir = image_dirs[0]
|
||||
logger.info(f"🖼️ Getting images info from: {image_dir}")
|
||||
|
||||
images_info = get_images_info(image_dir, upload_images)
|
||||
response['data']['images'] = images_info
|
||||
|
||||
# 6. 处理 Layout PDF
|
||||
if 'layout_pdf' in fields:
|
||||
layout_pdf_files = list(result_dir.rglob('*_layout.pdf'))
|
||||
if layout_pdf_files:
|
||||
layout_pdf_file = layout_pdf_files[0]
|
||||
response['data']['layout_pdf'] = {
|
||||
'file_name': layout_pdf_file.name,
|
||||
'path': str(layout_pdf_file.relative_to(result_dir))
|
||||
}
|
||||
|
||||
if include_metadata:
|
||||
metadata = get_file_metadata(layout_pdf_file)
|
||||
if metadata:
|
||||
response['data']['layout_pdf']['metadata'] = metadata
|
||||
|
||||
# 7. 处理 Span PDF
|
||||
if 'span_pdf' in fields:
|
||||
span_pdf_files = list(result_dir.rglob('*_span.pdf'))
|
||||
if span_pdf_files:
|
||||
span_pdf_file = span_pdf_files[0]
|
||||
response['data']['span_pdf'] = {
|
||||
'file_name': span_pdf_file.name,
|
||||
'path': str(span_pdf_file.relative_to(result_dir))
|
||||
}
|
||||
|
||||
if include_metadata:
|
||||
metadata = get_file_metadata(span_pdf_file)
|
||||
if metadata:
|
||||
response['data']['span_pdf']['metadata'] = metadata
|
||||
|
||||
# 8. 处理 Origin PDF
|
||||
if 'origin_pdf' in fields:
|
||||
origin_pdf_files = list(result_dir.rglob('*_origin.pdf'))
|
||||
if origin_pdf_files:
|
||||
origin_pdf_file = origin_pdf_files[0]
|
||||
response['data']['origin_pdf'] = {
|
||||
'file_name': origin_pdf_file.name,
|
||||
'path': str(origin_pdf_file.relative_to(result_dir))
|
||||
}
|
||||
|
||||
if include_metadata:
|
||||
metadata = get_file_metadata(origin_pdf_file)
|
||||
if metadata:
|
||||
response['data']['origin_pdf']['metadata'] = metadata
|
||||
|
||||
logger.info(f"✅ Complete data retrieved successfully for task {task_id}")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to get complete data for task {task_id}: {e}")
|
||||
logger.exception(e)
|
||||
raise HTTPException(status_code=500, detail=f"Internal server error: {e}")
|
||||
|
||||
return response
|
||||
|
||||
|
||||
@app.get("/api/v1/tasks/{task_id}")
|
||||
async def get_task_status(
|
||||
task_id: str,
|
||||
upload_images: bool = Query(False, description="是否上传图片到MinIO并替换链接(仅当任务完成时有效)")
|
||||
):
|
||||
"""
|
||||
查询任务状态和详情
|
||||
|
||||
当任务完成时,会自动返回解析后的 Markdown 内容(data 字段)
|
||||
可选择是否上传图片到 MinIO 并替换为 URL
|
||||
"""
|
||||
task = db.get_task(task_id)
|
||||
|
||||
if not task:
|
||||
raise HTTPException(status_code=404, detail="Task not found")
|
||||
|
||||
response = {
|
||||
'success': True,
|
||||
'task_id': task_id,
|
||||
'status': task['status'],
|
||||
'file_name': task['file_name'],
|
||||
'backend': task['backend'],
|
||||
'priority': task['priority'],
|
||||
'error_message': task['error_message'],
|
||||
'created_at': task['created_at'],
|
||||
'started_at': task['started_at'],
|
||||
'completed_at': task['completed_at'],
|
||||
'worker_id': task['worker_id'],
|
||||
'retry_count': task['retry_count']
|
||||
}
|
||||
logger.info(f"✅ Task status: {task['status']} - (result_path: {task['result_path']})")
|
||||
|
||||
# 如果任务已完成,尝试返回解析内容
|
||||
if task['status'] == 'completed':
|
||||
if not task['result_path']:
|
||||
# 结果文件已被清理
|
||||
response['data'] = None
|
||||
response['message'] = 'Task completed but result files have been cleaned up (older than retention period)'
|
||||
return response
|
||||
|
||||
result_dir = Path(task['result_path'])
|
||||
logger.info(f"📂 Checking result directory: {result_dir}")
|
||||
|
||||
if result_dir.exists():
|
||||
logger.info(f"✅ Result directory exists")
|
||||
# 递归查找 Markdown 文件(MinerU 输出结构:task_id/filename/auto/*.md)
|
||||
md_files = list(result_dir.rglob('*.md'))
|
||||
logger.info(f"📄 Found {len(md_files)} markdown files: {[f.relative_to(result_dir) for f in md_files]}")
|
||||
|
||||
if md_files:
|
||||
try:
|
||||
# 读取 Markdown 内容
|
||||
md_file = md_files[0]
|
||||
logger.info(f"📖 Reading markdown file: {md_file}")
|
||||
with open(md_file, 'r', encoding='utf-8') as f:
|
||||
md_content = f.read()
|
||||
|
||||
logger.info(f"✅ Markdown content loaded, length: {len(md_content)} characters")
|
||||
|
||||
# 查找图片目录(在 markdown 文件的同级目录下)
|
||||
image_dir = md_file.parent / 'images'
|
||||
|
||||
# 处理图片(如果需要)
|
||||
if upload_images and image_dir.exists():
|
||||
logger.info(f"🖼️ Processing images for task {task_id}, upload_images={upload_images}")
|
||||
md_content = process_markdown_images(md_content, image_dir, upload_images)
|
||||
|
||||
# 添加 data 字段
|
||||
response['data'] = {
|
||||
'markdown_file': md_file.name,
|
||||
'content': md_content,
|
||||
'images_uploaded': upload_images,
|
||||
'has_images': image_dir.exists() if not upload_images else None
|
||||
}
|
||||
logger.info(f"✅ Response data field added successfully")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to read markdown content: {e}")
|
||||
logger.exception(e)
|
||||
# 读取失败不影响状态查询,只是不返回 data
|
||||
response['data'] = None
|
||||
else:
|
||||
logger.warning(f"⚠️ No markdown files found in {result_dir}")
|
||||
else:
|
||||
logger.error(f"❌ Result directory does not exist: {result_dir}")
|
||||
elif task['status'] == 'completed':
|
||||
logger.warning(f"⚠️ Task completed but result_path is empty")
|
||||
else:
|
||||
logger.info(f"ℹ️ Task status is {task['status']}, skipping content loading")
|
||||
|
||||
return response
|
||||
|
||||
|
||||
@app.delete("/api/v1/tasks/{task_id}")
|
||||
async def cancel_task(task_id: str):
|
||||
"""
|
||||
取消任务(仅限 pending 状态)
|
||||
"""
|
||||
task = db.get_task(task_id)
|
||||
|
||||
if not task:
|
||||
raise HTTPException(status_code=404, detail="Task not found")
|
||||
|
||||
if task['status'] == 'pending':
|
||||
db.update_task_status(task_id, 'cancelled')
|
||||
|
||||
# 删除临时文件
|
||||
file_path = Path(task['file_path'])
|
||||
if file_path.exists():
|
||||
file_path.unlink()
|
||||
|
||||
logger.info(f"⏹️ Task cancelled: {task_id}")
|
||||
return {
|
||||
'success': True,
|
||||
'message': 'Task cancelled successfully'
|
||||
}
|
||||
else:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Cannot cancel task in {task['status']} status"
|
||||
)
|
||||
|
||||
|
||||
@app.get("/api/v1/queue/stats")
|
||||
async def get_queue_stats():
|
||||
"""
|
||||
获取队列统计信息
|
||||
"""
|
||||
stats = db.get_queue_stats()
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'stats': stats,
|
||||
'total': sum(stats.values()),
|
||||
'timestamp': datetime.now().isoformat()
|
||||
}
|
||||
|
||||
|
||||
@app.get("/api/v1/queue/tasks")
|
||||
async def list_tasks(
|
||||
status: Optional[str] = Query(None, description="筛选状态: pending/processing/completed/failed"),
|
||||
limit: int = Query(100, description="返回数量限制", le=1000)
|
||||
):
|
||||
"""
|
||||
获取任务列表
|
||||
"""
|
||||
if status:
|
||||
tasks = db.get_tasks_by_status(status, limit)
|
||||
else:
|
||||
# 返回所有任务(需要修改 TaskDB 添加这个方法)
|
||||
with db.get_cursor() as cursor:
|
||||
cursor.execute('''
|
||||
SELECT * FROM tasks
|
||||
ORDER BY created_at DESC
|
||||
LIMIT ?
|
||||
''', (limit,))
|
||||
tasks = [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'count': len(tasks),
|
||||
'tasks': tasks
|
||||
}
|
||||
|
||||
|
||||
@app.post("/api/v1/admin/cleanup")
|
||||
async def cleanup_old_tasks(days: int = Query(7, description="清理N天前的任务")):
|
||||
"""
|
||||
清理旧任务记录(管理接口)
|
||||
"""
|
||||
deleted_count = db.cleanup_old_tasks(days)
|
||||
|
||||
logger.info(f"🧹 Cleaned up {deleted_count} old tasks")
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'deleted_count': deleted_count,
|
||||
'message': f'Cleaned up tasks older than {days} days'
|
||||
}
|
||||
|
||||
|
||||
@app.post("/api/v1/admin/reset-stale")
|
||||
async def reset_stale_tasks(timeout_minutes: int = Query(60, description="超时时间(分钟)")):
|
||||
"""
|
||||
重置超时的 processing 任务(管理接口)
|
||||
"""
|
||||
reset_count = db.reset_stale_tasks(timeout_minutes)
|
||||
|
||||
logger.info(f"🔄 Reset {reset_count} stale tasks")
|
||||
|
||||
return {
|
||||
'success': True,
|
||||
'reset_count': reset_count,
|
||||
'message': f'Reset tasks processing for more than {timeout_minutes} minutes'
|
||||
}
|
||||
|
||||
|
||||
@app.get("/api/v1/health")
|
||||
async def health_check():
|
||||
"""
|
||||
健康检查接口
|
||||
"""
|
||||
try:
|
||||
# 检查数据库连接
|
||||
stats = db.get_queue_stats()
|
||||
|
||||
return {
|
||||
'status': 'healthy',
|
||||
'timestamp': datetime.now().isoformat(),
|
||||
'database': 'connected',
|
||||
'queue_stats': stats
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Health check failed: {e}")
|
||||
return JSONResponse(
|
||||
status_code=503,
|
||||
content={
|
||||
'status': 'unhealthy',
|
||||
'error': str(e)
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# 从环境变量读取端口,默认为8000
|
||||
api_port = int(os.getenv('API_PORT', '8000'))
|
||||
|
||||
logger.info("🚀 Starting MinerU Tianshu API Server...")
|
||||
logger.info(f"📖 API Documentation: http://localhost:{api_port}/docs")
|
||||
|
||||
uvicorn.run(
|
||||
app,
|
||||
host='0.0.0.0',
|
||||
port=api_port,
|
||||
log_level='info'
|
||||
)
|
||||
|
||||
@@ -1,318 +0,0 @@
|
||||
"""
|
||||
MinerU Tianshu - Client Example
|
||||
天枢客户端示例
|
||||
|
||||
演示如何使用 Python 客户端提交任务和查询状态
|
||||
"""
|
||||
import asyncio
|
||||
import aiohttp
|
||||
from pathlib import Path
|
||||
from loguru import logger
|
||||
import time
|
||||
from typing import Dict
|
||||
|
||||
|
||||
class TianshuClient:
|
||||
"""天枢客户端"""
|
||||
|
||||
def __init__(self, api_url='http://localhost:8000'):
|
||||
self.api_url = api_url
|
||||
self.base_url = f"{api_url}/api/v1"
|
||||
|
||||
async def submit_task(
|
||||
self,
|
||||
session: aiohttp.ClientSession,
|
||||
file_path: str,
|
||||
backend: str = 'pipeline',
|
||||
lang: str = 'ch',
|
||||
method: str = 'auto',
|
||||
formula_enable: bool = True,
|
||||
table_enable: bool = True,
|
||||
priority: int = 0
|
||||
) -> Dict:
|
||||
"""
|
||||
提交任务
|
||||
|
||||
Args:
|
||||
session: aiohttp session
|
||||
file_path: 文件路径
|
||||
backend: 处理后端
|
||||
lang: 语言
|
||||
method: 解析方法
|
||||
formula_enable: 是否启用公式识别
|
||||
table_enable: 是否启用表格识别
|
||||
priority: 优先级
|
||||
|
||||
Returns:
|
||||
响应字典,包含 task_id
|
||||
"""
|
||||
with open(file_path, 'rb') as f:
|
||||
data = aiohttp.FormData()
|
||||
data.add_field('file', f, filename=Path(file_path).name)
|
||||
data.add_field('backend', backend)
|
||||
data.add_field('lang', lang)
|
||||
data.add_field('method', method)
|
||||
data.add_field('formula_enable', str(formula_enable).lower())
|
||||
data.add_field('table_enable', str(table_enable).lower())
|
||||
data.add_field('priority', str(priority))
|
||||
|
||||
async with session.post(f'{self.base_url}/tasks/submit', data=data) as resp:
|
||||
if resp.status == 200:
|
||||
result = await resp.json()
|
||||
logger.info(f"✅ Submitted: {file_path} -> Task ID: {result['task_id']}")
|
||||
return result
|
||||
else:
|
||||
error = await resp.text()
|
||||
logger.error(f"❌ Failed to submit {file_path}: {error}")
|
||||
return {'success': False, 'error': error}
|
||||
|
||||
async def get_task_status(self, session: aiohttp.ClientSession, task_id: str) -> Dict:
|
||||
"""
|
||||
查询任务状态
|
||||
|
||||
Args:
|
||||
session: aiohttp session
|
||||
task_id: 任务ID
|
||||
|
||||
Returns:
|
||||
任务状态字典
|
||||
"""
|
||||
async with session.get(f'{self.base_url}/tasks/{task_id}') as resp:
|
||||
if resp.status == 200:
|
||||
return await resp.json()
|
||||
else:
|
||||
return {'success': False, 'error': 'Task not found'}
|
||||
|
||||
async def wait_for_task(
|
||||
self,
|
||||
session: aiohttp.ClientSession,
|
||||
task_id: str,
|
||||
timeout: int = 600,
|
||||
poll_interval: int = 2
|
||||
) -> Dict:
|
||||
"""
|
||||
等待任务完成
|
||||
|
||||
Args:
|
||||
session: aiohttp session
|
||||
task_id: 任务ID
|
||||
timeout: 超时时间(秒)
|
||||
poll_interval: 轮询间隔(秒)
|
||||
|
||||
Returns:
|
||||
最终任务状态
|
||||
"""
|
||||
start_time = time.time()
|
||||
|
||||
while True:
|
||||
status = await self.get_task_status(session, task_id)
|
||||
|
||||
if not status.get('success'):
|
||||
logger.error(f"❌ Failed to get status for task {task_id}")
|
||||
return status
|
||||
|
||||
task_status = status.get('status')
|
||||
|
||||
if task_status == 'completed':
|
||||
logger.info(f"✅ Task {task_id} completed!")
|
||||
logger.info(f" Output: {status.get('result_path')}")
|
||||
return status
|
||||
|
||||
elif task_status == 'failed':
|
||||
logger.error(f"❌ Task {task_id} failed!")
|
||||
logger.error(f" Error: {status.get('error_message')}")
|
||||
return status
|
||||
|
||||
elif task_status == 'cancelled':
|
||||
logger.warning(f"⚠️ Task {task_id} was cancelled")
|
||||
return status
|
||||
|
||||
# 检查超时
|
||||
if time.time() - start_time > timeout:
|
||||
logger.error(f"⏱️ Task {task_id} timeout after {timeout}s")
|
||||
return {'success': False, 'error': 'timeout'}
|
||||
|
||||
# 等待后继续轮询
|
||||
await asyncio.sleep(poll_interval)
|
||||
|
||||
async def get_queue_stats(self, session: aiohttp.ClientSession) -> Dict:
|
||||
"""获取队列统计"""
|
||||
async with session.get(f'{self.base_url}/queue/stats') as resp:
|
||||
return await resp.json()
|
||||
|
||||
async def cancel_task(self, session: aiohttp.ClientSession, task_id: str) -> Dict:
|
||||
"""取消任务"""
|
||||
async with session.delete(f'{self.base_url}/tasks/{task_id}') as resp:
|
||||
return await resp.json()
|
||||
|
||||
|
||||
async def example_single_task():
|
||||
"""示例1:提交单个任务并等待完成"""
|
||||
logger.info("=" * 60)
|
||||
logger.info("示例1:提交单个任务")
|
||||
logger.info("=" * 60)
|
||||
|
||||
client = TianshuClient()
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# 提交任务
|
||||
result = await client.submit_task(
|
||||
session,
|
||||
file_path='../../demo/pdfs/demo1.pdf',
|
||||
backend='pipeline',
|
||||
lang='ch',
|
||||
formula_enable=True,
|
||||
table_enable=True
|
||||
)
|
||||
|
||||
if result.get('success'):
|
||||
task_id = result['task_id']
|
||||
|
||||
# 等待完成
|
||||
logger.info(f"⏳ Waiting for task {task_id} to complete...")
|
||||
final_status = await client.wait_for_task(session, task_id)
|
||||
|
||||
return final_status
|
||||
|
||||
|
||||
async def example_batch_tasks():
|
||||
"""示例2:批量提交多个任务并并发等待"""
|
||||
logger.info("=" * 60)
|
||||
logger.info("示例2:批量提交多个任务")
|
||||
logger.info("=" * 60)
|
||||
|
||||
client = TianshuClient()
|
||||
|
||||
# 准备任务列表
|
||||
files = [
|
||||
'../../demo/pdfs/demo1.pdf',
|
||||
'../../demo/pdfs/demo2.pdf',
|
||||
'../../demo/pdfs/demo3.pdf',
|
||||
]
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# 并发提交所有任务
|
||||
logger.info(f"📤 Submitting {len(files)} tasks...")
|
||||
submit_tasks = [
|
||||
client.submit_task(session, file)
|
||||
for file in files
|
||||
]
|
||||
results = await asyncio.gather(*submit_tasks)
|
||||
|
||||
# 提取 task_ids
|
||||
task_ids = [r['task_id'] for r in results if r.get('success')]
|
||||
logger.info(f"✅ Submitted {len(task_ids)} tasks successfully")
|
||||
|
||||
# 并发等待所有任务完成
|
||||
logger.info(f"⏳ Waiting for all tasks to complete...")
|
||||
wait_tasks = [
|
||||
client.wait_for_task(session, task_id)
|
||||
for task_id in task_ids
|
||||
]
|
||||
final_results = await asyncio.gather(*wait_tasks)
|
||||
|
||||
# 统计结果
|
||||
completed = sum(1 for r in final_results if r.get('status') == 'completed')
|
||||
failed = sum(1 for r in final_results if r.get('status') == 'failed')
|
||||
|
||||
logger.info("=" * 60)
|
||||
logger.info(f"📊 Results: {completed} completed, {failed} failed")
|
||||
logger.info("=" * 60)
|
||||
|
||||
return final_results
|
||||
|
||||
|
||||
async def example_priority_tasks():
|
||||
"""示例3:使用优先级队列"""
|
||||
logger.info("=" * 60)
|
||||
logger.info("示例3:优先级队列")
|
||||
logger.info("=" * 60)
|
||||
|
||||
client = TianshuClient()
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# 提交低优先级任务
|
||||
low_priority = await client.submit_task(
|
||||
session,
|
||||
file_path='../../demo/pdfs/demo1.pdf',
|
||||
priority=0
|
||||
)
|
||||
logger.info(f"📝 Low priority task: {low_priority['task_id']}")
|
||||
|
||||
# 提交高优先级任务
|
||||
high_priority = await client.submit_task(
|
||||
session,
|
||||
file_path='../../demo/pdfs/demo2.pdf',
|
||||
priority=10
|
||||
)
|
||||
logger.info(f"🔥 High priority task: {high_priority['task_id']}")
|
||||
|
||||
# 高优先级任务会先被处理
|
||||
logger.info("⏳ 高优先级任务将优先处理...")
|
||||
|
||||
|
||||
async def example_queue_monitoring():
|
||||
"""示例4:监控队列状态"""
|
||||
logger.info("=" * 60)
|
||||
logger.info("示例4:监控队列状态")
|
||||
logger.info("=" * 60)
|
||||
|
||||
client = TianshuClient()
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# 获取队列统计
|
||||
stats = await client.get_queue_stats(session)
|
||||
|
||||
logger.info("📊 Queue Statistics:")
|
||||
logger.info(f" Total: {stats.get('total', 0)}")
|
||||
for status, count in stats.get('stats', {}).items():
|
||||
logger.info(f" {status:12s}: {count}")
|
||||
|
||||
|
||||
async def main():
|
||||
"""主函数"""
|
||||
import sys
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
example = sys.argv[1]
|
||||
else:
|
||||
example = 'all'
|
||||
|
||||
try:
|
||||
if example == 'single' or example == 'all':
|
||||
await example_single_task()
|
||||
print()
|
||||
|
||||
if example == 'batch' or example == 'all':
|
||||
await example_batch_tasks()
|
||||
print()
|
||||
|
||||
if example == 'priority' or example == 'all':
|
||||
await example_priority_tasks()
|
||||
print()
|
||||
|
||||
if example == 'monitor' or example == 'all':
|
||||
await example_queue_monitoring()
|
||||
print()
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Example failed: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
"""
|
||||
使用方法:
|
||||
|
||||
# 运行所有示例
|
||||
python client_example.py
|
||||
|
||||
# 运行特定示例
|
||||
python client_example.py single
|
||||
python client_example.py batch
|
||||
python client_example.py priority
|
||||
python client_example.py monitor
|
||||
"""
|
||||
asyncio.run(main())
|
||||
|
||||
@@ -1,546 +0,0 @@
|
||||
"""
|
||||
MinerU Tianshu - LitServe Worker
|
||||
天枢 LitServe Worker
|
||||
|
||||
使用 LitServe 实现 GPU 资源的自动负载均衡
|
||||
Worker 主动循环拉取任务并处理
|
||||
"""
|
||||
import os
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
import threading
|
||||
import signal
|
||||
import atexit
|
||||
from pathlib import Path
|
||||
import litserve as ls
|
||||
from loguru import logger
|
||||
|
||||
# 添加父目录到路径以导入 MinerU
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent.parent))
|
||||
|
||||
from task_db import TaskDB
|
||||
from mineru.cli.common import do_parse, read_fn
|
||||
from mineru.utils.config_reader import get_device
|
||||
from mineru.utils.model_utils import get_vram, clean_memory
|
||||
|
||||
# 尝试导入 markitdown
|
||||
try:
|
||||
from markitdown import MarkItDown
|
||||
MARKITDOWN_AVAILABLE = True
|
||||
except ImportError:
|
||||
MARKITDOWN_AVAILABLE = False
|
||||
logger.warning("⚠️ markitdown not available, Office format parsing will be disabled")
|
||||
|
||||
|
||||
class MinerUWorkerAPI(ls.LitAPI):
|
||||
"""
|
||||
LitServe API Worker
|
||||
|
||||
Worker 主动循环拉取任务,利用 LitServe 的自动 GPU 负载均衡
|
||||
支持两种解析方式:
|
||||
- PDF/图片 -> MinerU 解析(GPU 加速)
|
||||
- 其他所有格式 -> MarkItDown 解析(快速处理)
|
||||
|
||||
新模式:每个 worker 启动后持续循环拉取任务,处理完一个立即拉取下一个
|
||||
"""
|
||||
|
||||
# 支持的文件格式定义
|
||||
# MinerU 专用格式:PDF 和图片
|
||||
PDF_IMAGE_FORMATS = {'.pdf', '.png', '.jpg', '.jpeg', '.bmp', '.tiff', '.tif', '.webp'}
|
||||
# 其他所有格式都使用 MarkItDown 解析
|
||||
|
||||
def __init__(self, output_dir='/tmp/mineru_tianshu_output', worker_id_prefix='tianshu',
|
||||
poll_interval=0.5, enable_worker_loop=True):
|
||||
super().__init__()
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.worker_id_prefix = worker_id_prefix
|
||||
self.poll_interval = poll_interval # Worker 拉取任务的间隔(秒)
|
||||
self.enable_worker_loop = enable_worker_loop # 是否启用 worker 循环拉取
|
||||
self.db = TaskDB()
|
||||
self.worker_id = None
|
||||
self.markitdown = None
|
||||
self.running = False # Worker 运行状态
|
||||
self.worker_thread = None # Worker 线程
|
||||
|
||||
def setup(self, device):
|
||||
"""
|
||||
初始化环境(每个 worker 进程调用一次)
|
||||
|
||||
关键修复:使用 CUDA_VISIBLE_DEVICES 确保每个进程只使用分配的 GPU
|
||||
|
||||
Args:
|
||||
device: LitServe 分配的设备 (cuda:0, cuda:1, etc.)
|
||||
"""
|
||||
# 生成唯一的 worker_id
|
||||
import socket
|
||||
hostname = socket.gethostname()
|
||||
pid = os.getpid()
|
||||
self.worker_id = f"{self.worker_id_prefix}-{hostname}-{device}-{pid}"
|
||||
|
||||
logger.info(f"⚙️ Worker {self.worker_id} setting up on device: {device}")
|
||||
|
||||
# 关键修复:设置 CUDA_VISIBLE_DEVICES 限制进程只能看到分配的 GPU
|
||||
# 这样可以防止一个进程占用多张卡的显存
|
||||
if device != 'auto' and device != 'cpu' and ':' in str(device):
|
||||
# 从 'cuda:0' 提取设备ID '0'
|
||||
device_id = str(device).split(':')[-1]
|
||||
os.environ['CUDA_VISIBLE_DEVICES'] = device_id
|
||||
# 设置为 cuda:0,因为对进程来说只能看到一张卡(逻辑ID变为0)
|
||||
os.environ['MINERU_DEVICE_MODE'] = 'cuda:0'
|
||||
device_mode = os.environ['MINERU_DEVICE_MODE']
|
||||
logger.info(f"🔒 CUDA_VISIBLE_DEVICES={device_id} (Physical GPU {device_id} → Logical GPU 0)")
|
||||
else:
|
||||
# 配置 MinerU 环境
|
||||
if os.getenv('MINERU_DEVICE_MODE', None) is None:
|
||||
os.environ['MINERU_DEVICE_MODE'] = device if device != 'auto' else get_device()
|
||||
device_mode = os.environ['MINERU_DEVICE_MODE']
|
||||
|
||||
# 配置显存
|
||||
if os.getenv('MINERU_VIRTUAL_VRAM_SIZE', None) is None:
|
||||
if device_mode.startswith("cuda") or device_mode.startswith("npu"):
|
||||
try:
|
||||
vram = get_vram(device_mode)
|
||||
os.environ['MINERU_VIRTUAL_VRAM_SIZE'] = str(vram)
|
||||
except:
|
||||
os.environ['MINERU_VIRTUAL_VRAM_SIZE'] = '8' # 默认值
|
||||
else:
|
||||
os.environ['MINERU_VIRTUAL_VRAM_SIZE'] = '1'
|
||||
|
||||
# 初始化 MarkItDown(如果可用)
|
||||
if MARKITDOWN_AVAILABLE:
|
||||
self.markitdown = MarkItDown()
|
||||
logger.info(f"✅ MarkItDown initialized for Office format parsing")
|
||||
|
||||
logger.info(f"✅ Worker {self.worker_id} ready")
|
||||
logger.info(f" Device: {device_mode}")
|
||||
logger.info(f" VRAM: {os.environ['MINERU_VIRTUAL_VRAM_SIZE']}GB")
|
||||
|
||||
# 启动 worker 循环拉取任务(在独立线程中)
|
||||
if self.enable_worker_loop:
|
||||
self.running = True
|
||||
self.worker_thread = threading.Thread(
|
||||
target=self._worker_loop,
|
||||
daemon=True,
|
||||
name=f"Worker-{self.worker_id}"
|
||||
)
|
||||
self.worker_thread.start()
|
||||
logger.info(f"🔄 Worker loop started (poll_interval={self.poll_interval}s)")
|
||||
|
||||
def teardown(self):
|
||||
"""
|
||||
优雅关闭 Worker
|
||||
|
||||
设置 running 标志为 False,等待 worker 线程完成当前任务后退出。
|
||||
这避免了守护线程可能导致的任务处理不完整或数据库操作不一致问题。
|
||||
"""
|
||||
if self.enable_worker_loop and self.worker_thread and self.worker_thread.is_alive():
|
||||
logger.info(f"🛑 Shutting down worker {self.worker_id}...")
|
||||
self.running = False
|
||||
|
||||
# 等待线程完成当前任务(最多等待 poll_interval * 2 秒)
|
||||
timeout = self.poll_interval * 2
|
||||
self.worker_thread.join(timeout=timeout)
|
||||
|
||||
if self.worker_thread.is_alive():
|
||||
logger.warning(f"⚠️ Worker thread did not stop within {timeout}s, forcing exit")
|
||||
else:
|
||||
logger.info(f"✅ Worker {self.worker_id} shut down gracefully")
|
||||
|
||||
def _worker_loop(self):
|
||||
"""
|
||||
Worker 主循环:持续拉取并处理任务
|
||||
|
||||
这个方法在独立线程中运行,让每个 worker 主动拉取任务
|
||||
而不是被动等待调度器触发
|
||||
"""
|
||||
logger.info(f"🔁 {self.worker_id} started task polling loop")
|
||||
|
||||
idle_count = 0
|
||||
while self.running:
|
||||
try:
|
||||
# 从数据库获取任务
|
||||
task = self.db.get_next_task(self.worker_id)
|
||||
|
||||
if task:
|
||||
idle_count = 0 # 重置空闲计数
|
||||
|
||||
# 处理任务
|
||||
task_id = task['task_id']
|
||||
logger.info(f"🔄 {self.worker_id} picked up task {task_id}")
|
||||
|
||||
try:
|
||||
self._process_task(task)
|
||||
except Exception as e:
|
||||
logger.error(f"❌ {self.worker_id} failed to process task {task_id}: {e}")
|
||||
success = self.db.update_task_status(
|
||||
task_id, 'failed',
|
||||
error_message=str(e),
|
||||
worker_id=self.worker_id
|
||||
)
|
||||
if not success:
|
||||
logger.warning(f"⚠️ Task {task_id} was modified by another process during failure update")
|
||||
|
||||
else:
|
||||
# 没有任务时,增加空闲计数
|
||||
idle_count += 1
|
||||
|
||||
# 只在第一次空闲时记录日志,避免刷屏
|
||||
if idle_count == 1:
|
||||
logger.debug(f"💤 {self.worker_id} is idle, waiting for tasks...")
|
||||
|
||||
# 空闲时等待一段时间再拉取
|
||||
time.sleep(self.poll_interval)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ {self.worker_id} loop error: {e}")
|
||||
time.sleep(self.poll_interval)
|
||||
|
||||
logger.info(f"⏹️ {self.worker_id} stopped task polling loop")
|
||||
|
||||
def _process_task(self, task: dict):
|
||||
"""
|
||||
处理单个任务
|
||||
|
||||
Args:
|
||||
task: 任务字典
|
||||
"""
|
||||
task_id = task['task_id']
|
||||
file_path = task['file_path']
|
||||
file_name = task['file_name']
|
||||
backend = task['backend']
|
||||
options = json.loads(task['options'])
|
||||
|
||||
logger.info(f"🔄 Processing task {task_id}: {file_name}")
|
||||
|
||||
try:
|
||||
# 准备输出目录
|
||||
output_path = self.output_dir / task_id
|
||||
output_path.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# 判断文件类型并选择解析方式
|
||||
file_type = self._get_file_type(file_path)
|
||||
|
||||
if file_type == 'pdf_image':
|
||||
# 使用 MinerU 解析 PDF 和图片
|
||||
self._parse_with_mineru(
|
||||
file_path=Path(file_path),
|
||||
file_name=file_name,
|
||||
task_id=task_id,
|
||||
backend=backend,
|
||||
options=options,
|
||||
output_path=output_path
|
||||
)
|
||||
parse_method = 'MinerU'
|
||||
|
||||
else: # file_type == 'markitdown'
|
||||
# 使用 markitdown 解析所有其他格式
|
||||
self._parse_with_markitdown(
|
||||
file_path=Path(file_path),
|
||||
file_name=file_name,
|
||||
output_path=output_path
|
||||
)
|
||||
parse_method = 'MarkItDown'
|
||||
|
||||
# 更新状态为成功
|
||||
success = self.db.update_task_status(
|
||||
task_id, 'completed',
|
||||
result_path=str(output_path),
|
||||
worker_id=self.worker_id
|
||||
)
|
||||
|
||||
if success:
|
||||
logger.info(f"✅ Task {task_id} completed by {self.worker_id}")
|
||||
logger.info(f" Parser: {parse_method}")
|
||||
logger.info(f" Output: {output_path}")
|
||||
else:
|
||||
logger.warning(
|
||||
f"⚠️ Task {task_id} was modified by another process. "
|
||||
f"Worker {self.worker_id} completed the work but status update was rejected."
|
||||
)
|
||||
|
||||
finally:
|
||||
# 清理临时文件
|
||||
try:
|
||||
if Path(file_path).exists():
|
||||
Path(file_path).unlink()
|
||||
except Exception as e:
|
||||
logger.warning(f"Failed to clean up temp file {file_path}: {e}")
|
||||
|
||||
def decode_request(self, request):
|
||||
"""
|
||||
解码请求
|
||||
|
||||
现在主要用于健康检查和手动触发(兼容旧接口)
|
||||
"""
|
||||
return request.get('action', 'poll')
|
||||
|
||||
def _get_file_type(self, file_path: str) -> str:
|
||||
"""
|
||||
判断文件类型
|
||||
|
||||
Args:
|
||||
file_path: 文件路径
|
||||
|
||||
Returns:
|
||||
'pdf_image': PDF 或图片格式,使用 MinerU 解析
|
||||
'markitdown': 其他所有格式,使用 markitdown 解析
|
||||
"""
|
||||
suffix = Path(file_path).suffix.lower()
|
||||
|
||||
if suffix in self.PDF_IMAGE_FORMATS:
|
||||
return 'pdf_image'
|
||||
else:
|
||||
# 所有非 PDF/图片格式都使用 markitdown
|
||||
return 'markitdown'
|
||||
|
||||
def _parse_with_mineru(self, file_path: Path, file_name: str, task_id: str,
|
||||
backend: str, options: dict, output_path: Path):
|
||||
"""
|
||||
使用 MinerU 解析 PDF 和图片格式
|
||||
|
||||
Args:
|
||||
file_path: 文件路径
|
||||
file_name: 文件名
|
||||
task_id: 任务ID
|
||||
backend: 后端类型
|
||||
options: 解析选项
|
||||
output_path: 输出路径
|
||||
"""
|
||||
logger.info(f"📄 Using MinerU to parse: {file_name}")
|
||||
|
||||
try:
|
||||
# 读取文件
|
||||
pdf_bytes = read_fn(file_path)
|
||||
|
||||
# 执行解析(MinerU 的 ModelSingleton 会自动复用模型)
|
||||
do_parse(
|
||||
output_dir=str(output_path),
|
||||
pdf_file_names=[Path(file_name).stem],
|
||||
pdf_bytes_list=[pdf_bytes],
|
||||
p_lang_list=[options.get('lang', 'ch')],
|
||||
backend=backend,
|
||||
parse_method=options.get('method', 'auto'),
|
||||
formula_enable=options.get('formula_enable', True),
|
||||
table_enable=options.get('table_enable', True),
|
||||
)
|
||||
finally:
|
||||
# 使用 MinerU 自带的内存清理函数
|
||||
# 这个函数只清理推理产生的中间结果,不会卸载模型
|
||||
try:
|
||||
clean_memory()
|
||||
except Exception as e:
|
||||
logger.debug(f"Memory cleanup failed for task {task_id}: {e}")
|
||||
|
||||
def _parse_with_markitdown(self, file_path: Path, file_name: str,
|
||||
output_path: Path):
|
||||
"""
|
||||
使用 markitdown 解析文档(支持 Office、HTML、文本等多种格式)
|
||||
|
||||
Args:
|
||||
file_path: 文件路径
|
||||
file_name: 文件名
|
||||
output_path: 输出路径
|
||||
"""
|
||||
if not MARKITDOWN_AVAILABLE or self.markitdown is None:
|
||||
raise RuntimeError("markitdown is not available. Please install it: pip install markitdown")
|
||||
|
||||
logger.info(f"📊 Using MarkItDown to parse: {file_name}")
|
||||
|
||||
# 使用 markitdown 转换文档
|
||||
result = self.markitdown.convert(str(file_path))
|
||||
|
||||
# 保存为 markdown 文件
|
||||
output_file = output_path / f"{Path(file_name).stem}.md"
|
||||
output_file.write_text(result.text_content, encoding='utf-8')
|
||||
|
||||
logger.info(f"📝 Markdown saved to: {output_file}")
|
||||
|
||||
def predict(self, action):
|
||||
"""
|
||||
HTTP 接口(主要用于健康检查和监控)
|
||||
|
||||
现在任务由 worker 循环自动拉取处理,这个接口主要用于:
|
||||
1. 健康检查
|
||||
2. 获取 worker 状态
|
||||
3. 兼容旧的手动触发模式(当 enable_worker_loop=False 时)
|
||||
"""
|
||||
if action == 'health':
|
||||
# 健康检查
|
||||
stats = self.db.get_queue_stats()
|
||||
return {
|
||||
'status': 'healthy',
|
||||
'worker_id': self.worker_id,
|
||||
'worker_loop_enabled': self.enable_worker_loop,
|
||||
'worker_running': self.running,
|
||||
'queue_stats': stats
|
||||
}
|
||||
|
||||
elif action == 'poll':
|
||||
if not self.enable_worker_loop:
|
||||
# 兼容模式:手动触发任务拉取
|
||||
task = self.db.get_next_task(self.worker_id)
|
||||
|
||||
if not task:
|
||||
return {
|
||||
'status': 'idle',
|
||||
'message': 'No pending tasks in queue',
|
||||
'worker_id': self.worker_id
|
||||
}
|
||||
|
||||
try:
|
||||
self._process_task(task)
|
||||
return {
|
||||
'status': 'completed',
|
||||
'task_id': task['task_id'],
|
||||
'worker_id': self.worker_id
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
'status': 'failed',
|
||||
'task_id': task['task_id'],
|
||||
'error': str(e),
|
||||
'worker_id': self.worker_id
|
||||
}
|
||||
else:
|
||||
# Worker 循环模式:返回状态信息
|
||||
return {
|
||||
'status': 'auto_mode',
|
||||
'message': 'Worker is running in auto-loop mode, tasks are processed automatically',
|
||||
'worker_id': self.worker_id,
|
||||
'worker_running': self.running
|
||||
}
|
||||
|
||||
else:
|
||||
return {
|
||||
'status': 'error',
|
||||
'message': f'Invalid action: {action}. Use "health" or "poll".',
|
||||
'worker_id': self.worker_id
|
||||
}
|
||||
|
||||
def encode_response(self, response):
|
||||
"""编码响应"""
|
||||
return response
|
||||
|
||||
|
||||
def start_litserve_workers(
|
||||
output_dir='/tmp/mineru_tianshu_output',
|
||||
accelerator='auto',
|
||||
devices='auto',
|
||||
workers_per_device=1,
|
||||
port=9000,
|
||||
poll_interval=0.5,
|
||||
enable_worker_loop=True
|
||||
):
|
||||
"""
|
||||
启动 LitServe Worker Pool
|
||||
|
||||
Args:
|
||||
output_dir: 输出目录
|
||||
accelerator: 加速器类型 (auto/cuda/cpu/mps)
|
||||
devices: 使用的设备 (auto/[0,1,2])
|
||||
workers_per_device: 每个 GPU 的 worker 数量
|
||||
port: 服务端口
|
||||
poll_interval: Worker 拉取任务的间隔(秒)
|
||||
enable_worker_loop: 是否启用 worker 自动循环拉取任务
|
||||
"""
|
||||
logger.info("=" * 60)
|
||||
logger.info("🚀 Starting MinerU Tianshu LitServe Worker Pool")
|
||||
logger.info("=" * 60)
|
||||
logger.info(f"📂 Output Directory: {output_dir}")
|
||||
logger.info(f"🎮 Accelerator: {accelerator}")
|
||||
logger.info(f"💾 Devices: {devices}")
|
||||
logger.info(f"👷 Workers per Device: {workers_per_device}")
|
||||
logger.info(f"🔌 Port: {port}")
|
||||
logger.info(f"🔄 Worker Loop: {'Enabled' if enable_worker_loop else 'Disabled'}")
|
||||
if enable_worker_loop:
|
||||
logger.info(f"⏱️ Poll Interval: {poll_interval}s")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# 创建 LitServe 服务器
|
||||
api = MinerUWorkerAPI(
|
||||
output_dir=output_dir,
|
||||
poll_interval=poll_interval,
|
||||
enable_worker_loop=enable_worker_loop
|
||||
)
|
||||
server = ls.LitServer(
|
||||
api,
|
||||
accelerator=accelerator,
|
||||
devices=devices,
|
||||
workers_per_device=workers_per_device,
|
||||
timeout=False, # 不设置超时
|
||||
)
|
||||
|
||||
# 注册优雅关闭处理器
|
||||
def graceful_shutdown(signum=None, frame=None):
|
||||
"""处理关闭信号,优雅地停止 worker"""
|
||||
logger.info("🛑 Received shutdown signal, gracefully stopping workers...")
|
||||
# 注意:LitServe 会为每个设备创建多个 worker 实例
|
||||
# 这里的 api 只是模板,实际的 worker 实例由 LitServe 管理
|
||||
# teardown 会在每个 worker 进程中被调用
|
||||
if hasattr(api, 'teardown'):
|
||||
api.teardown()
|
||||
sys.exit(0)
|
||||
|
||||
# 注册信号处理器(Ctrl+C 等)
|
||||
signal.signal(signal.SIGINT, graceful_shutdown)
|
||||
signal.signal(signal.SIGTERM, graceful_shutdown)
|
||||
|
||||
# 注册 atexit 处理器(正常退出时调用)
|
||||
atexit.register(lambda: api.teardown() if hasattr(api, 'teardown') else None)
|
||||
|
||||
logger.info(f"✅ LitServe worker pool initialized")
|
||||
logger.info(f"📡 Listening on: http://0.0.0.0:{port}/predict")
|
||||
if enable_worker_loop:
|
||||
logger.info(f"🔁 Workers will continuously poll and process tasks")
|
||||
else:
|
||||
logger.info(f"🔄 Workers will wait for scheduler triggers")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# 启动服务器
|
||||
server.run(port=port, generate_client_file=False)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description='MinerU Tianshu LitServe Worker Pool')
|
||||
parser.add_argument('--output-dir', type=str, default='/tmp/mineru_tianshu_output',
|
||||
help='Output directory for processed files')
|
||||
parser.add_argument('--accelerator', type=str, default='auto',
|
||||
choices=['auto', 'cuda', 'cpu', 'mps'],
|
||||
help='Accelerator type')
|
||||
parser.add_argument('--devices', type=str, default='auto',
|
||||
help='Devices to use (auto or comma-separated list like 0,1,2)')
|
||||
parser.add_argument('--workers-per-device', type=int, default=1,
|
||||
help='Number of workers per device')
|
||||
parser.add_argument('--port', type=int, default=9000,
|
||||
help='Server port')
|
||||
parser.add_argument('--poll-interval', type=float, default=0.5,
|
||||
help='Worker poll interval in seconds (default: 0.5)')
|
||||
parser.add_argument('--disable-worker-loop', action='store_true',
|
||||
help='Disable worker auto-loop mode (use scheduler-driven mode)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 处理 devices 参数
|
||||
devices = args.devices
|
||||
if devices != 'auto':
|
||||
try:
|
||||
devices = [int(d) for d in devices.split(',')]
|
||||
except:
|
||||
logger.warning(f"Invalid devices format: {devices}, using 'auto'")
|
||||
devices = 'auto'
|
||||
|
||||
start_litserve_workers(
|
||||
output_dir=args.output_dir,
|
||||
accelerator=args.accelerator,
|
||||
devices=devices,
|
||||
workers_per_device=args.workers_per_device,
|
||||
port=args.port,
|
||||
poll_interval=args.poll_interval,
|
||||
enable_worker_loop=not args.disable_worker_loop
|
||||
)
|
||||
|
||||
|
||||
@@ -1,32 +0,0 @@
|
||||
# MinerU Tianshu Requirements
|
||||
# 天枢项目依赖
|
||||
|
||||
# Core MinerU
|
||||
mineru[core]>=2.5.0
|
||||
|
||||
# Image Augmentation (Version Pinned for Compatibility)
|
||||
albumentations>=1.4.4,<2.0.0
|
||||
albucore>=0.0.13,<0.0.20
|
||||
|
||||
# Web Framework
|
||||
fastapi>=0.115.0
|
||||
uvicorn[standard]>=0.32.0
|
||||
|
||||
# LitServe for GPU Load Balancing
|
||||
litserve>=0.2.0
|
||||
|
||||
# Async HTTP Client
|
||||
aiohttp>=3.11.0
|
||||
|
||||
# Logging
|
||||
loguru>=0.7.0
|
||||
|
||||
# Office Document Parsing
|
||||
markitdown[all]>=0.1.3
|
||||
|
||||
# MinIO Object Storage
|
||||
minio>=7.2.0
|
||||
|
||||
# Optional: For better performance
|
||||
# ujson>=5.10.0
|
||||
|
||||
@@ -1,256 +0,0 @@
|
||||
"""
|
||||
MinerU Tianshu - Unified Startup Script
|
||||
天枢统一启动脚本
|
||||
|
||||
一键启动所有服务:API Server + LitServe Workers + Task Scheduler
|
||||
"""
|
||||
import subprocess
|
||||
import signal
|
||||
import sys
|
||||
import time
|
||||
import os
|
||||
from loguru import logger
|
||||
from pathlib import Path
|
||||
import argparse
|
||||
|
||||
|
||||
class TianshuLauncher:
|
||||
"""天枢服务启动器"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
output_dir='/tmp/mineru_tianshu_output',
|
||||
api_port=8000,
|
||||
worker_port=9000,
|
||||
workers_per_device=1,
|
||||
devices='auto',
|
||||
accelerator='auto'
|
||||
):
|
||||
self.output_dir = output_dir
|
||||
self.api_port = api_port
|
||||
self.worker_port = worker_port
|
||||
self.workers_per_device = workers_per_device
|
||||
self.devices = devices
|
||||
self.accelerator = accelerator
|
||||
self.processes = []
|
||||
|
||||
def start_services(self):
|
||||
"""启动所有服务"""
|
||||
logger.info("=" * 70)
|
||||
logger.info("🚀 MinerU Tianshu - Starting All Services")
|
||||
logger.info("=" * 70)
|
||||
logger.info("天枢 - 企业级多GPU文档解析服务")
|
||||
logger.info("")
|
||||
|
||||
try:
|
||||
# 1. 启动 API Server
|
||||
logger.info("📡 [1/3] Starting API Server...")
|
||||
env = os.environ.copy()
|
||||
env['API_PORT'] = str(self.api_port)
|
||||
api_proc = subprocess.Popen(
|
||||
[sys.executable, 'api_server.py'],
|
||||
cwd=Path(__file__).parent,
|
||||
env=env
|
||||
)
|
||||
self.processes.append(('API Server', api_proc))
|
||||
time.sleep(3)
|
||||
|
||||
if api_proc.poll() is not None:
|
||||
logger.error("❌ API Server failed to start!")
|
||||
return False
|
||||
|
||||
logger.info(f" ✅ API Server started (PID: {api_proc.pid})")
|
||||
logger.info(f" 📖 API Docs: http://localhost:{self.api_port}/docs")
|
||||
logger.info("")
|
||||
|
||||
# 2. 启动 LitServe Worker Pool
|
||||
logger.info("⚙️ [2/3] Starting LitServe Worker Pool...")
|
||||
worker_cmd = [
|
||||
sys.executable, 'litserve_worker.py',
|
||||
'--output-dir', self.output_dir,
|
||||
'--accelerator', self.accelerator,
|
||||
'--workers-per-device', str(self.workers_per_device),
|
||||
'--port', str(self.worker_port),
|
||||
'--devices', str(self.devices) if isinstance(self.devices, str) else ','.join(map(str, self.devices))
|
||||
]
|
||||
|
||||
worker_proc = subprocess.Popen(
|
||||
worker_cmd,
|
||||
cwd=Path(__file__).parent
|
||||
)
|
||||
self.processes.append(('LitServe Workers', worker_proc))
|
||||
time.sleep(5)
|
||||
|
||||
if worker_proc.poll() is not None:
|
||||
logger.error("❌ LitServe Workers failed to start!")
|
||||
return False
|
||||
|
||||
logger.info(f" ✅ LitServe Workers started (PID: {worker_proc.pid})")
|
||||
logger.info(f" 🔌 Worker Port: {self.worker_port}")
|
||||
logger.info(f" 👷 Workers per Device: {self.workers_per_device}")
|
||||
logger.info("")
|
||||
|
||||
# 3. 启动 Task Scheduler
|
||||
logger.info("🔄 [3/3] Starting Task Scheduler...")
|
||||
scheduler_cmd = [
|
||||
sys.executable, 'task_scheduler.py',
|
||||
'--litserve-url', f'http://localhost:{self.worker_port}/predict',
|
||||
'--wait-for-workers'
|
||||
]
|
||||
|
||||
scheduler_proc = subprocess.Popen(
|
||||
scheduler_cmd,
|
||||
cwd=Path(__file__).parent
|
||||
)
|
||||
self.processes.append(('Task Scheduler', scheduler_proc))
|
||||
time.sleep(3)
|
||||
|
||||
if scheduler_proc.poll() is not None:
|
||||
logger.error("❌ Task Scheduler failed to start!")
|
||||
return False
|
||||
|
||||
logger.info(f" ✅ Task Scheduler started (PID: {scheduler_proc.pid})")
|
||||
logger.info("")
|
||||
|
||||
# 启动成功
|
||||
logger.info("=" * 70)
|
||||
logger.info("✅ All Services Started Successfully!")
|
||||
logger.info("=" * 70)
|
||||
logger.info("")
|
||||
logger.info("📚 Quick Start:")
|
||||
logger.info(f" • API Documentation: http://localhost:{self.api_port}/docs")
|
||||
logger.info(f" • Submit Task: POST http://localhost:{self.api_port}/api/v1/tasks/submit")
|
||||
logger.info(f" • Query Status: GET http://localhost:{self.api_port}/api/v1/tasks/{{task_id}}")
|
||||
logger.info(f" • Queue Stats: GET http://localhost:{self.api_port}/api/v1/queue/stats")
|
||||
logger.info("")
|
||||
logger.info("🔧 Service Details:")
|
||||
for name, proc in self.processes:
|
||||
logger.info(f" • {name:20s} PID: {proc.pid}")
|
||||
logger.info("")
|
||||
logger.info("⚠️ Press Ctrl+C to stop all services")
|
||||
logger.info("=" * 70)
|
||||
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to start services: {e}")
|
||||
self.stop_services()
|
||||
return False
|
||||
|
||||
def stop_services(self, signum=None, frame=None):
|
||||
"""停止所有服务"""
|
||||
logger.info("")
|
||||
logger.info("=" * 70)
|
||||
logger.info("⏹️ Stopping All Services...")
|
||||
logger.info("=" * 70)
|
||||
|
||||
for name, proc in self.processes:
|
||||
if proc.poll() is None: # 进程仍在运行
|
||||
logger.info(f" Stopping {name} (PID: {proc.pid})...")
|
||||
proc.terminate()
|
||||
|
||||
# 等待所有进程结束
|
||||
for name, proc in self.processes:
|
||||
try:
|
||||
proc.wait(timeout=10)
|
||||
logger.info(f" ✅ {name} stopped")
|
||||
except subprocess.TimeoutExpired:
|
||||
logger.warning(f" ⚠️ {name} did not stop gracefully, forcing...")
|
||||
proc.kill()
|
||||
proc.wait()
|
||||
|
||||
logger.info("=" * 70)
|
||||
logger.info("✅ All Services Stopped")
|
||||
logger.info("=" * 70)
|
||||
sys.exit(0)
|
||||
|
||||
def wait(self):
|
||||
"""等待所有服务"""
|
||||
try:
|
||||
while True:
|
||||
time.sleep(1)
|
||||
|
||||
# 检查进程状态
|
||||
for name, proc in self.processes:
|
||||
if proc.poll() is not None:
|
||||
logger.error(f"❌ {name} unexpectedly stopped!")
|
||||
self.stop_services()
|
||||
return
|
||||
|
||||
except KeyboardInterrupt:
|
||||
self.stop_services()
|
||||
|
||||
|
||||
def main():
|
||||
"""主函数"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='MinerU Tianshu - 统一启动脚本',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
示例:
|
||||
# 使用默认配置启动(自动检测GPU)
|
||||
python start_all.py
|
||||
|
||||
# 使用CPU模式
|
||||
python start_all.py --accelerator cpu
|
||||
|
||||
# 指定输出目录和端口
|
||||
python start_all.py --output-dir /data/output --api-port 8080
|
||||
|
||||
# 每个GPU启动2个worker
|
||||
python start_all.py --accelerator cuda --workers-per-device 2
|
||||
|
||||
# 只使用指定的GPU
|
||||
python start_all.py --accelerator cuda --devices 0,1
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('--output-dir', type=str, default='/tmp/mineru_tianshu_output',
|
||||
help='输出目录 (默认: /tmp/mineru_tianshu_output)')
|
||||
parser.add_argument('--api-port', type=int, default=8000,
|
||||
help='API服务器端口 (默认: 8000)')
|
||||
parser.add_argument('--worker-port', type=int, default=9000,
|
||||
help='Worker服务器端口 (默认: 9000)')
|
||||
parser.add_argument('--accelerator', type=str, default='auto',
|
||||
choices=['auto', 'cuda', 'cpu', 'mps'],
|
||||
help='加速器类型 (默认: auto,自动检测)')
|
||||
parser.add_argument('--workers-per-device', type=int, default=1,
|
||||
help='每个GPU的worker数量 (默认: 1)')
|
||||
parser.add_argument('--devices', type=str, default='auto',
|
||||
help='使用的GPU设备,逗号分隔 (默认: auto,使用所有GPU)')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 处理 devices 参数
|
||||
devices = args.devices
|
||||
if devices != 'auto':
|
||||
try:
|
||||
devices = [int(d) for d in devices.split(',')]
|
||||
except:
|
||||
logger.warning(f"Invalid devices format: {devices}, using 'auto'")
|
||||
devices = 'auto'
|
||||
|
||||
# 创建启动器
|
||||
launcher = TianshuLauncher(
|
||||
output_dir=args.output_dir,
|
||||
api_port=args.api_port,
|
||||
worker_port=args.worker_port,
|
||||
workers_per_device=args.workers_per_device,
|
||||
devices=devices,
|
||||
accelerator=args.accelerator
|
||||
)
|
||||
|
||||
# 设置信号处理
|
||||
signal.signal(signal.SIGINT, launcher.stop_services)
|
||||
signal.signal(signal.SIGTERM, launcher.stop_services)
|
||||
|
||||
# 启动服务
|
||||
if launcher.start_services():
|
||||
launcher.wait()
|
||||
else:
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
|
||||
@@ -1,436 +0,0 @@
|
||||
"""
|
||||
MinerU Tianshu - SQLite Task Database Manager
|
||||
天枢任务数据库管理器
|
||||
|
||||
负责任务的持久化存储、状态管理和原子性操作
|
||||
"""
|
||||
import sqlite3
|
||||
import json
|
||||
import uuid
|
||||
from contextlib import contextmanager
|
||||
from typing import Optional, List, Dict
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class TaskDB:
|
||||
"""任务数据库管理类"""
|
||||
|
||||
def __init__(self, db_path='mineru_tianshu.db'):
|
||||
self.db_path = db_path
|
||||
self._init_db()
|
||||
|
||||
def _get_conn(self):
|
||||
"""获取数据库连接(每次创建新连接,避免 pickle 问题)
|
||||
|
||||
并发安全说明:
|
||||
- 使用 check_same_thread=False 是安全的,因为:
|
||||
1. 每次调用都创建新连接,不跨线程共享
|
||||
2. 连接使用完立即关闭(在 get_cursor 上下文管理器中)
|
||||
3. 不使用连接池,避免线程间共享同一连接
|
||||
- timeout=30.0 防止死锁,如果锁等待超过30秒会抛出异常
|
||||
"""
|
||||
conn = sqlite3.connect(
|
||||
self.db_path,
|
||||
check_same_thread=False,
|
||||
timeout=30.0
|
||||
)
|
||||
conn.row_factory = sqlite3.Row
|
||||
return conn
|
||||
|
||||
@contextmanager
|
||||
def get_cursor(self):
|
||||
"""上下文管理器,自动提交和错误处理"""
|
||||
conn = self._get_conn()
|
||||
cursor = conn.cursor()
|
||||
try:
|
||||
yield cursor
|
||||
conn.commit()
|
||||
except Exception as e:
|
||||
conn.rollback()
|
||||
raise e
|
||||
finally:
|
||||
conn.close() # 关闭连接
|
||||
|
||||
def _init_db(self):
|
||||
"""初始化数据库表"""
|
||||
with self.get_cursor() as cursor:
|
||||
cursor.execute('''
|
||||
CREATE TABLE IF NOT EXISTS tasks (
|
||||
task_id TEXT PRIMARY KEY,
|
||||
file_name TEXT NOT NULL,
|
||||
file_path TEXT,
|
||||
status TEXT DEFAULT 'pending',
|
||||
priority INTEGER DEFAULT 0,
|
||||
backend TEXT DEFAULT 'pipeline',
|
||||
options TEXT,
|
||||
result_path TEXT,
|
||||
error_message TEXT,
|
||||
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
||||
started_at TIMESTAMP,
|
||||
completed_at TIMESTAMP,
|
||||
worker_id TEXT,
|
||||
retry_count INTEGER DEFAULT 0
|
||||
)
|
||||
''')
|
||||
|
||||
# 创建索引加速查询
|
||||
cursor.execute('CREATE INDEX IF NOT EXISTS idx_status ON tasks(status)')
|
||||
cursor.execute('CREATE INDEX IF NOT EXISTS idx_priority ON tasks(priority DESC)')
|
||||
cursor.execute('CREATE INDEX IF NOT EXISTS idx_created_at ON tasks(created_at)')
|
||||
cursor.execute('CREATE INDEX IF NOT EXISTS idx_worker_id ON tasks(worker_id)')
|
||||
|
||||
def create_task(self, file_name: str, file_path: str,
|
||||
backend: str = 'pipeline', options: dict = None,
|
||||
priority: int = 0) -> str:
|
||||
"""
|
||||
创建新任务
|
||||
|
||||
Args:
|
||||
file_name: 文件名
|
||||
file_path: 文件路径
|
||||
backend: 处理后端 (pipeline/vlm-transformers/vlm-vllm-engine)
|
||||
options: 处理选项 (dict)
|
||||
priority: 优先级,数字越大越优先
|
||||
|
||||
Returns:
|
||||
task_id: 任务ID
|
||||
"""
|
||||
task_id = str(uuid.uuid4())
|
||||
with self.get_cursor() as cursor:
|
||||
cursor.execute('''
|
||||
INSERT INTO tasks (task_id, file_name, file_path, backend, options, priority)
|
||||
VALUES (?, ?, ?, ?, ?, ?)
|
||||
''', (task_id, file_name, file_path, backend, json.dumps(options or {}), priority))
|
||||
return task_id
|
||||
|
||||
def get_next_task(self, worker_id: str, max_retries: int = 3) -> Optional[Dict]:
|
||||
"""
|
||||
获取下一个待处理任务(原子操作,防止并发冲突)
|
||||
|
||||
Args:
|
||||
worker_id: Worker ID
|
||||
max_retries: 当任务被其他 worker 抢走时的最大重试次数(默认3次)
|
||||
|
||||
Returns:
|
||||
task: 任务字典,如果没有任务返回 None
|
||||
|
||||
并发安全说明:
|
||||
1. 使用 BEGIN IMMEDIATE 立即获取写锁
|
||||
2. UPDATE 时检查 status = 'pending' 防止重复拉取
|
||||
3. 检查 rowcount 确保更新成功
|
||||
4. 如果任务被抢走,立即重试而不是返回 None(避免不必要的等待)
|
||||
"""
|
||||
for attempt in range(max_retries):
|
||||
with self.get_cursor() as cursor:
|
||||
# 使用事务确保原子性
|
||||
cursor.execute('BEGIN IMMEDIATE')
|
||||
|
||||
# 按优先级和创建时间获取任务
|
||||
cursor.execute('''
|
||||
SELECT * FROM tasks
|
||||
WHERE status = 'pending'
|
||||
ORDER BY priority DESC, created_at ASC
|
||||
LIMIT 1
|
||||
''')
|
||||
|
||||
task = cursor.fetchone()
|
||||
if task:
|
||||
# 立即标记为 processing,并确保状态仍是 pending
|
||||
cursor.execute('''
|
||||
UPDATE tasks
|
||||
SET status = 'processing',
|
||||
started_at = CURRENT_TIMESTAMP,
|
||||
worker_id = ?
|
||||
WHERE task_id = ? AND status = 'pending'
|
||||
''', (worker_id, task['task_id']))
|
||||
|
||||
# 检查是否更新成功(防止被其他 worker 抢走)
|
||||
if cursor.rowcount == 0:
|
||||
# 任务被其他进程抢走了,立即重试
|
||||
# 因为队列中可能还有其他待处理任务
|
||||
continue
|
||||
|
||||
return dict(task)
|
||||
else:
|
||||
# 队列中没有待处理任务,返回 None
|
||||
return None
|
||||
|
||||
# 重试次数用尽,仍未获取到任务(高并发场景)
|
||||
return None
|
||||
|
||||
def _build_update_clauses(self, status: str, result_path: str = None,
|
||||
error_message: str = None, worker_id: str = None,
|
||||
task_id: str = None):
|
||||
"""
|
||||
构建 UPDATE 和 WHERE 子句的辅助方法
|
||||
|
||||
Args:
|
||||
status: 新状态
|
||||
result_path: 结果路径(可选)
|
||||
error_message: 错误信息(可选)
|
||||
worker_id: Worker ID(可选)
|
||||
task_id: 任务ID(可选)
|
||||
|
||||
Returns:
|
||||
tuple: (update_clauses, update_params, where_clauses, where_params)
|
||||
"""
|
||||
update_clauses = ['status = ?']
|
||||
update_params = [status]
|
||||
where_clauses = []
|
||||
where_params = []
|
||||
|
||||
# 添加 task_id 条件(如果提供)
|
||||
if task_id:
|
||||
where_clauses.append('task_id = ?')
|
||||
where_params.append(task_id)
|
||||
|
||||
# 处理 completed 状态
|
||||
if status == 'completed':
|
||||
update_clauses.append('completed_at = CURRENT_TIMESTAMP')
|
||||
if result_path:
|
||||
update_clauses.append('result_path = ?')
|
||||
update_params.append(result_path)
|
||||
# 只更新正在处理的任务
|
||||
where_clauses.append("status = 'processing'")
|
||||
if worker_id:
|
||||
where_clauses.append('worker_id = ?')
|
||||
where_params.append(worker_id)
|
||||
|
||||
# 处理 failed 状态
|
||||
elif status == 'failed':
|
||||
update_clauses.append('completed_at = CURRENT_TIMESTAMP')
|
||||
if error_message:
|
||||
update_clauses.append('error_message = ?')
|
||||
update_params.append(error_message)
|
||||
# 只更新正在处理的任务
|
||||
where_clauses.append("status = 'processing'")
|
||||
if worker_id:
|
||||
where_clauses.append('worker_id = ?')
|
||||
where_params.append(worker_id)
|
||||
|
||||
return update_clauses, update_params, where_clauses, where_params
|
||||
|
||||
def update_task_status(self, task_id: str, status: str,
|
||||
result_path: str = None, error_message: str = None,
|
||||
worker_id: str = None):
|
||||
"""
|
||||
更新任务状态
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
status: 新状态 (pending/processing/completed/failed/cancelled)
|
||||
result_path: 结果路径(可选)
|
||||
error_message: 错误信息(可选)
|
||||
worker_id: Worker ID(可选,用于并发检查)
|
||||
|
||||
Returns:
|
||||
bool: 更新是否成功
|
||||
|
||||
并发安全说明:
|
||||
1. 更新为 completed/failed 时会检查状态是 processing
|
||||
2. 如果提供 worker_id,会检查任务是否属于该 worker
|
||||
3. 返回 False 表示任务被其他进程修改了
|
||||
"""
|
||||
with self.get_cursor() as cursor:
|
||||
# 使用辅助方法构建 UPDATE 和 WHERE 子句
|
||||
update_clauses, update_params, where_clauses, where_params = \
|
||||
self._build_update_clauses(status, result_path, error_message, worker_id, task_id)
|
||||
|
||||
# 合并参数:先 UPDATE 部分,再 WHERE 部分
|
||||
all_params = update_params + where_params
|
||||
|
||||
sql = f'''
|
||||
UPDATE tasks
|
||||
SET {', '.join(update_clauses)}
|
||||
WHERE {' AND '.join(where_clauses)}
|
||||
'''
|
||||
|
||||
cursor.execute(sql, all_params)
|
||||
|
||||
# 检查更新是否成功
|
||||
success = cursor.rowcount > 0
|
||||
|
||||
# 调试日志(仅在失败时)
|
||||
if not success and status in ['completed', 'failed']:
|
||||
from loguru import logger
|
||||
logger.debug(
|
||||
f"Status update failed: task_id={task_id}, status={status}, "
|
||||
f"worker_id={worker_id}, SQL: {sql}, params: {all_params}"
|
||||
)
|
||||
|
||||
return success
|
||||
|
||||
def get_task(self, task_id: str) -> Optional[Dict]:
|
||||
"""
|
||||
查询任务详情
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
|
||||
Returns:
|
||||
task: 任务字典,如果不存在返回 None
|
||||
"""
|
||||
with self.get_cursor() as cursor:
|
||||
cursor.execute('SELECT * FROM tasks WHERE task_id = ?', (task_id,))
|
||||
task = cursor.fetchone()
|
||||
return dict(task) if task else None
|
||||
|
||||
def get_queue_stats(self) -> Dict[str, int]:
|
||||
"""
|
||||
获取队列统计信息
|
||||
|
||||
Returns:
|
||||
stats: 各状态的任务数量
|
||||
"""
|
||||
with self.get_cursor() as cursor:
|
||||
cursor.execute('''
|
||||
SELECT status, COUNT(*) as count
|
||||
FROM tasks
|
||||
GROUP BY status
|
||||
''')
|
||||
stats = {row['status']: row['count'] for row in cursor.fetchall()}
|
||||
return stats
|
||||
|
||||
def get_tasks_by_status(self, status: str, limit: int = 100) -> List[Dict]:
|
||||
"""
|
||||
根据状态获取任务列表
|
||||
|
||||
Args:
|
||||
status: 任务状态
|
||||
limit: 返回数量限制
|
||||
|
||||
Returns:
|
||||
tasks: 任务列表
|
||||
"""
|
||||
with self.get_cursor() as cursor:
|
||||
cursor.execute('''
|
||||
SELECT * FROM tasks
|
||||
WHERE status = ?
|
||||
ORDER BY created_at DESC
|
||||
LIMIT ?
|
||||
''', (status, limit))
|
||||
return [dict(row) for row in cursor.fetchall()]
|
||||
|
||||
def cleanup_old_task_files(self, days: int = 7):
|
||||
"""
|
||||
清理旧任务的结果文件(保留数据库记录)
|
||||
|
||||
Args:
|
||||
days: 清理多少天前的任务文件
|
||||
|
||||
Returns:
|
||||
int: 删除的文件目录数
|
||||
|
||||
注意:
|
||||
- 只删除结果文件,保留数据库记录
|
||||
- 数据库中的 result_path 字段会被清空
|
||||
- 用户仍可查询任务状态和历史记录
|
||||
"""
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
|
||||
with self.get_cursor() as cursor:
|
||||
# 查询要清理文件的任务
|
||||
cursor.execute('''
|
||||
SELECT task_id, result_path FROM tasks
|
||||
WHERE completed_at < datetime('now', '-' || ? || ' days')
|
||||
AND status IN ('completed', 'failed')
|
||||
AND result_path IS NOT NULL
|
||||
''', (days,))
|
||||
|
||||
old_tasks = cursor.fetchall()
|
||||
file_count = 0
|
||||
|
||||
# 删除结果文件
|
||||
for task in old_tasks:
|
||||
if task['result_path']:
|
||||
result_path = Path(task['result_path'])
|
||||
if result_path.exists() and result_path.is_dir():
|
||||
try:
|
||||
shutil.rmtree(result_path)
|
||||
file_count += 1
|
||||
|
||||
# 清空数据库中的 result_path,表示文件已被清理
|
||||
cursor.execute('''
|
||||
UPDATE tasks
|
||||
SET result_path = NULL
|
||||
WHERE task_id = ?
|
||||
''', (task['task_id'],))
|
||||
|
||||
except Exception as e:
|
||||
from loguru import logger
|
||||
logger.warning(f"Failed to delete result files for task {task['task_id']}: {e}")
|
||||
|
||||
return file_count
|
||||
|
||||
def cleanup_old_task_records(self, days: int = 30):
|
||||
"""
|
||||
清理极旧的任务记录(可选功能)
|
||||
|
||||
Args:
|
||||
days: 删除多少天前的任务记录
|
||||
|
||||
Returns:
|
||||
int: 删除的记录数
|
||||
|
||||
注意:
|
||||
- 这个方法会永久删除数据库记录
|
||||
- 建议设置较长的保留期(如30-90天)
|
||||
- 一般情况下不需要调用此方法
|
||||
"""
|
||||
with self.get_cursor() as cursor:
|
||||
cursor.execute('''
|
||||
DELETE FROM tasks
|
||||
WHERE completed_at < datetime('now', '-' || ? || ' days')
|
||||
AND status IN ('completed', 'failed')
|
||||
''', (days,))
|
||||
|
||||
deleted_count = cursor.rowcount
|
||||
return deleted_count
|
||||
|
||||
def reset_stale_tasks(self, timeout_minutes: int = 60):
|
||||
"""
|
||||
重置超时的 processing 任务为 pending
|
||||
|
||||
Args:
|
||||
timeout_minutes: 超时时间(分钟)
|
||||
"""
|
||||
with self.get_cursor() as cursor:
|
||||
cursor.execute('''
|
||||
UPDATE tasks
|
||||
SET status = 'pending',
|
||||
worker_id = NULL,
|
||||
retry_count = retry_count + 1
|
||||
WHERE status = 'processing'
|
||||
AND started_at < datetime('now', '-' || ? || ' minutes')
|
||||
''', (timeout_minutes,))
|
||||
reset_count = cursor.rowcount
|
||||
return reset_count
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# 测试代码
|
||||
db = TaskDB('test_tianshu.db')
|
||||
|
||||
# 创建测试任务
|
||||
task_id = db.create_task(
|
||||
file_name='test.pdf',
|
||||
file_path='/tmp/test.pdf',
|
||||
backend='pipeline',
|
||||
options={'lang': 'ch', 'formula_enable': True},
|
||||
priority=1
|
||||
)
|
||||
print(f"Created task: {task_id}")
|
||||
|
||||
# 查询任务
|
||||
task = db.get_task(task_id)
|
||||
print(f"Task details: {task}")
|
||||
|
||||
# 获取统计
|
||||
stats = db.get_queue_stats()
|
||||
print(f"Queue stats: {stats}")
|
||||
|
||||
# 清理测试数据库
|
||||
Path('test_tianshu.db').unlink(missing_ok=True)
|
||||
print("Test completed!")
|
||||
|
||||
@@ -1,270 +0,0 @@
|
||||
"""
|
||||
MinerU Tianshu - Task Scheduler (Optional)
|
||||
天枢任务调度器(可选)
|
||||
|
||||
在 Worker 自动循环模式下,调度器主要用于:
|
||||
1. 监控队列状态(默认5分钟一次)
|
||||
2. 健康检查(默认15分钟一次)
|
||||
3. 统计信息收集
|
||||
4. 故障恢复(重置超时任务)
|
||||
|
||||
注意:
|
||||
- 如果 workers 启用了自动循环模式(默认),则不需要调度器来触发任务处理
|
||||
- Worker 已经主动工作,调度器只是偶尔检查系统状态
|
||||
- 较长的间隔可以最小化系统开销,同时保持必要的监控能力
|
||||
- 5分钟监控、15分钟健康检查对于自动运行的系统来说已经足够及时
|
||||
"""
|
||||
import asyncio
|
||||
import aiohttp
|
||||
from loguru import logger
|
||||
from task_db import TaskDB
|
||||
import signal
|
||||
|
||||
|
||||
class TaskScheduler:
|
||||
"""
|
||||
任务调度器(可选)
|
||||
|
||||
职责(在 Worker 自动循环模式下):
|
||||
1. 监控 SQLite 任务队列状态
|
||||
2. 健康检查 Workers
|
||||
3. 故障恢复(重置超时任务)
|
||||
4. 收集和展示统计信息
|
||||
|
||||
职责(在传统模式下):
|
||||
1. 触发 Workers 拉取任务
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
litserve_url='http://localhost:9000/predict',
|
||||
monitor_interval=300,
|
||||
health_check_interval=900,
|
||||
stale_task_timeout=60,
|
||||
cleanup_old_files_days=7,
|
||||
cleanup_old_records_days=0,
|
||||
worker_auto_mode=True
|
||||
):
|
||||
"""
|
||||
初始化调度器
|
||||
|
||||
Args:
|
||||
litserve_url: LitServe Worker 的 URL
|
||||
monitor_interval: 监控间隔(秒,默认300秒=5分钟)
|
||||
health_check_interval: 健康检查间隔(秒,默认900秒=15分钟)
|
||||
stale_task_timeout: 超时任务重置时间(分钟)
|
||||
cleanup_old_files_days: 清理多少天前的结果文件(0=禁用,默认7天)
|
||||
cleanup_old_records_days: 清理多少天前的数据库记录(0=禁用,不推荐删除)
|
||||
worker_auto_mode: Worker 是否启用自动循环模式
|
||||
"""
|
||||
self.litserve_url = litserve_url
|
||||
self.monitor_interval = monitor_interval
|
||||
self.health_check_interval = health_check_interval
|
||||
self.stale_task_timeout = stale_task_timeout
|
||||
self.cleanup_old_files_days = cleanup_old_files_days
|
||||
self.cleanup_old_records_days = cleanup_old_records_days
|
||||
self.worker_auto_mode = worker_auto_mode
|
||||
self.db = TaskDB()
|
||||
self.running = True
|
||||
|
||||
async def check_worker_health(self, session: aiohttp.ClientSession):
|
||||
"""
|
||||
检查 worker 健康状态
|
||||
"""
|
||||
try:
|
||||
async with session.post(
|
||||
self.litserve_url,
|
||||
json={'action': 'health'},
|
||||
timeout=aiohttp.ClientTimeout(total=10)
|
||||
) as resp:
|
||||
if resp.status == 200:
|
||||
result = await resp.json()
|
||||
return result
|
||||
else:
|
||||
logger.error(f"Health check failed with status {resp.status}")
|
||||
return None
|
||||
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Health check timeout")
|
||||
return None
|
||||
except Exception as e:
|
||||
logger.error(f"Health check error: {e}")
|
||||
return None
|
||||
|
||||
async def schedule_loop(self):
|
||||
"""
|
||||
主监控循环
|
||||
"""
|
||||
logger.info("🔄 Task scheduler started")
|
||||
logger.info(f" LitServe URL: {self.litserve_url}")
|
||||
logger.info(f" Worker Mode: {'Auto-Loop' if self.worker_auto_mode else 'Scheduler-Driven'}")
|
||||
logger.info(f" Monitor Interval: {self.monitor_interval}s")
|
||||
logger.info(f" Health Check Interval: {self.health_check_interval}s")
|
||||
logger.info(f" Stale Task Timeout: {self.stale_task_timeout}m")
|
||||
if self.cleanup_old_files_days > 0:
|
||||
logger.info(f" Cleanup Old Files: {self.cleanup_old_files_days} days")
|
||||
else:
|
||||
logger.info(f" Cleanup Old Files: Disabled")
|
||||
if self.cleanup_old_records_days > 0:
|
||||
logger.info(f" Cleanup Old Records: {self.cleanup_old_records_days} days (Not Recommended)")
|
||||
else:
|
||||
logger.info(f" Cleanup Old Records: Disabled (Keep Forever)")
|
||||
|
||||
health_check_counter = 0
|
||||
stale_task_counter = 0
|
||||
cleanup_counter = 0
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
while self.running:
|
||||
try:
|
||||
# 1. 监控队列状态
|
||||
stats = self.db.get_queue_stats()
|
||||
pending_count = stats.get('pending', 0)
|
||||
processing_count = stats.get('processing', 0)
|
||||
completed_count = stats.get('completed', 0)
|
||||
failed_count = stats.get('failed', 0)
|
||||
|
||||
if pending_count > 0 or processing_count > 0:
|
||||
logger.info(
|
||||
f"📊 Queue: {pending_count} pending, {processing_count} processing, "
|
||||
f"{completed_count} completed, {failed_count} failed"
|
||||
)
|
||||
|
||||
# 2. 定期健康检查
|
||||
health_check_counter += 1
|
||||
if health_check_counter * self.monitor_interval >= self.health_check_interval:
|
||||
health_check_counter = 0
|
||||
logger.info("🏥 Performing health check...")
|
||||
health_result = await self.check_worker_health(session)
|
||||
if health_result:
|
||||
logger.info(f"✅ Workers healthy: {health_result}")
|
||||
else:
|
||||
logger.warning("⚠️ Workers health check failed")
|
||||
|
||||
# 3. 定期重置超时任务
|
||||
stale_task_counter += 1
|
||||
if stale_task_counter * self.monitor_interval >= self.stale_task_timeout * 60:
|
||||
stale_task_counter = 0
|
||||
reset_count = self.db.reset_stale_tasks(self.stale_task_timeout)
|
||||
if reset_count > 0:
|
||||
logger.warning(f"⚠️ Reset {reset_count} stale tasks (timeout: {self.stale_task_timeout}m)")
|
||||
|
||||
# 4. 定期清理旧任务文件和记录
|
||||
cleanup_counter += 1
|
||||
# 每24小时清理一次(基于当前监控间隔计算)
|
||||
cleanup_interval_cycles = (24 * 3600) / self.monitor_interval
|
||||
if cleanup_counter >= cleanup_interval_cycles:
|
||||
cleanup_counter = 0
|
||||
|
||||
# 清理旧结果文件(保留数据库记录)
|
||||
if self.cleanup_old_files_days > 0:
|
||||
logger.info(f"🧹 Cleaning up result files older than {self.cleanup_old_files_days} days...")
|
||||
file_count = self.db.cleanup_old_task_files(days=self.cleanup_old_files_days)
|
||||
if file_count > 0:
|
||||
logger.info(f"✅ Cleaned up {file_count} result directories (DB records kept)")
|
||||
|
||||
# 清理极旧的数据库记录(可选,默认不启用)
|
||||
if self.cleanup_old_records_days > 0:
|
||||
logger.warning(
|
||||
f"🗑️ Cleaning up database records older than {self.cleanup_old_records_days} days..."
|
||||
)
|
||||
record_count = self.db.cleanup_old_task_records(days=self.cleanup_old_records_days)
|
||||
if record_count > 0:
|
||||
logger.warning(f"⚠️ Deleted {record_count} task records permanently")
|
||||
|
||||
# 等待下一次监控
|
||||
await asyncio.sleep(self.monitor_interval)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Scheduler loop error: {e}")
|
||||
await asyncio.sleep(self.monitor_interval)
|
||||
|
||||
logger.info("⏹️ Task scheduler stopped")
|
||||
|
||||
def start(self):
|
||||
"""启动调度器"""
|
||||
logger.info("🚀 Starting MinerU Tianshu Task Scheduler...")
|
||||
|
||||
# 设置信号处理
|
||||
def signal_handler(sig, frame):
|
||||
logger.info("\n🛑 Received stop signal, shutting down...")
|
||||
self.running = False
|
||||
|
||||
signal.signal(signal.SIGINT, signal_handler)
|
||||
signal.signal(signal.SIGTERM, signal_handler)
|
||||
|
||||
# 运行调度循环
|
||||
asyncio.run(self.schedule_loop())
|
||||
|
||||
def stop(self):
|
||||
"""停止调度器"""
|
||||
self.running = False
|
||||
|
||||
|
||||
async def health_check(litserve_url: str) -> bool:
|
||||
"""
|
||||
健康检查:验证 LitServe Worker 是否可用
|
||||
"""
|
||||
try:
|
||||
async with aiohttp.ClientSession() as session:
|
||||
async with session.get(
|
||||
litserve_url.replace('/predict', '/health'),
|
||||
timeout=aiohttp.ClientTimeout(total=5)
|
||||
) as resp:
|
||||
return resp.status == 200
|
||||
except:
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser(description='MinerU Tianshu Task Scheduler (Optional)')
|
||||
parser.add_argument('--litserve-url', type=str, default='http://localhost:9000/predict',
|
||||
help='LitServe worker URL')
|
||||
parser.add_argument('--monitor-interval', type=int, default=300,
|
||||
help='Monitor interval in seconds (default: 300s = 5 minutes)')
|
||||
parser.add_argument('--health-check-interval', type=int, default=900,
|
||||
help='Health check interval in seconds (default: 900s = 15 minutes)')
|
||||
parser.add_argument('--stale-task-timeout', type=int, default=60,
|
||||
help='Timeout for stale tasks in minutes (default: 60)')
|
||||
parser.add_argument('--cleanup-old-files-days', type=int, default=7,
|
||||
help='Delete result files older than N days (0=disable, default: 7)')
|
||||
parser.add_argument('--cleanup-old-records-days', type=int, default=0,
|
||||
help='Delete DB records older than N days (0=disable, NOT recommended)')
|
||||
parser.add_argument('--wait-for-workers', action='store_true',
|
||||
help='Wait for workers to be ready before starting')
|
||||
parser.add_argument('--no-worker-auto-mode', action='store_true',
|
||||
help='Disable worker auto-loop mode assumption')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 等待 workers 就绪(可选)
|
||||
if args.wait_for_workers:
|
||||
logger.info("⏳ Waiting for LitServe workers to be ready...")
|
||||
import time
|
||||
max_retries = 30
|
||||
for i in range(max_retries):
|
||||
if asyncio.run(health_check(args.litserve_url)):
|
||||
logger.info("✅ LitServe workers are ready!")
|
||||
break
|
||||
time.sleep(2)
|
||||
if i == max_retries - 1:
|
||||
logger.error("❌ LitServe workers not responding, starting anyway...")
|
||||
|
||||
# 创建并启动调度器
|
||||
scheduler = TaskScheduler(
|
||||
litserve_url=args.litserve_url,
|
||||
monitor_interval=args.monitor_interval,
|
||||
health_check_interval=args.health_check_interval,
|
||||
stale_task_timeout=args.stale_task_timeout,
|
||||
cleanup_old_files_days=args.cleanup_old_files_days,
|
||||
cleanup_old_records_days=args.cleanup_old_records_days,
|
||||
worker_auto_mode=not args.no_worker_auto_mode
|
||||
)
|
||||
|
||||
try:
|
||||
scheduler.start()
|
||||
except KeyboardInterrupt:
|
||||
logger.info("👋 Scheduler interrupted by user")
|
||||
|
||||
@@ -1,85 +0,0 @@
|
||||
# MinerU v2.0 Multi-GPU Server
|
||||
|
||||
[简体中文](README_zh.md)
|
||||
|
||||
A streamlined multi-GPU server implementation.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. install MinerU
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
pip install uv
|
||||
uv pip install -U "mineru[core]"
|
||||
uv pip install litserve aiohttp loguru
|
||||
```
|
||||
|
||||
### 2. Start the Server
|
||||
```bash
|
||||
python server.py
|
||||
```
|
||||
|
||||
### 3. Start the Client
|
||||
```bash
|
||||
python client.py
|
||||
```
|
||||
|
||||
Now, pdf files under folder [demo](../../demo/) will be processed in parallel. Assuming you have 2 gpus, if you change the `workers_per_device` to `2`, 4 pdf files will be processed at the same time!
|
||||
|
||||
## Customize
|
||||
|
||||
### Server
|
||||
|
||||
Example showing how to start the server with custom settings:
|
||||
```python
|
||||
server = ls.LitServer(
|
||||
MinerUAPI(output_dir='/tmp/mineru_output'),
|
||||
accelerator='auto', # You can specify 'cuda'
|
||||
devices='auto', # "auto" uses all available GPUs
|
||||
workers_per_device=1, # One worker instance per GPU
|
||||
timeout=False # Disable timeout for long processing
|
||||
)
|
||||
server.run(port=8000, generate_client_file=False)
|
||||
```
|
||||
|
||||
### Client
|
||||
|
||||
The client supports both synchronous and asynchronous processing:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import aiohttp
|
||||
from client import mineru_parse_async
|
||||
|
||||
async def process_documents():
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# Basic usage
|
||||
result = await mineru_parse_async(session, 'document.pdf')
|
||||
|
||||
# With custom options
|
||||
result = await mineru_parse_async(
|
||||
session,
|
||||
'document.pdf',
|
||||
backend='pipeline',
|
||||
lang='ch',
|
||||
formula_enable=True,
|
||||
table_enable=True
|
||||
)
|
||||
|
||||
# Run async processing
|
||||
asyncio.run(process_documents())
|
||||
```
|
||||
|
||||
### Concurrent Processing
|
||||
Process multiple files simultaneously:
|
||||
```python
|
||||
async def process_multiple_files():
|
||||
files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
tasks = [mineru_parse_async(session, file) for file in files]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
return results
|
||||
```
|
||||
@@ -1,87 +0,0 @@
|
||||
# MinerU v2.0 多GPU服务器
|
||||
|
||||
[English](README.md)
|
||||
|
||||
这是一个精简的多GPU服务器实现。
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 1. 安装 MinerU
|
||||
|
||||
```bash
|
||||
pip install --upgrade pip
|
||||
pip install uv
|
||||
uv pip install -U "mineru[core]"
|
||||
uv pip install litserve aiohttp loguru
|
||||
```
|
||||
|
||||
### 2. 启动服务器
|
||||
|
||||
```bash
|
||||
python server.py
|
||||
```
|
||||
|
||||
### 3. 启动客户端
|
||||
|
||||
```bash
|
||||
python client.py
|
||||
```
|
||||
|
||||
现在,`[demo](../../demo/)` 文件夹下的PDF文件将并行处理。假设您有2个GPU,如果您将 `workers_per_device` 更改为 `2`,则可以同时处理4个PDF文件!
|
||||
|
||||
## 自定义
|
||||
|
||||
### 服务器
|
||||
|
||||
以下示例展示了如何启动带有自定义设置的服务器:
|
||||
```python
|
||||
server = ls.LitServer(
|
||||
MinerUAPI(output_dir='/tmp/mineru_output'), # 自定义输出文件夹
|
||||
accelerator='auto', # 您可以指定 'cuda'
|
||||
devices='auto', # "auto" 使用所有可用的GPU
|
||||
workers_per_device=1, # 每个GPU启动一个工作实例
|
||||
timeout=False # 禁用超时,用于长时间处理
|
||||
)
|
||||
server.run(port=8000, generate_client_file=False)
|
||||
```
|
||||
|
||||
### 客户端
|
||||
|
||||
客户端支持同步和异步处理:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import aiohttp
|
||||
from client import mineru_parse_async
|
||||
|
||||
async def process_documents():
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# 基本用法
|
||||
result = await mineru_parse_async(session, 'document.pdf')
|
||||
|
||||
# 带自定义选项
|
||||
result = await mineru_parse_async(
|
||||
session,
|
||||
'document.pdf',
|
||||
backend='pipeline',
|
||||
lang='ch',
|
||||
formula_enable=True,
|
||||
table_enable=True
|
||||
)
|
||||
|
||||
# 运行异步处理
|
||||
asyncio.run(process_documents())
|
||||
```
|
||||
|
||||
### 并行处理
|
||||
同时处理多个文件:
|
||||
```python
|
||||
async def process_multiple_files():
|
||||
files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf']
|
||||
|
||||
async with aiohttp.ClientSession() as session:
|
||||
tasks = [mineru_parse_async(session, file) for file in files]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
return results
|
||||
```
|
||||
@@ -1,60 +0,0 @@
|
||||
import requests
|
||||
import os
|
||||
import logging
|
||||
|
||||
logging.basicConfig(level=logging.INFO)
|
||||
|
||||
# test connection to huggingface
|
||||
TIMEOUT = 3
|
||||
|
||||
def config_endpoint():
|
||||
"""
|
||||
Checks for connectivity to Hugging Face and sets the model source accordingly.
|
||||
If the Hugging Face endpoint is reachable, it sets MINERU_MODEL_SOURCE to 'huggingface'.
|
||||
Otherwise, it falls back to 'modelscope'.
|
||||
"""
|
||||
|
||||
os.environ.setdefault('MINERU_MODEL_SOURCE', 'huggingface')
|
||||
model_list_url = f"https://huggingface.co/models"
|
||||
modelscope_url = f"https://modelscope.cn/models"
|
||||
|
||||
# Use a specific check for the Hugging Face source
|
||||
if os.environ['MINERU_MODEL_SOURCE'] == 'huggingface':
|
||||
try:
|
||||
response = requests.head(model_list_url, timeout=TIMEOUT)
|
||||
|
||||
# Check for any successful status code (2xx)
|
||||
if response.ok:
|
||||
logging.info(f"Successfully connected to Hugging Face. Using 'huggingface' as model source.")
|
||||
return True
|
||||
else:
|
||||
logging.warning(f"Hugging Face endpoint returned a non-200 status code: {response.status_code}")
|
||||
|
||||
except requests.exceptions.RequestException as e:
|
||||
logging.error(f"Failed to connect to Hugging Face at {model_list_url}: {e}")
|
||||
|
||||
# If any of the above checks fail, switch to modelscope
|
||||
logging.info("Falling back to 'modelscope' as model source.")
|
||||
os.environ['MINERU_MODEL_SOURCE'] = 'modelscope'
|
||||
|
||||
elif os.environ['MINERU_MODEL_SOURCE'] == 'modelscope':
|
||||
try:
|
||||
response = requests.head(modelscope_url, timeout=TIMEOUT)
|
||||
if response.ok:
|
||||
logging.info(f"Successfully connected to ModelScope. Using 'modelscope' as model source.")
|
||||
return True
|
||||
except requests.exceptions.RequestException as e:
|
||||
logging.error(f"Failed to connect to ModelScope at {modelscope_url}: {e}")
|
||||
|
||||
elif os.environ['MINERU_MODEL_SOURCE'] == 'local':
|
||||
logging.info("Using 'local' as model source.")
|
||||
return True
|
||||
|
||||
else:
|
||||
logging.error(f"Using custom model source: {os.environ['MINERU_MODEL_SOURCE']}")
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
if __name__ == '__main__':
|
||||
print(config_endpoint())
|
||||
@@ -1,82 +0,0 @@
|
||||
import base64
|
||||
import os
|
||||
from loguru import logger
|
||||
import asyncio
|
||||
import aiohttp
|
||||
|
||||
async def mineru_parse_async(session, file_path, url='http://127.0.0.1:8000/predict', **options):
|
||||
"""
|
||||
Asynchronous version of the parse function.
|
||||
"""
|
||||
try:
|
||||
# Asynchronously read and encode the file
|
||||
with open(file_path, 'rb') as f:
|
||||
file_b64 = base64.b64encode(f.read()).decode('utf-8')
|
||||
|
||||
payload = {
|
||||
'file': file_b64,
|
||||
'options': options
|
||||
}
|
||||
|
||||
# Use the aiohttp session to send the request
|
||||
async with session.post(url, json=payload) as response:
|
||||
if response.status == 200:
|
||||
result = await response.json()
|
||||
logger.info(f"✅ Processed: {file_path} -> {result.get('output_dir', 'N/A')}")
|
||||
return result
|
||||
else:
|
||||
error_text = await response.text()
|
||||
logger.error(f"❌ Server error for {file_path}: {error_text}")
|
||||
return {'error': error_text}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"❌ Failed to process {file_path}: {e}")
|
||||
return {'error': str(e)}
|
||||
|
||||
|
||||
async def main():
|
||||
"""
|
||||
Main function to run all parsing tasks concurrently.
|
||||
"""
|
||||
test_files = [
|
||||
'../../demo/pdfs/demo1.pdf',
|
||||
'../../demo/pdfs/demo2.pdf',
|
||||
'../../demo/pdfs/demo3.pdf',
|
||||
'../../demo/pdfs/small_ocr.pdf',
|
||||
]
|
||||
|
||||
test_files = [os.path.join(os.path.dirname(__file__), f) for f in test_files]
|
||||
|
||||
existing_files = [f for f in test_files if os.path.exists(f)]
|
||||
if not existing_files:
|
||||
logger.warning("No test files found.")
|
||||
return
|
||||
|
||||
# Create an aiohttp session to be reused across requests
|
||||
async with aiohttp.ClientSession() as session:
|
||||
# === Basic Processing ===
|
||||
basic_tasks = [mineru_parse_async(session, file_path) for file_path in existing_files[:2]]
|
||||
|
||||
# === Custom Options ===
|
||||
custom_options = {
|
||||
'backend': 'pipeline', 'lang': 'ch', 'method': 'auto',
|
||||
'formula_enable': True, 'table_enable': True,
|
||||
# Example for remote vlm server (vllm/sglang/lmdeploy...)
|
||||
# 'backend': 'vlm-http-client', 'server_url': 'http://127.0.0.1:30000',
|
||||
}
|
||||
|
||||
custom_tasks = [mineru_parse_async(session, file_path, **custom_options) for file_path in existing_files[2:]]
|
||||
|
||||
# Start all tasks
|
||||
all_tasks = basic_tasks + custom_tasks
|
||||
|
||||
all_results = await asyncio.gather(*all_tasks)
|
||||
|
||||
logger.info(f"All Results: {all_results}")
|
||||
|
||||
|
||||
logger.info("🎉 All processing completed!")
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Run the async main function
|
||||
asyncio.run(main())
|
||||
@@ -1,108 +0,0 @@
|
||||
import os
|
||||
import base64
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
import litserve as ls
|
||||
from fastapi import HTTPException
|
||||
from loguru import logger
|
||||
|
||||
from mineru.cli.common import do_parse, read_fn
|
||||
from mineru.utils.config_reader import get_device
|
||||
from mineru.utils.model_utils import get_vram
|
||||
from _config_endpoint import config_endpoint
|
||||
|
||||
class MinerUAPI(ls.LitAPI):
|
||||
def __init__(self, output_dir='/tmp'):
|
||||
super().__init__()
|
||||
self.output_dir = output_dir
|
||||
|
||||
def setup(self, device):
|
||||
"""Setup environment variables exactly like MinerU CLI does"""
|
||||
logger.info(f"Setting up on device: {device}")
|
||||
|
||||
if os.getenv('MINERU_DEVICE_MODE', None) == None:
|
||||
os.environ['MINERU_DEVICE_MODE'] = device if device != 'auto' else get_device()
|
||||
|
||||
device_mode = os.environ['MINERU_DEVICE_MODE']
|
||||
if os.getenv('MINERU_VIRTUAL_VRAM_SIZE', None) == None:
|
||||
if device_mode.startswith("cuda") or device_mode.startswith("npu"):
|
||||
vram = get_vram(device_mode)
|
||||
os.environ['MINERU_VIRTUAL_VRAM_SIZE'] = str(vram)
|
||||
else:
|
||||
os.environ['MINERU_VIRTUAL_VRAM_SIZE'] = '1'
|
||||
logger.info(f"MINERU_VIRTUAL_VRAM_SIZE: {os.environ['MINERU_VIRTUAL_VRAM_SIZE']}")
|
||||
|
||||
if os.getenv('MINERU_MODEL_SOURCE', None) in ['huggingface', None]:
|
||||
config_endpoint()
|
||||
logger.info(f"MINERU_MODEL_SOURCE: {os.environ['MINERU_MODEL_SOURCE']}")
|
||||
|
||||
|
||||
def decode_request(self, request):
|
||||
"""Decode file and options from request"""
|
||||
file_b64 = request['file']
|
||||
options = request.get('options', {})
|
||||
|
||||
file_bytes = base64.b64decode(file_b64)
|
||||
with tempfile.NamedTemporaryFile(suffix='.pdf', delete=False) as temp:
|
||||
temp.write(file_bytes)
|
||||
temp_file = Path(temp.name)
|
||||
return {
|
||||
'input_path': str(temp_file),
|
||||
'backend': options.get('backend', 'pipeline'),
|
||||
'method': options.get('method', 'auto'),
|
||||
'lang': options.get('lang', 'ch'),
|
||||
'formula_enable': options.get('formula_enable', True),
|
||||
'table_enable': options.get('table_enable', True),
|
||||
'start_page_id': options.get('start_page_id', 0),
|
||||
'end_page_id': options.get('end_page_id', None),
|
||||
'server_url': options.get('server_url', None),
|
||||
}
|
||||
|
||||
def predict(self, inputs):
|
||||
"""Call MinerU's do_parse - same as CLI"""
|
||||
input_path = inputs['input_path']
|
||||
output_dir = Path(self.output_dir)
|
||||
|
||||
try:
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
file_name = Path(input_path).stem
|
||||
pdf_bytes = read_fn(Path(input_path))
|
||||
|
||||
do_parse(
|
||||
output_dir=str(output_dir),
|
||||
pdf_file_names=[file_name],
|
||||
pdf_bytes_list=[pdf_bytes],
|
||||
p_lang_list=[inputs['lang']],
|
||||
backend=inputs['backend'],
|
||||
parse_method=inputs['method'],
|
||||
formula_enable=inputs['formula_enable'],
|
||||
table_enable=inputs['table_enable'],
|
||||
server_url=inputs['server_url'],
|
||||
start_page_id=inputs['start_page_id'],
|
||||
end_page_id=inputs['end_page_id']
|
||||
)
|
||||
|
||||
return str(output_dir/Path(input_path).stem)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Processing failed: {e}")
|
||||
raise HTTPException(status_code=500, detail=str(e))
|
||||
finally:
|
||||
# Cleanup temp file
|
||||
if Path(input_path).exists():
|
||||
Path(input_path).unlink()
|
||||
|
||||
def encode_response(self, response):
|
||||
return {'output_dir': response}
|
||||
|
||||
if __name__ == '__main__':
|
||||
server = ls.LitServer(
|
||||
MinerUAPI(output_dir='/tmp/mineru_output'),
|
||||
accelerator='auto',
|
||||
devices='auto',
|
||||
workers_per_device=1,
|
||||
timeout=False
|
||||
)
|
||||
logger.info("Starting MinerU server on port 8000")
|
||||
server.run(port=8000, generate_client_file=False)
|
||||
Reference in New Issue
Block a user