Compare commits

..

1 Commits

Author SHA1 Message Date
Xiaomeng Zhao
cebfa5f47e Merge pull request #2387 from opendatalab/master
update version
2025-04-27 18:29:26 +08:00
22 changed files with 62 additions and 119 deletions

View File

@@ -48,10 +48,9 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
</div>
# Changelog
- 2025/04/29 1.3.10 Released
- Support for custom formula delimiters can be achieved by modifying the `latex-delimiter-config` item in the `magic-pdf.json` file under the user directory.
- 2025/04/27 1.3.9 Released
- Optimized the formula parsing function to improve the success rate of formula rendering
- Optimized the formula parsing function to improve the success rate of formula rendering
- Updated `pdfminer.six` to the latest version, fixing some abnormal PDF parsing issues
- 2025/04/23 1.3.8 Released
- The default `ocr` model (`ch`) has been updated to `PP-OCRv4_server_rec_doc` (model update required)
- `PP-OCRv4_server_rec_doc` is trained on a mix of more Chinese document data and PP-OCR training data, enhancing recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It supports over 15,000 recognizable characters, improving text recognition in documents while also boosting general text recognition.
@@ -353,7 +352,7 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">Python Version</td>
<td colspan="3">3.10~3.13</td>
<td colspan="3">>=3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver Version</td>
@@ -363,7 +362,8 @@ There are three different ways to experience MinerU:
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -394,7 +394,7 @@ Synced with dev branch updates:
#### 1. Install magic-pdf
```bash
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]"
```

View File

@@ -47,10 +47,9 @@
</div>
# 更新记录
- 2025/04/29 1.3.10 发布
- 支持使用自定义公式标识符,可通过修改用户目录下的`magic-pdf.json`文件中的`latex-delimiter-config`项实现。
- 2025/04/27 1.3.9 发布
- 优化公式解析功能,提升公式渲染的成功率
- 更新`pdfminer.six`到最新版本修复了部分pdf解析异常问题
- 2025/04/23 1.3.8 发布
- `ocr`默认模型(`ch`)更新为`PP-OCRv4_server_rec_doc`(需更新模型)
- `PP-OCRv4_server_rec_doc`是在`PP-OCRv4_server_rec`的基础上在更多中文文档数据和PP-OCR训练数据的混合数据训练而成增加了部分繁体字、日文、特殊字符的识别能力可支持识别的字符为1.5万+,除文档相关的文字识别能力提升外,也同时提升了通用文字的识别能力。
@@ -342,7 +341,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">python版本</td>
<td colspan="3">3.10~3.13</td>
<td colspan="3">>=3.10</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver 版本</td>
@@ -352,7 +351,8 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
</tr>
<tr>
<td colspan="3">CUDA环境</td>
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -387,7 +387,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
> 最新版本国内镜像源同步可能会有延迟,请耐心等待
```bash
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
```

View File

@@ -45,7 +45,7 @@ RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/m
pip3 install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple"
# Download models and update the configuration file
RUN /bin/bash -c "pip3 install modelscope -i https://mirrors.aliyun.com/pypi/simple && \
RUN /bin/bash -c "pip3 install modelscope && \
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py && \
python3 download_models.py && \
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"

View File

@@ -54,7 +54,7 @@ In the final step, enter `yes`, close the terminal, and reopen it.
### 4. Create an Environment Using Conda
```bash
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```

View File

@@ -54,7 +54,7 @@ bash Anaconda3-2024.06-1-Linux-x86_64.sh
## 4. 使用conda 创建环境
```bash
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```

View File

@@ -2,12 +2,11 @@
### 1. Install CUDA and cuDNN
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
You need to install a CUDA version that is compatible with torch's requirements. Currently, torch supports CUDA 11.8/12.4/12.6.
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
### 2. Install Anaconda
@@ -18,7 +17,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
### 3. Create an Environment Using Conda
```bash
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -64,7 +63,7 @@ If your graphics card has at least 6GB of VRAM, follow these steps to test CUDA-
1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
```
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
```
2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory.

View File

@@ -1,13 +1,12 @@
# Windows10/11
## 1. 安装cuda环境
## 1. 安装cuda和cuDNN
需要安装符合torch要求的cuda版本具体可参考[torch官网](https://pytorch.org/get-started/locally/)
需要安装符合torch要求的cuda版本torch目前支持11.8/12.4/12.6
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
## 2. 安装anaconda
@@ -19,7 +18,7 @@ https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Window
## 3. 使用conda 创建环境
```bash
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
```
@@ -65,7 +64,7 @@ pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url具体可参考[torch官网](https://pytorch.org/get-started/locally/))
```bash
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
```
**2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**

View File

@@ -20,16 +20,6 @@
"enable": true,
"max_time": 400
},
"latex-delimiter-config": {
"display": {
"left": "$$",
"right": "$$"
},
"inline": {
"left": "$",
"right": "$"
}
},
"llm-aided-config": {
"formula_aided": {
"api_key": "your_api_key",
@@ -50,5 +40,5 @@
"enable": false
}
},
"config_version": "1.2.1"
"config_version": "1.2.0"
}

View File

@@ -5,7 +5,6 @@ from loguru import logger
from magic_pdf.config.make_content_config import DropMode, MakeMode
from magic_pdf.config.ocr_content_type import BlockType, ContentType
from magic_pdf.libs.commons import join_path
from magic_pdf.libs.config_reader import get_latex_delimiter_config
from magic_pdf.libs.language import detect_lang
from magic_pdf.libs.markdown_utils import ocr_escape_special_markdown_char
from magic_pdf.post_proc.para_split_v3 import ListLineTag
@@ -146,19 +145,6 @@ def full_to_half(text: str) -> str:
result.append(char)
return ''.join(result)
latex_delimiters_config = get_latex_delimiter_config()
default_delimiters = {
'display': {'left': '$$', 'right': '$$'},
'inline': {'left': '$', 'right': '$'}
}
delimiters = latex_delimiters_config if latex_delimiters_config else default_delimiters
display_left_delimiter = delimiters['display']['left']
display_right_delimiter = delimiters['display']['right']
inline_left_delimiter = delimiters['inline']['left']
inline_right_delimiter = delimiters['inline']['right']
def merge_para_with_text(para_block):
block_text = ''
@@ -182,9 +168,9 @@ def merge_para_with_text(para_block):
if span_type == ContentType.Text:
content = ocr_escape_special_markdown_char(span['content'])
elif span_type == ContentType.InlineEquation:
content = f"{inline_left_delimiter}{span['content']}{inline_right_delimiter}"
content = f"${span['content']}$"
elif span_type == ContentType.InterlineEquation:
content = f"\n{display_left_delimiter}\n{span['content']}\n{display_right_delimiter}\n"
content = f"\n$$\n{span['content']}\n$$\n"
content = content.strip()

View File

@@ -125,15 +125,6 @@ def get_llm_aided_config():
else:
return llm_aided_config
def get_latex_delimiter_config():
config = read_config()
latex_delimiter_config = config.get('latex-delimiter-config')
if latex_delimiter_config is None:
logger.warning(f"'latex-delimiter-config' not found in {CONFIG_FILE_NAME}, use 'None' as default")
return None
else:
return latex_delimiter_config
if __name__ == '__main__':
ak, sk, endpoint = get_s3_config('llm-raw')

View File

@@ -1 +1 @@
__version__ = "1.3.10"
__version__ = "1.3.9"

View File

@@ -156,10 +156,7 @@ def doc_analyze(
batch_images = [images_with_extra_info]
results = []
processed_images_count = 0
for index, batch_image in enumerate(batch_images):
processed_images_count += len(batch_image)
logger.info(f'Batch {index + 1}/{len(batch_images)}: {processed_images_count} pages/{len(images_with_extra_info)} pages')
for batch_image in batch_images:
result = may_batch_image_analyze(batch_image, ocr, show_log,layout_model, formula_enable, table_enable)
results.extend(result)

View File

@@ -66,9 +66,9 @@ LEFT_RIGHT_REMOVE_PATTERN = re.compile(r'\\left\.?|\\right\.?')
def fix_latex_left_right(s):
"""
修复LaTeX中的\\left和\\right命令
修复LaTeX中的\left和\right命令
1. 确保它们后面跟有效分隔符
2. 平衡\\left和\\right的数量
2. 平衡\left和\right的数量
"""
# 白名单分隔符
valid_delims_list = [r'(', r')', r'[', r']', r'{', r'}', r'/', r'|',
@@ -106,7 +106,7 @@ def fix_latex_left_right(s):
def fix_left_right_pairs(latex_formula):
"""
检测并修复LaTeX公式中\\left和\\right不在同一组的情况
检测并修复LaTeX公式中\left和\right不在同一组的情况
Args:
latex_formula (str): 输入的LaTeX公式
@@ -308,9 +308,9 @@ ENV_FORMAT_PATTERNS = {env: re.compile(r'\\begin\{' + env + r'\}\{([^}]*)\}') fo
def fix_latex_environments(s):
"""
检测LaTeX中环境如array\\begin和\\end是否匹配
1. 如果缺少\\begin标签则在开头添加
2. 如果缺少\\end标签则在末尾添加
检测LaTeX中环境如array\begin和\end是否匹配
1. 如果缺少\begin标签则在开头添加
2. 如果缺少\end标签则在末尾添加
"""
for env in ENV_TYPES:
begin_count = len(ENV_BEGIN_PATTERNS[env].findall(s))
@@ -334,7 +334,7 @@ def fix_latex_environments(s):
UP_PATTERN = re.compile(r'\\up([a-zA-Z]+)')
COMMANDS_TO_REMOVE_PATTERN = re.compile(
r'\\(?:lefteqn|boldmath|ensuremath|centering|textsubscript|sides|textsl|textcent|emph|protect|null)')
r'\\(?:lefteqn|boldmath|ensuremath|centering|textsubscript|sides|textsl|textcent|emph)')
REPLACEMENTS_PATTERNS = {
re.compile(r'\\underbar'): r'\\underline',
re.compile(r'\\Bar'): r'\\hat',
@@ -342,13 +342,7 @@ REPLACEMENTS_PATTERNS = {
re.compile(r'\\Tilde'): r'\\tilde',
re.compile(r'\\slash'): r'/',
re.compile(r'\\textperthousand'): r'',
re.compile(r'\\sun'): r'',
re.compile(r'\\textunderscore'): r'\\_',
re.compile(r'\\fint'): r'',
re.compile(r'\\up '): r'\\ ',
re.compile(r'\\vline = '): r'\\models ',
re.compile(r'\\vDash '): r'\\models ',
re.compile(r'\\sq \\sqcup '): r'\\square ',
re.compile(r'\\sun'): r''
}
QQUAD_PATTERN = re.compile(r'\\qquad(?!\s)')

View File

@@ -172,8 +172,8 @@ def filter_nested_tables(table_res_list, overlap_threshold=0.8, area_threshold=0
tables_inside = [j for j in range(len(table_res_list))
if i != j and is_inside(table_info[j], table_info[i], overlap_threshold)]
# Continue if there are at least 3 tables inside
if len(tables_inside) >= 3:
# Continue if there are at least 2 tables inside
if len(tables_inside) >= 2:
# Check if inside tables overlap with each other
tables_overlap = any(do_overlap(table_info[tables_inside[idx1]], table_info[tables_inside[idx2]])
for idx1 in range(len(tables_inside))

View File

@@ -76,11 +76,11 @@ In the final step, enter ``yes``, close the terminal, and reopen it.
4. Create an Environment Using Conda
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Specify Python version 3.103.13.
Specify Python version 3.10.
.. code:: sh
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
5. Install Applications
@@ -155,15 +155,14 @@ to test CUDA acceleration:
Windows 10/11
--------------
1. Install CUDA
1. Install CUDA and cuDNN
~~~~~~~~~~~~~~~~~~~~~~~~~
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
You need to install a CUDA version that is compatible with torch's requirements. Currently, torch supports CUDA 11.8/12.4/12.6.
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
2. Install Anaconda
@@ -178,7 +177,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
::
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
4. Install Applications

View File

@@ -61,7 +61,7 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
</tr>
<tr>
<td colspan="3">Python Version</td>
<td colspan="3">3.10~3.13</td>
<td colspan="3">3.10~3.12</td>
</tr>
<tr>
<td colspan="3">Nvidia Driver Version</td>
@@ -71,7 +71,8 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
</tr>
<tr>
<td colspan="3">CUDA Environment</td>
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
<td>11.8/12.4/12.6/12.8</td>
<td>11.8/12.4/12.6/12.8</td>
<td>None</td>
</tr>
<tr>
@@ -96,7 +97,7 @@ Create an environment
.. code-block:: shell
conda create -n mineru 'python=3.12' -y
conda create -n mineru 'python>=3.10' -y
conda activate mineru
pip install -U "magic-pdf[full]"

View File

@@ -117,12 +117,8 @@ def to_markdown(file_path, end_pages, is_ocr, layout_mode, formula_enable, table
return md_content, txt_content, archive_zip_path, new_pdf_path
latex_delimiters = [
{'left': '$$', 'right': '$$', 'display': True},
{'left': '$', 'right': '$', 'display': False},
{'left': '\\(', 'right': '\\)', 'display': False},
{'left': '\\[', 'right': '\\]', 'display': True},
]
latex_delimiters = [{'left': '$$', 'right': '$$', 'display': True},
{'left': '$', 'right': '$', 'display': False}]
def init_model():
@@ -222,8 +218,7 @@ if __name__ == '__main__':
with gr.Tabs():
with gr.Tab('Markdown rendering'):
md = gr.Markdown(label='Markdown rendering', height=1100, show_copy_button=True,
latex_delimiters=latex_delimiters,
line_breaks=True)
latex_delimiters=latex_delimiters, line_breaks=True)
with gr.Tab('Markdown text'):
md_text = gr.TextArea(lines=45, show_copy_button=True)
file.change(fn=to_pdf, inputs=file, outputs=pdf_show)

View File

@@ -4,7 +4,9 @@
## 环境配置
请使用以下命令配置所需的环境:
```bash
pip install -U magic-pdf[full] litserve python-multipart filetype
pip install -U litserve python-multipart filetype
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118
```
## 快速使用

View File

@@ -21,7 +21,6 @@ from magic_pdf.libs.config_reader import get_bucket_name, get_s3_config
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.operators.models import InferenceResult
from magic_pdf.operators.pipes import PipeResult
from fastapi import Form
model_config.__use_inside_model__ = True
@@ -103,7 +102,6 @@ def init_writers(
# 处理上传的文件
file_bytes = file.file.read()
file_extension = os.path.splitext(file.filename)[1]
writer = FileBasedDataWriter(output_path)
image_writer = FileBasedDataWriter(output_image_path)
os.makedirs(output_image_path, exist_ok=True)
@@ -178,14 +176,14 @@ def encode_image(image_path: str) -> str:
)
async def file_parse(
file: UploadFile = None,
file_path: str = Form(None),
parse_method: str = Form("auto"),
is_json_md_dump: bool = Form(False),
output_dir: str = Form("output"),
return_layout: bool = Form(False),
return_info: bool = Form(False),
return_content_list: bool = Form(False),
return_images: bool = Form(False),
file_path: str = None,
parse_method: str = "auto",
is_json_md_dump: bool = False,
output_dir: str = "output",
return_layout: bool = False,
return_info: bool = False,
return_content_list: bool = False,
return_images: bool = False,
):
"""
Execute the process of converting PDF to JSON and MD, outputting MD and JSON files

View File

@@ -7,9 +7,9 @@ numpy>=1.21.6
pydantic>=2.7.2,<2.11
PyMuPDF>=1.24.9,<1.25.0
scikit-learn>=1.0.2
torch>=2.2.2,!=2.5.0,!=2.5.1,<3
torch>=2.2.2,!=2.5.0,!=2.5.1
torchvision
transformers>=4.49.0,!=4.51.0,<5.0.0
pdfminer.six==20250506
pdfminer.six>=20250416
tqdm>=4.67.1
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.

View File

@@ -81,7 +81,7 @@ if __name__ == '__main__':
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
],
python_requires=">=3.10,<3.14", # 项目依赖的 Python 版本
python_requires=">=3.10,<4", # 项目依赖的 Python 版本
entry_points={
"console_scripts": [
"magic-pdf = magic_pdf.tools.cli:cli",

View File

@@ -255,14 +255,6 @@
"created_at": "2025-04-25T02:54:20Z",
"repoId": 765083837,
"pullRequestNo": 2367
},
{
"name": "CharlesKeeling65",
"id": 94165417,
"comment_id": 2841356871,
"created_at": "2025-04-30T09:25:31Z",
"repoId": 765083837,
"pullRequestNo": 2411
}
]
}