mirror of
https://github.com/opendatalab/MinerU.git
synced 2026-03-27 11:08:32 +07:00
Compare commits
67 Commits
release-1.
...
magic_pdf-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
ea619281ef | ||
|
|
212cfcf24a | ||
|
|
cda85d6262 | ||
|
|
51ceb48014 | ||
|
|
0b8c614280 | ||
|
|
c1b387abe6 | ||
|
|
1ab54ac2e3 | ||
|
|
78a0208425 | ||
|
|
cd785f6af8 | ||
|
|
a8f752f753 | ||
|
|
65f332ffae | ||
|
|
c4b04ae642 | ||
|
|
3858d918dd | ||
|
|
70696165c7 | ||
|
|
b799d302c2 | ||
|
|
9351d64a41 | ||
|
|
3230793b55 | ||
|
|
9f0d45bb58 | ||
|
|
6c9645aa0c | ||
|
|
96fb646a86 | ||
|
|
71a429a32e | ||
|
|
201e338b3a | ||
|
|
2a28f604c6 | ||
|
|
38d0a622d9 | ||
|
|
a8ca183094 | ||
|
|
11bf98d0aa | ||
|
|
50700646e4 | ||
|
|
862891e294 | ||
|
|
f0b66d3aab | ||
|
|
b29b73af21 | ||
|
|
5e8656c74f | ||
|
|
2aaf2310f2 | ||
|
|
8802687934 | ||
|
|
2c2fcbe832 | ||
|
|
9c37d65fab | ||
|
|
49a8f8be0a | ||
|
|
5e15d9b664 | ||
|
|
81daf298b5 | ||
|
|
2d4e9e544e | ||
|
|
dfd13fa2ab | ||
|
|
2cf55ce1d1 | ||
|
|
100e9c17a5 | ||
|
|
cf33cb882d | ||
|
|
98dd179053 | ||
|
|
7d77d614ec | ||
|
|
c060413b19 | ||
|
|
1e715d026d | ||
|
|
0d5762e57a | ||
|
|
d68fe15bde | ||
|
|
9bdc254456 | ||
|
|
ebb7df984e | ||
|
|
e54f8fd31e | ||
|
|
9f892a5e9d | ||
|
|
623537dd9c | ||
|
|
c1fbf01c43 | ||
|
|
0807e971fe | ||
|
|
ef854b23aa | ||
|
|
2d1a0f2ca6 | ||
|
|
c8747cffb4 | ||
|
|
0299dea199 | ||
|
|
2e91fb3f52 | ||
|
|
6c1511517a | ||
|
|
b864062a4f | ||
|
|
c1558af3ef | ||
|
|
2a9ac8939f | ||
|
|
bfb80cb2e5 | ||
|
|
80a80482f3 |
13
README.md
13
README.md
@@ -48,6 +48,10 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
|
||||
</div>
|
||||
|
||||
# Changelog
|
||||
- 2025/04/29 1.3.10 Released
|
||||
- Support for custom formula delimiters can be achieved by modifying the `latex-delimiter-config` item in the `magic-pdf.json` file under the user directory.
|
||||
- 2025/04/27 1.3.9 Released
|
||||
- Optimized the formula parsing function to improve the success rate of formula rendering
|
||||
- 2025/04/23 1.3.8 Released
|
||||
- The default `ocr` model (`ch`) has been updated to `PP-OCRv4_server_rec_doc` (model update required)
|
||||
- `PP-OCRv4_server_rec_doc` is trained on a mix of more Chinese document data and PP-OCR training data, enhancing recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It supports over 15,000 recognizable characters, improving text recognition in documents while also boosting general text recognition.
|
||||
@@ -349,7 +353,7 @@ There are three different ways to experience MinerU:
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">Python Version</td>
|
||||
<td colspan="3">>=3.10</td>
|
||||
<td colspan="3">3.10~3.13</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">Nvidia Driver Version</td>
|
||||
@@ -359,8 +363,7 @@ There are three different ways to experience MinerU:
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">CUDA Environment</td>
|
||||
<td>11.8/12.4/12.6/12.8</td>
|
||||
<td>11.8/12.4/12.6/12.8</td>
|
||||
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
|
||||
<td>None</td>
|
||||
</tr>
|
||||
<tr>
|
||||
@@ -374,7 +377,7 @@ There are three different ways to experience MinerU:
|
||||
<td colspan="2">GPU VRAM 6GB or more</td>
|
||||
<td colspan="2">All GPUs with Tensor Cores produced from Volta(2017) onwards.<br>
|
||||
More than 6GB VRAM </td>
|
||||
<td rowspan="2">apple slicon</td>
|
||||
<td rowspan="2">Apple silicon</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
@@ -391,7 +394,7 @@ Synced with dev branch updates:
|
||||
#### 1. Install magic-pdf
|
||||
|
||||
```bash
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
pip install -U "magic-pdf[full]"
|
||||
```
|
||||
|
||||
@@ -47,6 +47,10 @@
|
||||
</div>
|
||||
|
||||
# 更新记录
|
||||
- 2025/04/29 1.3.10 发布
|
||||
- 支持使用自定义公式标识符,可通过修改用户目录下的`magic-pdf.json`文件中的`latex-delimiter-config`项实现。
|
||||
- 2025/04/27 1.3.9 发布
|
||||
- 优化公式解析功能,提升公式渲染的成功率
|
||||
- 2025/04/23 1.3.8 发布
|
||||
- `ocr`默认模型(`ch`)更新为`PP-OCRv4_server_rec_doc`(需更新模型)
|
||||
- `PP-OCRv4_server_rec_doc`是在`PP-OCRv4_server_rec`的基础上,在更多中文文档数据和PP-OCR训练数据的混合数据训练而成,增加了部分繁体字、日文、特殊字符的识别能力,可支持识别的字符为1.5万+,除文档相关的文字识别能力提升外,也同时提升了通用文字的识别能力。
|
||||
@@ -338,7 +342,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">python版本</td>
|
||||
<td colspan="3">>=3.10</td>
|
||||
<td colspan="3">3.10~3.13</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">Nvidia Driver 版本</td>
|
||||
@@ -348,8 +352,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">CUDA环境</td>
|
||||
<td>11.8/12.4/12.6/12.8</td>
|
||||
<td>11.8/12.4/12.6/12.8</td>
|
||||
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
|
||||
<td>None</td>
|
||||
</tr>
|
||||
<tr>
|
||||
@@ -364,7 +367,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
|
||||
<td colspan="2">
|
||||
Volta(2017)及之后生产的全部带Tensor Core的GPU <br>
|
||||
6G显存及以上</td>
|
||||
<td rowspan="2">apple slicon</td>
|
||||
<td rowspan="2">Apple silicon</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
@@ -384,7 +387,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
|
||||
> 最新版本国内镜像源同步可能会有延迟,请耐心等待
|
||||
|
||||
```bash
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
|
||||
```
|
||||
|
||||
@@ -45,7 +45,7 @@ RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/m
|
||||
pip3 install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple"
|
||||
|
||||
# Download models and update the configuration file
|
||||
RUN /bin/bash -c "pip3 install modelscope && \
|
||||
RUN /bin/bash -c "pip3 install modelscope -i https://mirrors.aliyun.com/pypi/simple && \
|
||||
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py && \
|
||||
python3 download_models.py && \
|
||||
sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
|
||||
|
||||
@@ -54,7 +54,7 @@ In the final step, enter `yes`, close the terminal, and reopen it.
|
||||
### 4. Create an Environment Using Conda
|
||||
|
||||
```bash
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
```
|
||||
|
||||
|
||||
@@ -54,7 +54,7 @@ bash Anaconda3-2024.06-1-Linux-x86_64.sh
|
||||
## 4. 使用conda 创建环境
|
||||
|
||||
```bash
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
```
|
||||
|
||||
|
||||
@@ -2,11 +2,12 @@
|
||||
|
||||
### 1. Install CUDA and cuDNN
|
||||
|
||||
You need to install a CUDA version that is compatible with torch's requirements. Currently, torch supports CUDA 11.8/12.4/12.6.
|
||||
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
|
||||
|
||||
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
|
||||
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
|
||||
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
|
||||
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
|
||||
|
||||
### 2. Install Anaconda
|
||||
|
||||
@@ -17,7 +18,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
|
||||
### 3. Create an Environment Using Conda
|
||||
|
||||
```bash
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
```
|
||||
|
||||
@@ -63,7 +64,7 @@ If your graphics card has at least 6GB of VRAM, follow these steps to test CUDA-
|
||||
1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
|
||||
|
||||
```
|
||||
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
|
||||
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
|
||||
```
|
||||
|
||||
2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory.
|
||||
|
||||
@@ -1,12 +1,13 @@
|
||||
# Windows10/11
|
||||
|
||||
## 1. 安装cuda和cuDNN
|
||||
## 1. 安装cuda环境
|
||||
|
||||
需要安装符合torch要求的cuda版本,torch目前支持11.8/12.4/12.6
|
||||
需要安装符合torch要求的cuda版本,具体可参考[torch官网](https://pytorch.org/get-started/locally/)
|
||||
|
||||
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
|
||||
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
|
||||
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
|
||||
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
|
||||
|
||||
## 2. 安装anaconda
|
||||
|
||||
@@ -18,7 +19,7 @@ https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Window
|
||||
## 3. 使用conda 创建环境
|
||||
|
||||
```bash
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
```
|
||||
|
||||
@@ -64,7 +65,7 @@ pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
|
||||
**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url,具体可参考[torch官网](https://pytorch.org/get-started/locally/))
|
||||
|
||||
```bash
|
||||
pip install --force-reinstall torch torchvision "numpy<=2.1.1" --index-url https://download.pytorch.org/whl/cu124
|
||||
pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
|
||||
```
|
||||
|
||||
**2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**
|
||||
|
||||
@@ -20,6 +20,16 @@
|
||||
"enable": true,
|
||||
"max_time": 400
|
||||
},
|
||||
"latex-delimiter-config": {
|
||||
"display": {
|
||||
"left": "$$",
|
||||
"right": "$$"
|
||||
},
|
||||
"inline": {
|
||||
"left": "$",
|
||||
"right": "$"
|
||||
}
|
||||
},
|
||||
"llm-aided-config": {
|
||||
"formula_aided": {
|
||||
"api_key": "your_api_key",
|
||||
@@ -40,5 +50,5 @@
|
||||
"enable": false
|
||||
}
|
||||
},
|
||||
"config_version": "1.2.0"
|
||||
"config_version": "1.2.1"
|
||||
}
|
||||
@@ -5,6 +5,7 @@ from loguru import logger
|
||||
from magic_pdf.config.make_content_config import DropMode, MakeMode
|
||||
from magic_pdf.config.ocr_content_type import BlockType, ContentType
|
||||
from magic_pdf.libs.commons import join_path
|
||||
from magic_pdf.libs.config_reader import get_latex_delimiter_config
|
||||
from magic_pdf.libs.language import detect_lang
|
||||
from magic_pdf.libs.markdown_utils import ocr_escape_special_markdown_char
|
||||
from magic_pdf.post_proc.para_split_v3 import ListLineTag
|
||||
@@ -145,6 +146,19 @@ def full_to_half(text: str) -> str:
|
||||
result.append(char)
|
||||
return ''.join(result)
|
||||
|
||||
latex_delimiters_config = get_latex_delimiter_config()
|
||||
|
||||
default_delimiters = {
|
||||
'display': {'left': '$$', 'right': '$$'},
|
||||
'inline': {'left': '$', 'right': '$'}
|
||||
}
|
||||
|
||||
delimiters = latex_delimiters_config if latex_delimiters_config else default_delimiters
|
||||
|
||||
display_left_delimiter = delimiters['display']['left']
|
||||
display_right_delimiter = delimiters['display']['right']
|
||||
inline_left_delimiter = delimiters['inline']['left']
|
||||
inline_right_delimiter = delimiters['inline']['right']
|
||||
|
||||
def merge_para_with_text(para_block):
|
||||
block_text = ''
|
||||
@@ -168,9 +182,9 @@ def merge_para_with_text(para_block):
|
||||
if span_type == ContentType.Text:
|
||||
content = ocr_escape_special_markdown_char(span['content'])
|
||||
elif span_type == ContentType.InlineEquation:
|
||||
content = f"${span['content']}$"
|
||||
content = f"{inline_left_delimiter}{span['content']}{inline_right_delimiter}"
|
||||
elif span_type == ContentType.InterlineEquation:
|
||||
content = f"\n$$\n{span['content']}\n$$\n"
|
||||
content = f"\n{display_left_delimiter}\n{span['content']}\n{display_right_delimiter}\n"
|
||||
|
||||
content = content.strip()
|
||||
|
||||
|
||||
@@ -125,6 +125,15 @@ def get_llm_aided_config():
|
||||
else:
|
||||
return llm_aided_config
|
||||
|
||||
def get_latex_delimiter_config():
|
||||
config = read_config()
|
||||
latex_delimiter_config = config.get('latex-delimiter-config')
|
||||
if latex_delimiter_config is None:
|
||||
logger.warning(f"'latex-delimiter-config' not found in {CONFIG_FILE_NAME}, use 'None' as default")
|
||||
return None
|
||||
else:
|
||||
return latex_delimiter_config
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
ak, sk, endpoint = get_s3_config('llm-raw')
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = "1.3.8"
|
||||
__version__ = "1.3.10"
|
||||
|
||||
@@ -156,7 +156,10 @@ def doc_analyze(
|
||||
batch_images = [images_with_extra_info]
|
||||
|
||||
results = []
|
||||
for batch_image in batch_images:
|
||||
processed_images_count = 0
|
||||
for index, batch_image in enumerate(batch_images):
|
||||
processed_images_count += len(batch_image)
|
||||
logger.info(f'Batch {index + 1}/{len(batch_images)}: {processed_images_count} pages/{len(images_with_extra_info)} pages')
|
||||
result = may_batch_image_analyze(batch_image, ocr, show_log,layout_model, formula_enable, table_enable)
|
||||
results.extend(result)
|
||||
|
||||
|
||||
@@ -5,6 +5,7 @@ from typing import Optional
|
||||
|
||||
import torch
|
||||
from ftfy import fix_text
|
||||
from loguru import logger
|
||||
|
||||
from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer, PretrainedConfig, PreTrainedModel
|
||||
from transformers import VisionEncoderDecoderConfig, VisionEncoderDecoderModel
|
||||
@@ -57,22 +58,322 @@ class TokenizerWrapper:
|
||||
return toks
|
||||
|
||||
|
||||
def latex_rm_whitespace(s: str):
|
||||
"""Remove unnecessary whitespace from LaTeX code.
|
||||
LEFT_PATTERN = re.compile(r'(\\left)(\S*)')
|
||||
RIGHT_PATTERN = re.compile(r'(\\right)(\S*)')
|
||||
LEFT_COUNT_PATTERN = re.compile(r'\\left(?![a-zA-Z])')
|
||||
RIGHT_COUNT_PATTERN = re.compile(r'\\right(?![a-zA-Z])')
|
||||
LEFT_RIGHT_REMOVE_PATTERN = re.compile(r'\\left\.?|\\right\.?')
|
||||
|
||||
def fix_latex_left_right(s):
|
||||
"""
|
||||
text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
|
||||
letter = r'[a-zA-Z]'
|
||||
noletter = r'[\W_^\d]'
|
||||
names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
|
||||
s = re.sub(text_reg, lambda _: str(names.pop(0)), s)
|
||||
news = s
|
||||
while True:
|
||||
s = news
|
||||
news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
|
||||
news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
|
||||
news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
|
||||
if news == s:
|
||||
break
|
||||
修复LaTeX中的\\left和\\right命令
|
||||
1. 确保它们后面跟有效分隔符
|
||||
2. 平衡\\left和\\right的数量
|
||||
"""
|
||||
# 白名单分隔符
|
||||
valid_delims_list = [r'(', r')', r'[', r']', r'{', r'}', r'/', r'|',
|
||||
r'\{', r'\}', r'\lceil', r'\rceil', r'\lfloor',
|
||||
r'\rfloor', r'\backslash', r'\uparrow', r'\downarrow',
|
||||
r'\Uparrow', r'\Downarrow', r'\|', r'\.']
|
||||
|
||||
# 为\left后缺失有效分隔符的情况添加点
|
||||
def fix_delim(match, is_left=True):
|
||||
cmd = match.group(1) # \left 或 \right
|
||||
rest = match.group(2) if len(match.groups()) > 1 else ""
|
||||
if not rest or rest not in valid_delims_list:
|
||||
return cmd + "."
|
||||
return match.group(0)
|
||||
|
||||
# 使用更精确的模式匹配\left和\right命令
|
||||
# 确保它们是独立的命令,不是其他命令的一部分
|
||||
# 使用预编译正则和统一回调函数
|
||||
s = LEFT_PATTERN.sub(lambda m: fix_delim(m, True), s)
|
||||
s = RIGHT_PATTERN.sub(lambda m: fix_delim(m, False), s)
|
||||
|
||||
# 更精确地计算\left和\right的数量
|
||||
left_count = len(LEFT_COUNT_PATTERN.findall(s)) # 不匹配\lefteqn等
|
||||
right_count = len(RIGHT_COUNT_PATTERN.findall(s)) # 不匹配\rightarrow等
|
||||
|
||||
if left_count == right_count:
|
||||
# 如果数量相等,检查是否在同一组
|
||||
return fix_left_right_pairs(s)
|
||||
else:
|
||||
# 如果数量不等,移除所有\left和\right
|
||||
# logger.debug(f"latex:{s}")
|
||||
# logger.warning(f"left_count: {left_count}, right_count: {right_count}")
|
||||
return LEFT_RIGHT_REMOVE_PATTERN.sub('', s)
|
||||
|
||||
|
||||
def fix_left_right_pairs(latex_formula):
|
||||
"""
|
||||
检测并修复LaTeX公式中\\left和\\right不在同一组的情况
|
||||
|
||||
Args:
|
||||
latex_formula (str): 输入的LaTeX公式
|
||||
|
||||
Returns:
|
||||
str: 修复后的LaTeX公式
|
||||
"""
|
||||
# 用于跟踪花括号嵌套层级
|
||||
brace_stack = []
|
||||
# 用于存储\left信息: (位置, 深度, 分隔符)
|
||||
left_stack = []
|
||||
# 存储需要调整的\right信息: (开始位置, 结束位置, 目标位置)
|
||||
adjustments = []
|
||||
|
||||
i = 0
|
||||
while i < len(latex_formula):
|
||||
# 检查是否是转义字符
|
||||
if i > 0 and latex_formula[i - 1] == '\\':
|
||||
backslash_count = 0
|
||||
j = i - 1
|
||||
while j >= 0 and latex_formula[j] == '\\':
|
||||
backslash_count += 1
|
||||
j -= 1
|
||||
|
||||
if backslash_count % 2 == 1:
|
||||
i += 1
|
||||
continue
|
||||
|
||||
# 检测\left命令
|
||||
if i + 5 < len(latex_formula) and latex_formula[i:i + 5] == "\\left" and i + 5 < len(latex_formula):
|
||||
delimiter = latex_formula[i + 5]
|
||||
left_stack.append((i, len(brace_stack), delimiter))
|
||||
i += 6 # 跳过\left和分隔符
|
||||
continue
|
||||
|
||||
# 检测\right命令
|
||||
elif i + 6 < len(latex_formula) and latex_formula[i:i + 6] == "\\right" and i + 6 < len(latex_formula):
|
||||
delimiter = latex_formula[i + 6]
|
||||
|
||||
if left_stack:
|
||||
left_pos, left_depth, left_delim = left_stack.pop()
|
||||
|
||||
# 如果\left和\right不在同一花括号深度
|
||||
if left_depth != len(brace_stack):
|
||||
# 找到\left所在花括号组的结束位置
|
||||
target_pos = find_group_end(latex_formula, left_pos, left_depth)
|
||||
if target_pos != -1:
|
||||
# 记录需要移动的\right
|
||||
adjustments.append((i, i + 7, target_pos))
|
||||
|
||||
i += 7 # 跳过\right和分隔符
|
||||
continue
|
||||
|
||||
# 处理花括号
|
||||
if latex_formula[i] == '{':
|
||||
brace_stack.append(i)
|
||||
elif latex_formula[i] == '}':
|
||||
if brace_stack:
|
||||
brace_stack.pop()
|
||||
|
||||
i += 1
|
||||
|
||||
# 应用调整,从后向前处理以避免索引变化
|
||||
if not adjustments:
|
||||
return latex_formula
|
||||
|
||||
result = list(latex_formula)
|
||||
adjustments.sort(reverse=True, key=lambda x: x[0])
|
||||
|
||||
for start, end, target in adjustments:
|
||||
# 提取\right部分
|
||||
right_part = result[start:end]
|
||||
# 从原位置删除
|
||||
del result[start:end]
|
||||
# 在目标位置插入
|
||||
result.insert(target, ''.join(right_part))
|
||||
|
||||
return ''.join(result)
|
||||
|
||||
|
||||
def find_group_end(text, pos, depth):
|
||||
"""查找特定深度的花括号组的结束位置"""
|
||||
current_depth = depth
|
||||
i = pos
|
||||
|
||||
while i < len(text):
|
||||
if text[i] == '{' and (i == 0 or not is_escaped(text, i)):
|
||||
current_depth += 1
|
||||
elif text[i] == '}' and (i == 0 or not is_escaped(text, i)):
|
||||
current_depth -= 1
|
||||
if current_depth < depth:
|
||||
return i
|
||||
i += 1
|
||||
|
||||
return -1 # 未找到对应结束位置
|
||||
|
||||
|
||||
def is_escaped(text, pos):
|
||||
"""检查字符是否被转义"""
|
||||
backslash_count = 0
|
||||
j = pos - 1
|
||||
while j >= 0 and text[j] == '\\':
|
||||
backslash_count += 1
|
||||
j -= 1
|
||||
|
||||
return backslash_count % 2 == 1
|
||||
|
||||
|
||||
def fix_unbalanced_braces(latex_formula):
|
||||
"""
|
||||
检测LaTeX公式中的花括号是否闭合,并删除无法配对的花括号
|
||||
|
||||
Args:
|
||||
latex_formula (str): 输入的LaTeX公式
|
||||
|
||||
Returns:
|
||||
str: 删除无法配对的花括号后的LaTeX公式
|
||||
"""
|
||||
stack = [] # 存储左括号的索引
|
||||
unmatched = set() # 存储不匹配括号的索引
|
||||
i = 0
|
||||
|
||||
while i < len(latex_formula):
|
||||
# 检查是否是转义的花括号
|
||||
if latex_formula[i] in ['{', '}']:
|
||||
# 计算前面连续的反斜杠数量
|
||||
backslash_count = 0
|
||||
j = i - 1
|
||||
while j >= 0 and latex_formula[j] == '\\':
|
||||
backslash_count += 1
|
||||
j -= 1
|
||||
|
||||
# 如果前面有奇数个反斜杠,则该花括号是转义的,不参与匹配
|
||||
if backslash_count % 2 == 1:
|
||||
i += 1
|
||||
continue
|
||||
|
||||
# 否则,该花括号参与匹配
|
||||
if latex_formula[i] == '{':
|
||||
stack.append(i)
|
||||
else: # latex_formula[i] == '}'
|
||||
if stack: # 有对应的左括号
|
||||
stack.pop()
|
||||
else: # 没有对应的左括号
|
||||
unmatched.add(i)
|
||||
|
||||
i += 1
|
||||
|
||||
# 所有未匹配的左括号
|
||||
unmatched.update(stack)
|
||||
|
||||
# 构建新字符串,删除不匹配的括号
|
||||
return ''.join(char for i, char in enumerate(latex_formula) if i not in unmatched)
|
||||
|
||||
|
||||
def process_latex(input_string):
|
||||
"""
|
||||
处理LaTeX公式中的反斜杠:
|
||||
1. 如果\后跟特殊字符(#$%&~_^\\{})或空格,保持不变
|
||||
2. 如果\后跟两个小写字母,保持不变
|
||||
3. 其他情况,在\后添加空格
|
||||
|
||||
Args:
|
||||
input_string (str): 输入的LaTeX公式
|
||||
|
||||
Returns:
|
||||
str: 处理后的LaTeX公式
|
||||
"""
|
||||
|
||||
def replace_func(match):
|
||||
# 获取\后面的字符
|
||||
next_char = match.group(1)
|
||||
|
||||
# 如果是特殊字符或空格,保持不变
|
||||
if next_char in "#$%&~_^|\\{} \t\n\r\v\f":
|
||||
return match.group(0)
|
||||
|
||||
# 如果是字母,检查下一个字符
|
||||
if 'a' <= next_char <= 'z' or 'A' <= next_char <= 'Z':
|
||||
pos = match.start() + 2 # \x后的位置
|
||||
if pos < len(input_string) and ('a' <= input_string[pos] <= 'z' or 'A' <= input_string[pos] <= 'Z'):
|
||||
# 下一个字符也是字母,保持不变
|
||||
return match.group(0)
|
||||
|
||||
# 其他情况,在\后添加空格
|
||||
return '\\' + ' ' + next_char
|
||||
|
||||
# 匹配\后面跟一个字符的情况
|
||||
pattern = r'\\(.)'
|
||||
|
||||
return re.sub(pattern, replace_func, input_string)
|
||||
|
||||
# 常见的在KaTeX/MathJax中可用的数学环境
|
||||
ENV_TYPES = ['array', 'matrix', 'pmatrix', 'bmatrix', 'vmatrix',
|
||||
'Bmatrix', 'Vmatrix', 'cases', 'aligned', 'gathered']
|
||||
ENV_BEGIN_PATTERNS = {env: re.compile(r'\\begin\{' + env + r'\}') for env in ENV_TYPES}
|
||||
ENV_END_PATTERNS = {env: re.compile(r'\\end\{' + env + r'\}') for env in ENV_TYPES}
|
||||
ENV_FORMAT_PATTERNS = {env: re.compile(r'\\begin\{' + env + r'\}\{([^}]*)\}') for env in ENV_TYPES}
|
||||
|
||||
def fix_latex_environments(s):
|
||||
"""
|
||||
检测LaTeX中环境(如array)的\\begin和\\end是否匹配
|
||||
1. 如果缺少\\begin标签则在开头添加
|
||||
2. 如果缺少\\end标签则在末尾添加
|
||||
"""
|
||||
for env in ENV_TYPES:
|
||||
begin_count = len(ENV_BEGIN_PATTERNS[env].findall(s))
|
||||
end_count = len(ENV_END_PATTERNS[env].findall(s))
|
||||
|
||||
if begin_count != end_count:
|
||||
if end_count > begin_count:
|
||||
format_match = ENV_FORMAT_PATTERNS[env].search(s)
|
||||
default_format = '{c}' if env == 'array' else ''
|
||||
format_str = '{' + format_match.group(1) + '}' if format_match else default_format
|
||||
|
||||
missing_count = end_count - begin_count
|
||||
begin_command = '\\begin{' + env + '}' + format_str + ' '
|
||||
s = begin_command * missing_count + s
|
||||
else:
|
||||
missing_count = begin_count - end_count
|
||||
s = s + (' \\end{' + env + '}') * missing_count
|
||||
|
||||
return s
|
||||
|
||||
|
||||
UP_PATTERN = re.compile(r'\\up([a-zA-Z]+)')
|
||||
COMMANDS_TO_REMOVE_PATTERN = re.compile(
|
||||
r'\\(?:lefteqn|boldmath|ensuremath|centering|textsubscript|sides|textsl|textcent|emph|protect|null)')
|
||||
REPLACEMENTS_PATTERNS = {
|
||||
re.compile(r'\\underbar'): r'\\underline',
|
||||
re.compile(r'\\Bar'): r'\\hat',
|
||||
re.compile(r'\\Hat'): r'\\hat',
|
||||
re.compile(r'\\Tilde'): r'\\tilde',
|
||||
re.compile(r'\\slash'): r'/',
|
||||
re.compile(r'\\textperthousand'): r'‰',
|
||||
re.compile(r'\\sun'): r'☉',
|
||||
re.compile(r'\\textunderscore'): r'\\_',
|
||||
re.compile(r'\\fint'): r'⨏',
|
||||
re.compile(r'\\up '): r'\\ ',
|
||||
re.compile(r'\\vline = '): r'\\models ',
|
||||
re.compile(r'\\vDash '): r'\\models ',
|
||||
re.compile(r'\\sq \\sqcup '): r'\\square ',
|
||||
}
|
||||
QQUAD_PATTERN = re.compile(r'\\qquad(?!\s)')
|
||||
|
||||
def latex_rm_whitespace(s: str):
|
||||
"""Remove unnecessary whitespace from LaTeX code."""
|
||||
s = fix_unbalanced_braces(s)
|
||||
s = fix_latex_left_right(s)
|
||||
s = fix_latex_environments(s)
|
||||
|
||||
# 使用预编译的正则表达式
|
||||
s = UP_PATTERN.sub(
|
||||
lambda m: m.group(0) if m.group(1) in ["arrow", "downarrow", "lus", "silon"] else f"\\{m.group(1)}", s
|
||||
)
|
||||
s = COMMANDS_TO_REMOVE_PATTERN.sub('', s)
|
||||
|
||||
# 应用所有替换
|
||||
for pattern, replacement in REPLACEMENTS_PATTERNS.items():
|
||||
s = pattern.sub(replacement, s)
|
||||
|
||||
# 处理LaTeX中的反斜杠和空格
|
||||
s = process_latex(s)
|
||||
|
||||
# \qquad后补空格
|
||||
s = QQUAD_PATTERN.sub(r'\\qquad ', s)
|
||||
|
||||
return s
|
||||
|
||||
|
||||
|
||||
@@ -172,8 +172,8 @@ def filter_nested_tables(table_res_list, overlap_threshold=0.8, area_threshold=0
|
||||
tables_inside = [j for j in range(len(table_res_list))
|
||||
if i != j and is_inside(table_info[j], table_info[i], overlap_threshold)]
|
||||
|
||||
# Continue if there are at least 2 tables inside
|
||||
if len(tables_inside) >= 2:
|
||||
# Continue if there are at least 3 tables inside
|
||||
if len(tables_inside) >= 3:
|
||||
# Check if inside tables overlap with each other
|
||||
tables_overlap = any(do_overlap(table_info[tables_inside[idx1]], table_info[tables_inside[idx2]])
|
||||
for idx1 in range(len(tables_inside))
|
||||
|
||||
@@ -76,11 +76,11 @@ In the final step, enter ``yes``, close the terminal, and reopen it.
|
||||
4. Create an Environment Using Conda
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Specify Python version 3.10.
|
||||
Specify Python version 3.10~3.13.
|
||||
|
||||
.. code:: sh
|
||||
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
|
||||
5. Install Applications
|
||||
@@ -155,14 +155,15 @@ to test CUDA acceleration:
|
||||
Windows 10/11
|
||||
--------------
|
||||
|
||||
1. Install CUDA and cuDNN
|
||||
1. Install CUDA
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You need to install a CUDA version that is compatible with torch's requirements. Currently, torch supports CUDA 11.8/12.4/12.6.
|
||||
You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
|
||||
|
||||
- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
|
||||
- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
|
||||
- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
|
||||
- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
|
||||
|
||||
|
||||
2. Install Anaconda
|
||||
@@ -177,7 +178,7 @@ Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86
|
||||
|
||||
::
|
||||
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
|
||||
4. Install Applications
|
||||
|
||||
@@ -61,7 +61,7 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">Python Version</td>
|
||||
<td colspan="3">3.10~3.12</td>
|
||||
<td colspan="3">3.10~3.13</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">Nvidia Driver Version</td>
|
||||
@@ -71,8 +71,7 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
|
||||
</tr>
|
||||
<tr>
|
||||
<td colspan="3">CUDA Environment</td>
|
||||
<td>11.8/12.4/12.6/12.8</td>
|
||||
<td>11.8/12.4/12.6/12.8</td>
|
||||
<td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
|
||||
<td>None</td>
|
||||
</tr>
|
||||
<tr>
|
||||
@@ -86,7 +85,7 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
|
||||
<td colspan="2">GPU VRAM 6GB or more</td>
|
||||
<td colspan="2">All GPUs with Tensor Cores produced from Volta(2017) onwards.<br>
|
||||
More than 6GB VRAM </td>
|
||||
<td rowspan="2">apple slicon</td>
|
||||
<td rowspan="2">Apple silicon</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
@@ -97,7 +96,7 @@ Create an environment
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
conda create -n mineru 'python>=3.10' -y
|
||||
conda create -n mineru 'python=3.12' -y
|
||||
conda activate mineru
|
||||
pip install -U "magic-pdf[full]"
|
||||
|
||||
|
||||
@@ -117,8 +117,12 @@ def to_markdown(file_path, end_pages, is_ocr, layout_mode, formula_enable, table
|
||||
return md_content, txt_content, archive_zip_path, new_pdf_path
|
||||
|
||||
|
||||
latex_delimiters = [{'left': '$$', 'right': '$$', 'display': True},
|
||||
{'left': '$', 'right': '$', 'display': False}]
|
||||
latex_delimiters = [
|
||||
{'left': '$$', 'right': '$$', 'display': True},
|
||||
{'left': '$', 'right': '$', 'display': False},
|
||||
{'left': '\\(', 'right': '\\)', 'display': False},
|
||||
{'left': '\\[', 'right': '\\]', 'display': True},
|
||||
]
|
||||
|
||||
|
||||
def init_model():
|
||||
@@ -218,7 +222,8 @@ if __name__ == '__main__':
|
||||
with gr.Tabs():
|
||||
with gr.Tab('Markdown rendering'):
|
||||
md = gr.Markdown(label='Markdown rendering', height=1100, show_copy_button=True,
|
||||
latex_delimiters=latex_delimiters, line_breaks=True)
|
||||
latex_delimiters=latex_delimiters,
|
||||
line_breaks=True)
|
||||
with gr.Tab('Markdown text'):
|
||||
md_text = gr.TextArea(lines=45, show_copy_button=True)
|
||||
file.change(fn=to_pdf, inputs=file, outputs=pdf_show)
|
||||
|
||||
@@ -4,9 +4,7 @@
|
||||
## 环境配置
|
||||
请使用以下命令配置所需的环境:
|
||||
```bash
|
||||
pip install -U litserve python-multipart filetype
|
||||
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
|
||||
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118
|
||||
pip install -U magic-pdf[full] litserve python-multipart filetype
|
||||
```
|
||||
|
||||
## 快速使用
|
||||
|
||||
@@ -21,6 +21,7 @@ from magic_pdf.libs.config_reader import get_bucket_name, get_s3_config
|
||||
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
|
||||
from magic_pdf.operators.models import InferenceResult
|
||||
from magic_pdf.operators.pipes import PipeResult
|
||||
from fastapi import Form
|
||||
|
||||
model_config.__use_inside_model__ = True
|
||||
|
||||
@@ -102,6 +103,7 @@ def init_writers(
|
||||
# 处理上传的文件
|
||||
file_bytes = file.file.read()
|
||||
file_extension = os.path.splitext(file.filename)[1]
|
||||
|
||||
writer = FileBasedDataWriter(output_path)
|
||||
image_writer = FileBasedDataWriter(output_image_path)
|
||||
os.makedirs(output_image_path, exist_ok=True)
|
||||
@@ -176,14 +178,14 @@ def encode_image(image_path: str) -> str:
|
||||
)
|
||||
async def file_parse(
|
||||
file: UploadFile = None,
|
||||
file_path: str = None,
|
||||
parse_method: str = "auto",
|
||||
is_json_md_dump: bool = False,
|
||||
output_dir: str = "output",
|
||||
return_layout: bool = False,
|
||||
return_info: bool = False,
|
||||
return_content_list: bool = False,
|
||||
return_images: bool = False,
|
||||
file_path: str = Form(None),
|
||||
parse_method: str = Form("auto"),
|
||||
is_json_md_dump: bool = Form(False),
|
||||
output_dir: str = Form("output"),
|
||||
return_layout: bool = Form(False),
|
||||
return_info: bool = Form(False),
|
||||
return_content_list: bool = Form(False),
|
||||
return_images: bool = Form(False),
|
||||
):
|
||||
"""
|
||||
Execute the process of converting PDF to JSON and MD, outputting MD and JSON files
|
||||
|
||||
@@ -7,9 +7,9 @@ numpy>=1.21.6
|
||||
pydantic>=2.7.2,<2.11
|
||||
PyMuPDF>=1.24.9,<1.25.0
|
||||
scikit-learn>=1.0.2
|
||||
torch>=2.2.2,!=2.5.0,!=2.5.1
|
||||
torch>=2.2.2,!=2.5.0,!=2.5.1,<3
|
||||
torchvision
|
||||
transformers>=4.49.0,!=4.51.0,<5.0.0
|
||||
pdfminer.six==20231228
|
||||
pdfminer.six==20250506
|
||||
tqdm>=4.67.1
|
||||
# The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.
|
||||
|
||||
2
setup.py
2
setup.py
@@ -81,7 +81,7 @@ if __name__ == '__main__':
|
||||
"Programming Language :: Python :: 3.12",
|
||||
"Programming Language :: Python :: 3.13",
|
||||
],
|
||||
python_requires=">=3.10,<4", # 项目依赖的 Python 版本
|
||||
python_requires=">=3.10,<3.14", # 项目依赖的 Python 版本
|
||||
entry_points={
|
||||
"console_scripts": [
|
||||
"magic-pdf = magic_pdf.tools.cli:cli",
|
||||
|
||||
@@ -247,6 +247,22 @@
|
||||
"created_at": "2025-04-17T03:54:59Z",
|
||||
"repoId": 765083837,
|
||||
"pullRequestNo": 2267
|
||||
},
|
||||
{
|
||||
"name": "kowyo",
|
||||
"id": 110339237,
|
||||
"comment_id": 2829263082,
|
||||
"created_at": "2025-04-25T02:54:20Z",
|
||||
"repoId": 765083837,
|
||||
"pullRequestNo": 2367
|
||||
},
|
||||
{
|
||||
"name": "CharlesKeeling65",
|
||||
"id": 94165417,
|
||||
"comment_id": 2841356871,
|
||||
"created_at": "2025-04-30T09:25:31Z",
|
||||
"repoId": 765083837,
|
||||
"pullRequestNo": 2411
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user