Merge pull request #2387 from opendatalab/master

update version
Update version.py with new version
2026-03-27 19:18:34 +07:00 · 2025-04-27 18:29:26 +08:00 · 2025-04-27 10:23:03 +00:00 · 2025-04-27 18:18:46 +08:00 · 2025-04-27 18:18:28 +08:00 · 2025-04-27 18:18:01 +08:00
15 changed files with 16024 additions and 32 deletions
--- a/README.md
+++ b/README.md
@@ -48,6 +48,18 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
 </div>

 # Changelog
+- 2025/04/27 1.3.9 Released  
+  - Optimized the formula parsing function to improve the success rate of formula rendering  
+  - Updated `pdfminer.six` to the latest version, fixing some abnormal PDF parsing issues
+- 2025/04/23 1.3.8 Released
+  - The default `ocr` model (`ch`) has been updated to `PP-OCRv4_server_rec_doc` (model update required)
+    - `PP-OCRv4_server_rec_doc` is trained on a mix of more Chinese document data and PP-OCR training data, enhancing recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It supports over 15,000 recognizable characters, improving text recognition in documents while also boosting general text recognition.
+    - [Performance comparison between PP-OCRv4_server_rec_doc, PP-OCRv4_server_rec, and PP-OCRv4_mobile_rec](https://paddlepaddle.github.io/PaddleX/latest/en/module_usage/tutorials/ocr_modules/text_recognition.html#ii-supported-model-list)
+    - Verified results show that the `PP-OCRv4_server_rec_doc` model significantly improves accuracy in both single-language (`Chinese`, `English`, `Japanese`, `Traditional Chinese`) and mixed-language scenarios, with speed comparable to `PP-OCRv4_server_rec`, making it suitable for most use cases.
+    - In a small number of pure English scenarios, the `PP-OCRv4_server_rec_doc` model may encounter word concatenation issues, whereas `PP-OCRv4_server_rec` performs better in such cases. Therefore, we have retained the `PP-OCRv4_server_rec` model, which users can invoke by passing the parameter `lang='ch_server'`(python api) or `--lang ch_server`(cli).
+- 2025/04/22 1.3.7 Released
+  - Fixed the issue where the `lang` parameter was ineffective during table parsing model initialization.
+  - Fixed the significant slowdown in OCR and table parsing speed in `cpu` mode.
 - 2025/04/16 1.3.4 Released
  - Slightly improved the speed of OCR detection by removing some unused blocks.
  - Fixed page-level sorting errors caused by footnotes in certain cases.
@@ -365,7 +377,7 @@ There are three different ways to experience MinerU:
        <td colspan="2">GPU VRAM 6GB or more</td>
        <td colspan="2">All GPUs with Tensor Cores produced from Volta(2017) onwards.<br>
        More than 6GB VRAM </td>
-        <td rowspan="2">apple slicon</td>
+        <td rowspan="2">Apple silicon</td>
    </tr>
 </table>

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -47,6 +47,18 @@
 </div>

 # 更新记录
+- 2025/04/27 1.3.9 发布
+  - 优化公式解析功能，提升公式渲染的成功率
+  - 更新`pdfminer.six`到最新版本，修复了部分pdf解析异常问题
+- 2025/04/23 1.3.8 发布
+  - `ocr`默认模型(`ch`)更新为`PP-OCRv4_server_rec_doc`（需更新模型）
+    - `PP-OCRv4_server_rec_doc`是在`PP-OCRv4_server_rec`的基础上，在更多中文文档数据和PP-OCR训练数据的混合数据训练而成，增加了部分繁体字、日文、特殊字符的识别能力，可支持识别的字符为1.5万+，除文档相关的文字识别能力提升外，也同时提升了通用文字的识别能力。
+    - [PP-OCRv4_server_rec_doc/PP-OCRv4_server_rec/PP-OCRv4_mobile_rec 性能对比](https://paddlepaddle.github.io/PaddleX/latest/module_usage/tutorials/ocr_modules/text_recognition.html#_3)
+    - 经验证，`PP-OCRv4_server_rec_doc`模型在`中英日繁`单种语言或多种语言混合场景均有明显精度提升，且速度与`PP-OCRv4_server_rec`相当，适合绝大部分场景使用。
+    - `PP-OCRv4_server_rec_doc`在小部分纯英文场景可能会发生单词粘连问题，`PP-OCRv4_server_rec`则在此场景下表现更好，因此我们保留了`PP-OCRv4_server_rec`模型，用户可通过增加参数`lang='ch_server'`(python api)或`--lang ch_server`(命令行)调用。
+- 2025/04/22 1.3.7 发布
+  - 修复表格解析模型初始化时lang参数失效的问题
+  - 修复在`cpu`模式下ocr和表格解析速度大幅下降的问题
 - 2025/04/16 1.3.4 发布
  - 通过移除一些无用的块，小幅提升了ocr-det的速度
  - 修复部分情况下由footnote导致的页面内排序错误
@@ -355,7 +367,7 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
        <td colspan="2">
        Volta(2017)及之后生产的全部带Tensor Core的GPU <br>
        6G显存及以上</td>
-        <td rowspan="2">apple slicon</td>
+        <td rowspan="2">Apple silicon</td>
    </tr>
 </table>

--- a/magic_pdf/libs/version.py
+++ b/magic_pdf/libs/version.py
@@ -1 +1 @@
-__version__ = "1.3.5"
+__version__ = "1.3.9"
--- a/magic_pdf/model/batch_analyze.py
+++ b/magic_pdf/model/batch_analyze.py
@@ -161,20 +161,13 @@ class BatchAnalyze:
            for table_res_dict in tqdm(table_res_list_all_page, desc="Table Predict"):
                _lang = table_res_dict['lang']
                atom_model_manager = AtomModelSingleton()
-                ocr_engine = atom_model_manager.get_atom_model(
-                    atom_model_name='ocr',
-                    ocr_show_log=False,
-                    det_db_box_thresh=0.5,
-                    det_db_unclip_ratio=1.6,
-                    lang=_lang
-                )
                table_model = atom_model_manager.get_atom_model(
                    atom_model_name='table',
                    table_model_name='rapid_table',
                    table_model_path='',
                    table_max_time=400,
                    device='cpu',
-                    ocr_engine=ocr_engine,
+                    lang=_lang,
                    table_sub_model_name='slanet_plus'
                )
                html_code, table_cell_bboxes, logic_points, elapse = table_model.predict(table_res_dict['table_img'])
--- a/magic_pdf/model/sub_modules/mfr/unimernet/unimernet_hf/modeling_unimernet.py
+++ b/magic_pdf/model/sub_modules/mfr/unimernet/unimernet_hf/modeling_unimernet.py
@@ -5,6 +5,7 @@ from typing import Optional

 import torch
 from ftfy import fix_text
+from loguru import logger

 from transformers import AutoConfig, AutoModel, AutoModelForCausalLM, AutoTokenizer, PretrainedConfig, PreTrainedModel
 from transformers import VisionEncoderDecoderConfig, VisionEncoderDecoderModel
@@ -57,22 +58,316 @@ class TokenizerWrapper:
        return toks


-def latex_rm_whitespace(s: str):
-    """Remove unnecessary whitespace from LaTeX code.
+LEFT_PATTERN = re.compile(r'(\\left)(\S*)')
+RIGHT_PATTERN = re.compile(r'(\\right)(\S*)')
+LEFT_COUNT_PATTERN = re.compile(r'\\left(?![a-zA-Z])')
+RIGHT_COUNT_PATTERN = re.compile(r'\\right(?![a-zA-Z])')
+LEFT_RIGHT_REMOVE_PATTERN = re.compile(r'\\left\.?|\\right\.?')
+
+def fix_latex_left_right(s):
    """
-    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
-    letter = r'[a-zA-Z]'
-    noletter = r'[\W_^\d]'
-    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
-    s = re.sub(text_reg, lambda _: str(names.pop(0)), s)
-    news = s
-    while True:
-        s = news
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
-        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
-        if news == s:
-            break
+    修复LaTeX中的\left和\right命令
+    1. 确保它们后面跟有效分隔符
+    2. 平衡\left和\right的数量
+    """
+    # 白名单分隔符
+    valid_delims_list = [r'(', r')', r'[', r']', r'{', r'}', r'/', r'|',
+                         r'\{', r'\}', r'\lceil', r'\rceil', r'\lfloor',
+                         r'\rfloor', r'\backslash', r'\uparrow', r'\downarrow',
+                         r'\Uparrow', r'\Downarrow', r'\|', r'\.']
+
+    # 为\left后缺失有效分隔符的情况添加点
+    def fix_delim(match, is_left=True):
+        cmd = match.group(1)  # \left 或 \right
+        rest = match.group(2) if len(match.groups()) > 1 else ""
+        if not rest or rest not in valid_delims_list:
+            return cmd + "."
+        return match.group(0)
+
+    # 使用更精确的模式匹配\left和\right命令
+    # 确保它们是独立的命令，不是其他命令的一部分
+    # 使用预编译正则和统一回调函数
+    s = LEFT_PATTERN.sub(lambda m: fix_delim(m, True), s)
+    s = RIGHT_PATTERN.sub(lambda m: fix_delim(m, False), s)
+
+    # 更精确地计算\left和\right的数量
+    left_count = len(LEFT_COUNT_PATTERN.findall(s))  # 不匹配\lefteqn等
+    right_count = len(RIGHT_COUNT_PATTERN.findall(s))  # 不匹配\rightarrow等
+
+    if left_count == right_count:
+        # 如果数量相等，检查是否在同一组
+        return fix_left_right_pairs(s)
+    else:
+        # 如果数量不等，移除所有\left和\right
+        # logger.debug(f"latex:{s}")
+        # logger.warning(f"left_count: {left_count}, right_count: {right_count}")
+        return LEFT_RIGHT_REMOVE_PATTERN.sub('', s)
+
+
+def fix_left_right_pairs(latex_formula):
+    """
+    检测并修复LaTeX公式中\left和\right不在同一组的情况
+
+    Args:
+        latex_formula (str): 输入的LaTeX公式
+
+    Returns:
+        str: 修复后的LaTeX公式
+    """
+    # 用于跟踪花括号嵌套层级
+    brace_stack = []
+    # 用于存储\left信息: (位置, 深度, 分隔符)
+    left_stack = []
+    # 存储需要调整的\right信息: (开始位置, 结束位置, 目标位置)
+    adjustments = []
+
+    i = 0
+    while i < len(latex_formula):
+        # 检查是否是转义字符
+        if i > 0 and latex_formula[i - 1] == '\\':
+            backslash_count = 0
+            j = i - 1
+            while j >= 0 and latex_formula[j] == '\\':
+                backslash_count += 1
+                j -= 1
+
+            if backslash_count % 2 == 1:
+                i += 1
+                continue
+
+        # 检测\left命令
+        if i + 5 < len(latex_formula) and latex_formula[i:i + 5] == "\\left" and i + 5 < len(latex_formula):
+            delimiter = latex_formula[i + 5]
+            left_stack.append((i, len(brace_stack), delimiter))
+            i += 6  # 跳过\left和分隔符
+            continue
+
+        # 检测\right命令
+        elif i + 6 < len(latex_formula) and latex_formula[i:i + 6] == "\\right" and i + 6 < len(latex_formula):
+            delimiter = latex_formula[i + 6]
+
+            if left_stack:
+                left_pos, left_depth, left_delim = left_stack.pop()
+
+                # 如果\left和\right不在同一花括号深度
+                if left_depth != len(brace_stack):
+                    # 找到\left所在花括号组的结束位置
+                    target_pos = find_group_end(latex_formula, left_pos, left_depth)
+                    if target_pos != -1:
+                        # 记录需要移动的\right
+                        adjustments.append((i, i + 7, target_pos))
+
+            i += 7  # 跳过\right和分隔符
+            continue
+
+        # 处理花括号
+        if latex_formula[i] == '{':
+            brace_stack.append(i)
+        elif latex_formula[i] == '}':
+            if brace_stack:
+                brace_stack.pop()
+
+        i += 1
+
+    # 应用调整，从后向前处理以避免索引变化
+    if not adjustments:
+        return latex_formula
+
+    result = list(latex_formula)
+    adjustments.sort(reverse=True, key=lambda x: x[0])
+
+    for start, end, target in adjustments:
+        # 提取\right部分
+        right_part = result[start:end]
+        # 从原位置删除
+        del result[start:end]
+        # 在目标位置插入
+        result.insert(target, ''.join(right_part))
+
+    return ''.join(result)
+
+
+def find_group_end(text, pos, depth):
+    """查找特定深度的花括号组的结束位置"""
+    current_depth = depth
+    i = pos
+
+    while i < len(text):
+        if text[i] == '{' and (i == 0 or not is_escaped(text, i)):
+            current_depth += 1
+        elif text[i] == '}' and (i == 0 or not is_escaped(text, i)):
+            current_depth -= 1
+            if current_depth < depth:
+                return i
+        i += 1
+
+    return -1  # 未找到对应结束位置
+
+
+def is_escaped(text, pos):
+    """检查字符是否被转义"""
+    backslash_count = 0
+    j = pos - 1
+    while j >= 0 and text[j] == '\\':
+        backslash_count += 1
+        j -= 1
+
+    return backslash_count % 2 == 1
+
+
+def fix_unbalanced_braces(latex_formula):
+    """
+    检测LaTeX公式中的花括号是否闭合，并删除无法配对的花括号
+
+    Args:
+        latex_formula (str): 输入的LaTeX公式
+
+    Returns:
+        str: 删除无法配对的花括号后的LaTeX公式
+    """
+    stack = []  # 存储左括号的索引
+    unmatched = set()  # 存储不匹配括号的索引
+    i = 0
+
+    while i < len(latex_formula):
+        # 检查是否是转义的花括号
+        if latex_formula[i] in ['{', '}']:
+            # 计算前面连续的反斜杠数量
+            backslash_count = 0
+            j = i - 1
+            while j >= 0 and latex_formula[j] == '\\':
+                backslash_count += 1
+                j -= 1
+
+            # 如果前面有奇数个反斜杠，则该花括号是转义的，不参与匹配
+            if backslash_count % 2 == 1:
+                i += 1
+                continue
+
+            # 否则，该花括号参与匹配
+            if latex_formula[i] == '{':
+                stack.append(i)
+            else:  # latex_formula[i] == '}'
+                if stack:  # 有对应的左括号
+                    stack.pop()
+                else:  # 没有对应的左括号
+                    unmatched.add(i)
+
+        i += 1
+
+    # 所有未匹配的左括号
+    unmatched.update(stack)
+
+    # 构建新字符串，删除不匹配的括号
+    return ''.join(char for i, char in enumerate(latex_formula) if i not in unmatched)
+
+
+def process_latex(input_string):
+    """
+        处理LaTeX公式中的反斜杠：
+        1. 如果\后跟特殊字符(#$%&~_^\\{})或空格，保持不变
+        2. 如果\后跟两个小写字母，保持不变
+        3. 其他情况，在\后添加空格
+
+        Args:
+            input_string (str): 输入的LaTeX公式
+
+        Returns:
+            str: 处理后的LaTeX公式
+        """
+
+    def replace_func(match):
+        # 获取\后面的字符
+        next_char = match.group(1)
+
+        # 如果是特殊字符或空格，保持不变
+        if next_char in "#$%&~_^|\\{} \t\n\r\v\f":
+            return match.group(0)
+
+        # 如果是字母，检查下一个字符
+        if 'a' <= next_char <= 'z' or 'A' <= next_char <= 'Z':
+            pos = match.start() + 2  # \x后的位置
+            if pos < len(input_string) and ('a' <= input_string[pos] <= 'z' or 'A' <= input_string[pos] <= 'Z'):
+                # 下一个字符也是字母，保持不变
+                return match.group(0)
+
+        # 其他情况，在\后添加空格
+        return '\\' + ' ' + next_char
+
+    # 匹配\后面跟一个字符的情况
+    pattern = r'\\(.)'
+
+    return re.sub(pattern, replace_func, input_string)
+
+# 常见的在KaTeX/MathJax中可用的数学环境
+ENV_TYPES = ['array', 'matrix', 'pmatrix', 'bmatrix', 'vmatrix',
+             'Bmatrix', 'Vmatrix', 'cases', 'aligned', 'gathered']
+ENV_BEGIN_PATTERNS = {env: re.compile(r'\\begin\{' + env + r'\}') for env in ENV_TYPES}
+ENV_END_PATTERNS = {env: re.compile(r'\\end\{' + env + r'\}') for env in ENV_TYPES}
+ENV_FORMAT_PATTERNS = {env: re.compile(r'\\begin\{' + env + r'\}\{([^}]*)\}') for env in ENV_TYPES}
+
+def fix_latex_environments(s):
+    """
+    检测LaTeX中环境（如array）的\begin和\end是否匹配
+    1. 如果缺少\begin标签则在开头添加
+    2. 如果缺少\end标签则在末尾添加
+    """
+    for env in ENV_TYPES:
+        begin_count = len(ENV_BEGIN_PATTERNS[env].findall(s))
+        end_count = len(ENV_END_PATTERNS[env].findall(s))
+
+        if begin_count != end_count:
+            if end_count > begin_count:
+                format_match = ENV_FORMAT_PATTERNS[env].search(s)
+                default_format = '{c}' if env == 'array' else ''
+                format_str = '{' + format_match.group(1) + '}' if format_match else default_format
+
+                missing_count = end_count - begin_count
+                begin_command = '\\begin{' + env + '}' + format_str + ' '
+                s = begin_command * missing_count + s
+            else:
+                missing_count = begin_count - end_count
+                s = s + (' \\end{' + env + '}') * missing_count
+
+    return s
+
+
+UP_PATTERN = re.compile(r'\\up([a-zA-Z]+)')
+COMMANDS_TO_REMOVE_PATTERN = re.compile(
+    r'\\(?:lefteqn|boldmath|ensuremath|centering|textsubscript|sides|textsl|textcent|emph)')
+REPLACEMENTS_PATTERNS = {
+    re.compile(r'\\underbar'): r'\\underline',
+    re.compile(r'\\Bar'): r'\\hat',
+    re.compile(r'\\Hat'): r'\\hat',
+    re.compile(r'\\Tilde'): r'\\tilde',
+    re.compile(r'\\slash'): r'/',
+    re.compile(r'\\textperthousand'): r'‰',
+    re.compile(r'\\sun'): r'☉'
+}
+QQUAD_PATTERN = re.compile(r'\\qquad(?!\s)')
+
+def latex_rm_whitespace(s: str):
+    """Remove unnecessary whitespace from LaTeX code."""
+    s = fix_unbalanced_braces(s)
+    s = fix_latex_left_right(s)
+    s = fix_latex_environments(s)
+
+    # 使用预编译的正则表达式
+    s = UP_PATTERN.sub(
+        lambda m: m.group(0) if m.group(1) in ["arrow", "downarrow", "lus", "silon"] else f"\\{m.group(1)}", s
+    )
+    s = COMMANDS_TO_REMOVE_PATTERN.sub('', s)
+
+    # 应用所有替换
+    for pattern, replacement in REPLACEMENTS_PATTERNS.items():
+        s = pattern.sub(replacement, s)
+
+    # 处理LaTeX中的反斜杠和空格
+    s = process_latex(s)
+
+    # \qquad后补空格
+    s = QQUAD_PATTERN.sub(r'\\qquad ', s)
+
    return s


--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorch_paddle.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorch_paddle.py
@@ -53,6 +53,12 @@ class PytorchPaddleOCR(TextSystem):
        args = parser.parse_args(args)

        self.lang = kwargs.get('lang', 'ch')
+
+        device = get_device()
+        if device == 'cpu' and self.lang in ['ch', 'ch_server']:
+            logger.warning("The current device in use is CPU. To ensure the speed of parsing, the language is automatically switched to ch_lite.")
+            self.lang = 'ch_lite'
+
        if self.lang in latin_lang:
            self.lang = 'latin'
        elif self.lang in arabic_lang:
@@ -74,7 +80,7 @@ class PytorchPaddleOCR(TextSystem):
        kwargs['rec_char_dict_path'] = os.path.join(root_dir, 'pytorchocr', 'utils', 'resources', 'dict', dict_file)
        # kwargs['rec_batch_num'] = 8

-        kwargs['device'] = get_device()
+        kwargs['device'] = device

        default_args = vars(args)
        default_args.update(kwargs)
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/arch_config.yaml
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/arch_config.yaml
@@ -171,6 +171,31 @@ ch_PP-OCRv4_rec_server_infer:
          nrtr_dim: 384
          max_text_length: 25

+ch_PP-OCRv4_rec_server_doc_infer:
+  model_type: rec
+  algorithm: SVTR_HGNet
+  Transform:
+  Backbone:
+    name: PPHGNet_small
+  Head:
+    name: MultiHead
+    out_channels_list:
+      CTCLabelDecode: 15631
+    head_list:
+      - CTCHead:
+          Neck:
+            name: svtr
+            dims: 120
+            depth: 2
+            hidden_dims: 120
+            kernel_size: [ 1, 3 ]
+            use_guide: True
+          Head:
+            fc_decay: 0.00001
+      - NRTRHead:
+          nrtr_dim: 384
+          max_text_length: 25
+
 chinese_cht_PP-OCRv3_rec_infer:
  model_type: rec
  algorithm: SVTR
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/dict/ppocrv4_doc_dict.txt
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/dict/ppocrv4_doc_dict.txt
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/models_config.yml
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/models_config.yml
@@ -3,10 +3,14 @@ lang:
    det: ch_PP-OCRv3_det_infer.pth
    rec: ch_PP-OCRv4_rec_infer.pth
    dict: ppocr_keys_v1.txt
-  ch:
+  ch_server:
    det: ch_PP-OCRv3_det_infer.pth
    rec: ch_PP-OCRv4_rec_server_infer.pth
    dict: ppocr_keys_v1.txt
+  ch:
+    det: ch_PP-OCRv3_det_infer.pth
+    rec: ch_PP-OCRv4_rec_server_doc_infer.pth
+    dict: ppocrv4_doc_dict.txt
  en:
    det: en_PP-OCRv3_det_infer.pth
    rec: en_PP-OCRv4_rec_infer.pth
--- a/next_docs/en/user_guide/install/install.rst
+++ b/next_docs/en/user_guide/install/install.rst
@@ -86,7 +86,7 @@ Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/Min
        <td colspan="2">GPU VRAM 6GB or more</td>
        <td colspan="2">All GPUs with Tensor Cores produced from Volta(2017) onwards.<br>
        More than 6GB VRAM </td>
-        <td rowspan="2">apple slicon</td>
+        <td rowspan="2">Apple silicon</td>
    </tr>
    </table>

--- a/projects/gradio_app/app.py
+++ b/projects/gradio_app/app.py
@@ -158,7 +158,7 @@ devanagari_lang = [
        'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom',  # noqa: E126
        'sa', 'bgc'
 ]
-other_lang = ['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
+other_lang = ['ch', 'ch_lite', 'ch_server', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
 add_lang = ['latin', 'arabic', 'cyrillic', 'devanagari']

 # all_lang = ['', 'auto']
--- a/projects/gradio_app/examples/complex_layout.pdf
+++ b/projects/gradio_app/examples/complex_layout.pdf
--- a/requirements.txt
+++ b/requirements.txt
@@ -10,6 +10,6 @@ scikit-learn>=1.0.2
 torch>=2.2.2,!=2.5.0,!=2.5.1
 torchvision
 transformers>=4.49.0,!=4.51.0,<5.0.0
-pdfminer.six==20231228
+pdfminer.six>=20250416
 tqdm>=4.67.1
 # The requirements.txt must ensure that only necessary external dependencies are introduced. If there are new dependencies to add, please contact the project administrator.
--- a/signatures/version1/cla.json
+++ b/signatures/version1/cla.json
@@ -239,6 +239,22 @@
      "created_at": "2025-04-14T10:40:54Z",
      "repoId": 765083837,
      "pullRequestNo": 2226
+    },
+    {
+      "name": "vloum",
+      "id": 75369577,
+      "comment_id": 2811669681,
+      "created_at": "2025-04-17T03:54:59Z",
+      "repoId": 765083837,
+      "pullRequestNo": 2267
+    },
+    {
+      "name": "kowyo",
+      "id": 110339237,
+      "comment_id": 2829263082,
+      "created_at": "2025-04-25T02:54:20Z",
+      "repoId": 765083837,
+      "pullRequestNo": 2367
    }
  ]
 }
--- a/tests/unittest/test_table/test_rapidtable.py
+++ b/tests/unittest/test_table/test_rapidtable.py
@@ -41,7 +41,7 @@ class TestppTableModel(unittest.TestCase):
        # 检查第一行数据
        first_row = tree.xpath('//table/tr[2]/td')
        assert len(first_row) == 5, "First row should have 5 cells"
-        assert first_row[0].text and first_row[0].text.strip() == "SegLink[26]", "First cell should be 'SegLink[26]'"
+        assert first_row[0].text and 'SegLink' in first_row[0].text.strip(), "First cell should be 'SegLink [26]'"
        assert first_row[1].text and first_row[1].text.strip() == "70.0", "Second cell should be '70.0'"
        assert first_row[2].text and first_row[2].text.strip() == "86.0", "Third cell should be '86.0'"
        assert first_row[3].text and first_row[3].text.strip() == "77.0", "Fourth cell should be '77.0'"
Author	SHA1	Message	Date
Xiaomeng Zhao	cebfa5f47e	Merge pull request #2387 from opendatalab/master update version	2025-04-27 18:29:26 +08:00
myhloli	1e715d026d	Update version.py with new version	2025-04-27 10:23:03 +00:00
Xiaomeng Zhao	0d5762e57a	Merge pull request #2381 from opendatalab/release-1.3.9 Release 1.3.9	2025-04-27 18:18:46 +08:00
Xiaomeng Zhao	d68fe15bde	Merge pull request #2386 from opendatalab/dev Dev	2025-04-27 18:18:28 +08:00
Xiaomeng Zhao	9bdc254456	Merge pull request #2385 from myhloli/dev docs: correct typo for Apple Silicon in install guide and README	2025-04-27 18:18:01 +08:00
myhloli	ebb7df984e	docs: correct typo for Apple Silicon in install guide and README - Fix typo in install.rst and README_zh-CN.md - Change 'apple slicon' to 'Apple silicon'	2025-04-27 18:16:46 +08:00
Xiaomeng Zhao	e54f8fd31e	Merge pull request #2384 from opendatalab/dev docs(README): fix typo	2025-04-27 18:14:46 +08:00
Xiaomeng Zhao	9f892a5e9d	Merge pull request #2367 from kowyo/patch-1 docs(README): fix typo	2025-04-27 18:14:08 +08:00
Xiaomeng Zhao	623537dd9c	Merge pull request #2383 from opendatalab/dev update readme	2025-04-27 18:12:26 +08:00
Xiaomeng Zhao	c1fbf01c43	Merge pull request #2382 from myhloli/dev feat(pdf): optimize formula parsing and update pdfminer.six	2025-04-27 18:11:47 +08:00
myhloli	0807e971fe	feat(pdf): optimize formula parsing and update pdfminer.six - Improve formula parsing success rate for better formula rendering - Upgrade pdfminer.six to the latest version to fix PDF parsing issues- Update changelog in both English and Chinese README files	2025-04-27 18:10:02 +08:00
Xiaomeng Zhao	ef854b23aa	Merge pull request #2380 from myhloli/dev build(deps): update pdfminer.six to latest version	2025-04-27 17:38:42 +08:00
myhloli	2d1a0f2ca6	fix(mfr): optimize LaTeX formula repair functionality - Improve \left and \right command handling in LaTeX formulas - Enhance environment type matching for array, matrix, and other structures - Refactor code for better readability and maintainability	2025-04-27 17:35:36 +08:00
myhloli	c8747cffb4	fix(magic_pdf): improve LaTeX formula processing and environment handling - Refactor LaTeX left/right pair fixing logic for better balance - Add environment detection and correction for common math environments - Implement more robust whitespace handling and command substitution - Optimize regex patterns for improved performance and readability	2025-04-27 17:10:15 +08:00
myhloli	0299dea199	build(deps): update pdfminer.six to latest version - Change pdfminer.six dependency from ==20231228 to >=20250416 - This update ensures compatibility with the latest version of pdfminer.six	2025-04-27 16:38:34 +08:00
myhloli	2e91fb3f52	fix(mfr): improve LaTeX formula processing and repair - Add functions to fix LaTeX left and right commands - Implement brace matching and repair in LaTeX formulas - Remove unnecessary whitespace and repair LaTeX code - Replace specific LaTeX commands with appropriate alternatives - Add logging for debugging purposes	2025-04-25 20:43:39 +08:00
myhloli	6c1511517a	fix(mfr): improve LaTeX formula processing and repair - Add functions to fix LaTeX left and right commands - Implement brace matching and repair in LaTeX formulas - Remove unnecessary whitespace and repair LaTeX code - Replace specific LaTeX commands with appropriate alternatives - Add logging for debugging purposes	2025-04-25 20:12:50 +08:00
github-actions[bot]	b864062a4f	@kowyo has signed the CLA in opendatalab/MinerU#2367	2025-04-25 02:54:31 +00:00
小林在忙毕业设计	c1558af3ef	docs(README): fix typo	2025-04-24 23:08:19 +08:00
Xiaomeng Zhao	2a9ac8939f	Merge pull request #2365 from myhloli/dev fix(mfr): improve LaTeX whitespace handling in unimernet model	2025-04-24 19:34:45 +08:00
myhloli	bfb80cb2e5	fix(mfr): improve LaTeX whitespace handling in unimernet model - Preserve "\ " sequences during whitespace removal - Add temporary substitution to prevent incorrect processing of "\ " sequences - Restore "\ " sequences after removing unnecessary whitespace	2025-04-24 19:33:03 +08:00
Xiaomeng Zhao	80a80482f3	Merge pull request #2356 from opendatalab/master master->dev	2025-04-23 18:50:42 +08:00
myhloli	a24b9ed8fd	Merge remote-tracking branch 'origin/master'	2025-04-23 18:48:46 +08:00
myhloli	e0dc6c8473	docs(README): update changelog for version 1.3.8 release	2025-04-23 18:48:32 +08:00
myhloli	801d3ade19	Update version.py with new version	2025-04-23 10:41:07 +00:00
Xiaomeng Zhao	6b7a861e8f	Merge pull request #2354 from opendatalab/release-1.3.8 Release 1.3.8	2025-04-23 18:38:42 +08:00
Xiaomeng Zhao	9fbaee9e89	Merge pull request #2353 from myhloli/dev test(table): update test_rapidtable.py to handle SegLink text variations	2025-04-23 18:27:20 +08:00
myhloli	61fa95d4e0	test(table): update test_rapidtable.py to handle SegLink text variations - Modify assertion for first cell text to check for 'SegLink' instead of exact match - This change accommodates variations in SegLink text format	2025-04-23 18:26:19 +08:00
Xiaomeng Zhao	5c232f0587	Merge pull request #2352 from myhloli/dev feat(ocr): add new Chinese OCR model and update language support	2025-04-23 18:15:25 +08:00
myhloli	45f5082613	refactor(ocr): update device parameter handling in paddleocr2pytorch - Replace get_device() function call with direct 'device' variable usage - Simplify device configuration in OCR model initialization	2025-04-23 18:13:58 +08:00
myhloli	4f88fcaa51	feat(ocr): add new Chinese OCR model and update language support - Add new Chinese OCR model (ch_PP-OCRv4_rec_server_doc_infer) for server-side use - Update language support in app.py to include new Chinese model - Modify models_config.yml to add new model configuration	2025-04-23 18:06:12 +08:00
Xiaomeng Zhao	3cf1ea1f5b	Merge pull request #2316 from opendatalab/master master->dev	2025-04-22 19:28:21 +08:00
myhloli	d874563e38	Update version.py with new version	2025-04-22 11:27:25 +00:00
Xiaomeng Zhao	55fcb7387f	Merge pull request #2315 from opendatalab/release-1.3.7 Release 1.3.7	2025-04-22 19:26:03 +08:00
Xiaomeng Zhao	f2169686e1	Merge pull request #2314 from myhloli/dev refactor(table): replace ocr_engine with lang in table model prediction	2025-04-22 19:25:00 +08:00
myhloli	9c4e779b91	fix(lang\|performance): resolve lang parameter issue and speed up OCR/table parsing - Fix lang parameter ineffectiveness during table parsing model initialization - Resolve significant slowdown in OCR and table parsing speed in CPU mode - Update changelog in README.md and README_zh-CN.md	2025-04-22 19:15:29 +08:00
myhloli	8d9070db10	fix(lang\|performance): resolve lang parameter issue and speed up OCR/table parsing - Fix lang parameter ineffectiveness during table parsing model initialization - Resolve significant slowdown in OCR and table parsing speed in CPU mode - Update changelog in README.md and README_zh-CN.md	2025-04-22 19:15:01 +08:00
myhloli	69cdea908d	fix(ocr): switch to ch_lite model for Chinese OCR on CPU - Automatically change to ch_lite model when using CPU for Chinese OCR - This modification improves performance on CPU devices	2025-04-22 19:12:35 +08:00
myhloli	1d1c7ba9ab	refactor(table): replace ocr_engine with lang in table model prediction - Remove OCR engine instantiation inside the loop - Pass language directly to the table model instead of OCR engine - Simplify code structure and improve readability	2025-04-22 18:55:10 +08:00
myhloli	4d5fd0ee55	Update version.py with new version	2025-04-21 06:45:36 +00:00
Xiaomeng Zhao	601b44bfe0	Merge pull request #2298 from opendatalab/release-1.3.6 Release 1.3.6	2025-04-21 14:37:23 +08:00
Xiaomeng Zhao	6fbbe3e6f0	Merge pull request #2274 from opendatalab/dev docs: update issue templates and disable blank issues	2025-04-17 18:46:05 +08:00
github-actions[bot]	19fd2cfa37	@vloum has signed the CLA in opendatalab/MinerU#2267	2025-04-17 03:55:12 +00:00