Merge pull request #2298 from opendatalab/release-1.3.6

Release 1.3.6
Merge pull request #2297 from myhloli/dev
2026-03-27 11:08:32 +07:00 · 2025-04-21 14:37:23 +08:00 · 2025-04-21 14:26:35 +08:00 · 2025-04-21 14:22:23 +08:00 · 2025-04-18 10:56:36 +08:00 · 2025-04-18 10:56:26 +08:00
22 changed files with 307 additions and 120 deletions
--- a/.github/ISSUE_TEMPLATE/bug_report.yml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -1,4 +1,4 @@
-name: Bug Report | 反馈 Bug
+name: 🐛 Bug Report
 description: Create a bug report for MinerU | MinerU 的 Bug 反馈
 labels: bug

@@ -6,14 +6,32 @@ labels: bug
 # empty string, Github seems to reject this .yml file.

 body:
+  - type: markdown
+    attributes:
+      value: |
+        Thank you for submitting a MinerU 🐛 Bug Report! | 感谢您提交 MinerU 🐛 Bug 反馈！
+
+  - type: checkboxes
+    attributes:
+      label: 🔎 Search before asking | 提交之前请先搜索
+      description: >
+        Please search the MinerU [Readme](https://github.com/opendatalab/MinerU), [Issues](https://github.com/opendatalab/MinerU/issues) and [Discussions](https://github.com/opendatalab/MinerU/discussions) to see if a similar bug report already exists.
+      options:
+        - label: I have searched the MinerU [Readme](https://github.com/opendatalab/MinerU) and found no similar bug report.
+          required: true
+        - label: I have searched the MinerU [Issues](https://github.com/opendatalab/MinerU/issues) and found no similar bug report.
+          required: true
+        - label: I have searched the MinerU [Discussions](https://github.com/opendatalab/MinerU/discussions) and found no similar bug report.
+          required: true

  - type: textarea
    id: description
    attributes:
      label: Description of the bug | 错误描述
      description: |
-        A clear and concise description of the bug. | 简单描述遇到的问题  
-        
+        Provide console output with error messages and/or screenshots of the bug. | 请提供详细报错信息或者截图
+      placeholder: |
+        💡 ProTip! Include as much information as possible (screenshots, logs, tracebacks etc.) to receive the most helpful response.
    validations:
      required: true
  
@@ -24,11 +42,12 @@ body:
      
      # Should not word-wrap this description here.
      description: |
-        * Explain the steps required to reproduce the bug. | 说明复现此错误所需的步骤。
-        * Include required code snippets, example files, etc. | 包含必要的代码片段、示例文件等。
-        * Describe what you expected to happen (if not obvious). | 描述你期望发生的情况。
-        * If applicable, add screenshots to help explain the problem. | 添加截图以帮助解释问题。
-        * Include any other information that could be relevant, for example information about the Python environment. | 包括任何其他可能相关的信息。
+        If you have questions about the parsing results or encounter errors during execution: | 如对解析结果有疑问或在运行中出现报错等异常:
+        * Provide a minimal reproducible example. | 请提供一个最小可复现的demo。 
+        * The demo should include the complete steps, code, and the PDF file to be parsed. | demo需要包含完整的操作步骤，代码，以及需要解析的PDF文件。
+        * When reporting parsing result anomalies and runtime errors, reproducible PDF files are essential. If the document is too large or confidential, you can print the problematic page(s) via the browser and submit the corresponding example file.
+        * 在反馈解析结果异常和运行时报错时，可复现的PDF文件是必不可少的，如文档过大或涉密，您可通过浏览器打印出出现问题的某一页或某几页再提交相应的示例文件。
+        
        
        For problems when building or installing MinerU: | 在构建或安装 MinerU 时遇到的问题:
        * Give the **exact** build/install commands that were run. | 提供**确切**的构建/安装命令。
@@ -44,9 +63,9 @@ body:


  - type: dropdown
-    id: os_name
+    id: os_mode
    attributes:
-      label: Operating system | 操作系统
+      label: Operating System Mode | 操作系统类型
      #multiple: true
      options:
        -
@@ -56,6 +75,22 @@ body:
    validations:
      required: true

+  - type: textarea
+    id: os_name_version
+    attributes:
+      label: Operating System Version| 操作系统版本
+      #multiple: true
+      description: |
+        * 如果您使用的是Linux系统，请提供Linux系统的**发行版名称**和**版本号**来帮助开发人员排查问题。 
+        * If you are using a Linux system, please provide the Linux distribution and version number to help developers troubleshoot the issue.
+        * 如果您使用的是Windows或MacOS系统，请提供操作系统的**版本号**来帮助开发人员排查问题。
+        * If you are using a Windows or MacOS system, please provide the version number of the operating system to help developers troubleshoot the issue.
+        * 例如：Ubuntu 22.04, CentOS 7.9, MacOS 15.1, Windows 11
+        * For example: Ubuntu 22.04, CentOS 7.9, MacOS 15.1, Windows 11.
+
+    validations:
+      required: true
+
  - type: dropdown
    id: python_version
    attributes:
@@ -94,6 +129,7 @@ body:
        -
        - cpu
        - cuda
+        - mps
        - npu
    validations:
      required: true
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@@ -0,0 +1,11 @@
+blank_issues_enabled: false
+contact_links:
+  - name: 🙏 Q&A
+    url: https://github.com/opendatalab/MinerU/discussions/categories/q-a
+    about: Ask the community for help
+  - name: 💡 Feature requests and ideas
+    url: https://github.com/opendatalab/MinerU/discussions/categories/ideas
+    about: Share ideas for new features
+  - name: 🙌 Show and tell
+    url: https://github.com/opendatalab/MinerU/discussions/categories/show-and-tell
+    about: Show off something you've made
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -1,28 +0,0 @@
---
-name: Feature request | 功能需求
-about: Suggest an idea for this project | 提出一个有价值的idea
-title: ''
-labels: enhancement
-assignees: ''
-
---
-
-**Is your feature request related to a problem? Please describe.**
-**您的特性请求是否与某个问题相关？请描述。**
-A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
-对存在的问题进行清晰且简洁的描述。例如：我一直很困扰的是 [...]
-
-**Describe the solution you'd like**
-**描述您期望的解决方案**
-A clear and concise description of what you want to happen.
-清晰且简洁地描述您希望实现的内容。
-
-**Describe alternatives you've considered**
-**描述您已考虑的替代方案**
-A clear and concise description of any alternative solutions or features you've considered.
-清晰且简洁地描述您已经考虑过的任何替代解决方案。
-
-**Additional context**
-**提供更多细节**
-Add any other context or screenshots about the feature request here.
-请附上任何相关截图、链接或文件，以帮助我们更好地理解您的请求。
--- a/README.md
+++ b/README.md
@@ -48,6 +48,9 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
 </div>

 # Changelog
+- 2025/04/16 1.3.4 Released
+  - Slightly improved the speed of OCR detection by removing some unused blocks.
+  - Fixed page-level sorting errors caused by footnotes in certain cases.
 - 2025/04/12 1.3.2 released
  - Fixed the issue of incompatible dependency package versions when installing in Python 3.13 environment on Windows systems.
  - Optimized memory usage during batch inference.
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -47,6 +47,9 @@
 </div>

 # 更新记录
+- 2025/04/16 1.3.4 发布
+  - 通过移除一些无用的块，小幅提升了ocr-det的速度
+  - 修复部分情况下由footnote导致的页面内排序错误
 - 2025/04/12 1.3.2 发布
  - 修复了windows系统下，在python3.13环境安装时一些依赖包版本不兼容的问题
  - 优化批量推理时的内存占用
--- a/docker/ascend_npu/Dockerfile
+++ b/docker/ascend_npu/Dockerfile
@@ -35,7 +35,8 @@ RUN /bin/bash -c "wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/m
    cp magic-pdf.template.json /root/magic-pdf.json && \
    source /opt/mineru_venv/bin/activate && \
    pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple && \
-    pip3 install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple && \
+    pip3 install torch==2.3.1 torchvision==0.18.1 -i https://mirrors.aliyun.com/pypi/simple && \
+    pip3 install -U magic-pdf[full] 'numpy<2' decorator attrs absl-py cloudpickle ml-dtypes tornado einops -i https://mirrors.aliyun.com/pypi/simple && \
    wget https://gitee.com/ascend/pytorch/releases/download/v6.0.rc2-pytorch2.3.1/torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl && \
    pip3 install torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl"

--- a/docker/china/Dockerfile
+++ b/docker/china/Dockerfile
@@ -18,7 +18,17 @@ RUN apt-get update && \
        wget \
        git \
        libgl1 \
+        libreoffice \
+        fonts-noto-cjk \
+        fonts-wqy-zenhei \
+        fonts-wqy-microhei \
+        ttf-mscorefonts-installer \
+        fontconfig \
        libglib2.0-0 \
+        libxrender1 \
+        libsm6 \
+        libxext6 \
+        poppler-utils \
        && rm -rf /var/lib/apt/lists/*

 # Set Python 3.10 as the default python3
--- a/docker/global/Dockerfile
+++ b/docker/global/Dockerfile
@@ -18,7 +18,17 @@ RUN apt-get update && \
        wget \
        git \
        libgl1 \
+        libreoffice \
+        fonts-noto-cjk \
+        fonts-wqy-zenhei \
+        fonts-wqy-microhei \
+        ttf-mscorefonts-installer \
+        fontconfig \
        libglib2.0-0 \
+        libxrender1 \
+        libsm6 \
+        libxext6 \
+        poppler-utils \
        && rm -rf /var/lib/apt/lists/*

 # Set Python 3.10 as the default python3
--- a/magic_pdf/data/read_api.py
+++ b/magic_pdf/data/read_api.py
@@ -116,7 +116,7 @@ def read_local_office(path: str) -> list[PymuDocDataset]:
    shutil.rmtree(temp_dir)
    return ret

-def read_local_images(path: str, suffixes: list[str]=['.png', '.jpg']) -> list[ImageDataset]:
+def read_local_images(path: str, suffixes: list[str]=['.png', '.jpg', '.jpeg']) -> list[ImageDataset]:
    """Read images from path or directory.

    Args:
--- a/magic_pdf/libs/version.py
+++ b/magic_pdf/libs/version.py
@@ -1 +1 @@
-__version__ = "1.3.1"
+__version__ = "1.3.5"
--- a/magic_pdf/model/doc_analyze_by_custom_model.py
+++ b/magic_pdf/model/doc_analyze_by_custom_model.py
@@ -147,7 +147,7 @@ def doc_analyze(
            images.append(img_dict['img'])
            page_wh_list.append((img_dict['width'], img_dict['height']))

-    images_with_extra_info = [(images[index], ocr, dataset._lang) for index in range(len(dataset))]
+    images_with_extra_info = [(images[index], ocr, dataset._lang) for index in range(len(images))]

    if len(images) >= MIN_BATCH_INFERENCE_SIZE:
        batch_size = MIN_BATCH_INFERENCE_SIZE
--- a/magic_pdf/model/sub_modules/model_utils.py
+++ b/magic_pdf/model/sub_modules/model_utils.py
@@ -2,6 +2,8 @@ import time
 import torch
 from loguru import logger
 import numpy as np
+
+from magic_pdf.libs.boxbase import get_minbox_if_overlap_by_ratio
 from magic_pdf.libs.clean_memory import clean_memory


@@ -188,9 +190,46 @@ def filter_nested_tables(table_res_list, overlap_threshold=0.8, area_threshold=0
    return [table for i, table in enumerate(table_res_list) if i not in big_tables_idx]


+def remove_overlaps_min_blocks(res_list):
+    #  重叠block，小的不能直接删除，需要和大的那个合并成一个更大的。
+    #  删除重叠blocks中较小的那些
+    need_remove = []
+    for res1 in res_list:
+        for res2 in res_list:
+            if res1 != res2:
+                overlap_box = get_minbox_if_overlap_by_ratio(
+                    res1['bbox'], res2['bbox'], 0.8
+                )
+                if overlap_box is not None:
+                    res_to_remove = next(
+                        (res for res in res_list if res['bbox'] == overlap_box),
+                        None,
+                    )
+                    if (
+                        res_to_remove is not None
+                        and res_to_remove not in need_remove
+                    ):
+                        large_res = res1 if res1 != res_to_remove else res2
+                        x1, y1, x2, y2 = large_res['bbox']
+                        sx1, sy1, sx2, sy2 = res_to_remove['bbox']
+                        x1 = min(x1, sx1)
+                        y1 = min(y1, sy1)
+                        x2 = max(x2, sx2)
+                        y2 = max(y2, sy2)
+                        large_res['bbox'] = [x1, y1, x2, y2]
+                        need_remove.append(res_to_remove)
+
+    if len(need_remove) > 0:
+        for res in need_remove:
+            res_list.remove(res)
+
+    return res_list, need_remove
+
+
 def get_res_list_from_layout_res(layout_res, iou_threshold=0.7, overlap_threshold=0.8, area_threshold=0.8):
    """Extract OCR, table and other regions from layout results."""
    ocr_res_list = []
+    text_res_list = []
    table_res_list = []
    table_indices = []
    single_page_mfdetrec_res = []
@@ -204,11 +243,14 @@ def get_res_list_from_layout_res(layout_res, iou_threshold=0.7, overlap_threshol
                "bbox": [int(res['poly'][0]), int(res['poly'][1]),
                         int(res['poly'][4]), int(res['poly'][5])],
            })
-        elif category_id in [0, 1, 2, 4, 6, 7]:  # OCR regions
+        elif category_id in [0, 2, 4, 6, 7]:  # OCR regions
            ocr_res_list.append(res)
        elif category_id == 5:  # Table regions
            table_res_list.append(res)
            table_indices.append(i)
+        elif category_id in [1]:  # Text regions
+            res['bbox'] = [int(res['poly'][0]), int(res['poly'][1]), int(res['poly'][4]), int(res['poly'][5])]
+            text_res_list.append(res)

    # Process tables: merge high IoU tables first, then filter nested tables
    table_res_list, table_indices = merge_high_iou_tables(
@@ -226,6 +268,22 @@ def get_res_list_from_layout_res(layout_res, iou_threshold=0.7, overlap_threshol
        for idx in sorted(to_remove, reverse=True):
            del layout_res[idx]

+    # Remove overlaps in OCR and text regions
+    text_res_list, need_remove = remove_overlaps_min_blocks(text_res_list)
+    for res in text_res_list:
+        # 将res的poly使用bbox重构
+        res['poly'] = [res['bbox'][0], res['bbox'][1], res['bbox'][2], res['bbox'][1],
+                       res['bbox'][2], res['bbox'][3], res['bbox'][0], res['bbox'][3]]
+        # 删除res的bbox
+        del res['bbox']
+
+    ocr_res_list.extend(text_res_list)
+
+    if len(need_remove) > 0:
+        for res in need_remove:
+            del res['bbox']
+            layout_res.remove(res)
+
    return ocr_res_list, filtered_table_res_list, single_page_mfdetrec_res


--- a/magic_pdf/pdf_parse_union_core_v2.py
+++ b/magic_pdf/pdf_parse_union_core_v2.py
@@ -490,7 +490,7 @@ def insert_lines_into_block(block_bbox, line_height, page_w, page_h):
        return [[x0, y0, x1, y1]]


-def sort_lines_by_model(fix_blocks, page_w, page_h, line_height):
+def sort_lines_by_model(fix_blocks, page_w, page_h, line_height, footnote_blocks):
    page_line_list = []

    def add_lines_to_block(b):
@@ -519,6 +519,10 @@ def sort_lines_by_model(fix_blocks, page_w, page_h, line_height):
            block['real_lines'] = copy.deepcopy(block['lines'])
            add_lines_to_block(block)

+    for block in footnote_blocks:
+        footnote_block = {'bbox': block[:4]}
+        add_lines_to_block(footnote_block)
+
    if len(page_line_list) > 200:  # layoutreader最高支持512line
        return None

@@ -779,7 +783,7 @@ def parse_page_core(
    # interline_equation_blocks参数不够准，后面切换到interline_equations上
    interline_equation_blocks = []
    if len(interline_equation_blocks) > 0:
-        all_bboxes, all_discarded_blocks = ocr_prepare_bboxes_for_layout_split_v2(
+        all_bboxes, all_discarded_blocks, footnote_blocks = ocr_prepare_bboxes_for_layout_split_v2(
            img_body_blocks, img_caption_blocks, img_footnote_blocks,
            table_body_blocks, table_caption_blocks, table_footnote_blocks,
            discarded_blocks,
@@ -790,7 +794,7 @@ def parse_page_core(
            page_h,
        )
    else:
-        all_bboxes, all_discarded_blocks = ocr_prepare_bboxes_for_layout_split_v2(
+        all_bboxes, all_discarded_blocks, footnote_blocks = ocr_prepare_bboxes_for_layout_split_v2(
            img_body_blocks, img_caption_blocks, img_footnote_blocks,
            table_body_blocks, table_caption_blocks, table_footnote_blocks,
            discarded_blocks,
@@ -866,7 +870,7 @@ def parse_page_core(
    line_height = get_line_height(fix_blocks)

    """获取所有line并对line排序"""
-    sorted_bboxes = sort_lines_by_model(fix_blocks, page_w, page_h, line_height)
+    sorted_bboxes = sort_lines_by_model(fix_blocks, page_w, page_h, line_height, footnote_blocks)

    """根据line的中位数算block的序列关系"""
    fix_blocks = cal_block_index(fix_blocks, sorted_bboxes)
--- a/magic_pdf/pre_proc/ocr_detect_all_bboxes.py
+++ b/magic_pdf/pre_proc/ocr_detect_all_bboxes.py
@@ -99,11 +99,11 @@ def ocr_prepare_bboxes_for_layout_split_v2(
    all_discarded_blocks = []
    add_bboxes(discarded_blocks, BlockType.Discarded, all_discarded_blocks)

-    """footnote识别：宽度超过1/3页面宽度的，高度超过10的，处于页面下半50%区域的"""
+    """footnote识别：宽度超过1/3页面宽度的，高度超过10的，处于页面下半30%区域的"""
    footnote_blocks = []
    for discarded in discarded_blocks:
        x0, y0, x1, y1 = discarded['bbox']
-        if (x1 - x0) > (page_w / 3) and (y1 - y0) > 10 and y0 > (page_h / 2):
+        if (x1 - x0) > (page_w / 3) and (y1 - y0) > 10 and y0 > (page_h * 0.7):
            footnote_blocks.append([x0, y0, x1, y1])

    """移除在footnote下面的任何框"""
@@ -119,7 +119,7 @@ def ocr_prepare_bboxes_for_layout_split_v2(
    """将剩余的bbox做分离处理，防止后面分layout时出错"""
    # all_bboxes, drop_reasons = remove_overlap_between_bbox_for_block(all_bboxes)
    all_bboxes.sort(key=lambda x: x[0]+x[1])
-    return all_bboxes, all_discarded_blocks
+    return all_bboxes, all_discarded_blocks, footnote_blocks


 def find_blocks_under_footnote(all_bboxes, footnote_blocks):
--- a/magic_pdf/utils/office_to_pdf.py
+++ b/magic_pdf/utils/office_to_pdf.py
@@ -1,6 +1,10 @@
 import os
 import subprocess
+import platform
 from pathlib import Path
+import shutil
+
+from loguru import logger


 class ConvertToPdfError(Exception):
@@ -9,21 +13,103 @@ class ConvertToPdfError(Exception):
        super().__init__(self.msg)


+def check_fonts_installed():
+    """Check if required Chinese fonts are installed."""
+    system_type = platform.system()
+
+    if system_type in ['Windows', 'Darwin']:
+        pass
+    else:
+        # Linux: use fc-list
+        try:
+            output = subprocess.check_output(['fc-list', ':lang=zh'], encoding='utf-8')
+            if output.strip():  # 只要有任何输出（非空）
+                return True
+            else:
+                logger.warning(
+                    f"No Chinese fonts were detected, the converted document may not display Chinese content properly."
+                )
+        except Exception:
+            pass
+
+
+def get_soffice_command():
+    """Return the path to LibreOffice's soffice executable depending on the platform."""
+    system_type = platform.system()
+
+    # First check if soffice is in PATH
+    soffice_path = shutil.which('soffice')
+    if soffice_path:
+        return soffice_path
+
+    if system_type == 'Windows':
+        # Check common installation paths
+        possible_paths = [
+            Path(os.environ.get('PROGRAMFILES', 'C:/Program Files')) / 'LibreOffice/program/soffice.exe',
+            Path(os.environ.get('PROGRAMFILES(X86)', 'C:/Program Files (x86)')) / 'LibreOffice/program/soffice.exe',
+            Path('C:/Program Files/LibreOffice/program/soffice.exe'),
+            Path('C:/Program Files (x86)/LibreOffice/program/soffice.exe')
+        ]
+
+        # Check other drives for windows
+        for drive in ['C:', 'D:', 'E:', 'F:', 'G:', 'H:']:
+            possible_paths.append(Path(f"{drive}/LibreOffice/program/soffice.exe"))
+
+        for path in possible_paths:
+            if path.exists():
+                return str(path)
+
+        raise ConvertToPdfError(
+            "LibreOffice not found. Please install LibreOffice from https://www.libreoffice.org/ "
+            "or ensure soffice.exe is in your PATH environment variable."
+        )
+    else:
+        # For Linux/macOS, provide installation instructions if not found
+        try:
+            # Try to find soffice in standard locations
+            possible_paths = [
+                '/usr/bin/soffice',
+                '/usr/local/bin/soffice',
+                '/opt/libreoffice/program/soffice',
+                '/Applications/LibreOffice.app/Contents/MacOS/soffice'
+            ]
+            for path in possible_paths:
+                if os.path.exists(path):
+                    return path
+
+            raise ConvertToPdfError(
+                "LibreOffice not found. Please install it:\n"
+                "  - Ubuntu/Debian: sudo apt-get install libreoffice\n"
+                "  - CentOS/RHEL: sudo yum install libreoffice\n"
+                "  - macOS: brew install libreoffice or download from https://www.libreoffice.org/\n"
+                "  - Or ensure soffice is in your PATH environment variable."
+            )
+        except Exception as e:
+            raise ConvertToPdfError(f"Error locating LibreOffice: {str(e)}")
+
+
 def convert_file_to_pdf(input_path, output_dir):
+    """Convert a single document (ppt, doc, etc.) to PDF."""
    if not os.path.isfile(input_path):
        raise FileNotFoundError(f"The input file {input_path} does not exist.")

    os.makedirs(output_dir, exist_ok=True)
-    
+
+    check_fonts_installed()
+
+    soffice_cmd = get_soffice_command()
+
    cmd = [
-        'soffice',
+        soffice_cmd,
        '--headless',
+        '--norestore',
+        '--invisible',
        '--convert-to', 'pdf',
        '--outdir', str(output_dir),
        str(input_path)
    ]
-    
+
    process = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
-    
+
    if process.returncode != 0:
-        raise ConvertToPdfError(process.stderr.decode())
+        raise ConvertToPdfError(f"LibreOffice convert failed: {process.stderr.decode()}")
--- a/projects/README.md
+++ b/projects/README.md
@@ -4,6 +4,6 @@

 - [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
 - [gradio_app](./gradio_app/README.md): Build a web app based on gradio
- [web_demo](./web_demo/README.md): MinerU online [demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) localized deployment version
+- ~~[web_demo](./web_demo/README.md): MinerU online [demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/) localized deployment version~~(Deprecated)
 - [web_api](./web_api/README.md): Web API Based on FastAPI
 - [multi_gpu](./multi_gpu/README.md): Multi-GPU parallel processing based on LitServe
--- a/projects/README_zh-CN.md
+++ b/projects/README_zh-CN.md
@@ -4,6 +4,6 @@

 - [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
 - [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用
- [web_demo](./web_demo/README_zh-CN.md): MinerU在线[demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)本地化部署版本
+- ~~[web_demo](./web_demo/README_zh-CN.md): MinerU在线[demo](https://opendatalab.com/OpenSourceTools/Extractor/PDF/)本地化部署版本~~(已过时)
 - [web_api](./web_api/README.md): 基于 FastAPI 的 Web API
 - [multi_gpu](./multi_gpu/README.md): 基于 LitServe 的多 GPU 并行处理
--- a/projects/web_api/app.py
+++ b/projects/web_api/app.py
@@ -28,7 +28,7 @@ app = FastAPI()

 pdf_extensions = [".pdf"]
 office_extensions = [".ppt", ".pptx", ".doc", ".docx"]
-image_extensions = [".png", ".jpg"]
+image_extensions = [".png", ".jpg", ".jpeg"]

 class MemoryDataWriter(DataWriter):
    def __init__(self):
@@ -128,7 +128,7 @@ def process_file(
        Tuple[InferenceResult, PipeResult]: Returns inference result and pipeline result
    """

-    ds = Union[PymuDocDataset, ImageDataset]
+    ds: Union[PymuDocDataset, ImageDataset] = None
    if file_extension in pdf_extensions:
        ds = PymuDocDataset(file_bytes)
    elif file_extension in office_extensions:
--- a/setup.py
+++ b/setup.py
@@ -43,7 +43,7 @@ if __name__ == '__main__':
                     "matplotlib>=3.10,<4",
                     "ultralytics>=8.3.48,<9",  # yolov8,公式检测
                     "doclayout_yolo==0.0.2b1",  # doclayout_yolo
-                     "dill>=0.3.9,<1",  # doclayout_yolo
+                     "dill>=0.3.8,<1",  # doclayout_yolo
                     "rapid_table>=1.0.5,<2.0.0",  # rapid_table
                     "PyYAML>=6.0.2,<7",  # yaml
                     "ftfy>=6.3.1,<7",  # unimernet_hf
@@ -56,7 +56,7 @@ if __name__ == '__main__':
                    "matplotlib>=3.10,<=3.10.1",
                    "ultralytics>=8.3.48,<=8.3.104",  # yolov8,公式检测
                    "doclayout_yolo==0.0.2b1",  # doclayout_yolo
-                    "dill==0.3.9",  # doclayout_yolo
+                    "dill==0.3.8",  # doclayout_yolo
                    "PyYAML==6.0.2",  # yaml
                    "ftfy==6.3.1",  # unimernet_hf
                    "openai==1.71.0",  # openai SDK
--- a/signatures/version1/cla.json
+++ b/signatures/version1/cla.json
@@ -223,6 +223,30 @@
      "created_at": "2025-03-24T12:58:56Z",
      "repoId": 765083837,
      "pullRequestNo": 1982
+    },
+    {
+      "name": "zjx20",
+      "id": 2639200,
+      "comment_id": 2800714918,
+      "created_at": "2025-04-14T07:25:26Z",
+      "repoId": 765083837,
+      "pullRequestNo": 2215
+    },
+    {
+      "name": "Doge2077",
+      "id": 91442300,
+      "comment_id": 2801283257,
+      "created_at": "2025-04-14T10:40:54Z",
+      "repoId": 765083837,
+      "pullRequestNo": 2226
+    },
+    {
+      "name": "vloum",
+      "id": 75369577,
+      "comment_id": 2811669681,
+      "created_at": "2025-04-17T03:54:59Z",
+      "repoId": 765083837,
+      "pullRequestNo": 2267
    }
  ]
 }
--- a/tests/test_cli/test_cli_sdk.py
+++ b/tests/test_cli/test_cli_sdk.py
@@ -323,44 +323,6 @@ class TestCli:
        logging.info(cmd)
        os.system(cmd)
    
-
-    @pytest.mark.P1
-    def test_local_magic_pdf_open_st_table(self):
-        """magic pdf cli open st table."""
-        time.sleep(2)
-        #pre_cmd = "cp ~/magic_pdf_st.json ~/magic-pdf.json"
-        value = {
-        "model": "struct_eqtable",
-        "enable": True,
-        "max_time": 400
-        }   
-        common.update_config_file(magic_pdf_config, "table-config", value)
-        pdf_path = os.path.join(pdf_dev_path, "pdf", "test_rearch_report.pdf")
-        common.delete_file(pdf_res_path)
-        cli_cmd = "magic-pdf -p %s -o %s" % (pdf_path, pdf_res_path)
-        os.system(cli_cmd)
-        res = common.check_html_table_exists(os.path.join(pdf_res_path, "test_rearch_report", "auto", "test_rearch_report.md"))
-        assert res is True
-  
-    @pytest.mark.P1
-    def test_local_magic_pdf_open_tablemaster_cuda(self):
-        """magic pdf cli open table master html table cuda mode."""
-        time.sleep(2)
-        #pre_cmd = "cp ~/magic_pdf_html.json ~/magic-pdf.json"
-        #os.system(pre_cmd)
-        value = {
-        "model": "tablemaster",
-        "enable": True,
-        "max_time": 400
-        }   
-        common.update_config_file(magic_pdf_config, "table-config", value)
-        pdf_path = os.path.join(pdf_dev_path, "pdf", "test_rearch_report.pdf")
-        common.delete_file(pdf_res_path)
-        cli_cmd = "magic-pdf -p %s -o %s" % (pdf_path, pdf_res_path)
-        os.system(cli_cmd)
-        res = common.check_html_table_exists(os.path.join(pdf_res_path, "test_rearch_report", "auto", "test_rearch_report.md"))
-        assert res is True
-    
    @pytest.mark.P1
    def test_local_magic_pdf_open_rapidai_table(self):
        """magic pdf cli open rapid ai table."""
@@ -370,6 +332,7 @@ class TestCli:
        value = {
        "model": "rapid_table",
        "enable": True,
+        "sub_model": "slanet_plus",
        "max_time": 400
        }   
        common.update_config_file(magic_pdf_config, "table-config", value)
@@ -397,6 +360,7 @@ class TestCli:
        os.system(cli_cmd)
        common.cli_count_folders_and_check_contents(os.path.join(pdf_res_path, "test_rearch_report", "auto"))

+    @pytest.mark.skip(reason="layoutlmv3废弃")
    @pytest.mark.P1
    def test_local_magic_pdf_layoutlmv3_yolo(self):
        """magic pdf cli open layoutlmv3."""
@@ -419,8 +383,9 @@ class TestCli:
        #pre_cmd = "cp ~/magic_pdf_html_table_cpu.json ~/magic-pdf.json"
        #os.system(pre_cmd)
        value = {
-        "model": "tablemaster",
-        "enable": False,
+        "model": "rapid_table",
+        "enable": True,
+        "sub_model": "slanet_plus",
        "max_time": 400
        }   
        common.update_config_file(magic_pdf_config, "table-config", value)
@@ -439,8 +404,9 @@ class TestCli:
        #pre_cmd = "cp ~/magic_pdf_close_table.json ~/magic-pdf.json"
        #os.system(pre_cmd)
        value = {
-        "model": "tablemaster",
+        "model": "rapid_table",
        "enable": False,
+        "sub_model": "slanet_plus",
        "max_time": 400
        }   
        common.update_config_file(magic_pdf_config, "table-config", value)
--- a/tests/unittest/test_table/test_tablemaster.py
+++ b/tests/unittest/test_table/test_tablemaster.py
@@ -1,32 +1,36 @@
 import unittest
+import os
 from PIL import Image
 from lxml import etree

-from magic_pdf.model.sub_modules.table.tablemaster.tablemaster_paddle import TableMasterPaddleModel
+from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
+from magic_pdf.model.sub_modules.table.rapidtable.rapid_table import RapidTableModel


 class TestppTableModel(unittest.TestCase):
    def test_image2html(self):
-        img = Image.open("tests/unittest/test_table/assets/table.jpg")
-        # 修改table模型路径
-        config = {"device": "cuda",
-                  "model_dir": "/home/quyuan/.cache/modelscope/hub/opendatalab/PDF-Extract-Kit/models/TabRec/TableMaster"}
-        table_model = TableMasterPaddleModel(config)
-        res = table_model.img2html(img)
+        img = Image.open(os.path.join(os.path.dirname(__file__), "assets/table.jpg"))
+        atom_model_manager = AtomModelSingleton()
+        ocr_engine = atom_model_manager.get_atom_model(
+            atom_model_name='ocr',
+            ocr_show_log=False,
+            det_db_box_thresh=0.5,
+            det_db_unclip_ratio=1.6,
+            lang='ch'
+        )
+        table_model = RapidTableModel(ocr_engine, 'slanet_plus')
+        html_code, table_cell_bboxes, logic_points, elapse = table_model.predict(img)
        # 验证生成的 HTML 是否符合预期
        parser = etree.HTMLParser()
-        tree = etree.fromstring(res, parser)
+        tree = etree.fromstring(html_code, parser)

        # 检查 HTML 结构
        assert tree.find('.//table') is not None, "HTML should contain a <table> element"
-        assert tree.find('.//thead') is not None, "HTML should contain a <thead> element"
-        assert tree.find('.//tbody') is not None, "HTML should contain a <tbody> element"
        assert tree.find('.//tr') is not None, "HTML should contain a <tr> element"
        assert tree.find('.//td') is not None, "HTML should contain a <td> element"

        # 检查具体的表格内容
-        headers = tree.xpath('//thead/tr/td/b')
-        print(headers)  # Print headers for debugging
+        headers = tree.xpath('//table/tr[1]/td')
        assert len(headers) == 5, "Thead should have 5 columns"
        assert headers[0].text and headers[0].text.strip() == "Methods", "First header should be 'Methods'"
        assert headers[1].text and headers[1].text.strip() == "R", "Second header should be 'R'"
@@ -35,7 +39,7 @@ class TestppTableModel(unittest.TestCase):
        assert headers[4].text and headers[4].text.strip() == "FPS", "Fifth header should be 'FPS'"

        # 检查第一行数据
-        first_row = tree.xpath('//tbody/tr[1]/td')
+        first_row = tree.xpath('//table/tr[2]/td')
        assert len(first_row) == 5, "First row should have 5 cells"
        assert first_row[0].text and first_row[0].text.strip() == "SegLink[26]", "First cell should be 'SegLink[26]'"
        assert first_row[1].text and first_row[1].text.strip() == "70.0", "Second cell should be '70.0'"
@@ -44,14 +48,13 @@ class TestppTableModel(unittest.TestCase):
        assert first_row[4].text and first_row[4].text.strip() == "8.9", "Fifth cell should be '8.9'"

        # 检查倒数第二行数据
-        second_last_row = tree.xpath('//tbody/tr[position()=last()-1]/td')
+        second_last_row = tree.xpath('//table/tr[position()=last()-1]/td')
        assert len(second_last_row) == 5, "second_last_row should have 5 cells"
-        assert second_last_row[0].text and second_last_row[
-            0].text.strip() == "Ours (SynText)", "First cell should be 'Ours (SynText)'"
+        assert second_last_row[0].text and second_last_row[0].text.strip() == "Ours (SynText)", "First cell should be 'Ours (SynText)'"
        assert second_last_row[1].text and second_last_row[1].text.strip() == "80.68", "Second cell should be '80.68'"
        assert second_last_row[2].text and second_last_row[2].text.strip() == "85.40", "Third cell should be '85.40'"
-        assert second_last_row[3].text and second_last_row[3].text.strip() == "82.97", "Fourth cell should be '82.97'"
-        assert second_last_row[3].text and second_last_row[4].text.strip() == "12.68", "Fifth cell should be '12.68'"
+        # assert second_last_row[3].text and second_last_row[3].text.strip() == "82.97", "Fourth cell should be '82.97'"
+        # assert second_last_row[3].text and second_last_row[4].text.strip() == "12.68", "Fifth cell should be '12.68'"


 if __name__ == "__main__":
Author	SHA1	Message	Date
Xiaomeng Zhao	601b44bfe0	Merge pull request #2298 from opendatalab/release-1.3.6 Release 1.3.6	2025-04-21 14:37:23 +08:00
Xiaomeng Zhao	012327badb	Merge pull request #2297 from myhloli/dev feat: add support for JPEG images and update documentation	2025-04-21 14:26:35 +08:00
myhloli	fcb5660f6a	feat: add support for JPEG images and update documentation - Add '.jpeg' to the list of supported image extensions in app.py and read_api.py - Update projects READMEs to indicate that web_demo is deprecated	2025-04-21 14:22:23 +08:00
myhloli	d105d87cf5	Merge remote-tracking branch 'origin/dev' into dev	2025-04-18 10:56:36 +08:00
myhloli	619b3b6d32	docs(README): update bug report template to reference Readme instead of Docs - Update the bug report template to direct users to search the MinerU Readme instead of Docs - This change ensures users check the most relevant and up-to-date information source before reporting issues	2025-04-18 10:56:26 +08:00
Xiaomeng Zhao	6fbbe3e6f0	Merge pull request #2274 from opendatalab/dev docs: update issue templates and disable blank issues	2025-04-17 18:46:05 +08:00
Xiaomeng Zhao	a47b17cd88	Merge pull request #2273 from myhloli/dev docs: update issue templates and disable blank issues	2025-04-17 18:45:26 +08:00
myhloli	737d7d6eb9	docs: update issue templates and disable blank issues - Update bug report template with more detailed instructions and sections - Add operating system version field to bug report - Include support for MPS in device options - Disable blank issues and provide alternative contact links - Remove feature request template	2025-04-17 18:44:20 +08:00
Xiaomeng Zhao	3492744ce1	Merge pull request #2269 from dt-yy/dev update test case	2025-04-17 15:23:19 +08:00
dt-yy	a1fe370270	update test case	2025-04-17 15:21:41 +08:00
dt-yy	fea756fd3e	update test case	2025-04-17 14:34:54 +08:00
dt-yy	e98988920e	update test case	2025-04-17 14:24:58 +08:00
github-actions[bot]	19fd2cfa37	@vloum has signed the CLA in opendatalab/MinerU#2267	2025-04-17 03:55:12 +00:00
Xiaomeng Zhao	74f9978e02	Merge pull request #2266 from opendatalab/master master->dev	2025-04-17 11:42:23 +08:00
myhloli	0c9572c871	Update version.py with new version	2025-04-17 03:34:11 +00:00
Xiaomeng Zhao	8fb6794b95	Merge pull request #2265 from opendatalab/release-1.3.5 Release 1.3.5	2025-04-17 11:31:24 +08:00
Xiaomeng Zhao	af53a46311	Merge pull request #2264 from myhloli/dev refactor(office_to_pdf): simplify font checking and add logging	2025-04-17 11:29:20 +08:00
myhloli	2e5e55cfe2	refactor(office_to_pdf): simplify font checking and add logging - Remove specific Chinese font list and detailed font checking - Add logging warning if no Chinese fonts are detected - Make font checking more robust and less platform-specific	2025-04-17 10:52:08 +08:00
myhloli	658e6bc768	refactor(utils): comment out Chinese font check on Windows - Temporarily disable Chinese font check for Windows systems - This change allows bypassing the font check when the required fonts are not present	2025-04-17 00:54:28 +08:00
myhloli	4641264e12	build(docker): update magic-pdf installation and add dependencies - Update magic-pdf installation to include specific version with full dependencies - Add numpy, decorator, attrs, absl-py, cloudpickle, ml-dtypes, tornado, and einops as separate packages - Specify numpy version to be less than 2	2025-04-17 00:16:20 +08:00
Xiaomeng Zhao	4bd3381c92	Merge pull request #2256 from myhloli/dev fix(test_table): update image path to use relative path	2025-04-16 18:24:37 +08:00
myhloli	f5a56bf157	fix(test_table): update image path to use relative path - Replace hardcoded image path with dynamic path generation - Use os.path.join to create platform-independent file paths - Improve code maintainability and portability across different environments	2025-04-16 18:23:13 +08:00
Xiaomeng Zhao	78d11172e3	Merge pull request #2255 from opendatalab/master master->dev	2025-04-16 18:12:06 +08:00
myhloli	a2b07bfde4	Update version.py with new version	2025-04-16 10:02:13 +00:00
Xiaomeng Zhao	1b35f04453	Merge pull request #2252 from opendatalab/release-1.3.4 Release 1.3.4	2025-04-16 18:00:29 +08:00
Xiaomeng Zhao	0222293f64	Merge pull request #2254 from opendatalab/dev Dev	2025-04-16 17:59:02 +08:00
Xiaomeng Zhao	16f176ea65	Merge pull request #2253 from myhloli/dev docs(README): update changelog for v1.3.4 release	2025-04-16 17:58:29 +08:00
myhloli	1705958f65	docs(README): update changelog for v1.3.4 release - Update README.md and README_zh-CN.md with the latest changes - Add new release notes for version 1.3.4 - Include improvements in OCR detection speed and page-level sorting	2025-04-16 17:57:17 +08:00
Xiaomeng Zhao	2de5a79f52	Merge pull request #2251 from myhloli/dev feat(pdf_parse): add footnote block handling in layout split	2025-04-16 17:49:52 +08:00
myhloli	058d318491	feat(pdf_parse): add footnote block handling in layout split - Modify `ocr_detect_all_bboxes.py` to return footnote blocks - Update `pdf_parse_union_core_v2.py` to handle footnote blocks in line sorting and layout splitting - This change improves the accuracy of layout analysis by considering footnote blocks separately	2025-04-16 17:48:30 +08:00
Xiaomeng Zhao	cfa90743b5	Merge pull request #2250 from myhloli/dev test(table): update unit test to use RapidTable model	2025-04-16 17:07:23 +08:00
myhloli	b36b469a1c	test(table): update unit test to use RapidTable model - Rename test file from test_tablemaster.py to test_rapidtable.py - Replace TableMasterPaddleModel with RapidTableModel - Update test case to use new model and adjust assertions accordingly - Remove some outdated assertions and comments	2025-04-16 16:54:27 +08:00
Xiaomeng Zhao	40bfd7acce	Merge pull request #2240 from myhloli/dev feat(model): add text region handling and improve overlap resolution	2025-04-15 19:31:05 +08:00
Xiaomeng Zhao	b7ff7ded64	Merge pull request #9 from myhloli/refactor-pipeline feat(model): add text region handling and improve overlap resolution	2025-04-15 19:30:06 +08:00
myhloli	07edefaa7d	feat(model): add text region handling and improve overlap resolution - Add text region handling in get_res_list_from_layout_res function - Implement remove_overlaps_min_blocks function to handle overlapping blocks - Update OCR region handling to include text regions - Improve overlap resolution for all regions in layout results	2025-04-15 19:28:29 +08:00
Xiaomeng Zhao	24b7e7ca36	Merge pull request #2226 from Doge2077/master fix:Chinese Character Garbling in PPTX/DOCX Conversion by Adding Font Check and Installation	2025-04-15 11:09:32 +08:00
Doge2077	87440ba43c	fix:remove duplicate code	2025-04-15 10:33:54 +08:00
Xiaomeng Zhao	ff35c75531	Merge pull request #2234 from myhloli/dev build(docker): add torch and torchvision installation	2025-04-15 10:26:25 +08:00
Xiaomeng Zhao	8f3c178003	build(docker): add torch and torchvision installation build(docker): add torch and torchvision installation	2025-04-15 09:57:25 +08:00
Xiaomeng Zhao	27883619f5	Merge pull request #2231 from myhloli/dev build(docker): add torch and torchvision installation	2025-04-15 09:56:47 +08:00
myhloli	5ddd6799aa	build(docker): add torch and torchvision installation - Add pip install command for torch and torchvision - Specify version2.3.1 for both packages - Use Aliyun mirror for faster download	2025-04-15 09:55:57 +08:00
Doge2077	039f8cbfde	feat:add advice on LibreOffice installing	2025-04-14 20:21:37 +08:00
Xiaomeng Zhao	73ccfbbfbe	Merge pull request #8 from myhloli/dev Dev	2025-04-14 19:19:21 +08:00
Xiaomeng Zhao	410d0afc81	Merge pull request #2227 from opendatalab/master master->dev	2025-04-14 19:03:22 +08:00
github-actions[bot]	c774a4dde1	@Doge2077 has signed the CLA in opendatalab/MinerU#2226	2025-04-14 10:41:06 +00:00
myhloli	29b47466ff	Update version.py with new version	2025-04-14 10:34:29 +00:00
Xiaomeng Zhao	a1df670e34	Merge pull request #2225 from opendatalab/release-1.3.3 Release 1.3.3	2025-04-14 18:33:07 +08:00
Xiaomeng Zhao	a67de492b1	Merge pull request #2224 from opendatalab/dev build(deps): downgrade dill to 0.3.8 for doclayout_yolo compatibility	2025-04-14 18:31:49 +08:00
Xiaomeng Zhao	222af4f2f5	Merge pull request #2223 from myhloli/dev build(deps): downgrade dill to 0.3.8 for doclayout_yolo compatibility	2025-04-14 18:31:04 +08:00
myhloli	b9eed5d865	build(deps): downgrade dill to 0.3.8 for doclayout_yolo compatibility - Change dill dependency from >=0.3.9,<1 to >=0.3.8,<1 - Update dill version in both general and specific requirements	2025-04-14 18:29:47 +08:00
Doge2077	82a4376d8a	bugfix:While converting file to pdf, Chinese font will be ignored.	2025-04-14 17:51:56 +08:00
Xiaomeng Zhao	99ab04f588	Merge pull request #2220 from myhloli/refactor-pipeline fix(magic_pdf): correct range for images in document analysis	2025-04-14 17:30:45 +08:00
myhloli	67b31a78d0	fix(magic_pdf): correct range for images in document analysis - Update the range used to generate images_with_extra_info to match the number of images - This fixes a potential IndexError when the number of images differs from the dataset length	2025-04-14 17:24:58 +08:00
Xiaomeng Zhao	4f129a64aa	Merge pull request #7 from myhloli/dev refactor(footnote_detection): adjust footnote detection threshold	2025-04-14 16:30:32 +08:00
github-actions[bot]	47d287a2a0	@zjx20 has signed the CLA in opendatalab/MinerU#2215	2025-04-14 07:25:39 +00:00
Xiaomeng Zhao	bc51f9f75e	Merge pull request #2214 from myhloli/dev refactor(footnote_detection): adjust footnote detection threshold	2025-04-14 15:23:31 +08:00
myhloli	8caf59f7cb	refactor(footnote_detection): adjust footnote detection threshold - Change footnote detection threshold from 50% of page height to 30% - Improve accuracy of footnote identification in PDF processing	2025-04-14 15:16:33 +08:00
Xiaomeng Zhao	4df8523a31	Merge pull request #2208 from opendatalab/master master->dev	2025-04-13 21:53:37 +08:00
Xiaomeng Zhao	c7a609fa7a	Merge pull request #2207 from opendatalab/release-1.3.2 build(docker): remove requirements.txt and update package installation	2025-04-13 21:52:44 +08:00
myhloli	5957cb65f9	Update version.py with new version	2025-04-12 11:04:26 +00:00
Xiaomeng Zhao	d0ed731b9e	Merge pull request #2199 from opendatalab/release-1.3.2 Release 1.3.2	2025-04-12 18:58:15 +08:00
Xiaomeng Zhao	b60166a541	Merge pull request #2157 from opendatalab/release-1.3.1 Release 1.3.1	2025-04-08 18:16:33 +08:00
Xiaomeng Zhao	ccf2ea04cb	Merge pull request #2156 from opendatalab/dev Dev	2025-04-08 18:16:07 +08:00
Xiaomeng Zhao	cb9c2e7616	Merge pull request #2154 from opendatalab/release-1.3.2 Release 1.3.2	2025-04-08 18:11:26 +08:00