Merge pull request #2355 from opendatalab/master

update version
Merge remote-tracking branch 'origin/master'
2026-04-12 15:29:03 +07:00 · 2025-04-23 18:50:01 +08:00 · 2025-04-23 18:48:46 +08:00 · 2025-04-23 18:48:32 +08:00 · 2025-04-23 10:41:07 +00:00 · 2025-04-23 18:38:42 +08:00
11 changed files with 15685 additions and 6 deletions
--- a/README.md
+++ b/README.md
@@ -48,6 +48,12 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
 </div>

 # Changelog
+- 2025/04/23 1.3.8 Released
+  - The default `ocr` model (`ch`) has been updated to `PP-OCRv4_server_rec_doc` (model update required)
+    - `PP-OCRv4_server_rec_doc` is trained on a mix of more Chinese document data and PP-OCR training data, enhancing recognition capabilities for some traditional Chinese characters, Japanese, and special characters. It supports over 15,000 recognizable characters, improving text recognition in documents while also boosting general text recognition.
+    - [Performance comparison between PP-OCRv4_server_rec_doc, PP-OCRv4_server_rec, and PP-OCRv4_mobile_rec](https://paddlepaddle.github.io/PaddleX/latest/en/module_usage/tutorials/ocr_modules/text_recognition.html#ii-supported-model-list)
+    - Verified results show that the `PP-OCRv4_server_rec_doc` model significantly improves accuracy in both single-language (`Chinese`, `English`, `Japanese`, `Traditional Chinese`) and mixed-language scenarios, with speed comparable to `PP-OCRv4_server_rec`, making it suitable for most use cases.
+    - In a small number of pure English scenarios, the `PP-OCRv4_server_rec_doc` model may encounter word concatenation issues, whereas `PP-OCRv4_server_rec` performs better in such cases. Therefore, we have retained the `PP-OCRv4_server_rec` model, which users can invoke by passing the parameter `lang='ch_server'`(python api) or `--lang ch_server`(cli).
 - 2025/04/22 1.3.7 Released
  - Fixed the issue where the `lang` parameter was ineffective during table parsing model initialization.
  - Fixed the significant slowdown in OCR and table parsing speed in `cpu` mode.
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -47,6 +47,12 @@
 </div>

 # 更新记录
+- 2025/04/23 1.3.8 发布
+  - `ocr`默认模型(`ch`)更新为`PP-OCRv4_server_rec_doc`（需更新模型）
+    - `PP-OCRv4_server_rec_doc`是在`PP-OCRv4_server_rec`的基础上，在更多中文文档数据和PP-OCR训练数据的混合数据训练而成，增加了部分繁体字、日文、特殊字符的识别能力，可支持识别的字符为1.5万+，除文档相关的文字识别能力提升外，也同时提升了通用文字的识别能力。
+    - [PP-OCRv4_server_rec_doc/PP-OCRv4_server_rec/PP-OCRv4_mobile_rec 性能对比](https://paddlepaddle.github.io/PaddleX/latest/module_usage/tutorials/ocr_modules/text_recognition.html#_3)
+    - 经验证，`PP-OCRv4_server_rec_doc`模型在`中英日繁`单种语言或多种语言混合场景均有明显精度提升，且速度与`PP-OCRv4_server_rec`相当，适合绝大部分场景使用。
+    - `PP-OCRv4_server_rec_doc`在小部分纯英文场景可能会发生单词粘连问题，`PP-OCRv4_server_rec`则在此场景下表现更好，因此我们保留了`PP-OCRv4_server_rec`模型，用户可通过增加参数`lang='ch_server'`(python api)或`--lang ch_server`(命令行)调用。
 - 2025/04/22 1.3.7 发布
  - 修复表格解析模型初始化时lang参数失效的问题
  - 修复在`cpu`模式下ocr和表格解析速度大幅下降的问题
--- a/magic_pdf/libs/version.py
+++ b/magic_pdf/libs/version.py
@@ -1 +1 @@
-__version__ = "1.3.5"
+__version__ = "1.3.8"
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorch_paddle.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorch_paddle.py
@@ -55,7 +55,8 @@ class PytorchPaddleOCR(TextSystem):
        self.lang = kwargs.get('lang', 'ch')

        device = get_device()
-        if device == 'cpu' and self.lang == 'ch':
+        if device == 'cpu' and self.lang in ['ch', 'ch_server']:
+            logger.warning("The current device in use is CPU. To ensure the speed of parsing, the language is automatically switched to ch_lite.")
            self.lang = 'ch_lite'

        if self.lang in latin_lang:
@@ -79,7 +80,7 @@ class PytorchPaddleOCR(TextSystem):
        kwargs['rec_char_dict_path'] = os.path.join(root_dir, 'pytorchocr', 'utils', 'resources', 'dict', dict_file)
        # kwargs['rec_batch_num'] = 8

-        kwargs['device'] = get_device()
+        kwargs['device'] = device

        default_args = vars(args)
        default_args.update(kwargs)
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/arch_config.yaml
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/arch_config.yaml
@@ -171,6 +171,31 @@ ch_PP-OCRv4_rec_server_infer:
          nrtr_dim: 384
          max_text_length: 25

+ch_PP-OCRv4_rec_server_doc_infer:
+  model_type: rec
+  algorithm: SVTR_HGNet
+  Transform:
+  Backbone:
+    name: PPHGNet_small
+  Head:
+    name: MultiHead
+    out_channels_list:
+      CTCLabelDecode: 15631
+    head_list:
+      - CTCHead:
+          Neck:
+            name: svtr
+            dims: 120
+            depth: 2
+            hidden_dims: 120
+            kernel_size: [ 1, 3 ]
+            use_guide: True
+          Head:
+            fc_decay: 0.00001
+      - NRTRHead:
+          nrtr_dim: 384
+          max_text_length: 25
+
 chinese_cht_PP-OCRv3_rec_infer:
  model_type: rec
  algorithm: SVTR
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/dict/ppocrv4_doc_dict.txt
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/dict/ppocrv4_doc_dict.txt
--- a/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/models_config.yml
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr2pytorch/pytorchocr/utils/resources/models_config.yml
@@ -3,10 +3,14 @@ lang:
    det: ch_PP-OCRv3_det_infer.pth
    rec: ch_PP-OCRv4_rec_infer.pth
    dict: ppocr_keys_v1.txt
-  ch:
+  ch_server:
    det: ch_PP-OCRv3_det_infer.pth
    rec: ch_PP-OCRv4_rec_server_infer.pth
    dict: ppocr_keys_v1.txt
+  ch:
+    det: ch_PP-OCRv3_det_infer.pth
+    rec: ch_PP-OCRv4_rec_server_doc_infer.pth
+    dict: ppocrv4_doc_dict.txt
  en:
    det: en_PP-OCRv3_det_infer.pth
    rec: en_PP-OCRv4_rec_infer.pth
--- a/projects/gradio_app/app.py
+++ b/projects/gradio_app/app.py
@@ -158,7 +158,7 @@ devanagari_lang = [
        'hi', 'mr', 'ne', 'bh', 'mai', 'ang', 'bho', 'mah', 'sck', 'new', 'gom',  # noqa: E126
        'sa', 'bgc'
 ]
-other_lang = ['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
+other_lang = ['ch', 'ch_lite', 'ch_server', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka']
 add_lang = ['latin', 'arabic', 'cyrillic', 'devanagari']

 # all_lang = ['', 'auto']
--- a/projects/gradio_app/examples/complex_layout.pdf
+++ b/projects/gradio_app/examples/complex_layout.pdf
--- a/signatures/version1/cla.json
+++ b/signatures/version1/cla.json
@@ -239,6 +239,14 @@
      "created_at": "2025-04-14T10:40:54Z",
      "repoId": 765083837,
      "pullRequestNo": 2226
+    },
+    {
+      "name": "vloum",
+      "id": 75369577,
+      "comment_id": 2811669681,
+      "created_at": "2025-04-17T03:54:59Z",
+      "repoId": 765083837,
+      "pullRequestNo": 2267
    }
  ]
 }
--- a/tests/unittest/test_table/test_rapidtable.py
+++ b/tests/unittest/test_table/test_rapidtable.py
@@ -41,7 +41,7 @@ class TestppTableModel(unittest.TestCase):
        # 检查第一行数据
        first_row = tree.xpath('//table/tr[2]/td')
        assert len(first_row) == 5, "First row should have 5 cells"
-        assert first_row[0].text and first_row[0].text.strip() == "SegLink[26]", "First cell should be 'SegLink[26]'"
+        assert first_row[0].text and 'SegLink' in first_row[0].text.strip(), "First cell should be 'SegLink [26]'"
        assert first_row[1].text and first_row[1].text.strip() == "70.0", "Second cell should be '70.0'"
        assert first_row[2].text and first_row[2].text.strip() == "86.0", "Third cell should be '86.0'"
        assert first_row[3].text and first_row[3].text.strip() == "77.0", "Fourth cell should be '77.0'"
Author	SHA1	Message	Date
Xiaomeng Zhao	dde90293f1	Merge pull request #2355 from opendatalab/master update version	2025-04-23 18:50:01 +08:00
myhloli	a24b9ed8fd	Merge remote-tracking branch 'origin/master'	2025-04-23 18:48:46 +08:00
myhloli	e0dc6c8473	docs(README): update changelog for version 1.3.8 release	2025-04-23 18:48:32 +08:00
myhloli	801d3ade19	Update version.py with new version	2025-04-23 10:41:07 +00:00
Xiaomeng Zhao	6b7a861e8f	Merge pull request #2354 from opendatalab/release-1.3.8 Release 1.3.8	2025-04-23 18:38:42 +08:00
Xiaomeng Zhao	9fbaee9e89	Merge pull request #2353 from myhloli/dev test(table): update test_rapidtable.py to handle SegLink text variations	2025-04-23 18:27:20 +08:00
myhloli	61fa95d4e0	test(table): update test_rapidtable.py to handle SegLink text variations - Modify assertion for first cell text to check for 'SegLink' instead of exact match - This change accommodates variations in SegLink text format	2025-04-23 18:26:19 +08:00
Xiaomeng Zhao	5c232f0587	Merge pull request #2352 from myhloli/dev feat(ocr): add new Chinese OCR model and update language support	2025-04-23 18:15:25 +08:00
myhloli	45f5082613	refactor(ocr): update device parameter handling in paddleocr2pytorch - Replace get_device() function call with direct 'device' variable usage - Simplify device configuration in OCR model initialization	2025-04-23 18:13:58 +08:00
myhloli	4f88fcaa51	feat(ocr): add new Chinese OCR model and update language support - Add new Chinese OCR model (ch_PP-OCRv4_rec_server_doc_infer) for server-side use - Update language support in app.py to include new Chinese model - Modify models_config.yml to add new model configuration	2025-04-23 18:06:12 +08:00
Xiaomeng Zhao	3cf1ea1f5b	Merge pull request #2316 from opendatalab/master master->dev	2025-04-22 19:28:21 +08:00
myhloli	d874563e38	Update version.py with new version	2025-04-22 11:27:25 +00:00
Xiaomeng Zhao	55fcb7387f	Merge pull request #2315 from opendatalab/release-1.3.7 Release 1.3.7	2025-04-22 19:26:03 +08:00
myhloli	4d5fd0ee55	Update version.py with new version	2025-04-21 06:45:36 +00:00
Xiaomeng Zhao	601b44bfe0	Merge pull request #2298 from opendatalab/release-1.3.6 Release 1.3.6	2025-04-21 14:37:23 +08:00
Xiaomeng Zhao	6fbbe3e6f0	Merge pull request #2274 from opendatalab/dev docs: update issue templates and disable blank issues	2025-04-17 18:46:05 +08:00
github-actions[bot]	19fd2cfa37	@vloum has signed the CLA in opendatalab/MinerU#2267	2025-04-17 03:55:12 +00:00