mirror of
https://github.com/opendatalab/MinerU.git
synced 2026-03-27 11:08:32 +07:00
Compare commits
22 Commits
release-1.
...
magic_pdf-
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
a989444e2f | ||
|
|
e3a4295527 | ||
|
|
73f0530d16 | ||
|
|
e92b5b698e | ||
|
|
1e01ffcf78 | ||
|
|
04b81dc1ab | ||
|
|
90585b67a9 | ||
|
|
4949dd0c18 | ||
|
|
a2b848136b | ||
|
|
04a712f940 | ||
|
|
27cad566fa | ||
|
|
ea3003f6ef | ||
|
|
93ad41edce | ||
|
|
8f8b8c4c1f | ||
|
|
048f6af406 | ||
|
|
b122b86e8a | ||
|
|
002333a8d7 | ||
|
|
e3f22e84ab | ||
|
|
40851b1c61 | ||
|
|
ea619281ef | ||
|
|
0b8c614280 | ||
|
|
50700646e4 |
14
README.md
14
README.md
@@ -48,6 +48,20 @@ Easier to use: Just grab MinerU Desktop. No coding, no login, just a simple inte
|
||||
</div>
|
||||
|
||||
# Changelog
|
||||
- 2025/05/24 1.3.12 Released
|
||||
- Added support for ppocrv5 model, updated `ch_server` model to `PP-OCRv5_rec_server` and `ch_lite` model to `PP-OCRv5_rec_mobile` (model update required)
|
||||
- In testing, we found that ppocrv5(server) shows some improvement for handwritten documents, but slightly lower accuracy than v4_server_doc for other document types. Therefore, the default ch model remains unchanged as `PP-OCRv4_server_rec_doc`.
|
||||
- Since ppocrv5 enhances recognition capabilities for handwritten text and special characters, you can manually select ppocrv5 models for Japanese, traditional Chinese mixed scenarios and handwritten document scenarios
|
||||
- You can select the appropriate model through the lang parameter `lang='ch_server'` (python api) or `--lang ch_server` (command line):
|
||||
- `ch`: `PP-OCRv4_rec_server_doc` (default) (Chinese, English, Japanese, Traditional Chinese mixed/15k dictionary)
|
||||
- `ch_server`: `PP-OCRv5_rec_server` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary)
|
||||
- `ch_lite`: `PP-OCRv5_rec_mobile` (Chinese, English, Japanese, Traditional Chinese mixed + handwriting/18k dictionary)
|
||||
- `ch_server_v4`: `PP-OCRv4_rec_server` (Chinese, English mixed/6k dictionary)
|
||||
- `ch_lite_v4`: `PP-OCRv4_rec_mobile` (Chinese, English mixed/6k dictionary)
|
||||
- Added support for handwritten documents by optimizing layout recognition of handwritten text areas
|
||||
- This feature is supported by default, no additional configuration needed
|
||||
- You can refer to the instructions above to manually select ppocrv5 model for better handwritten document parsing
|
||||
- The demos on `huggingface` and `modelscope` have been updated to support handwriting recognition and ppocrv5 models, which you can experience online
|
||||
- 2025/04/29 1.3.10 Released
|
||||
- Support for custom formula delimiters can be achieved by modifying the `latex-delimiter-config` item in the `magic-pdf.json` file under the user directory.
|
||||
- 2025/04/27 1.3.9 Released
|
||||
|
||||
@@ -47,6 +47,20 @@
|
||||
</div>
|
||||
|
||||
# 更新记录
|
||||
- 2025/05/24 1.3.12 发布
|
||||
- 增加ppocrv5模型的支持,将`ch_server`模型更新为`PP-OCRv5_rec_server`,`ch_lite`模型更新为`PP-OCRv5_rec_mobile`(需更新模型)
|
||||
- 在测试中,发现ppocrv5(server)对手写文档效果有一定提升,但在其余类别文档的精度略差于v4_server_doc,因此默认的ch模型保持不变,仍为`PP-OCRv4_server_rec_doc`。
|
||||
- 由于ppocrv5强化了手写场景和特殊字符的识别能力,因此您可以在日繁混合场景以及手写文档场景下手动选择使用ppocrv5模型
|
||||
- 您可通过lang参数`lang='ch_server'`(python api)或`--lang ch_server`(命令行)自行选择相应的模型:
|
||||
- `ch` :`PP-OCRv4_rec_server_doc`(默认)(中英日繁混合/1.5w字典)
|
||||
- `ch_server` :`PP-OCRv5_rec_server`(中英日繁混合+手写场景/1.8w字典)
|
||||
- `ch_lite` :`PP-OCRv5_rec_mobile`(中英日繁混合+手写场景/1.8w字典)
|
||||
- `ch_server_v4` :`PP-OCRv4_rec_server`(中英混合/6k字典)
|
||||
- `ch_lite_v4` :`PP-OCRv4_rec_mobile`(中英混合/6k字典)
|
||||
- 增加手写文档的支持,通过优化layout对手写文本区域的识别,现已支持手写文档的解析
|
||||
- 默认支持此功能,无需额外配置
|
||||
- 可以参考上述说明,手动选择ppocrv5模型以获得更好的手写文档解析效果
|
||||
- `huggingface`和`modelscope`的demo已更新为支持手写识别和ppocrv5模型的版本,可自行在线体验
|
||||
- 2025/04/29 1.3.10 发布
|
||||
- 支持使用自定义公式标识符,可通过修改用户目录下的`magic-pdf.json`文件中的`latex-delimiter-config`项实现。
|
||||
- 2025/04/27 1.3.9 发布
|
||||
|
||||
@@ -10,22 +10,22 @@ from loguru import logger
|
||||
|
||||
|
||||
|
||||
def fitz_doc_to_image(doc, dpi=200) -> dict:
|
||||
def fitz_doc_to_image(page, dpi=200) -> dict:
|
||||
"""Convert fitz.Document to image, Then convert the image to numpy array.
|
||||
|
||||
Args:
|
||||
doc (_type_): pymudoc page
|
||||
page (_type_): pymudoc page
|
||||
dpi (int, optional): reset the dpi of dpi. Defaults to 200.
|
||||
|
||||
Returns:
|
||||
dict: {'img': numpy array, 'width': width, 'height': height }
|
||||
"""
|
||||
mat = fitz.Matrix(dpi / 72, dpi / 72)
|
||||
pm = doc.get_pixmap(matrix=mat, alpha=False)
|
||||
pm = page.get_pixmap(matrix=mat, alpha=False)
|
||||
|
||||
# If the width or height exceeds 4500 after scaling, do not scale further.
|
||||
if pm.width > 4500 or pm.height > 4500:
|
||||
pm = doc.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
|
||||
pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
|
||||
|
||||
# Convert pixmap samples directly to numpy array
|
||||
img = np.frombuffer(pm.samples, dtype=np.uint8).reshape(pm.height, pm.width, 3)
|
||||
|
||||
@@ -70,19 +70,34 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
|
||||
if mode == 'nlp':
|
||||
continue
|
||||
elif mode == 'mm':
|
||||
for block in para_block['blocks']: # 1st.拼image_body
|
||||
if block['type'] == BlockType.ImageBody:
|
||||
for line in block['lines']:
|
||||
for span in line['spans']:
|
||||
if span['type'] == ContentType.Image:
|
||||
if span.get('image_path', ''):
|
||||
para_text += f"\n}) \n"
|
||||
for block in para_block['blocks']: # 2nd.拼image_caption
|
||||
if block['type'] == BlockType.ImageCaption:
|
||||
para_text += merge_para_with_text(block) + ' \n'
|
||||
for block in para_block['blocks']: # 3rd.拼image_footnote
|
||||
if block['type'] == BlockType.ImageFootnote:
|
||||
para_text += merge_para_with_text(block) + ' \n'
|
||||
# 检测是否存在图片脚注
|
||||
has_image_footnote = any(block['type'] == BlockType.ImageFootnote for block in para_block['blocks'])
|
||||
# 如果存在图片脚注,则将图片脚注拼接到图片正文后面
|
||||
if has_image_footnote:
|
||||
for block in para_block['blocks']: # 1st.拼image_caption
|
||||
if block['type'] == BlockType.ImageCaption:
|
||||
para_text += merge_para_with_text(block) + ' \n'
|
||||
for block in para_block['blocks']: # 2nd.拼image_body
|
||||
if block['type'] == BlockType.ImageBody:
|
||||
for line in block['lines']:
|
||||
for span in line['spans']:
|
||||
if span['type'] == ContentType.Image:
|
||||
if span.get('image_path', ''):
|
||||
para_text += f""
|
||||
for block in para_block['blocks']: # 3rd.拼image_footnote
|
||||
if block['type'] == BlockType.ImageFootnote:
|
||||
para_text += ' \n' + merge_para_with_text(block)
|
||||
else:
|
||||
for block in para_block['blocks']: # 1st.拼image_body
|
||||
if block['type'] == BlockType.ImageBody:
|
||||
for line in block['lines']:
|
||||
for span in line['spans']:
|
||||
if span['type'] == ContentType.Image:
|
||||
if span.get('image_path', ''):
|
||||
para_text += f""
|
||||
for block in para_block['blocks']: # 2nd.拼image_caption
|
||||
if block['type'] == BlockType.ImageCaption:
|
||||
para_text += ' \n' + merge_para_with_text(block)
|
||||
elif para_type == BlockType.Table:
|
||||
if mode == 'nlp':
|
||||
continue
|
||||
@@ -96,20 +111,19 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
|
||||
for span in line['spans']:
|
||||
if span['type'] == ContentType.Table:
|
||||
# if processed by table model
|
||||
if span.get('latex', ''):
|
||||
para_text += f"\n\n$\n {span['latex']}\n$\n\n"
|
||||
elif span.get('html', ''):
|
||||
para_text += f"\n\n{span['html']}\n\n"
|
||||
if span.get('html', ''):
|
||||
para_text += f"\n{span['html']}\n"
|
||||
elif span.get('image_path', ''):
|
||||
para_text += f"\n}) \n"
|
||||
para_text += f""
|
||||
for block in para_block['blocks']: # 3rd.拼table_footnote
|
||||
if block['type'] == BlockType.TableFootnote:
|
||||
para_text += merge_para_with_text(block) + ' \n'
|
||||
para_text += '\n' + merge_para_with_text(block) + ' '
|
||||
|
||||
if para_text.strip() == '':
|
||||
continue
|
||||
else:
|
||||
page_markdown.append(para_text.strip() + ' ')
|
||||
# page_markdown.append(para_text.strip() + ' ')
|
||||
page_markdown.append(para_text.strip())
|
||||
|
||||
return page_markdown
|
||||
|
||||
@@ -257,9 +271,9 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx, drop_reason
|
||||
if span['type'] == ContentType.Table:
|
||||
|
||||
if span.get('latex', ''):
|
||||
para_content['table_body'] = f"\n\n$\n {span['latex']}\n$\n\n"
|
||||
para_content['table_body'] = f"{span['latex']}"
|
||||
elif span.get('html', ''):
|
||||
para_content['table_body'] = f"\n\n{span['html']}\n\n"
|
||||
para_content['table_body'] = f"{span['html']}"
|
||||
|
||||
if span.get('image_path', ''):
|
||||
para_content['img_path'] = join_path(img_buket_path, span['image_path'])
|
||||
|
||||
@@ -1 +1 @@
|
||||
__version__ = "1.3.10"
|
||||
__version__ = "1.3.11"
|
||||
|
||||
@@ -6,7 +6,7 @@ from tqdm import tqdm
|
||||
from magic_pdf.config.constants import MODEL_NAME
|
||||
from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
|
||||
from magic_pdf.model.sub_modules.model_utils import (
|
||||
clean_vram, crop_img, get_res_list_from_layout_res)
|
||||
clean_vram, crop_img, get_res_list_from_layout_res, get_coords_and_area)
|
||||
from magic_pdf.model.sub_modules.ocr.paddleocr2pytorch.ocr_utils import (
|
||||
get_adjusted_mfdetrec_res, get_ocr_result_list)
|
||||
|
||||
@@ -148,6 +148,19 @@ class BatchAnalyze:
|
||||
# Integration results
|
||||
if ocr_res:
|
||||
ocr_result_list = get_ocr_result_list(ocr_res, useful_list, ocr_res_list_dict['ocr_enable'], new_image, _lang)
|
||||
|
||||
if res["category_id"] == 3:
|
||||
# ocr_result_list中所有bbox的面积之和
|
||||
ocr_res_area = sum(get_coords_and_area(ocr_res_item)[4] for ocr_res_item in ocr_result_list if 'poly' in ocr_res_item)
|
||||
# 求ocr_res_area和res的面积的比值
|
||||
res_area = get_coords_and_area(res)[4]
|
||||
if res_area > 0:
|
||||
ratio = ocr_res_area / res_area
|
||||
if ratio > 0.25:
|
||||
res["category_id"] = 1
|
||||
else:
|
||||
continue
|
||||
|
||||
ocr_res_list_dict['layout_res'].extend(ocr_result_list)
|
||||
|
||||
# det_count += len(ocr_res_list_dict['ocr_res_list'])
|
||||
|
||||
@@ -189,7 +189,7 @@ def batch_doc_analyze(
|
||||
formula_enable=None,
|
||||
table_enable=None,
|
||||
):
|
||||
MIN_BATCH_INFERENCE_SIZE = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 200))
|
||||
MIN_BATCH_INFERENCE_SIZE = int(os.environ.get('MINERU_MIN_BATCH_INFERENCE_SIZE', 100))
|
||||
batch_size = MIN_BATCH_INFERENCE_SIZE
|
||||
page_wh_list = []
|
||||
|
||||
|
||||
@@ -31,10 +31,10 @@ def crop_img(input_res, input_np_img, crop_paste_x=0, crop_paste_y=0):
|
||||
return return_image, return_list
|
||||
|
||||
|
||||
def get_coords_and_area(table):
|
||||
def get_coords_and_area(block_with_poly):
|
||||
"""Extract coordinates and area from a table."""
|
||||
xmin, ymin = int(table['poly'][0]), int(table['poly'][1])
|
||||
xmax, ymax = int(table['poly'][4]), int(table['poly'][5])
|
||||
xmin, ymin = int(block_with_poly['poly'][0]), int(block_with_poly['poly'][1])
|
||||
xmax, ymax = int(block_with_poly['poly'][4]), int(block_with_poly['poly'][5])
|
||||
area = (xmax - xmin) * (ymax - ymin)
|
||||
return xmin, ymin, xmax, ymax, area
|
||||
|
||||
@@ -243,7 +243,7 @@ def get_res_list_from_layout_res(layout_res, iou_threshold=0.7, overlap_threshol
|
||||
"bbox": [int(res['poly'][0]), int(res['poly'][1]),
|
||||
int(res['poly'][4]), int(res['poly'][5])],
|
||||
})
|
||||
elif category_id in [0, 2, 4, 6, 7]: # OCR regions
|
||||
elif category_id in [0, 2, 4, 6, 7, 3]: # OCR regions
|
||||
ocr_res_list.append(res)
|
||||
elif category_id == 5: # Table regions
|
||||
table_res_list.append(res)
|
||||
|
||||
@@ -35,7 +35,7 @@ def build_backbone(config, model_type):
|
||||
from .rec_mobilenet_v3 import MobileNetV3
|
||||
from .rec_svtrnet import SVTRNet
|
||||
from .rec_mv1_enhance import MobileNetV1Enhance
|
||||
|
||||
from .rec_pphgnetv2 import PPHGNetV2_B4
|
||||
support_dict = [
|
||||
"MobileNetV1Enhance",
|
||||
"MobileNetV3",
|
||||
@@ -48,6 +48,7 @@ def build_backbone(config, model_type):
|
||||
"DenseNet",
|
||||
"PPLCNetV3",
|
||||
"PPHGNet_small",
|
||||
"PPHGNetV2_B4",
|
||||
]
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
@@ -0,0 +1,810 @@
|
||||
import math
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
|
||||
class AdaptiveAvgPool2D(nn.AdaptiveAvgPool2d):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
if isinstance(self.output_size, int) and self.output_size == 1:
|
||||
self._gap = True
|
||||
elif (
|
||||
isinstance(self.output_size, tuple)
|
||||
and self.output_size[0] == 1
|
||||
and self.output_size[1] == 1
|
||||
):
|
||||
self._gap = True
|
||||
else:
|
||||
self._gap = False
|
||||
|
||||
def forward(self, x):
|
||||
if self._gap:
|
||||
# Global Average Pooling
|
||||
N, C, _, _ = x.shape
|
||||
x_mean = torch.mean(x, dim=[2, 3])
|
||||
x_mean = torch.reshape(x_mean, [N, C, 1, 1])
|
||||
return x_mean
|
||||
else:
|
||||
return F.adaptive_avg_pool2d(
|
||||
x,
|
||||
output_size=self.output_size
|
||||
)
|
||||
|
||||
class LearnableAffineBlock(nn.Module):
|
||||
"""
|
||||
Create a learnable affine block module. This module can significantly improve accuracy on smaller models.
|
||||
|
||||
Args:
|
||||
scale_value (float): The initial value of the scale parameter, default is 1.0.
|
||||
bias_value (float): The initial value of the bias parameter, default is 0.0.
|
||||
lr_mult (float): The learning rate multiplier, default is 1.0.
|
||||
lab_lr (float): The learning rate, default is 0.01.
|
||||
"""
|
||||
|
||||
def __init__(self, scale_value=1.0, bias_value=0.0, lr_mult=1.0, lab_lr=0.01):
|
||||
super().__init__()
|
||||
self.scale = nn.Parameter(torch.Tensor([scale_value]))
|
||||
self.bias = nn.Parameter(torch.Tensor([bias_value]))
|
||||
|
||||
def forward(self, x):
|
||||
return self.scale * x + self.bias
|
||||
|
||||
|
||||
class ConvBNAct(nn.Module):
|
||||
"""
|
||||
ConvBNAct is a combination of convolution and batchnorm layers.
|
||||
|
||||
Args:
|
||||
in_channels (int): Number of input channels.
|
||||
out_channels (int): Number of output channels.
|
||||
kernel_size (int): Size of the convolution kernel. Defaults to 3.
|
||||
stride (int): Stride of the convolution. Defaults to 1.
|
||||
padding (int/str): Padding or padding type for the convolution. Defaults to 1.
|
||||
groups (int): Number of groups for the convolution. Defaults to 1.
|
||||
use_act: (bool): Whether to use activation function. Defaults to True.
|
||||
use_lab (bool): Whether to use the LAB operation. Defaults to False.
|
||||
lr_mult (float): Learning rate multiplier for the layer. Defaults to 1.0.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
in_channels,
|
||||
out_channels,
|
||||
kernel_size=3,
|
||||
stride=1,
|
||||
padding=1,
|
||||
groups=1,
|
||||
use_act=True,
|
||||
use_lab=False,
|
||||
lr_mult=1.0,
|
||||
):
|
||||
super().__init__()
|
||||
self.use_act = use_act
|
||||
self.use_lab = use_lab
|
||||
|
||||
self.conv = nn.Conv2d(
|
||||
in_channels,
|
||||
out_channels,
|
||||
kernel_size,
|
||||
stride,
|
||||
padding=padding if isinstance(padding, str) else (kernel_size - 1) // 2,
|
||||
# padding=(kernel_size - 1) // 2,
|
||||
groups=groups,
|
||||
bias=False,
|
||||
)
|
||||
self.bn = nn.BatchNorm2d(
|
||||
out_channels,
|
||||
)
|
||||
if self.use_act:
|
||||
self.act = nn.ReLU()
|
||||
if self.use_lab:
|
||||
self.lab = LearnableAffineBlock(lr_mult=lr_mult)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.conv(x)
|
||||
x = self.bn(x)
|
||||
if self.use_act:
|
||||
x = self.act(x)
|
||||
if self.use_lab:
|
||||
x = self.lab(x)
|
||||
return x
|
||||
|
||||
|
||||
class LightConvBNAct(nn.Module):
|
||||
"""
|
||||
LightConvBNAct is a combination of pw and dw layers.
|
||||
|
||||
Args:
|
||||
in_channels (int): Number of input channels.
|
||||
out_channels (int): Number of output channels.
|
||||
kernel_size (int): Size of the depth-wise convolution kernel.
|
||||
use_lab (bool): Whether to use the LAB operation. Defaults to False.
|
||||
lr_mult (float): Learning rate multiplier for the layer. Defaults to 1.0.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
in_channels,
|
||||
out_channels,
|
||||
kernel_size,
|
||||
use_lab=False,
|
||||
lr_mult=1.0,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__()
|
||||
self.conv1 = ConvBNAct(
|
||||
in_channels=in_channels,
|
||||
out_channels=out_channels,
|
||||
kernel_size=1,
|
||||
use_act=False,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
self.conv2 = ConvBNAct(
|
||||
in_channels=out_channels,
|
||||
out_channels=out_channels,
|
||||
kernel_size=kernel_size,
|
||||
groups=out_channels,
|
||||
use_act=True,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.conv1(x)
|
||||
x = self.conv2(x)
|
||||
return x
|
||||
|
||||
|
||||
class CustomMaxPool2d(nn.Module):
|
||||
def __init__(
|
||||
self,
|
||||
kernel_size,
|
||||
stride=None,
|
||||
padding=0,
|
||||
dilation=1,
|
||||
return_indices=False,
|
||||
ceil_mode=False,
|
||||
data_format="NCHW",
|
||||
):
|
||||
super(CustomMaxPool2d, self).__init__()
|
||||
self.kernel_size = kernel_size if isinstance(kernel_size, (tuple, list)) else (kernel_size, kernel_size)
|
||||
self.stride = stride if stride is not None else self.kernel_size
|
||||
self.stride = self.stride if isinstance(self.stride, (tuple, list)) else (self.stride, self.stride)
|
||||
self.dilation = dilation if isinstance(dilation, (tuple, list)) else (dilation, dilation)
|
||||
self.return_indices = return_indices
|
||||
self.ceil_mode = ceil_mode
|
||||
self.padding_mode = padding
|
||||
|
||||
# 当padding不是"same"时使用标准MaxPool2d
|
||||
if padding != "same":
|
||||
self.padding = padding if isinstance(padding, (tuple, list)) else (padding, padding)
|
||||
self.pool = nn.MaxPool2d(
|
||||
kernel_size=self.kernel_size,
|
||||
stride=self.stride,
|
||||
padding=self.padding,
|
||||
dilation=self.dilation,
|
||||
return_indices=self.return_indices,
|
||||
ceil_mode=self.ceil_mode
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
# 处理same padding
|
||||
if self.padding_mode == "same":
|
||||
input_height, input_width = x.size(2), x.size(3)
|
||||
|
||||
# 计算期望的输出尺寸
|
||||
out_height = math.ceil(input_height / self.stride[0])
|
||||
out_width = math.ceil(input_width / self.stride[1])
|
||||
|
||||
# 计算需要的padding
|
||||
pad_height = max((out_height - 1) * self.stride[0] + self.kernel_size[0] - input_height, 0)
|
||||
pad_width = max((out_width - 1) * self.stride[1] + self.kernel_size[1] - input_width, 0)
|
||||
|
||||
# 将padding分配到两边
|
||||
pad_top = pad_height // 2
|
||||
pad_bottom = pad_height - pad_top
|
||||
pad_left = pad_width // 2
|
||||
pad_right = pad_width - pad_left
|
||||
|
||||
# 应用padding
|
||||
x = F.pad(x, (pad_left, pad_right, pad_top, pad_bottom))
|
||||
|
||||
# 使用标准max_pool2d函数
|
||||
if self.return_indices:
|
||||
return F.max_pool2d_with_indices(
|
||||
x,
|
||||
kernel_size=self.kernel_size,
|
||||
stride=self.stride,
|
||||
padding=0, # 已经手动pad过了
|
||||
dilation=self.dilation,
|
||||
ceil_mode=self.ceil_mode
|
||||
)
|
||||
else:
|
||||
return F.max_pool2d(
|
||||
x,
|
||||
kernel_size=self.kernel_size,
|
||||
stride=self.stride,
|
||||
padding=0, # 已经手动pad过了
|
||||
dilation=self.dilation,
|
||||
ceil_mode=self.ceil_mode
|
||||
)
|
||||
else:
|
||||
# 使用预定义的MaxPool2d
|
||||
return self.pool(x)
|
||||
|
||||
class StemBlock(nn.Module):
|
||||
"""
|
||||
StemBlock for PP-HGNetV2.
|
||||
|
||||
Args:
|
||||
in_channels (int): Number of input channels.
|
||||
mid_channels (int): Number of middle channels.
|
||||
out_channels (int): Number of output channels.
|
||||
use_lab (bool): Whether to use the LAB operation. Defaults to False.
|
||||
lr_mult (float): Learning rate multiplier for the layer. Defaults to 1.0.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
in_channels,
|
||||
mid_channels,
|
||||
out_channels,
|
||||
use_lab=False,
|
||||
lr_mult=1.0,
|
||||
text_rec=False,
|
||||
):
|
||||
super().__init__()
|
||||
self.stem1 = ConvBNAct(
|
||||
in_channels=in_channels,
|
||||
out_channels=mid_channels,
|
||||
kernel_size=3,
|
||||
stride=2,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
self.stem2a = ConvBNAct(
|
||||
in_channels=mid_channels,
|
||||
out_channels=mid_channels // 2,
|
||||
kernel_size=2,
|
||||
stride=1,
|
||||
padding="same",
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
self.stem2b = ConvBNAct(
|
||||
in_channels=mid_channels // 2,
|
||||
out_channels=mid_channels,
|
||||
kernel_size=2,
|
||||
stride=1,
|
||||
padding="same",
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
self.stem3 = ConvBNAct(
|
||||
in_channels=mid_channels * 2,
|
||||
out_channels=mid_channels,
|
||||
kernel_size=3,
|
||||
stride=1 if text_rec else 2,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
self.stem4 = ConvBNAct(
|
||||
in_channels=mid_channels,
|
||||
out_channels=out_channels,
|
||||
kernel_size=1,
|
||||
stride=1,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
self.pool = CustomMaxPool2d(
|
||||
kernel_size=2, stride=1, ceil_mode=True, padding="same"
|
||||
)
|
||||
# self.pool = nn.MaxPool2d(
|
||||
# kernel_size=2, stride=1, ceil_mode=True, padding=1
|
||||
# )
|
||||
|
||||
def forward(self, x):
|
||||
x = self.stem1(x)
|
||||
x2 = self.stem2a(x)
|
||||
x2 = self.stem2b(x2)
|
||||
x1 = self.pool(x)
|
||||
|
||||
# if x1.shape[2:] != x2.shape[2:]:
|
||||
# x1 = F.interpolate(x1, size=x2.shape[2:], mode='bilinear', align_corners=False)
|
||||
|
||||
x = torch.cat([x1, x2], 1)
|
||||
x = self.stem3(x)
|
||||
x = self.stem4(x)
|
||||
|
||||
return x
|
||||
|
||||
|
||||
class HGV2_Block(nn.Module):
|
||||
"""
|
||||
HGV2_Block, the basic unit that constitutes the HGV2_Stage.
|
||||
|
||||
Args:
|
||||
in_channels (int): Number of input channels.
|
||||
mid_channels (int): Number of middle channels.
|
||||
out_channels (int): Number of output channels.
|
||||
kernel_size (int): Size of the convolution kernel. Defaults to 3.
|
||||
layer_num (int): Number of layers in the HGV2 block. Defaults to 6.
|
||||
stride (int): Stride of the convolution. Defaults to 1.
|
||||
padding (int/str): Padding or padding type for the convolution. Defaults to 1.
|
||||
groups (int): Number of groups for the convolution. Defaults to 1.
|
||||
use_act (bool): Whether to use activation function. Defaults to True.
|
||||
use_lab (bool): Whether to use the LAB operation. Defaults to False.
|
||||
lr_mult (float): Learning rate multiplier for the layer. Defaults to 1.0.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
in_channels,
|
||||
mid_channels,
|
||||
out_channels,
|
||||
kernel_size=3,
|
||||
layer_num=6,
|
||||
identity=False,
|
||||
light_block=True,
|
||||
use_lab=False,
|
||||
lr_mult=1.0,
|
||||
):
|
||||
super().__init__()
|
||||
self.identity = identity
|
||||
|
||||
self.layers = nn.ModuleList()
|
||||
block_type = "LightConvBNAct" if light_block else "ConvBNAct"
|
||||
for i in range(layer_num):
|
||||
self.layers.append(
|
||||
eval(block_type)(
|
||||
in_channels=in_channels if i == 0 else mid_channels,
|
||||
out_channels=mid_channels,
|
||||
stride=1,
|
||||
kernel_size=kernel_size,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
)
|
||||
# feature aggregation
|
||||
total_channels = in_channels + layer_num * mid_channels
|
||||
self.aggregation_squeeze_conv = ConvBNAct(
|
||||
in_channels=total_channels,
|
||||
out_channels=out_channels // 2,
|
||||
kernel_size=1,
|
||||
stride=1,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
self.aggregation_excitation_conv = ConvBNAct(
|
||||
in_channels=out_channels // 2,
|
||||
out_channels=out_channels,
|
||||
kernel_size=1,
|
||||
stride=1,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
identity = x
|
||||
output = []
|
||||
output.append(x)
|
||||
for layer in self.layers:
|
||||
x = layer(x)
|
||||
output.append(x)
|
||||
x = torch.cat(output, dim=1)
|
||||
x = self.aggregation_squeeze_conv(x)
|
||||
x = self.aggregation_excitation_conv(x)
|
||||
if self.identity:
|
||||
x += identity
|
||||
return x
|
||||
|
||||
|
||||
class HGV2_Stage(nn.Module):
|
||||
"""
|
||||
HGV2_Stage, the basic unit that constitutes the PPHGNetV2.
|
||||
|
||||
Args:
|
||||
in_channels (int): Number of input channels.
|
||||
mid_channels (int): Number of middle channels.
|
||||
out_channels (int): Number of output channels.
|
||||
block_num (int): Number of blocks in the HGV2 stage.
|
||||
layer_num (int): Number of layers in the HGV2 block. Defaults to 6.
|
||||
is_downsample (bool): Whether to use downsampling operation. Defaults to False.
|
||||
light_block (bool): Whether to use light block. Defaults to True.
|
||||
kernel_size (int): Size of the convolution kernel. Defaults to 3.
|
||||
use_lab (bool, optional): Whether to use the LAB operation. Defaults to False.
|
||||
lr_mult (float, optional): Learning rate multiplier for the layer. Defaults to 1.0.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
in_channels,
|
||||
mid_channels,
|
||||
out_channels,
|
||||
block_num,
|
||||
layer_num=6,
|
||||
is_downsample=True,
|
||||
light_block=True,
|
||||
kernel_size=3,
|
||||
use_lab=False,
|
||||
stride=2,
|
||||
lr_mult=1.0,
|
||||
):
|
||||
|
||||
super().__init__()
|
||||
self.is_downsample = is_downsample
|
||||
if self.is_downsample:
|
||||
self.downsample = ConvBNAct(
|
||||
in_channels=in_channels,
|
||||
out_channels=in_channels,
|
||||
kernel_size=3,
|
||||
stride=stride,
|
||||
groups=in_channels,
|
||||
use_act=False,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
|
||||
blocks_list = []
|
||||
for i in range(block_num):
|
||||
blocks_list.append(
|
||||
HGV2_Block(
|
||||
in_channels=in_channels if i == 0 else out_channels,
|
||||
mid_channels=mid_channels,
|
||||
out_channels=out_channels,
|
||||
kernel_size=kernel_size,
|
||||
layer_num=layer_num,
|
||||
identity=False if i == 0 else True,
|
||||
light_block=light_block,
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult,
|
||||
)
|
||||
)
|
||||
self.blocks = nn.Sequential(*blocks_list)
|
||||
|
||||
def forward(self, x):
|
||||
if self.is_downsample:
|
||||
x = self.downsample(x)
|
||||
x = self.blocks(x)
|
||||
return x
|
||||
|
||||
|
||||
class DropoutInferDownscale(nn.Module):
|
||||
"""
|
||||
实现与Paddle的mode="downscale_in_infer"等效的Dropout
|
||||
训练模式:out = input * mask(直接应用掩码,不进行放大)
|
||||
推理模式:out = input * (1.0 - p)(在推理时按概率缩小)
|
||||
"""
|
||||
|
||||
def __init__(self, p=0.5):
|
||||
super().__init__()
|
||||
self.p = p
|
||||
|
||||
def forward(self, x):
|
||||
if self.training:
|
||||
# 训练时:应用随机mask但不放大
|
||||
return F.dropout(x, self.p, training=True) * (1.0 - self.p)
|
||||
else:
|
||||
# 推理时:按照dropout概率缩小输出
|
||||
return x * (1.0 - self.p)
|
||||
|
||||
class PPHGNetV2(nn.Module):
|
||||
"""
|
||||
PPHGNetV2
|
||||
|
||||
Args:
|
||||
stage_config (dict): Config for PPHGNetV2 stages. such as the number of channels, stride, etc.
|
||||
stem_channels: (list): Number of channels of the stem of the PPHGNetV2.
|
||||
use_lab (bool): Whether to use the LAB operation. Defaults to False.
|
||||
use_last_conv (bool): Whether to use the last conv layer as the output channel. Defaults to True.
|
||||
class_expand (int): Number of channels for the last 1x1 convolutional layer.
|
||||
drop_prob (float): Dropout probability for the last 1x1 convolutional layer. Defaults to 0.0.
|
||||
class_num (int): The number of classes for the classification layer. Defaults to 1000.
|
||||
lr_mult_list (list): Learning rate multiplier for the stages. Defaults to [1.0, 1.0, 1.0, 1.0, 1.0].
|
||||
Returns:
|
||||
model: nn.Layer. Specific PPHGNetV2 model depends on args.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
stage_config,
|
||||
stem_channels=[3, 32, 64],
|
||||
use_lab=False,
|
||||
use_last_conv=True,
|
||||
class_expand=2048,
|
||||
dropout_prob=0.0,
|
||||
class_num=1000,
|
||||
lr_mult_list=[1.0, 1.0, 1.0, 1.0, 1.0],
|
||||
det=False,
|
||||
text_rec=False,
|
||||
out_indices=None,
|
||||
**kwargs,
|
||||
):
|
||||
super().__init__()
|
||||
self.det = det
|
||||
self.text_rec = text_rec
|
||||
self.use_lab = use_lab
|
||||
self.use_last_conv = use_last_conv
|
||||
self.class_expand = class_expand
|
||||
self.class_num = class_num
|
||||
self.out_indices = out_indices if out_indices is not None else [0, 1, 2, 3]
|
||||
self.out_channels = []
|
||||
|
||||
# stem
|
||||
self.stem = StemBlock(
|
||||
in_channels=stem_channels[0],
|
||||
mid_channels=stem_channels[1],
|
||||
out_channels=stem_channels[2],
|
||||
use_lab=use_lab,
|
||||
lr_mult=lr_mult_list[0],
|
||||
text_rec=text_rec,
|
||||
)
|
||||
|
||||
# stages
|
||||
self.stages = nn.ModuleList()
|
||||
for i, k in enumerate(stage_config):
|
||||
(
|
||||
in_channels,
|
||||
mid_channels,
|
||||
out_channels,
|
||||
block_num,
|
||||
is_downsample,
|
||||
light_block,
|
||||
kernel_size,
|
||||
layer_num,
|
||||
stride,
|
||||
) = stage_config[k]
|
||||
self.stages.append(
|
||||
HGV2_Stage(
|
||||
in_channels,
|
||||
mid_channels,
|
||||
out_channels,
|
||||
block_num,
|
||||
layer_num,
|
||||
is_downsample,
|
||||
light_block,
|
||||
kernel_size,
|
||||
use_lab,
|
||||
stride,
|
||||
lr_mult=lr_mult_list[i + 1],
|
||||
)
|
||||
)
|
||||
if i in self.out_indices:
|
||||
self.out_channels.append(out_channels)
|
||||
if not self.det:
|
||||
self.out_channels = stage_config["stage4"][2]
|
||||
|
||||
self.avg_pool = AdaptiveAvgPool2D(1)
|
||||
|
||||
if self.use_last_conv:
|
||||
self.last_conv = nn.Conv2d(
|
||||
in_channels=out_channels,
|
||||
out_channels=self.class_expand,
|
||||
kernel_size=1,
|
||||
stride=1,
|
||||
padding=0,
|
||||
bias=False,
|
||||
)
|
||||
self.act = nn.ReLU()
|
||||
if self.use_lab:
|
||||
self.lab = LearnableAffineBlock()
|
||||
self.dropout = DropoutInferDownscale(p=dropout_prob)
|
||||
|
||||
self.flatten = nn.Flatten(start_dim=1, end_dim=-1)
|
||||
if not self.det:
|
||||
self.fc = nn.Linear(
|
||||
self.class_expand if self.use_last_conv else out_channels,
|
||||
self.class_num,
|
||||
)
|
||||
|
||||
self._init_weights()
|
||||
|
||||
def _init_weights(self):
|
||||
for m in self.modules():
|
||||
if isinstance(m, nn.Conv2d):
|
||||
nn.init.kaiming_normal_(m.weight)
|
||||
elif isinstance(m, nn.BatchNorm2d):
|
||||
nn.init.ones_(m.weight)
|
||||
nn.init.zeros_(m.bias)
|
||||
elif isinstance(m, nn.Linear):
|
||||
nn.init.zeros_(m.bias)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.stem(x)
|
||||
out = []
|
||||
for i, stage in enumerate(self.stages):
|
||||
x = stage(x)
|
||||
if self.det and i in self.out_indices:
|
||||
out.append(x)
|
||||
if self.det:
|
||||
return out
|
||||
|
||||
if self.text_rec:
|
||||
if self.training:
|
||||
x = F.adaptive_avg_pool2d(x, [1, 40])
|
||||
else:
|
||||
x = F.avg_pool2d(x, [3, 2])
|
||||
return x
|
||||
|
||||
|
||||
def PPHGNetV2_B0(pretrained=False, use_ssld=False, **kwargs):
|
||||
"""
|
||||
PPHGNetV2_B0
|
||||
Args:
|
||||
pretrained (bool/str): If `True` load pretrained parameters, `False` otherwise.
|
||||
If str, means the path of the pretrained model.
|
||||
use_ssld (bool) Whether using ssld pretrained model when pretrained is True.
|
||||
Returns:
|
||||
model: nn.Layer. Specific `PPHGNetV2_B0` model depends on args.
|
||||
"""
|
||||
stage_config = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num
|
||||
"stage1": [16, 16, 64, 1, False, False, 3, 3],
|
||||
"stage2": [64, 32, 256, 1, True, False, 3, 3],
|
||||
"stage3": [256, 64, 512, 2, True, True, 5, 3],
|
||||
"stage4": [512, 128, 1024, 1, True, True, 5, 3],
|
||||
}
|
||||
|
||||
model = PPHGNetV2(
|
||||
stem_channels=[3, 16, 16], stage_config=stage_config, use_lab=True, **kwargs
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def PPHGNetV2_B1(pretrained=False, use_ssld=False, **kwargs):
|
||||
"""
|
||||
PPHGNetV2_B1
|
||||
Args:
|
||||
pretrained (bool/str): If `True` load pretrained parameters, `False` otherwise.
|
||||
If str, means the path of the pretrained model.
|
||||
use_ssld (bool) Whether using ssld pretrained model when pretrained is True.
|
||||
Returns:
|
||||
model: nn.Layer. Specific `PPHGNetV2_B1` model depends on args.
|
||||
"""
|
||||
stage_config = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num
|
||||
"stage1": [32, 32, 64, 1, False, False, 3, 3],
|
||||
"stage2": [64, 48, 256, 1, True, False, 3, 3],
|
||||
"stage3": [256, 96, 512, 2, True, True, 5, 3],
|
||||
"stage4": [512, 192, 1024, 1, True, True, 5, 3],
|
||||
}
|
||||
|
||||
model = PPHGNetV2(
|
||||
stem_channels=[3, 24, 32], stage_config=stage_config, use_lab=True, **kwargs
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def PPHGNetV2_B2(pretrained=False, use_ssld=False, **kwargs):
|
||||
"""
|
||||
PPHGNetV2_B2
|
||||
Args:
|
||||
pretrained (bool/str): If `True` load pretrained parameters, `False` otherwise.
|
||||
If str, means the path of the pretrained model.
|
||||
use_ssld (bool) Whether using ssld pretrained model when pretrained is True.
|
||||
Returns:
|
||||
model: nn.Layer. Specific `PPHGNetV2_B2` model depends on args.
|
||||
"""
|
||||
stage_config = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num
|
||||
"stage1": [32, 32, 96, 1, False, False, 3, 4],
|
||||
"stage2": [96, 64, 384, 1, True, False, 3, 4],
|
||||
"stage3": [384, 128, 768, 3, True, True, 5, 4],
|
||||
"stage4": [768, 256, 1536, 1, True, True, 5, 4],
|
||||
}
|
||||
|
||||
model = PPHGNetV2(
|
||||
stem_channels=[3, 24, 32], stage_config=stage_config, use_lab=True, **kwargs
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def PPHGNetV2_B3(pretrained=False, use_ssld=False, **kwargs):
|
||||
"""
|
||||
PPHGNetV2_B3
|
||||
Args:
|
||||
pretrained (bool/str): If `True` load pretrained parameters, `False` otherwise.
|
||||
If str, means the path of the pretrained model.
|
||||
use_ssld (bool) Whether using ssld pretrained model when pretrained is True.
|
||||
Returns:
|
||||
model: nn.Layer. Specific `PPHGNetV2_B3` model depends on args.
|
||||
"""
|
||||
stage_config = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num
|
||||
"stage1": [32, 32, 128, 1, False, False, 3, 5],
|
||||
"stage2": [128, 64, 512, 1, True, False, 3, 5],
|
||||
"stage3": [512, 128, 1024, 3, True, True, 5, 5],
|
||||
"stage4": [1024, 256, 2048, 1, True, True, 5, 5],
|
||||
}
|
||||
|
||||
model = PPHGNetV2(
|
||||
stem_channels=[3, 24, 32], stage_config=stage_config, use_lab=True, **kwargs
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def PPHGNetV2_B4(pretrained=False, use_ssld=False, det=False, text_rec=False, **kwargs):
|
||||
"""
|
||||
PPHGNetV2_B4
|
||||
Args:
|
||||
pretrained (bool/str): If `True` load pretrained parameters, `False` otherwise.
|
||||
If str, means the path of the pretrained model.
|
||||
use_ssld (bool) Whether using ssld pretrained model when pretrained is True.
|
||||
Returns:
|
||||
model: nn.Layer. Specific `PPHGNetV2_B4` model depends on args.
|
||||
"""
|
||||
stage_config_rec = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num, stride
|
||||
"stage1": [48, 48, 128, 1, True, False, 3, 6, [2, 1]],
|
||||
"stage2": [128, 96, 512, 1, True, False, 3, 6, [1, 2]],
|
||||
"stage3": [512, 192, 1024, 3, True, True, 5, 6, [2, 1]],
|
||||
"stage4": [1024, 384, 2048, 1, True, True, 5, 6, [2, 1]],
|
||||
}
|
||||
|
||||
stage_config_det = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num
|
||||
"stage1": [48, 48, 128, 1, False, False, 3, 6, 2],
|
||||
"stage2": [128, 96, 512, 1, True, False, 3, 6, 2],
|
||||
"stage3": [512, 192, 1024, 3, True, True, 5, 6, 2],
|
||||
"stage4": [1024, 384, 2048, 1, True, True, 5, 6, 2],
|
||||
}
|
||||
model = PPHGNetV2(
|
||||
stem_channels=[3, 32, 48],
|
||||
stage_config=stage_config_det if det else stage_config_rec,
|
||||
use_lab=False,
|
||||
det=det,
|
||||
text_rec=text_rec,
|
||||
**kwargs,
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def PPHGNetV2_B5(pretrained=False, use_ssld=False, **kwargs):
|
||||
"""
|
||||
PPHGNetV2_B5
|
||||
Args:
|
||||
pretrained (bool/str): If `True` load pretrained parameters, `False` otherwise.
|
||||
If str, means the path of the pretrained model.
|
||||
use_ssld (bool) Whether using ssld pretrained model when pretrained is True.
|
||||
Returns:
|
||||
model: nn.Layer. Specific `PPHGNetV2_B5` model depends on args.
|
||||
"""
|
||||
stage_config = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num
|
||||
"stage1": [64, 64, 128, 1, False, False, 3, 6],
|
||||
"stage2": [128, 128, 512, 2, True, False, 3, 6],
|
||||
"stage3": [512, 256, 1024, 5, True, True, 5, 6],
|
||||
"stage4": [1024, 512, 2048, 2, True, True, 5, 6],
|
||||
}
|
||||
|
||||
model = PPHGNetV2(
|
||||
stem_channels=[3, 32, 64], stage_config=stage_config, use_lab=False, **kwargs
|
||||
)
|
||||
return model
|
||||
|
||||
|
||||
def PPHGNetV2_B6(pretrained=False, use_ssld=False, **kwargs):
|
||||
"""
|
||||
PPHGNetV2_B6
|
||||
Args:
|
||||
pretrained (bool/str): If `True` load pretrained parameters, `False` otherwise.
|
||||
If str, means the path of the pretrained model.
|
||||
use_ssld (bool) Whether using ssld pretrained model when pretrained is True.
|
||||
Returns:
|
||||
model: nn.Layer. Specific `PPHGNetV2_B6` model depends on args.
|
||||
"""
|
||||
stage_config = {
|
||||
# in_channels, mid_channels, out_channels, num_blocks, is_downsample, light_block, kernel_size, layer_num
|
||||
"stage1": [96, 96, 192, 2, False, False, 3, 6],
|
||||
"stage2": [192, 192, 512, 3, True, False, 3, 6],
|
||||
"stage3": [512, 384, 1024, 6, True, True, 5, 6],
|
||||
"stage4": [1024, 768, 2048, 3, True, True, 5, 6],
|
||||
}
|
||||
|
||||
model = PPHGNetV2(
|
||||
stem_channels=[3, 48, 96], stage_config=stage_config, use_lab=False, **kwargs
|
||||
)
|
||||
return model
|
||||
@@ -9,14 +9,27 @@ class Im2Seq(nn.Module):
|
||||
super().__init__()
|
||||
self.out_channels = in_channels
|
||||
|
||||
# def forward(self, x):
|
||||
# B, C, H, W = x.shape
|
||||
# # assert H == 1
|
||||
# x = x.squeeze(dim=2)
|
||||
# # x = x.transpose([0, 2, 1]) # paddle (NTC)(batch, width, channels)
|
||||
# x = x.permute(0, 2, 1)
|
||||
# return x
|
||||
|
||||
def forward(self, x):
|
||||
B, C, H, W = x.shape
|
||||
# assert H == 1
|
||||
x = x.squeeze(dim=2)
|
||||
# x = x.transpose([0, 2, 1]) # paddle (NTC)(batch, width, channels)
|
||||
x = x.permute(0, 2, 1)
|
||||
return x
|
||||
# 处理四维张量,将空间维度展平为序列
|
||||
if H == 1:
|
||||
# 原来的处理逻辑,适用于H=1的情况
|
||||
x = x.squeeze(dim=2)
|
||||
x = x.permute(0, 2, 1) # (B, W, C)
|
||||
else:
|
||||
# 处理H不为1的情况
|
||||
x = x.permute(0, 2, 3, 1) # (B, H, W, C)
|
||||
x = x.reshape(B, H * W, C) # (B, H*W, C)
|
||||
|
||||
return x
|
||||
|
||||
class EncoderWithRNN_(nn.Module):
|
||||
def __init__(self, in_channels, hidden_size):
|
||||
|
||||
@@ -104,6 +104,22 @@ ch_PP-OCRv4_det_infer:
|
||||
name: DBHead
|
||||
k: 50
|
||||
|
||||
ch_PP-OCRv5_det_infer:
|
||||
model_type: det
|
||||
algorithm: DB
|
||||
Transform: null
|
||||
Backbone:
|
||||
name: PPLCNetV3
|
||||
scale: 0.75
|
||||
det: True
|
||||
Neck:
|
||||
name: RSEFPN
|
||||
out_channels: 96
|
||||
shortcut: True
|
||||
Head:
|
||||
name: DBHead
|
||||
k: 50
|
||||
|
||||
ch_PP-OCRv4_det_server_infer:
|
||||
model_type: det
|
||||
algorithm: DB
|
||||
@@ -196,6 +212,58 @@ ch_PP-OCRv4_rec_server_doc_infer:
|
||||
nrtr_dim: 384
|
||||
max_text_length: 25
|
||||
|
||||
ch_PP-OCRv5_rec_server_infer:
|
||||
model_type: rec
|
||||
algorithm: SVTR_HGNet
|
||||
Transform:
|
||||
Backbone:
|
||||
name: PPHGNetV2_B4
|
||||
text_rec: True
|
||||
Head:
|
||||
name: MultiHead
|
||||
out_channels_list:
|
||||
CTCLabelDecode: 18385
|
||||
head_list:
|
||||
- CTCHead:
|
||||
Neck:
|
||||
name: svtr
|
||||
dims: 120
|
||||
depth: 2
|
||||
hidden_dims: 120
|
||||
kernel_size: [ 1, 3 ]
|
||||
use_guide: True
|
||||
Head:
|
||||
fc_decay: 0.00001
|
||||
- NRTRHead:
|
||||
nrtr_dim: 384
|
||||
max_text_length: 25
|
||||
|
||||
ch_PP-OCRv5_rec_infer:
|
||||
model_type: rec
|
||||
algorithm: SVTR_HGNet
|
||||
Transform:
|
||||
Backbone:
|
||||
name: PPLCNetV3
|
||||
scale: 0.95
|
||||
Head:
|
||||
name: MultiHead
|
||||
out_channels_list:
|
||||
CTCLabelDecode: 18385
|
||||
head_list:
|
||||
- CTCHead:
|
||||
Neck:
|
||||
name: svtr
|
||||
dims: 120
|
||||
depth: 2
|
||||
hidden_dims: 120
|
||||
kernel_size: [ 1, 3 ]
|
||||
use_guide: True
|
||||
Head:
|
||||
fc_decay: 0.00001
|
||||
- NRTRHead:
|
||||
nrtr_dim: 384
|
||||
max_text_length: 25
|
||||
|
||||
chinese_cht_PP-OCRv3_rec_infer:
|
||||
model_type: rec
|
||||
algorithm: SVTR
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,9 +1,17 @@
|
||||
lang:
|
||||
ch_lite:
|
||||
det: ch_PP-OCRv3_det_infer.pth
|
||||
rec: ch_PP-OCRv5_rec_infer.pth
|
||||
dict: ppocrv5_dict.txt
|
||||
ch_lite_v4:
|
||||
det: ch_PP-OCRv3_det_infer.pth
|
||||
rec: ch_PP-OCRv4_rec_infer.pth
|
||||
dict: ppocr_keys_v1.txt
|
||||
ch_server:
|
||||
det: ch_PP-OCRv3_det_infer.pth
|
||||
rec: ch_PP-OCRv5_rec_server_infer.pth
|
||||
dict: ppocrv5_dict.txt
|
||||
ch_server_v4:
|
||||
det: ch_PP-OCRv3_det_infer.pth
|
||||
rec: ch_PP-OCRv4_rec_server_infer.pth
|
||||
dict: ppocr_keys_v1.txt
|
||||
|
||||
@@ -255,6 +255,14 @@
|
||||
"created_at": "2025-04-25T02:54:20Z",
|
||||
"repoId": 765083837,
|
||||
"pullRequestNo": 2367
|
||||
},
|
||||
{
|
||||
"name": "CharlesKeeling65",
|
||||
"id": 94165417,
|
||||
"comment_id": 2841356871,
|
||||
"created_at": "2025-04-30T09:25:31Z",
|
||||
"repoId": 765083837,
|
||||
"pullRequestNo": 2411
|
||||
}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user