Commit Graph

1584 Commits

Author SHA1 Message Date
Xiaomeng Zhao
1402479316 Merge branch 'master' into release-0.9.0 2024-11-01 18:46:26 +08:00
Xiaomeng Zhao
935e17e6bb Merge pull request #835 from opendatalab/dev
fix(pdf_parse): improve span filtering
2024-11-01 15:27:26 +08:00
Xiaomeng Zhao
099f19f277 Merge pull request #834 from myhloli/dev
feat(pdf_parse): improve span filtering and add new block types
2024-11-01 15:23:22 +08:00
myhloli
149132d608 feat(pdf_parse): improve span filtering and add new block types
- Refactor remove_outside_spans function to filter spans more accurately
- Add image_footnote, index, and list block types to output file documentation
- Update draw_span_bbox to use preproc_blocks instead of para_blocks
- Bump version to 0.9.0
2024-11-01 15:21:16 +08:00
Xiaomeng Zhao
11bd94321d Merge pull request #831 from opendatalab/dev
fix(pdf_parse): improve span removal logic for all content types
2024-11-01 11:17:22 +08:00
Xiaomeng Zhao
73afb7d6e1 Merge pull request #830 from myhloli/dev
fix(pdf_parse): improve span removal logic for all content types
2024-11-01 11:03:31 +08:00
myhloli
ad0d06b6a0 fix(pdf_parse): improve span removal logic for all content types
- Update remove_outside_spans function to handle all content types
- Add processing for text and equation spans
- Improve overlap calculation for better accuracy
2024-11-01 11:01:57 +08:00
myhloli
509128d505 fix(pdf_parse): improve span removal logic for all content types
- Update remove_outside_spans function to handle all content types
- Add processing for text and equation spans
- Improve overlap calculation for better accuracy
2024-11-01 10:48:40 +08:00
myhloli
eeda90af31 fix(pdf_parse): improve span removal logic for all content types
- Update remove_outside_spans function to handle all content types
- Add processing for text and equation spans
- Improve overlap calculation for better accuracy
2024-11-01 10:40:42 +08:00
Xiaomeng Zhao
4e6855248a Merge pull request #825 from opendatalab/dev
fix(pdf_parse): optimize span processing by removing outside spans
2024-10-31 17:52:34 +08:00
Xiaomeng Zhao
b2cd4aa029 Merge pull request #824 from myhloli/dev
fix(pdf_parse): optimize span processing by removing outside spans
2024-10-31 17:50:22 +08:00
myhloli
6b9f816f9e fix(pdf_parse): optimize span processing by removing outside spans
- Add new function `remove_outside_spans` to filter spans based on image and table blocks
- Reorder span processing steps to improve efficiency
- Update imports to include `calculate_overlap_area_in_bbox1_area_ratio`
2024-10-31 17:49:17 +08:00
Xiaomeng Zhao
d691298e47 Merge pull request #820 from papayalove/dev-table-model-update 2024-10-30 20:08:18 +08:00
liukaiwen
90547d11b7 perf: table model update with PP OCRv4 2024-10-30 20:05:29 +08:00
Xiaomeng Zhao
650dd6d4bf Merge pull request #818 from opendatalab/dev
fix(magic_pdf): handle missing image_path in spans
2024-10-30 19:01:42 +08:00
Xiaomeng Zhao
92cf9d496d Merge pull request #817 from myhloli/dev
fix(magic_pdf): handle missing image_path in spans
2024-10-30 19:00:38 +08:00
myhloli
76031a6d48 Merge remote-tracking branch 'origin/dev' into dev
# Conflicts:
#	magic_pdf/dict2md/ocr_mkcontent.py
2024-10-30 18:55:20 +08:00
myhloli
faf8c286fb fix(magic_pdf): handle missing image_path in spans
- Add check for 'image_path' in spans to avoid errors when it's missing
- Update image handling in both paragraph text and content dictionary
- Improve error handling and make the code more robust
2024-10-30 18:53:41 +08:00
myhloli
b7e9d454e9 fix(ocr): improve image and table content extraction
- Update image content extraction to iterate through all spans in a block
- Add support for extracting table content from spans within a block
- Handle multiple content types within table spans (latex, html, image)
- Refactor code to be more modular and easier to maintain
2024-10-30 18:21:38 +08:00
Xiaomeng Zhao
b31e1ffac8 Merge pull request #810 from opendatalab/dev
(docs&build): switch to Aliyun PyPI mirror
2024-10-29 18:16:28 +08:00
Xiaomeng Zhao
bcedd61863 Merge pull request #809 from myhloli/dev
(docs&build): switch to Aliyun PyPI mirror
2024-10-29 18:15:00 +08:00
myhloli
4c412b2878 (docs&build): switch to Aliyun PyPI mirror
- Update PyPI mirror from Tsinghua to Aliyun in multiple Dockerfiles and installation scripts
- This change may improve package download speed and reliability for users in China
2024-10-29 18:14:00 +08:00
Xiaomeng Zhao
6575adeafe Merge pull request #808 from opendatalab/dev
Dev->0.9 release
2024-10-29 11:37:20 +08:00
Xiaomeng Zhao
37dd55c4e1 Merge pull request #806 from myhloli/dev
docs(README): update model download instructions for PDF-Extract-Kit 1.0
2024-10-28 18:28:05 +08:00
myhloli
a9b6eb0093 docs(README): update model download instructions for PDF-Extract-Kit 1.0
- Update README.md and README_zh-CN.md to include new model download instructions
- Provide detailed steps on how to download models after PDF-Extract-Kit 1.0 repository change
- Emphasize the need to re-download models due to repository change
2024-10-28 18:27:11 +08:00
myhloli
247576c18e docs(README): update model download instructions for PDF-Extract-Kit 1.0
- Update README.md and README_zh-CN.md to include new model download instructions
- Provide detailed steps on how to download models after PDF-Extract-Kit 1.0 repository change
- Emphasize the need to re-download models due to repository change
2024-10-28 18:25:57 +08:00
Xiaomeng Zhao
03dd0cb9e6 Merge pull request #805 from myhloli/dev
refactor(table): disable StructEqTable support and add TableMaster support
2024-10-28 18:14:53 +08:00
myhloli
377b09cf8c refactor(table): disable StructEqTable support and add TableMaster support
- Remove import and usage of StructTableModel- Add support for TableMaster model- Update table model initialization logic to support TableMaster
- Log error and exit if StructEqTable is selected, as it's under upgrade
- Update README files to reflect changes in table parsing capabilities
2024-10-28 18:11:35 +08:00
Xiaomeng Zhao
3879bf8d09 Merge pull request #804 from icecraft/fix/match_figure_caption
fix: add priority match rule
2024-10-28 17:24:38 +08:00
icecraft
34a13a898b fix: add priority match rule 2024-10-28 17:21:50 +08:00
Xiaomeng Zhao
3a166bf13b Merge pull request #802 from papayalove/dev-table-model-update
perf: table model update with PP OCRv4
2024-10-28 17:19:45 +08:00
liukaiwen
4949408c9d perf: table model update with PP OCRv4 2024-10-28 17:09:46 +08:00
liukaiwen
7d2dfc8091 Merge branch 'dev' into dev-table-model-update 2024-10-28 16:36:32 +08:00
liukaiwen
a0eff3be5c feat: table model update with paddle recognition v4 2024-10-28 16:34:16 +08:00
Kaiwen Liu
6d571e2e2c Merge pull request #7 from opendatalab/dev
Dev
2024-10-28 16:32:05 +08:00
Xiaomeng Zhao
37c335ae38 Update README.md 2024-10-28 16:13:10 +08:00
Xiaomeng Zhao
889c8a33b5 Update README.md 2024-10-28 16:12:11 +08:00
Xiaomeng Zhao
739134844b Merge pull request #800 from myhloli/dev
docs: update documentation path in README files
2024-10-28 15:28:18 +08:00
myhloli
75b4375dbd docs: update documentation path in README files
- Update image path in README.md and README_zh-CN.md
- Update chemical formula recognition link in README.md and README_zh-CN.md
2024-10-28 15:27:26 +08:00
Xiaomeng Zhao
fc287b4b14 Merge pull request #799 from myhloli/update-docs
docs: update logo path in README files
2024-10-28 15:17:29 +08:00
myhloli
094f926494 docs: update logo path in README files
- Change the logo path from 'docs/images/MinerU-logo.png' to 'old_docs/images/MinerU-logo.png' in both README.md and README_zh-CN.md- This update ensures that the correct logo is displayed in the project's README files
2024-10-28 15:16:17 +08:00
Xiaomeng Zhao
92efd9a192 Merge pull request #798 from myhloli/update-docs
docs(README): update for v0.9.0 release
2024-10-28 15:05:50 +08:00
myhloli
05c3c95e53 docs(README): remove empty line in table-config example
- Delete unnecessary empty line in the table-config JSON example- Improve readability and formatting consistency in the configuration example
2024-10-28 15:05:15 +08:00
myhloli
c3c91dc469 docs(README): update for v0.9.0 release
- Add changelog for v0.9.0 release with major refactoring and improvements
- Update key features list to include new functionalities
- Modify system requirements and hardware support information
- Add section for deploying derived projects
- Update known issues and TODO list
2024-10-28 15:02:19 +08:00
Xiaomeng Zhao
bedefd8d78 Merge pull request #797 from icecraft/feat/new_table_caption_match
Feat/new table caption match
2024-10-28 14:20:20 +08:00
icecraft
f09148b9bb fix: patter match algorithm 2024-10-28 13:47:39 +08:00
Xiaomeng Zhao
d68b3d90b7 Merge pull request #794 from myhloli/update-docs
docs: update model download instructions and simplify demo scripts
2024-10-27 18:05:27 +08:00
myhloli
4cf7e9a224 refactor(pdf_parse): adjust block splitting logic for wide blocks
- Modify the logic for splitting wide blocks exceeding 0.4 page width
- Remove the specific case for blocks exceeding 0.25 page width
- Add comments to explain the reasoning behind different splitting strategies
2024-10-27 18:02:42 +08:00
myhloli
acab8de50f docs: update model download instructions and simplify demo scripts
- Update model download instructions for versions 0.9.x and later
- Simplify demo scripts by removing unnecessary model configuration
- Add visualization function to draw bounding boxes
- Update CLI help message with new URL
2024-10-27 12:12:56 +08:00
Xiaomeng Zhao
efb6b688db Merge pull request #793 from randydl/feat-multi-gpu
Add multi_gpu process project
2024-10-27 02:08:02 +08:00