Commit Graph

1900 Commits

Author SHA1 Message Date
Xiaomeng Zhao
9726403c69 Merge pull request #1152 from myhloli/dev
fix(mkcontent): optimize paragraph text merging and language detection
2024-11-30 02:45:02 +08:00
myhloli
b3127233f0 refactor: modify bbox processing for layout separation
- Remove overlap between bboxes for block separation
- Sort bboxes by combined x and y coordinates for better layout handling
- Comment out previous overlap removal function
2024-11-30 02:33:26 +08:00
myhloli
b80befe9cf refactor(mkcontent): optimize paragraph text merging and language detection
- Extract language detection to block level instead of line level
- Improve logic for handling Chinese, Japanese, and Korean languages
- Refactor code for better readability and performance
- Optimize handling of hyphenated words at line ends
2024-11-30 02:16:38 +08:00
myhloli
ea35fa6b60 Merge remote-tracking branch 'origin/dev' into dev 2024-11-30 01:14:26 +08:00
myhloli
c8cabb3cf6 feat(ocr_mkcontent): add language detection for line spacing
- Introduce language detection to determine line spacing based on language context
- Implement different spacing rules for Chinese/Japanese/Korean and Western texts
- Adjust span content handling based on detected language and span type
2024-11-30 01:14:12 +08:00
Xiaomeng Zhao
78c9014073 Merge pull request #1147 from opendatalab/master
master->dev
2024-11-29 16:44:47 +08:00
myhloli
d19911f113 Update version.py with new version 2024-11-29 08:03:01 +00:00
Xiaomeng Zhao
b3fbedf055 Merge pull request #1143 from opendatalab/release-0.10.3
Release 0.10.3
magic_pdf-0.10.3-released
2024-11-29 16:01:36 +08:00
Xiaomeng Zhao
66bd0f8b69 Merge pull request #1141 from myhloli/dev
refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment
2024-11-29 12:03:48 +08:00
myhloli
7f2f2c0f28 refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment 2024-11-29 12:02:48 +08:00
Xiaomeng Zhao
68c455309d Merge pull request #1140 from myhloli/dev
refactor(pdf_parse): adjust character-axis alignment algorithm
2024-11-29 12:00:36 +08:00
myhloli
d4345b6e39 refactor(pdf_parse): adjust character-axis alignment algorithm
- Introduce `span_height_radio` parameter to calculate_char_in_span function
- Replace fixed ratio with dynamic ratio for character and span axis alignment
- Improve flexibility and accuracy of character placement within spans
2024-11-29 11:59:52 +08:00
Xiaomeng Zhao
086b48b7ae Merge pull request #1139 from myhloli/dev
fix(ocr_mkcontent): handle empty paragraphs on pages
2024-11-29 11:59:03 +08:00
myhloli
782e6571bc fix(ocr_mkcontent): handle empty paragraphs on pages
- Add empty paragraph handling for pages with no content
- Append an empty markdown object when a page has no paragraphs
- Increment page number even if no content is present
2024-11-29 11:58:34 +08:00
Xiaomeng Zhao
4adabc37ac Merge pull request #1138 from myhloli/dev
feat(pdf_parse): add line start flag detection and optimize line stop flag logic
2024-11-28 23:13:47 +08:00
myhloli
949d0867fb feat(pdf_parse): add line start flag detection and optimize line stop flag logic
- Add LINE_START_FLAG tuple to identify starting flags of a line
- Modify calculate_char_in_span function to handle both line start and stop flags
- Remove redundant char_is_line_stop_flag variable and simplify logic
- Improve line flag detection to enhance text extraction accuracy
2024-11-28 23:12:37 +08:00
Xiaomeng Zhao
a1cff28c74 Merge pull request #1137 from myhloli/dev
refactor(pdf_check): improve character detection using PyMuPDF
2024-11-28 22:36:30 +08:00
myhloli
ac88815620 refactor(pdf_check): improve character detection using PyMuPDF
- Replace pdfminer with PyMuPDF for character detection
- Implement new method detect_invalid_chars_by_pymupdf
- Update check_invalid_chars in pdf_meta_scan.py to use new method
- Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters
- Remove unused imports and update requirements.txt
2024-11-28 22:34:23 +08:00
Xiaomeng Zhao
b4dfa0f92f Merge pull request #1136 from myhloli/dev
refactor(ocr): improve text processing and span handling
2024-11-28 19:39:28 +08:00
myhloli
88c0854a65 refactor(ocr): improve text processing and span handling
- Remove unused language detection code
- Simplify text content processing logic
- Update span sorting and text extraction in pdf_parse_union_core_v2.py
2024-11-28 19:38:30 +08:00
Xiaomeng Zhao
c295587b9e Merge pull request #1135 from myhloli/dev
feat(pdf_parse): filter out skewed text lines
2024-11-28 18:53:06 +08:00
myhloli
37da8c44c4 feat(pdf_parse): filter out skewed text lines
- Add direction filtering to ignore highly skewed text lines
- Improve text extraction accuracy by focusing on non-skewed content
2024-11-28 18:52:18 +08:00
Xiaomeng Zhao
5ecafbfa7d Merge pull request #1134 from myhloli/dev
refactor(para): improve language detection and block splitting
2024-11-28 18:07:23 +08:00
myhloli
f674b8d413 refactor(para): improve language detection and block splitting
- Add language detection for each block of text
- Implement language-specific logic for right margin alignment
- Introduce logging for debugging purposes
2024-11-28 18:06:17 +08:00
Xiaomeng Zhao
e22fa18b46 Merge pull request #1132 from myhloli/dev
fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text
2024-11-28 15:34:00 +08:00
myhloli
08392d63a0 fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text 2024-11-28 15:29:42 +08:00
Xiaomeng Zhao
f09c1cd284 Merge pull request #1130 from myhloli/dev
fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10
2024-11-28 15:27:52 +08:00
myhloli
9b4d77dcd4 fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10 2024-11-28 15:06:54 +08:00
Xiaomeng Zhao
89c7bd0419 Merge pull request #1121 from opendatalab/master
master -> dev
2024-11-27 18:33:05 +08:00
myhloli
52ef1bc782 Update version.py with new version 2024-11-27 10:31:09 +00:00
Xiaomeng Zhao
8afff9aee8 Merge pull request #1120 from opendatalab/release-0.10.2
Release 0.10.2
magic_pdf-0.10.2-released
2024-11-27 18:16:02 +08:00
Xiaomeng Zhao
7fdbb6e592 Merge pull request #1119 from myhloli/dev
refactor(pdf_parse_union_core_v2): optimize page processing time logging
2024-11-27 18:11:24 +08:00
myhloli
1d2eb70aa0 refactor(pdf_parse_union_core_v2): optimize page processing time logging 2024-11-27 18:08:27 +08:00
Xiaomeng Zhao
132c20899e Merge pull request #1117 from icecraft/feat/add_s3_read_write_example
Feat/add s3 read write example
2024-11-27 16:51:52 +08:00
xu rui
8152931756 fix: table format 2024-11-27 16:47:43 +08:00
Xiaomeng Zhao
b8fdab11d6 Merge pull request #1116 from myhloli/dev
docs(README): remove code examples and redirect to documentation
2024-11-27 16:38:05 +08:00
myhloli
6ae50fead8 docs(README): remove code examples and redirect to documentation
- Remove command line and API code examples from README files
- Add links to online documentation for command line and API usage
- Update content to point users to the new locations for detailed information
2024-11-27 16:36:06 +08:00
icecraft
a4b29f891b feat: add s3 example 2024-11-27 16:14:34 +08:00
Xiaomeng Zhao
2f0e5b2aa2 Merge pull request #1113 from myhloli/dev
refactor(ocr): remove unused functions and optimize OCR processing loop
2024-11-27 15:21:48 +08:00
myhloli
5f4410b469 refactor(ocr): remove unused functions and optimize OCR processing loop
- Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing
- Eliminate unnecessary loop index `idx` in OCR processing loops
2024-11-27 15:20:09 +08:00
myhloli
a46b12e967 refactor(pre_proc): clean up OCR processing code
- Remove commented-out code in ocr_dict_merge.py
- Improve imports and code organization in ocr_detect_all_bboxes.py
- Delete unnecessary empty lines and improve code readability
2024-11-27 15:09:32 +08:00
Xiaomeng Zhao
a65d6b53bd Merge pull request #1112 from myhloli/dev
refactor(libs): remove unused imports and functions
2024-11-27 14:52:01 +08:00
myhloli
2db3c26374 refactor(libs): remove unused imports and functions
- Remove unused imports from commons.py
- Delete unused functions related to AWS and S3 operations
- Update import statements in other modules to reflect changes in commons.py
- Remove redundant code and improve code readability
2024-11-27 14:51:30 +08:00
Xiaomeng Zhao
b69311715f Merge pull request #1110 from myhloli/dev
test: json minify
2024-11-27 11:20:28 +08:00
myhloli
e937e011f8 test: json minify 2024-11-27 11:08:03 +08:00
Xiaomeng Zhao
65a9eedd3c Merge pull request #1104 from icecraft/fix/test_tools_ut
fix: test_tools unittest
2024-11-27 10:50:10 +08:00
Xiaomeng Zhao
b53409ea16 Merge pull request #1106 from myhloli/dev
perf(image_processing): reduce maximum image size for analysis
2024-11-26 22:36:19 +08:00
myhloli
b3644157e7 perf(image_processing): reduce maximum image size for analysis
- Decrease the maximum image size threshold from 9000 to 4500 pixels
- This change aims to improve performance and reduce memory usage
- Affects the custom model document analysis process
2024-11-26 22:35:35 +08:00
Xiaomeng Zhao
eb6d5dc87c Merge pull request #1105 from icecraft/fix/test_rag
fix: test_rag
2024-11-26 22:08:18 +08:00
icecraft
843d13829b fix: test_rag 2024-11-26 19:35:59 +08:00