MinerU

mirror of https://github.com/opendatalab/MinerU.git synced 2026-03-27 11:08:32 +07:00

Author	SHA1	Message	Date
Xiaomeng Zhao	9726403c69	Merge pull request #1152 from myhloli/dev fix(mkcontent): optimize paragraph text merging and language detection	2024-11-30 02:45:02 +08:00
myhloli	b3127233f0	refactor: modify bbox processing for layout separation - Remove overlap between bboxes for block separation - Sort bboxes by combined x and y coordinates for better layout handling - Comment out previous overlap removal function	2024-11-30 02:33:26 +08:00
myhloli	b80befe9cf	refactor(mkcontent): optimize paragraph text merging and language detection - Extract language detection to block level instead of line level - Improve logic for handling Chinese, Japanese, and Korean languages - Refactor code for better readability and performance - Optimize handling of hyphenated words at line ends	2024-11-30 02:16:38 +08:00
myhloli	ea35fa6b60	Merge remote-tracking branch 'origin/dev' into dev	2024-11-30 01:14:26 +08:00
myhloli	c8cabb3cf6	feat(ocr_mkcontent): add language detection for line spacing - Introduce language detection to determine line spacing based on language context - Implement different spacing rules for Chinese/Japanese/Korean and Western texts - Adjust span content handling based on detected language and span type	2024-11-30 01:14:12 +08:00
Xiaomeng Zhao	78c9014073	Merge pull request #1147 from opendatalab/master master->dev	2024-11-29 16:44:47 +08:00
myhloli	d19911f113	Update version.py with new version	2024-11-29 08:03:01 +00:00
Xiaomeng Zhao	b3fbedf055	Merge pull request #1143 from opendatalab/release-0.10.3 Release 0.10.3 magic_pdf-0.10.3-released	2024-11-29 16:01:36 +08:00
Xiaomeng Zhao	66bd0f8b69	Merge pull request #1141 from myhloli/dev refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment	2024-11-29 12:03:48 +08:00
myhloli	7f2f2c0f28	refactor(ocr): Fix the error of paddleocr failing to initialize in a multi-threaded environment	2024-11-29 12:02:48 +08:00
Xiaomeng Zhao	68c455309d	Merge pull request #1140 from myhloli/dev refactor(pdf_parse): adjust character-axis alignment algorithm	2024-11-29 12:00:36 +08:00
myhloli	d4345b6e39	refactor(pdf_parse): adjust character-axis alignment algorithm - Introduce `span_height_radio` parameter to calculate_char_in_span function - Replace fixed ratio with dynamic ratio for character and span axis alignment - Improve flexibility and accuracy of character placement within spans	2024-11-29 11:59:52 +08:00
Xiaomeng Zhao	086b48b7ae	Merge pull request #1139 from myhloli/dev fix(ocr_mkcontent): handle empty paragraphs on pages	2024-11-29 11:59:03 +08:00
myhloli	782e6571bc	fix(ocr_mkcontent): handle empty paragraphs on pages - Add empty paragraph handling for pages with no content - Append an empty markdown object when a page has no paragraphs - Increment page number even if no content is present	2024-11-29 11:58:34 +08:00
Xiaomeng Zhao	4adabc37ac	Merge pull request #1138 from myhloli/dev feat(pdf_parse): add line start flag detection and optimize line stop flag logic	2024-11-28 23:13:47 +08:00
myhloli	949d0867fb	feat(pdf_parse): add line start flag detection and optimize line stop flag logic - Add LINE_START_FLAG tuple to identify starting flags of a line - Modify calculate_char_in_span function to handle both line start and stop flags - Remove redundant char_is_line_stop_flag variable and simplify logic - Improve line flag detection to enhance text extraction accuracy	2024-11-28 23:12:37 +08:00
Xiaomeng Zhao	a1cff28c74	Merge pull request #1137 from myhloli/dev refactor(pdf_check): improve character detection using PyMuPDF	2024-11-28 22:36:30 +08:00
myhloli	ac88815620	refactor(pdf_check): improve character detection using PyMuPDF - Replace pdfminer with PyMuPDF for character detection - Implement new method detect_invalid_chars_by_pymupdf - Update check_invalid_chars in pdf_meta_scan.py to use new method - Add __replace_0xfffd function in pdf_parse_union_core_v2.py to handle special characters - Remove unused imports and update requirements.txt	2024-11-28 22:34:23 +08:00
Xiaomeng Zhao	b4dfa0f92f	Merge pull request #1136 from myhloli/dev refactor(ocr): improve text processing and span handling	2024-11-28 19:39:28 +08:00
myhloli	88c0854a65	refactor(ocr): improve text processing and span handling - Remove unused language detection code - Simplify text content processing logic - Update span sorting and text extraction in pdf_parse_union_core_v2.py	2024-11-28 19:38:30 +08:00
Xiaomeng Zhao	c295587b9e	Merge pull request #1135 from myhloli/dev feat(pdf_parse): filter out skewed text lines	2024-11-28 18:53:06 +08:00
myhloli	37da8c44c4	feat(pdf_parse): filter out skewed text lines - Add direction filtering to ignore highly skewed text lines - Improve text extraction accuracy by focusing on non-skewed content	2024-11-28 18:52:18 +08:00
Xiaomeng Zhao	5ecafbfa7d	Merge pull request #1134 from myhloli/dev refactor(para): improve language detection and block splitting	2024-11-28 18:07:23 +08:00
myhloli	f674b8d413	refactor(para): improve language detection and block splitting - Add language detection for each block of text - Implement language-specific logic for right margin alignment - Introduce logging for debugging purposes	2024-11-28 18:06:17 +08:00
Xiaomeng Zhao	e22fa18b46	Merge pull request #1132 from myhloli/dev fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text	2024-11-28 15:34:00 +08:00
myhloli	08392d63a0	fix(Hybrid OCR):Enable Hybrid OCR for Empty Spans That Contain a Certain Number of Placeholders but No Actual Text	2024-11-28 15:29:42 +08:00
Xiaomeng Zhao	f09c1cd284	Merge pull request #1130 from myhloli/dev fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10	2024-11-28 15:27:52 +08:00
myhloli	9b4d77dcd4	fix(lite_model): Adapt iite Mode to the Hybrid OCR Mode in Version 0.10	2024-11-28 15:06:54 +08:00
Xiaomeng Zhao	89c7bd0419	Merge pull request #1121 from opendatalab/master master -> dev	2024-11-27 18:33:05 +08:00
myhloli	52ef1bc782	Update version.py with new version	2024-11-27 10:31:09 +00:00
Xiaomeng Zhao	8afff9aee8	Merge pull request #1120 from opendatalab/release-0.10.2 Release 0.10.2 magic_pdf-0.10.2-released	2024-11-27 18:16:02 +08:00
Xiaomeng Zhao	7fdbb6e592	Merge pull request #1119 from myhloli/dev refactor(pdf_parse_union_core_v2): optimize page processing time logging	2024-11-27 18:11:24 +08:00
myhloli	1d2eb70aa0	refactor(pdf_parse_union_core_v2): optimize page processing time logging	2024-11-27 18:08:27 +08:00
Xiaomeng Zhao	132c20899e	Merge pull request #1117 from icecraft/feat/add_s3_read_write_example Feat/add s3 read write example	2024-11-27 16:51:52 +08:00
xu rui	8152931756	fix: table format	2024-11-27 16:47:43 +08:00
Xiaomeng Zhao	b8fdab11d6	Merge pull request #1116 from myhloli/dev docs(README): remove code examples and redirect to documentation	2024-11-27 16:38:05 +08:00
myhloli	6ae50fead8	docs(README): remove code examples and redirect to documentation - Remove command line and API code examples from README files - Add links to online documentation for command line and API usage - Update content to point users to the new locations for detailed information	2024-11-27 16:36:06 +08:00
icecraft	a4b29f891b	feat: add s3 example	2024-11-27 16:14:34 +08:00
Xiaomeng Zhao	2f0e5b2aa2	Merge pull request #1113 from myhloli/dev refactor(ocr): remove unused functions and optimize OCR processing loop	2024-11-27 15:21:48 +08:00
myhloli	5f4410b469	refactor(ocr): remove unused functions and optimize OCR processing loop - Remove unused function `calculate_angle_degrees`- Refactor `calculate_is_angle` to use directly in OCR processing - Eliminate unnecessary loop index `idx` in OCR processing loops	2024-11-27 15:20:09 +08:00
myhloli	a46b12e967	refactor(pre_proc): clean up OCR processing code - Remove commented-out code in ocr_dict_merge.py - Improve imports and code organization in ocr_detect_all_bboxes.py - Delete unnecessary empty lines and improve code readability	2024-11-27 15:09:32 +08:00
Xiaomeng Zhao	a65d6b53bd	Merge pull request #1112 from myhloli/dev refactor(libs): remove unused imports and functions	2024-11-27 14:52:01 +08:00
myhloli	2db3c26374	refactor(libs): remove unused imports and functions - Remove unused imports from commons.py - Delete unused functions related to AWS and S3 operations - Update import statements in other modules to reflect changes in commons.py - Remove redundant code and improve code readability	2024-11-27 14:51:30 +08:00
Xiaomeng Zhao	b69311715f	Merge pull request #1110 from myhloli/dev test: json minify	2024-11-27 11:20:28 +08:00
myhloli	e937e011f8	test: json minify	2024-11-27 11:08:03 +08:00
Xiaomeng Zhao	65a9eedd3c	Merge pull request #1104 from icecraft/fix/test_tools_ut fix: test_tools unittest	2024-11-27 10:50:10 +08:00
Xiaomeng Zhao	b53409ea16	Merge pull request #1106 from myhloli/dev perf(image_processing): reduce maximum image size for analysis	2024-11-26 22:36:19 +08:00
myhloli	b3644157e7	perf(image_processing): reduce maximum image size for analysis - Decrease the maximum image size threshold from 9000 to 4500 pixels - This change aims to improve performance and reduce memory usage - Affects the custom model document analysis process	2024-11-26 22:35:35 +08:00
Xiaomeng Zhao	eb6d5dc87c	Merge pull request #1105 from icecraft/fix/test_rag fix: test_rag	2024-11-26 22:08:18 +08:00
icecraft	843d13829b	fix: test_rag	2024-11-26 19:35:59 +08:00

1 2 3 4 5 ...

1900 Commits