diff --git a/README.md b/README.md index c0006136..d80193ef 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,7 @@ # MinerU + ## Introduction MinerU is a one-stop, open-source data extraction tool, primarily includes the following features: @@ -24,8 +25,10 @@ MinerU is a one-stop, open-source data extraction tool, primarily includes the f - [Magic-PDF](#Magic-PDF) PDF Document Extraction - [Magic-Doc](#Magic-Doc) Webpage & E-book Extraction + # Magic-PDF + ## Introduction Magic-PDF is a tool designed to convert PDF documents into Markdown format, capable of processing files stored locally or on object storage supporting S3 protocol. @@ -51,6 +54,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ![Project Panorama](docs/images/project_panorama_en.png) + ## Flowchart ![Flowchart](docs/images/flowchart_en.png) @@ -62,6 +66,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 - [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark) - An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios + ## Getting Started ### Requirements @@ -119,18 +124,21 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") Demo can be referred to [demo.py](demo/demo.py) + ## All Thanks To Our Contributors - + + ## License Information [LICENSE.md](LICENSE.md) The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility. + ## Acknowledgments - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) @@ -139,6 +147,7 @@ The project currently leverages PyMuPDF to deliver advanced functionalities; how # Magic-Doc + ## Introduction Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format. @@ -166,6 +175,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7 + ## Project Repository - [Magic-Doc](https://github.com/magicpdf/Magic-Doc) diff --git a/README_zh-CN.md b/README_zh-CN.md index 333d3d4e..3143143a 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -17,6 +17,7 @@ # MinerU + ## 简介 MinerU 是一款一站式开源数据提取工具,主要包含以下功能: @@ -26,6 +27,7 @@ MinerU 是一款一站式开源数据提取工具,主要包含以下功能: # Magic-PDF + ## 简介 Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。 @@ -121,12 +123,20 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") 详细实现可参考 [demo.py](demo/demo.py) +## 感谢我们的贡献者 + + + + + + ## 版权说明 [LICENSE.md](LICENSE.md) 本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。 + ## 鸣谢 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF) @@ -134,6 +144,7 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") # Magic-Doc + ## 简介 Magic-Doc 是一款支持将网页或多格式电子书转换为 markdown 格式的工具。 @@ -161,6 +172,7 @@ https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d7 + ## 项目仓库 - [Magic-Doc](https://github.com/magicpdf/Magic-Doc) diff --git a/magic_pdf/libs/language.py b/magic_pdf/libs/language.py index c49b5e4c..29cdc9ea 100644 --- a/magic_pdf/libs/language.py +++ b/magic_pdf/libs/language.py @@ -1,13 +1,6 @@ -import regex import unicodedata from fast_langdetect import detect_langs -RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}") - - -def remove_bad_chars(text): - return RE_BAD_CHARS.sub("", text) - def detect_lang(text: str) -> str: if len(text) == 0: