diff --git a/README.md b/README.md index 149514c4..0c0edfa4 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,11 @@
-[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) -[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) -[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE) -[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) -[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) +[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) +[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) +[![license](https://img.shields.io/github/license/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU/tree/main/LICENSE) +[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) +[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [English](README.md) | [简体中文](README_zh-CN.md) @@ -15,6 +15,15 @@
+# MinerU + +## Introduction + +MinerU is a one-stop, open-source data extraction tool, primarily includes the following features: + +- PDF Document Extraction [Magic-PDF](#Magic-PDF) +- Webpage & E-book Extraction [Magic-Doc](#Magic-Doc) + # Magic-PDF ## Introduction @@ -49,17 +58,20 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ### Submodule Repositories - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) + A Comprehensive Toolkit for High-Quality PDF Content Extraction - [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark) + An end-to-end PDF document comprehension evaluation suite designed for large-scale model data scenarios ## Getting Started ### Requirements -- Python 3.9 or newer +- Python >= 3.9 ### Usage Instructions #### 1. Install Magic-PDF + ```bash pip install magic-pdf ``` @@ -67,11 +79,14 @@ pip install magic-pdf #### 2. Usage via Command Line ###### simple + ```bash cp magic-pdf.template.json to ~/magic-pdf.json magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path" ``` + ###### more + ```bash magic-pdf --help ``` @@ -112,9 +127,46 @@ Demo can be referred to [demo.py](demo/demo.py) ## License Information -See [LICENSE.md](LICENSE.md) for details. +[LICENSE.md](LICENSE.md) + +The project currently leverages PyMuPDF to deliver advanced functionalities; however, its adherence to the AGPL license may impose limitations on certain use cases. In upcoming iterations, we intend to explore and transition to a more permissively licensed PDF processing library to enhance user-friendliness and flexibility. ## Acknowledgments - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF) + + +# Magic-Doc + +## Introduction + +Magic-Doc is a tool designed to convert web pages or multi-format e-books into markdown format. + +Key Features Include: + +- Web Page Extraction + - Cross-modal precise parsing of text, images, tables, and formula information. + +- E-Book Document Extraction + - Supports various document formats including epub, mobi, with full adaptation for text and images. + +- Language Type Identification + - Accurate recognition of 176 languages. + +https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca + + + +https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d + + + +https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2 + + + +## Project Repository + +- [Magic-Doc](https://github.com/magicpdf/Magic-Doc) + Outstanding Webpage and E-book Extraction Tool diff --git a/README_zh-CN.md b/README_zh-CN.md index d58a4247..db5d9aaf 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -1,11 +1,11 @@
-[![stars](https://img.shields.io/github/stars/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) -[![forks](https://img.shields.io/github/forks/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF) -[![license](https://img.shields.io/github/license/magicpdf/Magic-PDF.svg)](https://github.com/magicpdf/Magic-PDF/tree/main/LICENSE) -[![issue resolution](https://img.shields.io/github/issues-closed-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) -[![open issues](https://img.shields.io/github/issues-raw/magicpdf/Magic-PDF)](https://github.com/magicpdf/Magic-PDF/issues) +[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) +[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU) +[![license](https://img.shields.io/github/license/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU/tree/main/LICENSE) +[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) +[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues) [English](README.md) | [简体中文](README_zh-CN.md) @@ -21,8 +21,8 @@ MinerU 是一款一站式开源数据提取工具,主要包含以下功能: -- PDF文档提取 (Magic-PDF) -- 网页与电子书提取 (Magic-Doc) +- PDF文档提取 [Magic-PDF](#Magic-PDF) +- 网页与电子书提取 [Magic-Doc](#Magic-Doc) # Magic-PDF @@ -58,7 +58,7 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ### 子模块仓库 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit) - 领先的文档分析模型 + 高质量的PDF内容提取工具包 - [Miner-PDF-Benchmark](https://github.com/opendatalab/Miner-PDF-Benchmark) 端到端的PDF文档理解评估套件,专为大规模模型数据场景而设计 @@ -67,11 +67,12 @@ https://github.com/magicpdf/Magic-PDF/assets/11393164/618937cb-dc6a-4646-b433-e3 ### 配置要求 -python 3.9+ +python >= 3.9 ### 使用说明 #### 1. 安装Magic-PDF + ```bash pip install magic-pdf ``` @@ -79,11 +80,14 @@ pip install magic-pdf #### 2. 通过命令行使用 ###### 直接使用 + ```bash cp magic-pdf.template.json to ~/magic-pdf.json magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path" ``` + ###### 更多用法 + ```bash magic-pdf --help ``` @@ -121,10 +125,13 @@ md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none") [LICENSE.md](LICENSE.md) +本项目目前采用PyMuPDF以实现高级功能,但因其遵循AGPL协议,可能对某些使用场景构成限制。未来版本迭代中,我们计划探索并替换为许可条款更为宽松的PDF处理库,以提升用户友好度及灵活性。 + ## 鸣谢 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF) + # Magic-Doc ## 简介