Compare commits

...

43 Commits

Author SHA1 Message Date
drunkpig
01a31a322d Merge branch 'dev' into realese-0.8.0 2024-09-10 19:13:30 +08:00
drunkpig
c54f20460e Merge branch 'master' into realese-0.8.0 2024-09-10 19:08:31 +08:00
sfk
4c64d3f7b0 Update README_zh-CN.md
update 更新记录
2024-09-09 21:00:46 +08:00
drunkpig
0c13af72c1 Update README_zh-CN.md
docs: remove RAG related release notes
2024-09-09 20:53:32 +08:00
drunkpig
7fe1aabd62 Update README.md
docs: remove RAG related release notes
2024-09-09 20:52:25 +08:00
drunkpig
c40b90b694 Update README_zh-CN.md
update rag api image
2024-09-09 20:37:28 +08:00
drunkpig
25299df8b1 add rag data api 2024-09-09 20:36:23 +08:00
drunkpig
72a1819ff4 Update README_zh-CN.md 2024-09-09 20:34:57 +08:00
sfk
f75562cd89 Update README.md 2024-09-09 20:25:31 +08:00
sfk
f97744b1fc Update README.md 2024-09-09 20:25:15 +08:00
sfk
8c2150e036 Update README.md 2024-09-09 20:24:56 +08:00
sfk
4997927214 Update README_zh-CN.md 2024-09-09 20:24:32 +08:00
Kaiwen Liu
3cce152e52 fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#573)
* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384
2024-09-09 14:24:14 +08:00
Xiaomeng Zhao
b6633cd6b3 Update FAQ_zh_cn.md 2024-09-06 17:07:31 +08:00
Xiaomeng Zhao
f3cd18ae37 Update FAQ_en_us.md 2024-09-06 17:07:13 +08:00
Xiaomeng Zhao
e890642af3 Update FAQ_zh_cn.md 2024-09-06 17:05:35 +08:00
Xiaomeng Zhao
9aeb59c1d4 Update README_zh-CN.md 2024-09-06 16:51:35 +08:00
Xiaomeng Zhao
3228067bbf Update README.md 2024-09-06 16:51:14 +08:00
Xiaomeng Zhao
0f3ae90922 Update README_zh-CN.md 2024-09-06 16:23:07 +08:00
sfk
610fde1b22 Update README_zh-CN.md 2024-09-05 17:35:44 +08:00
sfk
66c945c13a Update README.md 2024-09-05 17:34:54 +08:00
Xiaomeng Zhao
cc543669e8 Create download_models_hf.py 2024-09-04 20:44:53 +08:00
Xiaomeng Zhao
732b9cd845 Update README_zh-CN.md 2024-09-04 20:28:36 +08:00
Xiaomeng Zhao
a1c0f7dedb Update README.md 2024-09-04 20:28:00 +08:00
sfk
dfc9e5aa35 Update README.md 2024-09-04 19:04:00 +08:00
sfk
ea202eea32 Update README.md 2024-09-04 19:00:30 +08:00
sfk
7d88d8a959 Create README.md 2024-09-04 18:59:36 +08:00
sfk
595e216d83 Update README_zh-CN.md 2024-09-04 18:52:02 +08:00
sfk
478418fc48 Rename README.md to README_zh-CN.md 2024-09-04 18:51:40 +08:00
sfk
7940c0fd3b Rename readme.md to README.md 2024-09-04 18:48:25 +08:00
sfk
0dd6f17c20 Create readme.md 2024-09-04 18:47:54 +08:00
sfk
4bb8fc9c68 Rename README.md to README_zh-CN.md 2024-09-04 18:47:29 +08:00
sfk
cb81e28c3b Update README_zh-CN.md 2024-09-04 17:32:57 +08:00
sfk
6a088c77bc Update README.md 2024-09-04 17:31:21 +08:00
sfk
f4f4422813 Update README.md 2024-09-04 17:21:36 +08:00
sfk
6b06218159 Update README.md 2024-09-04 17:20:13 +08:00
sfk
25af50698e Merge pull request #544 from opendatalab/realease-0.8.0-readme
Update README_zh-CN.md
2024-09-04 16:28:09 +08:00
sfk
44ce716245 Update README.md 2024-09-04 16:27:30 +08:00
sfk
835ae119cb Update README_zh-CN.md
add HF、modelscope、colab url
2024-09-04 16:23:28 +08:00
Xiaomeng Zhao
88d847ab62 Update README_zh-CN.md 2024-09-04 15:14:42 +08:00
Xiaomeng Zhao
a31b7636b6 Update README.md 2024-09-04 15:12:58 +08:00
sfk
7ee2f4b77b Hotfix readme 0.7.1 (#528)
* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md
2024-09-02 20:47:53 +08:00
yyy
1dc915a4a9 release: release 0.7.1 version (#526)
* Update README_zh-CN.md (#404) (#409)

correct FAQ url

Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#493)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url

Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url

Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)

Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------

Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#508)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url

Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url

Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)

Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------

Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

* feat<table model>: add tablemaster with paddleocr to detect and recognize table (#511)

* Update cla.yml

* Update bug_report.yml

* Update README_zh-CN.md (#404)

correct FAQ url

* Update README_zh-CN.md (#404) (#409) (#410)

correct FAQ url

Co-authored-by: sfk <18810651050@163.com>

* Update FAQ_zh_cn.md

add new issue

* Update FAQ_en_us.md

* Update README_Windows_CUDA_Acceleration_zh_CN.md

* Update README_zh-CN.md

* @Thepathakarpit has signed the CLA in opendatalab/MinerU#418

* Update cla.yml

* feat: add tablemaster_paddle (#463)

* Update README_zh-CN.md (#404) (#409)

correct FAQ url

Co-authored-by: sfk <18810651050@163.com>

* add dockerfile (#189)

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>

* Update cla.yml

* Update cla.yml

---------

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>

* <fix>(para_split_v2): index out of range issue of span_text first char (#396)

Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>

* @Matthijz98 has signed the CLA in opendatalab/MinerU#467

* Create download_models.py

* Create requirements-docker.txt

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* @strongerfly has signed the CLA in opendatalab/MinerU#487

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* Update cla.yml

* Delete .github/workflows/gpu-ci.yml

* Update Huggingface and ModelScope links to organization account

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

* feat<table model>: add tablemaster with paddleocr to detect and recognize table

---------

Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: yyy <102640628+dt-yy@users.noreply.github.com>
Co-authored-by: wangbinDL <wangbin_research@163.com>

---------

Co-authored-by: drunkpig <60862764+drunkpig@users.noreply.github.com>
Co-authored-by: sfk <18810651050@163.com>
Co-authored-by: Aoyang Fang <222010547@link.cuhk.edu.cn>
Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: Kaiwen Liu <lkw_buaa@163.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: liukaiwen <liukaiwen@pjlab.org.cn>
Co-authored-by: wangbinDL <wangbin_research@163.com>
2024-09-02 20:23:46 +08:00
10 changed files with 108 additions and 40 deletions

View File

@@ -40,6 +40,7 @@
</div>
# Changelog
- 2024/09/09: Version 0.8.0 released, supporting fast deployment with Dockerfile, and launching demos on Huggingface and Modelscope.
- 2024/08/30: Version 0.7.1 released, add paddle tablemaster table recognition option
- 2024/08/09: Version 0.7.0b1 released, simplified installation process, added table recognition functionality
- 2024/08/01: Version 0.6.2b1 released, optimized dependency conflict issues and installation documentation
@@ -353,7 +354,6 @@ TODO
- If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions.
# FAQ
[FAQ in Chinese](docs/FAQ_zh_cn.md)

View File

@@ -40,6 +40,7 @@
</div>
# 更新记录
- 2024/09/09 0.8.0发布支持Dockerfile快速部署同时上线了huggingface、modelscope demo
- 2024/08/30 0.7.1发布集成了paddle tablemaster表格识别功能
- 2024/08/09 0.7.0b1发布,简化安装步骤提升易用性,加入表格识别功能
- 2024/08/01 0.6.2b1发布,优化了依赖冲突问题和安装文档
@@ -356,8 +357,8 @@ TODO
- 在一些公式密集的PDF上强制启用OCR效果会更好
- 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。
# FAQ
# FAQ
[常见问题](docs/FAQ_zh_cn.md)

View File

@@ -44,3 +44,11 @@ pip uninstall fairscale
pip install fairscale
```
Reference: https://github.com/opendatalab/MinerU/issues/411
### 6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
The compatibility of cuda11 with new graphics cards is poor, and the CUDA version used by Paddle needs to be upgraded.
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
Reference: https://github.com/opendatalab/MinerU/issues/558

View File

@@ -41,3 +41,11 @@ pip uninstall fairscale
pip install fairscale
```
参考https://github.com/opendatalab/MinerU/issues/411
### 6.在部分较新的设备如H100上使用CUDA加速OCR时解析出的文字乱码。
cuda11对新显卡的兼容性不好需要升级paddle使用的cuda版本
```bash
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
```
参考https://github.com/opendatalab/MinerU/issues/558

View File

@@ -0,0 +1,3 @@
from huggingface_hub import snapshot_download
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit')
print(f"model dir is: {model_dir}/models")

View File

@@ -230,6 +230,7 @@ class CustomPEKModel:
)
# 初始化ocr
if self.apply_ocr:
# self.ocr_model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=0.3)
self.ocr_model = atom_model_manager.get_atom_model(
atom_model_name=AtomicModel.OCR,
@@ -249,6 +250,7 @@ class CustomPEKModel:
table_max_time=self.table_max_time,
device=self.device
)
logger.info('DocAnalysis init done!')
def __call__(self, image):
@@ -389,6 +391,7 @@ class CustomPEKModel:
latex_code = self.table_model.image2latex(new_image)[0]
else:
html_code = self.table_model.img2html(new_image)
run_time = time.time() - single_table_start_time
logger.info(f"------------table recognition processing ends within {run_time}s-----")
if run_time > self.table_max_time:

View File

@@ -5,3 +5,4 @@
- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
- [gradio_app](./gradio_app/README.md): Build a web app based on gradio

View File

@@ -4,3 +4,4 @@
- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
- [gradio_app](./gradio_app/README_zh-CN.md): 基于 Gradio 的 Web 应用

View File

@@ -1,20 +1,65 @@
<details open="open">
<summary><h2 style="display: inline-block">目录</h2></summary>
<li><a href="#介绍">介绍</a></li>
<li><a href="#安装">安装</a></li>
<li><a href="#示例">示例</a></li>
<li><a href="#开发">开发</a></li>
</ol>
</details>
## 介绍
`MinerU` 提供数据 `API接口` 以支持用户导入数据到 `RAG` 系统。本项目将基于`通义千问`展示如何构建一个轻量级的 `RAG` 系统。
<p align="center">
<img src="rag_data_api.png" width="300px" style="vertical-align:middle;">
</p>
## 安装
MinerU
环境要求
```bash
git clone https://github.com/opendatalab/MinerU.git
cd MinerU
```text
NVIDIA A100 80GB,
Centos 7 3.10.0-957.el7.x86_64
conda create -n MinerU python=3.10
conda activate MinerU
pip install .[full] --extra-index-url https://wheels.myhloli.com
Client: Docker Engine - Community
Version: 24.0.5
API version: 1.43
Go version: go1.20.6
Git commit: ced0996
Built: Fri Jul 21 20:39:02 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 24.0.5
API version: 1.43 (minimum version 1.12)
Go version: go1.20.6
Git commit: a61e2b4
Built: Fri Jul 21 20:38:05 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.25
GitCommit: d8f198a4ed8892c764191ef7b3b06d8a2eeb5c7f
runc:
Version: 1.1.10
GitCommit: v1.1.10-0-g18a0cb0
docker-init:
Version: 0.19.0
GitCommit: de40ad0
```
请参考[文档](../../README_zh-CN.md) 安装 MinerU
第三方软件
```bash
# install
pip install modelscope==1.14.0
pip install llama-index-vector-stores-elasticsearch==0.2.0
pip install llama-index-embeddings-dashscope==0.2.0
pip install llama-index-core==0.10.68
@@ -26,39 +71,13 @@ pip install accelerate==0.33.0
pip uninstall transformer-engine
```
## 环境配置
```
export DASHSCOPE_API_KEY={some_key}
export ES_USER={some_es_user}
export ES_PASSWORD={some_es_password}
export ES_URL=http://{es_url}:9200
```
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
## 使用
### 导入数据
```bash
python data_ingestion.py -p some.pdf # load data from pdf
or
python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
```
### 查询
```bash
python query.py --question '{the_question_you_want_to_ask}'
```
## 示例
````bash
# 启动 es 服务
cd projects/llama_index_rag
docker compose up -d
or
@@ -67,17 +86,41 @@ docker-compose up -d
# 配置环境变量
export ES_USER=elastic
export ES_PASSWORD=llama_index
export ES_URL=http://127.0.0.1:9200
export DASHSCOPE_API_KEY={some_key}
DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
# 未导入数据,查询问题。返回通义千问默认答案
python query.py -q 'how about the rights of men'
## outputs
question: how about the rights of men
answer: The topic of men's rights often refers to discussions around legal, social, and political issues that affect men specifically or differently from women. Movements related to men's rights advocate for addressing areas where men face discrimination or unique challenges, such as:
Child Custody: Ensuring that men have equal opportunities for custody of their children following divorce or separation.
Domestic Violence: Recognizing that men can also be victims of domestic abuse and ensuring they have access to support services.
Mental Health and Suicide Rates: Addressing the higher rates of suicide among men and providing mental health resources.
Military Conscription: In some countries, only men are required to register for military service, which is seen as a gender-based obligation.
Workplace Safety: Historically, more men than women have been employed in high-risk occupations, leading to higher workplace injury and death rates.
Parental Leave: Advocating for paternity leave policies that allow men to take time off work for family care.
Men's rights activism often intersects with broader discussions on gender equality and aims to promote fairness and equity across genders. It's important to note that while advocating for these issues, it should be done in a way that does not detract from or oppose the goals of gender equality and the rights of other groups. The focus should be on creating a fair society where everyone has equal opportunities and protections under the law.
# 导入数据
python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
python data_ingestion.py -p example/data/
or
python data_ingestion.py -p example/data/declaration_of_the_rights_of_man_1789.pdf
# 查询问题
# 导入数据后,查询问题。通义千问模型会根据 RAG 系统的检索结果,结合上下文,给出答案。
python query.py -q 'how about the rights of men'
## outputs

Binary file not shown.

After

Width:  |  Height:  |  Size: 9.3 KiB