dify-docs

mirror of https://github.com/langgenius/dify-docs.git synced 2026-03-31 14:08:39 +07:00

Files

sunshinesDL 4e47eb05c1 Update chunking-and-cleaning-text.mdx (#345 )

* Update chunking-and-cleaning-text.mdx

原文中 '通用模式' 小节中，关于 '分段标识符' 的解释中描述如下：
`分段标识符，默认值为 \n，即按照文章段落进行分块。你可以遵循正则表达式语法自定义分块规则，系统将在文本出现分段标识符时自动执行分段。例如 的含义是按照句子进行分段。下图是不同语法的文本分段效果：`
`例如` 后好像缺少了转义字符，这里补充正则表达式 `(?<=[.!?])\s+` 以查找句子结束标点（., !, ?）后的空白字符，从而按照句子分段。
此外，在 "父子模式" 小节中，以下内容好像也遗漏了转义字符：
`在子分段内填写以下分段设置：
分段标识符，默认值为 ，即按照句子进行分段。你可以遵循正则表达式语法自定义分块规则，系统将在文本出现分段标识符时自动执行分段。`
这里补充 `\.|\!|\?` 作为按句子分段的标识符，供作者审核。

* correct, update, and remove expired content

---------

Co-authored-by: Riskey <riskey47@dify.ai>

2025-10-31 17:43:10 +08:00

import-content-data

chore: apply new issues button

2025-07-16 16:42:34 +08:00

chunking-and-cleaning-text.mdx

Update chunking-and-cleaning-text.mdx (#345 )

2025-10-31 17:43:10 +08:00

readme.mdx

chore: apply new issues button

2025-07-16 16:42:34 +08:00

setting-indexing-methods.mdx

chore: apply new issues button

2025-07-16 16:42:34 +08:00

readme.mdx

---
title: Creating Knowledge Base
---

To create a knowledge base and upload documents, follow these steps:

1. Create a knowledge base by uploading local files, importing online data, or creating an empty knowledge base.

<Card title="Import Text Data" icon="link" href="/en/guides/knowledge-base/create-knowledge-and-upload-documents/import-content-data">
Choose to create a knowledge base by uploading local files or importing online data
</Card>

2. Choose a chunking mode. This stage involves content preprocessing and data structuring, where long texts are divided into multiple content chunks. You can preview the text chunking results at this stage.

<Card title="Chunking and Cleaning Text" icon="link" href="/en/guides/knowledge-base/create-knowledge-and-upload-documents/chunking-and-cleaning-text">
Learn how to set up text chunking and cleaning rules
</Card>

3. Set up the indexing method and retrieval settings. After receiving a user query, the knowledge base searches for relevant content in existing documents according to preset retrieval methods, extracting highly relevant information chunks for the language model to generate high-quality answers.

<Card title="Setting up Indexing Methods" icon="link" href="/en/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods">
Learn how to configure retrieval methods and related settings
</Card>

5. Wait for chunk embedding
6. Complete the upload, integrate the knowledge base in the application and use it. You can refer to [Integrating Knowledge Base in Applications](/en/guides/knowledge-base/integrate-knowledge-within-application) to build an LLM application capable of answering questions based on the knowledge base. For knowledge base modification or management, please refer to [Managing Knowledge Base and Documents](/en/guides/knowledge-base/knowledge-and-documents-maintenance).

![Complete the creation of knowledge base](https://assets-docs.dify.ai/2024/12/a3362a1cd384cb2b539c9858de555518.png)

### Reference Reading

#### ETL

In production-level RAG applications, data preprocessing and cleaning of multi-source data, or ETL (_extract, transform, load_), is necessary for better data recall results. To enhance the preprocessing capabilities of unstructured/semi-structured data, Dify supports optional ETL solutions: **Dify ETL** and [**Unstructured ETL**](https://unstructured.io/). Unstructured efficiently extracts and transforms your data into clean data for subsequent steps. ETL solution choices for different versions of Dify:

* SaaS version: Non-optional, uses Unstructured ETL by default;
* Community version: Optional, uses Dify ETL by default, can enable Unstructured ETL through [environment variables](/en/getting-started/install-self-hosted/environments#knowledge-base-configuration);

Differences in supported file parsing formats:

| DIFY ETL | Unstructured ETL |
| --- | --- |
| txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv | txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv, eml, msg, pptx, ppt, xml, epub |

Different ETL solutions also have differences in file extraction effects. To learn more about Unstructured ETL's data processing methods, please refer to the [official documentation](https://docs.unstructured.io/open-source/core-functionality/partitioning).

#### **Embedding**

**Embedding** is a technique for converting discrete variables (such as words, sentences, or entire documents) into continuous vector representations. It can map high-dimensional data (such as words, phrases, or images) to low-dimensional spaces, providing a compact and efficient representation. This representation not only reduces data dimensionality but also preserves important semantic information, making subsequent content retrieval more efficient.

**Embedding models** are large language models specifically designed for text vectorization. They excel at converting text into dense numerical vectors, effectively capturing semantic information.

> To learn more, please refer to: ["Dify: Embedding Technology and Dify Knowledge Base Design/Planning"](https://mp.weixin.qq.com/s/vmY_CUmETo2IpEBf1nEGLQ).

#### **Metadata**

To use metadata functionality for managing knowledge bases, please refer to [Metadata](/en/guides/knowledge-base/metadata).

{/*
Contributing Section
DO NOT edit this section!
It will be automatically generated by the script.
*/}

---

[Edit this page](https://github.com/langgenius/dify-docs/edit/main/en/guides/knowledge-base/create-knowledge-and-upload-documents/readme.mdx) | [Report an issue](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)