diff --git a/docs/features/rag/index.md b/docs/features/rag/index.md index e3e838e6..b01e5cfc 100644 --- a/docs/features/rag/index.md +++ b/docs/features/rag/index.md @@ -39,6 +39,33 @@ Web pages often contain extraneous information such as navigation and footer. Fo Customize the RAG template from the `Admin Panel` > `Settings` > `Documents` menu. +## Markdown Header Splitting + +When enabled, documents are first split by markdown headers (H1-H6). This preserves document structure and ensures that sections under the same header are kept together when possible. The resulting chunks are then further processed by the standard character or token splitter. + +:::tip + +Use the **Chunk Min Size Target** setting (found in **Admin Panel > Settings > Documents**) to intelligently merge small sections after markdown splitting, improving retrieval coherence and reducing the total number of vectors in your database. + +::: + +## Chunking Configuration + +Open WebUI allows you to fine-tune how documents are split into chunks for embedding. This is crucial for optimal retrieval performance. + +- **Chunk Size**: Sets the maximum number of characters (or tokens) per chunk. +- **Chunk Overlap**: Specifies how much content is shared between adjacent chunks to maintain context. +- **Chunk Min Size Target**: Although [Markdown Header Splitting](#markdown-header-splitting) is excellent for preserving structure, it can often create tiny, fragmented chunks (e.g., a standalone sub-header, a table of contents entry, a single-sentence paragraph, or a short list item) that lack enough semantic context for high-quality embedding. You can counteract this by setting the **Chunk Min Size Target** to intelligently merge these small pieces with their neighbors. + +### Why use a Chunk Min Size Target? + +Intelligently merging small sections after markdown splitting provides several key advantages: + +- **Improves RAG Quality**: Eliminates tiny, meaningless fragments, ensuring better semantic coherence in each retrieve chunk. +- **Reduces Vector Database Size**: Fewer chunks mean fewer vectors to store, reducing storage costs and memory usage. +- **Speeds Up Retrieval & Embedding**: A smaller index is faster to search, and fewer chunks require fewer embedding API calls (or less local compute). This significantly accelerates document processing when uploading files to chats or knowledge bases, as there is less data to vectorize. +- **Efficiency & Impact**: Testing has shown that a well-configured threshold (e.g., 1000 for a chunk size of 2000) can reduce chunk counts by over 90% while **improving accuracy**, increasing embedding speed, and enhancing overall retrieval quality by maintaining semantic context. + ## RAG Embedding Support Change the RAG embedding model directly in the `Admin Panel` > `Settings` > `Documents` menu. This feature supports Ollama and OpenAI models, enabling you to enhance document processing according to your requirements. diff --git a/docs/getting-started/env-configuration.mdx b/docs/getting-started/env-configuration.mdx index 7c20b096..049c0902 100644 --- a/docs/getting-started/env-configuration.mdx +++ b/docs/getting-started/env-configuration.mdx @@ -2565,6 +2565,13 @@ Provide a clear and direct response to the user's query, including inline citati - Description: Specifies how much overlap there should be between chunks. - Persistence: This environment variable is a `PersistentConfig` variable. +#### `CHUNK_MIN_SIZE_TARGET` + +- Type: `int` +- Default: `0` +- Description: Chunks smaller than this threshold will be intelligently merged with neighboring chunks when possible. This helps prevent tiny, low-quality fragments that can hurt retrieval performance and waste embedding resources. This feature only works when `ENABLE_MARKDOWN_HEADER_TEXT_SPLITTER` is enabled. Set to `0` to disable merging. For more information on the benefits and configuration, see the [RAG guide](/features/rag#chunking-configuration). +- Persistence: This environment variable is a `PersistentConfig` variable. + #### `RAG_TEXT_SPLITTER` - Type: `str` diff --git a/docs/troubleshooting/rag.mdx b/docs/troubleshooting/rag.mdx index c5229362..1a98d053 100644 --- a/docs/troubleshooting/rag.mdx +++ b/docs/troubleshooting/rag.mdx @@ -155,6 +155,18 @@ By separating these limits, administrators can better manage resource usage acro --- +### 7. Fragmented or Tiny Chunks 🧩 + +When using the **Markdown Header Splitter**, documents can sometimes be split into very small fragments (e.g., just a table of contents entry or a short sub-header). These tiny chunks often lack enough semantic context for the embedding model to represent them accurately, leading to poor RAG results and unnecessary overhead. + +✅ Solution: + +- Go to **Admin Settings > Documents**. +- Increase the **Chunk Min Size Target**. +- Setting this to a value like `1000` (or ~50-60% of your `CHUNK_SIZE`) will force the system to merge small fragments with neighboring chunks when possible, resulting in better semantic coherence and fewer total chunks. + +--- + | Problem | Fix | |--------|------| @@ -163,6 +175,7 @@ By separating these limits, administrators can better manage resource usage acro | ⏱ Limited by 2048 token cap | Increase model context length (Admin Panel > Models > Settings > Advanced Parameters for Ollama) or use large-context LLM | | 📉 Inaccurate retrieval | Switch to a better embedding model, then reindex | | ❌ Upload limits bypass | Use Folder uploads (with `FOLDER_MAX_FILE_COUNT`) but note that Knowledge Base limits are separate | +| 🧩 Fragmented/Tiny Chunks | Increase **Chunk Min Size Target** to merge small sections | | Still confused? | Test with GPT-4o and compare outputs | --- diff --git a/docs/tutorials/tips/performance.md b/docs/tutorials/tips/performance.md index 6a9d6e9d..fba1b8e9 100644 --- a/docs/tutorials/tips/performance.md +++ b/docs/tutorials/tips/performance.md @@ -95,6 +95,15 @@ For multi-user setups, the choice of Vector DB matters. * `ENABLE_MILVUS_MULTITENANCY_MODE=True` * `ENABLE_QDRANT_MULTITENANCY_MODE=True` +### Optimizing Document Chunking + +The way your documents are chunked directly impacts both storage efficiency and retrieval quality. + +- **Use Markdown Header Splitting**: This preserves the semantic structure of your documents. +- **Set a Chunk Min Size Target**: When using the markdown header splitter, tiny chunks (e.g., just a single sub-header) can be created. These are inefficient to store and poor for retrieval. + - **Env Var**: `CHUNK_MIN_SIZE_TARGET=1000` (Example value) + - **Benefit**: Intelligently merges small chunks with neighbors, significantly reducing the total vector count and improving RAG performance. + --- ## 📈 Scaling Infrastructure (Multi-Tenancy & Kubernetes)