Merge pull request #949 from Classic298/dev

2026-03-27 13:28:37 +07:00 · 2026-01-03 21:17:07 +01:00
parent 0462e65458 888802a47f
commit 160cbe9ee0
1 changed files with 68 additions and 0 deletions
--- a/docs/features/rag/index.md
+++ b/docs/features/rag/index.md
@@ -66,6 +66,74 @@ Intelligently merging small sections after markdown splitting provides several k
 - **Speeds Up Retrieval & Embedding**: A smaller index is faster to search, and fewer chunks require fewer embedding API calls (or less local compute). This significantly accelerates document processing when uploading files to chats or knowledge bases, as there is less data to vectorize.
 - **Efficiency & Impact**: Testing has shown that a well-configured threshold (e.g., 1000 for a chunk size of 2000) can reduce chunk counts by over 90% while **improving accuracy**, increasing embedding speed, and enhancing overall retrieval quality by maintaining semantic context.

+<details>
+<summary>How the merging algorithm works (technical details)</summary>
+
+For most users, the explanation above is all you need: small chunks get merged with their neighbors, resulting in better retrieval with fewer vectors and other performance, cost and storage benefits. But if you're curious about the exact logic and design rationale, here's how it works under the hood.
+
+### Why header-based splitting needs merging
+
+Markdown header splitting is one of the better structural approaches to chunking because headers are explicit semantic boundaries placed by the document author. You're leveraging human judgment about where one topic ends and another begins, which usually produces more coherent chunks than fixed-size windowing that might cut mid-paragraph or mid-thought.
+
+However, real documents often have structural quirks: tables of contents, short introductory sections, single-sentence paragraphs under their own headers, or deeply nested subheadings with minimal content. These produce tiny chunks that cause problems:
+
+- They lack sufficient context to be useful when retrieved in isolation
+- They can produce noisy retrieval results (matching on limited signal but contributing nothing useful)
+- Very short texts sometimes embed less reliably
+- They waste vector storage and slow down retrieval
+- Many chunks take longer to embed than fewer chunks (with the same total content)
+- More embedding operations means more API calls (cost) or more local compute
+
+The merging algorithm addresses this by intelligently combining undersized chunks while respecting document structure and size limits.
+
+### The algorithm: a single forward pass
+
+The merging logic is deliberately simple—a single forward pass through all chunks:
+
+1. Start with the first chunk as the "current" accumulator.
+2. For each **subsequent** chunk, check if it can be absorbed into the current chunk.
+3. A chunk can be absorbed if **all three conditions** are met:
+   - The current accumulated content is still below `CHUNK_MIN_SIZE_TARGET`
+   - Merging wouldn't exceed `CHUNK_SIZE` (the maximum)
+   - Both chunks belong to the same source document
+4. If absorption is possible, merge them (with `\n\n` separation to preserve visual structure) and continue checking the next chunk.
+5. If absorption isn't possible, finalize the current chunk and start fresh with the next one as the new accumulator.
+6. Repeat until all chunks are processed.
+
+**Key point**: The size check is on the *accumulated* content, not individual chunks. This means multiple consecutive tiny chunks (like a table of contents with many small entries) will keep folding together until the combined size reaches the threshold or until merging the next chunk would exceed the maximum.
+
+### Design decisions and why they matter
+
+**Forward-only merging**: Small chunks always merge into the *next* chunk, never backward. This keeps the logic simple and predictable, and preserves the natural "this section introduces what follows" relationship common in documents. A brief intro section merging forward into the content it introduces makes semantic sense.
+
+**Why not backward merging?** Beyond added code complexity, backward merging would frequently fail anyway. By the time any chunk gets finalized, it's in one of two states: either it grew to meet or exceed `CHUNK_MIN_SIZE_TARGET` through absorption (so it's already "satisfied" with limited headroom), or it couldn't absorb the next chunk because that would exceed `CHUNK_SIZE` (so it's already bumping against the ceiling). Either way, a backward merge attempt would often fail the size check, meaning you'd add branching logic and state tracking for something that rarely succeeds.
+
+**No cross-document merging**: Chunks from different source files are never combined, even if both are small. This preserves clear document boundaries for citation, source attribution, and retrieval context.
+
+**Respects maximum size**: If merging two chunks would exceed `CHUNK_SIZE`, both are kept separate. Content is never discarded to force a merge.
+
+**Metadata inheritance**: Merged chunks inherit metadata from the *first* chunk in the merge sequence. This is consistent with forward-merge semantics—source and header information reflects where the merged section "started," which is typically the right choice for retrieval and citation purposes.
+
+**The `\n\n` separator**: When chunks merge, they're joined with double newlines rather than concatenated directly. This preserves visual and structural separation in the combined text, which can matter for both embedding quality and human readability if you inspect your chunks.
+
+### Edge cases
+
+**Consecutive tiny chunks**: Handled naturally. They keep accumulating into a single chunk until the threshold is met or max size would be exceeded.
+
+**Small chunk followed by large chunk**: If a small chunk is followed by a chunk large enough that merging would exceed `CHUNK_SIZE`, the small chunk gets finalized as-is, still undersized. This is unavoidable without backward merging or content splitting, but it's also rare in practice. It typically occurs at natural semantic boundaries (a brief transition before a dense section), and the small chunk being standalone at that boundary is arguably correct anyway.
+
+**Last chunk in document**: If the final chunk is undersized, it stays undersized since there's nothing to merge forward into. Again, unavoidable and usually fine—document endings are natural boundaries.
+
+### Performance characteristics
+
+The algorithm is O(n) in the number of chunks—a single pass with no lookahead or backtracking. This makes it fast even for large document collections.
+
+The efficiency gains from merging scale non-linearly in some ways. Retrieval over 45 vectors versus 588 isn't just ~13x faster in raw compute—you're also getting much cleaner top-k results because you've eliminated the noise of near-empty chunks that might score well on partial keyword matches but contribute nothing useful to the LLM. The quality improvement often matters more than the speed improvement.
+
+Testing has shown that a well-configured threshold (e.g., 1000 for a chunk size of 2000) can reduce chunk counts by over 90% while improving retrieval accuracy, because each remaining chunk carries meaningful semantic context rather than being a fragment that confuses both the embedding model and the retrieval ranking. As positive side effects, it also uses less storage space in the vector database and requires fewer embedding operations, which can be a significant cost saving if outsourcing to an embedding service.
+
+</details>
+
 ## RAG Embedding Support

 Change the RAG embedding model directly in the `Admin Panel` > `Settings` > `Documents` menu. This feature supports Ollama and OpenAI models, enabling you to enhance document processing according to your requirements.