Add data source plugin dev docs (#474)

This commit is contained in:
Riskey
2025-09-30 21:07:13 +08:00
committed by GitHub
parent fe4d97fae3
commit de587fb570
11 changed files with 1770 additions and 357 deletions

View File

@@ -303,7 +303,6 @@
"en/plugins/quick-start/develop-plugins/README",
"en/plugins/quick-start/develop-plugins/initialize-development-tools",
"en/plugins/quick-start/develop-plugins/tool-plugin",
"en/plugins/quick-start/develop-plugins/tool-oauth",
{
"group": "Model Plugin",
"pages": [
@@ -512,7 +511,9 @@
"plugin-dev-en/0222-creating-new-model-provider-extra",
"plugin-dev-en/0222-creating-new-model-provider",
"plugin-dev-en/0222-debugging-logs",
"plugin-dev-en/0222-tool-plugin"
"plugin-dev-en/0222-tool-plugin",
"plugin-dev-en/0222-tool-oauth",
"plugin-dev-en/0222-datasource-plugin"
]
}
]
@@ -984,7 +985,6 @@
"zh-hans/plugins/quick-start/develop-plugins/README",
"zh-hans/plugins/quick-start/develop-plugins/initialize-development-tools",
"zh-hans/plugins/quick-start/develop-plugins/tool-plugin",
"zh-hans/plugins/quick-start/develop-plugins/tool-oauth",
{
"group": "Model 插件",
"pages": [
@@ -1199,7 +1199,9 @@
"plugin-dev-zh/0222-creating-new-model-provider-extra",
"plugin-dev-zh/0222-creating-new-model-provider",
"plugin-dev-zh/0222-debugging-logs",
"plugin-dev-zh/0222-tool-plugin"
"plugin-dev-zh/0222-tool-plugin",
"plugin-dev-zh/0222-tool-oauth",
"plugin-dev-zh/0222-datasource-plugin"
]
}
]
@@ -1853,7 +1855,8 @@
"plugin-dev-ja/0222-creating-new-model-provider-extra",
"plugin-dev-ja/0222-creating-new-model-provider",
"plugin-dev-ja/0222-debugging-logs",
"plugin-dev-ja/0222-tool-plugin"
"plugin-dev-ja/0222-tool-plugin",
"plugin-dev-ja/0222-datasource-plugin"
]
}
]

View File

@@ -9,6 +9,7 @@ In this section, you'll learn about the knowledge pipeline process, understand d
### Interface Status
When entering the knowledge pipeline orchestration canvas, youll see:
- **Tab Status**: Documents, Retrieval Test, and Settings tabs will be grayed out and unavailable at the moment
- **Essential Steps**: You must complete knowledge pipeline orchestration and publishing before uploading files
@@ -27,13 +28,13 @@ Before we get started, let's break down the knowledge pipeline process to unders
The knowledge pipeline includes these key steps:
<Tip>
Data Source → Data Processing (Extractor + Chunker) → Knowledge Base Node (Chunk Structure + Retrieval Setting) → User Input Field → Test & Publish
Data Source → Data Processing (Extractor + Chunker) → Knowledge Base Node (Chunk Structure + Retrieval Setting) → User Input Field → Test & Publish
</Tip>
1. **Data Source**: Content from various data sources (local files, Notion, web pages, etc.)
2. **Data Processing**: Process and transform data content
- Extractor: Parse and structure document content
- Chunker: Split structured content into manageable segments
2. **Data Processing**: Process and transform data content
- Extractor: Parse and structure document content
- Chunker: Split structured content into manageable segments
3. **Knowledge Base**: Set up chunk structure and retrieval settings
4. **User Input Field**: Define parameters that pipeline users need to input for data processing
5. **Test & Publish**: Validate and officially activate the knowledge base
@@ -50,54 +51,64 @@ Visit the [Dify Marketplace](https://marketplace.dify.ai) for more data sources.
Upload local files through drag-and-drop or file selection.
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-1.PNG)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**Configuration Options**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-1.PNG)
| Item | Description |
|----------------|-----------------------------------------------------------------------------|
| File Format | Support PDF, XLSX, DOCX, etc. Users can customize their selection |
| Upload Method | Upload local files or folders through drag-and-drop or file selection. Batch upload is supported. |
**Limitations**
| Item | Description |
|--------------|-----------------------------------------------------------------------------|
| File Quantity| Maximum 50 files per upload |
| File Size | Each file must not exceed 15MB |
| Storage | Limits on total document uploads and storage space may vary for different SaaS subscription plans |
**Output Variables**
| Output Variable | Format |
|-----------------|------------------|
| `{x} Document` | Single document |
</div>
</div>
<div style={{ flex:2,minWidth:"300px" }}>
**Configuration Options**
| Item | Description |
| ------------- | ------------------------------------------------------------------------------------------------- |
| File Format | Support PDF, XLSX, DOCX, etc. Users can customize their selection |
| Upload Method | Upload local files or folders through drag-and-drop or file selection. Batch upload is supported. |
**Limitations**
| Item | Description |
| ------------- | ------------------------------------------------------------------------------------------------- |
| File Quantity | Maximum 50 files per upload |
| File Size | Each file must not exceed 15MB |
| Storage | Limits on total document uploads and storage space may vary for different SaaS subscription plans |
**Output Variables**
| Output Variable | Format |
| --------------- | --------------- |
| `{x} Document` | Single document |
</div>
</div>
---
### Online Doc
### Online Document
#### Notion
Integrate with your Notion workspace to seamlessly import pages and databases, always keeping your knowledge base automatically updated.
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![Notion](/images/knowledge-base/knowledge-pipeline-orchestration-2.PNG)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**Configuration Options**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![Notion](/images/knowledge-base/knowledge-pipeline-orchestration-2.PNG)
| Item | Option | Output Variable | Description |
|------------|-----------|-----------------|------------------------------------------|
| Extractor | Enabled | `{x} Content` | Structured and processed information |
| | Disabled | `{x} Document` | Original text |
</div>
</div>
<div style={{ flex:2,minWidth:"300px" }}>
**Configuration Options**
| Item | Option | Output Variable | Description |
| --------- | -------- | --------------- | ------------------------------------ |
| Extractor | Enabled | `{x} Content` | Structured and processed information |
| | Disabled | `{x} Document` | Original text |
</div>
</div>
---
### Web Crawler
@@ -108,45 +119,53 @@ Transform web content into formats that can be easily read by large language mod
An open-source web parsing tool providing simple and easy-to-use API services, suitable for fast crawling and processing web content.
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![Jina Reader](/images/knowledge-base/knowledge-pipeline-orchestration-3.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**Parameter Configuration**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![Jina Reader](/images/knowledge-base/knowledge-pipeline-orchestration-3.png)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
**Parameter Configuration**
| Parameter | Type | Description |
| ---------------- | -------- | ------------------------------------ |
| URL | Required | Target webpage address |
| Crawl sub-page | Optional | Whether to crawl linked pages |
| Use sitemap | Optional | Crawl by using website sitemap |
| Limit | Required | Set maximum number of pages to crawl |
| Enable Extractor | Optional | Choose data extraction method |
</div>
| Parameter | Type | Description |
|----------------|-----------|--------------------------------------|
| URL | Required | Target webpage address |
| Crawl sub-page | Optional | Whether to crawl linked pages |
| Use sitemap | Optional | Crawl by using website sitemap |
| Limit | Required | Set maximum number of pages to crawl |
| Enable Extractor | Optional | Choose data extraction method |
</div>
</div>
#### Firecrawl
An open-source web parsing tool that provides more refined crawling control options and API services. It supports deep crawling of complex website structures, recommended for batch processing and precise control.
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-4.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**Parameter Configuration**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-4.png)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
**Parameter Configuration**
| Parameter | Type | Description |
| ------------------------- | -------- | -------------------------------------------------------------------------- |
| URL | Required | Target webpage address |
| Limit | Required | Set maximum number of pages to crawl |
| Crawl sub-page | Optional | Whether to crawl linked pages |
| Max depth | Optional | How many levels deep the crawler will traverse from the starting URL |
| Exclude paths | Optional | Specify URL patterns that should not be crawled |
| Include only paths | Optional | Crawl specified paths only |
| Extractor | Optional | Choose data processing method |
| Extract Only Main Content | Optional | Isolate and retrieve the primary, meaningful text and media from a webpage |
</div>
| Parameter | Type | Description |
|-------------------------|-----------|---------------------------------------------------------------|
| URL | Required | Target webpage address |
| Limit | Required | Set maximum number of pages to crawl |
| Crawl sub-page | Optional | Whether to crawl linked pages |
| Max depth | Optional | How many levels deep the crawler will traverse from the starting URL |
| Exclude paths | Optional | Specify URL patterns that should not be crawled |
| Include only paths | Optional | Crawl specified paths only |
| Extractor | Optional | Choose data processing method |
| Extract Only Main Content | Optional | Isolate and retrieve the primary, meaningful text and media from a webpage |
</div>
</div>
---
@@ -156,7 +175,7 @@ An open-source web parsing tool that provides more refined crawling control opti
Connect your online cloud storage services (e.g., Google Drive, Dropbox, OneDrive) and let Dify automatically retrieve your files. Simply select and import the documents you need for processing, without manually downloading and re-uploading files.
<Tip>
Need help with authorization? Please check [Authorize Data Source](/en/guides/knowledge-base/knowledge-pipeline/authorize-data-source) for detailed guidance on authorizing different data sources.
Need help with authorization? Please check [Authorize Data Source](/en/guides/knowledge-base/knowledge-pipeline/authorize-data-source) for detailed guidance on authorizing different data sources.
</Tip>
---
@@ -178,7 +197,7 @@ You can choose Dify's Doc Extractor to process files, or select tools based on y
As an information processing center, document extractor node identifies and reads files from input variables, extracts information, and finally converts them into a format that works with the next node.
<Tip>
For more information, please refer to the [Document Extractor](/en/guides/workflow/node/doc-extractor).
For more information, please refer to the [Document Extractor](/en/guides/workflow/node/doc-extractor).
</Tip>
#### Dify Extractor
@@ -189,43 +208,47 @@ Dify Extractor is a built-in document parser presented by Dify. It supports mult
#### Unstructured
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![Unstructured](/images/knowledge-base/knowledge-pipeline-orchestration-7.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
[Unstructured](https://marketplace-staging.dify.dev/plugins/langgenius/unstructured) transforms documents into structured, machine-readable formats with highly customizable processing strategies. It offers multiple extraction strategies (auto, hi_res, fast, OCR-only) and chunking methods (by_title, by_page, by_similarity) to handle diverse document types, offering detailed element-level metadata including coordinates, confidence scores, and layout information. Its recommended for enterprise document workflows, processing of mixed file types, and cases that require precise control over document processing parameters.
</div>
</div>
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![Unstructured](/images/knowledge-base/knowledge-pipeline-orchestration-7.png)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
[Unstructured](https://marketplace-staging.dify.dev/plugins/langgenius/unstructured) transforms documents into structured, machine-readable formats with highly customizable processing strategies. It offers multiple extraction strategies (auto, hi_res, fast, OCR-only) and chunking methods (by_title, by_page, by_similarity) to handle diverse document types, offering detailed element-level metadata including coordinates, confidence scores, and layout information. Its recommended for enterprise document workflows, processing of mixed file types, and cases that require precise control over document processing parameters.
</div>
</div>
<Tip>
Explore more tools in the [Dify Marketplace](https://marketplace.dify.ai).
Explore more tools in the [Dify Marketplace](https://marketplace.dify.ai).
</Tip>
---
### Chunker
Similar to human limited attention span, large language models cannot process huge amount of information simultaneously. Therefore, after information extraction, the chunker splits large document content into smaller and manageable segments (called "chunks").
Similar to human limited attention span, large language models cannot process huge amount of information simultaneously. Therefore, after information extraction, the chunker splits large document content into smaller and manageable segments (called "chunks").
Different documents require different chunking strategies. A product manual works best when split by product features, while research papers should be divided by logical sections. Dify offers 3 types of chunkers for various document types and use cases.
#### Overview of Different Chunkers
| Chunker Type | Highlights | Best for |
|----------------------|----------------------------------------------|---------------------------------------------------|
| General Chunker | Fixed-size chunks with customizable delimiters | Simple documents with basic structure |
| Chunker Type | Highlights | Best for |
| -------------------- | ----------------------------------------------------- | ----------------------------------------------------- |
| General Chunker | Fixed-size chunks with customizable delimiters | Simple documents with basic structure |
| Parent-child Chunker | Dual-layer structure: precise matching + rich context | Complex documents requiring rich context preservation |
| Q&A Processor | Processes question-answer pairs from spreadsheets | Structured Q&A data from CSV/Excel files |
| Q&A Processor | Processes question-answer pairs from spreadsheets | Structured Q&A data from CSV/Excel files |
#### Common Text Pre-processing Rules
All chunkers support these text cleaning options:
| Preprocessing Option | Description |
|------------------------------------------|-------------------------------------------------------------------|
| Preprocessing Option | Description |
| --------------------------------------------- | ---------------------------------------------------------------------------------- |
| Replace consecutive spaces, newlines and tabs | Clean up formatting by replacing multiple whitespace characters with single spaces |
| Remove all URLs and email addresses | Automatically detect and remove web links and email addresses from text |
| Remove all URLs and email addresses | Automatically detect and remove web links and email addresses from text |
#### General Chunker
@@ -233,18 +256,18 @@ Basic document chunking processing, suitable for documents with relatively simpl
**Input and Output Variable**
| Type | Variable | Description |
|----------------|-----------------|-------------------------------------------------------------------|
| Input Variable | `{x} Content` | Complete document content that the chunker will split into smaller segments |
| Type | Variable | Description |
| --------------- | ------------------ | --------------------------------------------------------------------------- |
| Input Variable | `{x} Content` | Complete document content that the chunker will split into smaller segments |
| Output Variable | `{x} Array[Chunk]` | Array of chunked content, each segment optimized for retrieval and analysis |
**Chunk Settings**
| Configuration Item | Description |
|-------------------------|-----------------------------------------------------------------------------|
| Delimiter | Default value is `\n` (line breaks for paragraph segmentation). You can customize chunking rules following regex. The system will automatically execute segmentation when the delimiter appears in text. |
| Maximum Chunk Length | Specifies the maximum character limit within a segment. When this length is exceeded, forced segmentation will occur. |
| Chunk Overlap | When segmenting data, there is some overlap between segments. This overlap helps improve information retention and analysis accuracy, enhancing recall effectiveness. |
| Configuration Item | Description |
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Delimiter | Default value is `\n` (line breaks for paragraph segmentation). You can customize chunking rules following regex. The system will automatically execute segmentation when the delimiter appears in text. |
| Maximum Chunk Length | Specifies the maximum character limit within a segment. When this length is exceeded, forced segmentation will occur. |
| Chunk Overlap | When segmenting data, there is some overlap between segments. This overlap helps improve information retention and analysis accuracy, enhancing recall effectiveness. |
#### Parent-child Chunker
@@ -256,20 +279,20 @@ Child Chunks for query matching: Small, precise information segments (usually si
Parent Chunks provide rich context: Larger content blocks (paragraphs, sections, or entire documents) that contain the matching child chunks, giving the large language model (LLM) comprehensive background information.
| Type | Variable | Description |
|----------------|------------------------|-------------------------------------------------------------------|
| Input Variable | `{x} Content` | Complete document content that the chunker will split into smaller segments |
| Output Variable | `{x} Array[ParentChunk]` | Array of parent chunks |
| Type | Variable | Description |
| --------------- | ------------------------ | --------------------------------------------------------------------------- |
| Input Variable | `{x} Content` | Complete document content that the chunker will split into smaller segments |
| Output Variable | `{x} Array[ParentChunk]` | Array of parent chunks |
**Chunk Settings**
| Configuration Item | Description |
|------------------------------|-----------------------------------------------------------------------------|
| Parent Delimiter | Set delimiter for parent chunk splitting |
| Parent Maximum Chunk Length | Set maximum character count for parent chunks |
| Child Delimiter | Set delimiter for child chunk splitting |
| Child Maximum Chunk Length | Set maximum character count for child chunks |
| Parent Mode | Choose between Paragraph (split text into paragraphs) or "Full Document" (use entire document as parent chunk) for direct retrieval |
| Configuration Item | Description |
| --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| Parent Delimiter | Set delimiter for parent chunk splitting |
| Parent Maximum Chunk Length | Set maximum character count for parent chunks |
| Child Delimiter | Set delimiter for child chunk splitting |
| Child Maximum Chunk Length | Set maximum character count for child chunks |
| Parent Mode | Choose between Paragraph (split text into paragraphs) or "Full Document" (use entire document as parent chunk) for direct retrieval |
#### Q&A Processor
@@ -277,17 +300,17 @@ Combining extraction and chunking in one node, Q&A Processor is specifically des
**Input and Output Variable**
| Type | Variable | Description |
|----------------|-----------------|---------------------------------------|
| Input Variable | `{x} Document` | A single file |
| Output Variable | `{x} Array[QAChunk]` | QA chunk |
| Type | Variable | Description |
| --------------- | -------------------- | ------------- |
| Input Variable | `{x} Document` | A single file |
| Output Variable | `{x} Array[QAChunk]` | QA chunk |
**Variable Configuration**
| Configuration Item | Description |
|-----------------------------|----------------------------------|
| Column Number for Question |Set content column as question |
| Column Number for Answer |Set column answer as answer |
| Configuration Item | Description |
| -------------------------- | ------------------------------ |
| Column Number for Question | Set content column as question |
| Column Number for Answer | Set column answer as answer |
---
@@ -306,7 +329,7 @@ Chunk structure determines how the knowledge base organizes and indexes your doc
The knowledge base supports three chunk modes: **General Mode, Parent-child Mode, and Q&A Mode**. If you're creating a knowledge base for the first time, we recommend choosing Parent-child Mode.
<Warning>
**Important Reminder**: Chunk structure cannot be modified once saved and published. Please choose carefully.
**Important Reminder**: Chunk structure cannot be modified once saved and published. Please choose carefully.
</Warning>
#### General Mode
@@ -332,9 +355,10 @@ Q&A Mode supports HQ (High Quality) mode only.
Input variables receive processing results from data processing nodes as the data source for knowledge base. You need to connect the output from chunker to the knowledge base as input.
The node supports different types of standard inputs based on the selected chunk structure:
- **General Mode**: {x} Array[Chunk] - General chunk array
- **Parent-child Mode**: {x} Array[ParentChunk] - Parent chunk array
- **Q&A Mode**: {x} Array[QAChunk] - Q&A chunk array
- **General Mode**: x Array[Chunk] - General chunk array
- **Parent-child Mode**: x Array[ParentChunk] - Parent chunk array
- **Q&A Mode**: x Array[QAChunk] - Q&A chunk array
### Index Method & Retrieval Setting
@@ -347,25 +371,25 @@ High quality mode uses embedding models to convert segmented text blocks into nu
In economy mode, each block uses 10 keywords for retrieval without calling embedding models, generating no costs.
<Tip>
Please refer to [Select the Indexing Method and Retrieval Setting](/en/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods) for more details.
Please refer to [Select the Indexing Method and Retrieval Setting](/en/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods) for more details.
</Tip>
#### Index Methods and Retrieval Settings
| Index Method | Available Retrieval Settings | Description |
|--------------|------------------------------|-----------------------------------------------------------------------------|
| High Quality | Vector Retrieval | Understand deeper meaning of queries based on semantic similarity |
| | Full-text Retrieval | Keyword-based retrieval providing comprehensive search capabilities |
| | Hybrid Retrieval | Combine both semantic and keywords |
| Economy | Inverted Index | Common search engine retrieval method, matches queries with key content |
| Index Method | Available Retrieval Settings | Description |
| ------------ | ---------------------------- | ----------------------------------------------------------------------- |
| High Quality | Vector Retrieval | Understand deeper meaning of queries based on semantic similarity |
| | Full-text Retrieval | Keyword-based retrieval providing comprehensive search capabilities |
| | Hybrid Retrieval | Combine both semantic and keywords |
| Economy | Inverted Index | Common search engine retrieval method, matches queries with key content |
You can also refer to the table below for information on configuring chunk structure, indexing methods, parameters, and retrieval settings.
| Chunk Structure | Index Methods | Parameters | Retrieval Settings |
|-------------------|--------------------|--------------------|-------------------------------------|
| General mode | High Quality <br /> <br /> <br /> Economy | Embedding Model <br /> <br /> <br /> Number of Keywords | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval <br /> Inverted Index |
| Parent-child Mode | High Quality Only | Embedding Model | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval |
| Q&A Mode | High Quality Only | Embedding Model | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval |
| Chunk Structure | Index Methods | Parameters | Retrieval Settings |
| ----------------- | ----------------------------------------- | ------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| General mode | High Quality <br /> <br /> <br /> Economy | Embedding Model <br /> <br /> <br /> Number of Keywords | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval <br /> Inverted Index |
| Parent-child Mode | High Quality Only | Embedding Model | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval |
| Q&A Mode | High Quality Only | Embedding Model | Vector Retrieval <br /> Full-text Retrieval <br /> Hybrid Retrieval |
---
@@ -379,13 +403,11 @@ This way, you can create specialized input forms for different use scenarios, im
There're two ways to create user input field:
1. **Pipeline Orchestration Interface**
Click on the **Input field** to start creating and configuring input forms.
![](/images/knowledge-base/knowledge-pipeline-orchestration-9.png)
2. **Node Parameter Panel**
Select a node. Then, in parameter input on the right-side panel, click + Create user input for new input items. New input items will also be collected in the Input Field.
![](/images/knowledge-base/knowledge-pipeline-orchestration-10.png)
1. **Pipeline Orchestration Interface**\
Click on the **Input field** to start creating and configuring input forms.\
<img src="/images/knowledge-base/knowledge-pipeline-orchestration-9.png" alt="" />
2. **Node Parameter Panel**\
Select a node. Then, in parameter input on the right-side panel, click + Create user input for new input items. New input items will also be collected in the Input Field. <img src="/images/knowledge-base/knowledge-pipeline-orchestration-10.png" alt="" />
### Add User Input Fields
@@ -395,8 +417,7 @@ There're two ways to create user input field:
These inputs are specific to each data source and its downstream nodes. Users only need to fill out these fields when selecting the corresponding data source, such as different URLs for different data sources.
**How to create**: Click the `+` button on the right side of a data source to add fields for that specific data source. These fields can only be referenced by that data source and its subsequently connected nodes.
![](/images/knowledge-base/knowledge-pipeline-orchestration-12.png)
**How to create**: Click the `+` button on the right side of a data source to add fields for that specific data source. These fields can only be referenced by that data source and its subsequently connected nodes. <img src="/images/knowledge-base/knowledge-pipeline-orchestration-12.png" alt="" />
#### Global Inputs for All Entrances
@@ -410,40 +431,44 @@ Global shared inputs can be referenced by all nodes. These inputs are suitable f
The knowledge pipeline supports seven types of input variables:
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-14.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
| Field Type | Description |
|-------------|-----------------------------------------------------------------------------|
| Text | Short text input by knowledge base users, maximum length 256 characters |
| Paragraph | Long text input for longer character strings |
| Select | Fixed options preset by the orchestrator for users to choose from, users cannot add custom content |
| Boolean | Only true/false values |
| Number | Only accepts numerical input |
| Single | Upload a single file, supports multiple file types (documents, images, audio, and other file types) |
| File List | Batch file upload, supports multiple file types (documents, images, audio, and other file types) |
</div>
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-14.png)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
| Field Type | Description |
| ---------- | --------------------------------------------------------------------------------------------------- |
| Text | Short text input by knowledge base users, maximum length 256 characters |
| Paragraph | Long text input for longer character strings |
| Select | Fixed options preset by the orchestrator for users to choose from, users cannot add custom content |
| Boolean | Only true/false values |
| Number | Only accepts numerical input |
| Single | Upload a single file, supports multiple file types (documents, images, audio, and other file types) |
| File List | Batch file upload, supports multiple file types (documents, images, audio, and other file types) |
</div>
</div>
<Tip>
For more information about supported field types, please refer to the [Input Fields documentation](/en/guides/workflow/node/start#input-field).
For more information about supported field types, please refer to the [Input Fields documentation](/en/guides/workflow/node/start#input-field).
</Tip>
### Field Configuration Options
All input field types include: required, optional, and additional settings. You can set whether fields are required by checking the appropriate option.
| Setting | Name | Description | Example |
|-----------------------|---------------|-----------------------------------------------------------------------------|------------------------------------------|
| Required Settings | Variable Name | Internal system identifier, usually named using English and underscores | `user_email` |
| | Display Name | Interface display name, usually concise and readable text | User Email |
| Type-specific Settings | | Special requirements for different field types | Text field max length 100 characters |
| Additional Settings | Default Value | Default value when user hasn't provided input | Number field defaults to 0, text field defaults to empty |
| | Placeholder | Hint text displayed when input box is empty | "Please enter your email" |
| | Tooltip | Explanatory text to guide user input, usually displayed on mouse hover | "Please enter a valid email address" |
| Special Optional Settings | | Additional setting options based on different field types | Validation of email format |
| Setting | Name | Description | Example |
| ------------------------- | ------------- | ----------------------------------------------------------------------- | -------------------------------------------------------- |
| Required Settings | Variable Name | Internal system identifier, usually named using English and underscores | `user_email` |
| | Display Name | Interface display name, usually concise and readable text | User Email |
| Type-specific Settings | | Special requirements for different field types | Text field max length 100 characters |
| Additional Settings | Default Value | Default value when user hasn't provided input | Number field defaults to 0, text field defaults to empty |
| | Placeholder | Hint text displayed when input box is empty | "Please enter your email" |
| | Tooltip | Explanatory text to guide user input, usually displayed on mouse hover | "Please enter a valid email address" |
| Special Optional Settings | | Additional setting options based on different field types | Validation of email format |
After completing configuration, click the preview button in the upper right corner to browse the form preview interface. You can drag and adjust field groupings. If an exclamation mark appears, it indicates that the reference is invalid after moving.
@@ -458,19 +483,19 @@ After completing configuration, click the preview button in the upper right corn
By default, the knowledge base name will be "Untitled + number", permissions are set to "Only me", and the icon will be an orange book. If you import it from a DSL file, it will use the saved icon.
Edit knowledge base inforamtion by clicking **Settings** in the left panel and fill in the information below:
- **Name & Icon**
Pick a name for your knowledge base.
- **Name & Icon**\
Pick a name for your knowledge base.\
Choose an emoji, upload an image, or paste an image URL as the icon of this knowledge base.
- **Knowledge Description**
Provide a brief description of your knowledge base. This helps the AI better understand and retrieve your data. If left empty, Dify will apply the default retrieval strategy.
- **Permissions**
- **Knowledge Description** Provide a brief description of your knowledge base. This helps the AI better understand and retrieve your data. If left empty, Dify will apply the default retrieval strategy.
- **Permissions**\
Select the appropriate access permissions from the dropdown menu.
---
## Step 6: Testing
You're almost there! This is the final step of the knowledge pipeline orchestration.
You're almost there! This is the final step of the knowledge pipeline orchestration.
After completing the orchestration, you need to validate all the configuration first. Then, do some running tests and confirm all the settings. Finally, publish the knowledge pipeline.
@@ -490,9 +515,11 @@ After completing all configurations, you can preview the knowledge base pipeline
1. **Start Test**: Click the "Test Run" button in the upper right corner
2. **Import Test File**: Import files in the data source window that pops up on the right
<Warning>
**Important Note**: For better debugging and observation, only one file upload is allowed per test run.
</Warning>
<Warning>
**Important Note**: For better debugging and observation, only one file upload is allowed per test run.
</Warning>
3. **Fill Parameters**: After successful import, fill in corresponding parameters according to the user input form you configured earlier
4. **Start Test Run**: Click next step to start testing the entire pipeline

BIN
images/data_source_type.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 267 KiB

View File

@@ -0,0 +1,457 @@
---
title: "Data Source Plugin"
---
Data source plugins are a new type of plugin introduced in Dify 1.9.0. In a knowledge pipeline, they serve as the document data source and the starting point for the entire pipeline.
This article describes how to develop a data source plugin, covering plugin architecture, code examples, and debugging methods, to help you quickly develop and launch your data source plugin.
## Prerequisites
Before reading on, ensure you have a basic understanding of the knowledge pipeline and some knowledge of plugin development. You can find relevant information here:
- [Step 2: Knowledge Pipeline Orchestration](/en/guides/knowledge-base/knowledge-pipeline/knowledge-pipeline-orchestration)
- [Dify Plugin Development: Hello World Guide](/plugin-dev-en/0211-getting-started-dify-tool)
## **Data Source Plugin Types**
Dify supports three types of data source plugins: web crawler, online document, and online drive. When implementing the plugin code, the class that provides the plugin's functionality must inherit from a specific data source class. Each of the three plugin types corresponds to a different parent class.
<Info>
To learn how to inherit from a parent class to implement plugin functionality, see [Dify Plugin Development: Hello World Guide - 4.4 Implementing Tool Logic](/plugin-dev-en/0211-getting-started-dify-tool#4-4-implementing-tool-logic).
</Info>
Each data source plugin type supports multiple data sources. For example:
- **Web Crawler**: Jina Reader, FireCrawl
- **Online Document**: Notion, Confluence, GitHub
- **Online Drive**: OneDrive, Google Drive, Box, AWS S3, Tencent COS
The relationship between data source types and data source plugin types is illustrated below.
![](/images/data_source_type.png)
## Develop a Data Source Plugin
### Create a Data Source Plugin
You can use the scaffolding command-line tool to create a data source plugin by selecting the `datasource` type. After completing the setup, the command-line tool will automatically generate the plugin project code.
```powershell
dify plugin init
```
![](/images/datasource_plugin_init.png)
<Info>
Typically, a data source plugin does not need to use other features of the Dify platform, so no additional permissions are required.
</Info>
#### Data Source Plugin Structure
A data source plugin consists of three main components:
- The `manifest.yaml` file: Describes the basic information about the plugin.
- The `provider` directory: Contains the plugin provider's description and authentication implementation code.
- The `datasources` directory: Contains the description and core logic for fetching data from the data source.
```
├── _assets
│   └── icon.svg
├── datasources
│   ├── your_datasource.py
│   └── your_datasource.yaml
├── main.py
├── manifest.yaml
├── PRIVACY.md
├── provider
│   ├── your_datasource.py
│   └── your_datasource.yaml
├── README.md
└── requirements.txt
```
#### Set the Correct Version and Tag
- In the `manifest.yaml` file, set the minimum supported Dify version as follows:
```yaml
minimum_dify_version: 1.9.0
```
- In the `manifest.yaml` file, add the following tag to display the plugin under the data source category in the Dify Marketplace:
```yaml
tags:
- rag
```
- In the `requirements.txt` file, set the plugin SDK version used for data source plugin development as follows:
```yaml
dify-plugin>=0.5.0,<0.6.0
```
### Add the Data Source Provider
#### Create the Provider YAML File
The content of a provider YAML file is essentially the same as that for tool plugins, with only the following two differences:
```yaml
# Specify the provider type for the data source plugin: online_drive, online_document, or website_crawl
provider_type: online_drive # online_document, website_crawl
# Specify data sources
datasources:
- datasources/PluginName.yaml
```
<Info>
For more about creating a provider YAML file, see [Dify Plugin Development: Hello World Guide-4.3 Configuring Provider Credentials](/plugin-dev-en/0211-getting-started-dify-tool#4-3-configuring-provider-credentials).
</Info>
<Info>
Data source plugins support authentication via OAuth 2.0 or API Key.
To configure OAuth, see [Add OAuth Support to Your Tool Plugin](/plugin-dev-en/0222-tool-oauth).
</Info>
#### Create the Provider Code File
- When using API Key authentication mode, the provider code file for data source plugins is identical to that for tool plugins. You only need to change the parent class inherited by the provider class to `DatasourceProvider`.
```python
class YourDatasourceProvider(DatasourceProvider):
def _validate_credentials(self, credentials: Mapping[str, Any]) -> None:
try:
"""
IMPLEMENT YOUR VALIDATION HERE
"""
except Exception as e:
raise ToolProviderCredentialValidationError(str(e))
```
- When using OAuth authentication mode, data source plugins differ slightly from tool plugins. When obtaining access permissions via OAuth, data source plugins can simultaneously return the username and avatar to be displayed on the frontend. Therefore, `_oauth_get_credentials` and `_oauth_refresh_credentials` need to return a `DatasourceOAuthCredentials` type that contains `name`, `avatar_url`, `expires_at`, and `credentials`.
The `DatasourceOAuthCredentials` class is defined as follows and must be set to the corresponding type when returned:
```python
class DatasourceOAuthCredentials(BaseModel):
name: str | None = Field(None, description="The name of the OAuth credential")
avatar_url: str | None = Field(None, description="The avatar url of the OAuth")
credentials: Mapping[str, Any] = Field(..., description="The credentials of the OAuth")
expires_at: int | None = Field(
default=-1,
description="""The expiration timestamp (in seconds since Unix epoch, UTC) of the credentials.
Set to -1 or None if the credentials do not expire.""",
)
```
The function signatures for `_oauth_get_authorization_url`, `_oauth_get_credentials`, and `_oauth_refresh_credentials` are as follows:
<Tabs>
<Tab title="_oauth_get_authorization_url">
```python
def _oauth_get_authorization_url(self, redirect_uri: str, system_credentials: Mapping[str, Any]) -> str:
"""
Generate the authorization URL for {{ .PluginName }} OAuth.
"""
try:
"""
IMPLEMENT YOUR AUTHORIZATION URL GENERATION HERE
"""
except Exception as e:
raise DatasourceOAuthError(str(e))
return ""
```
</Tab>
<Tab title="_oauth_get_credentials">
```python
def _oauth_get_credentials(
self, redirect_uri: str, system_credentials: Mapping[str, Any], request: Request
) -> DatasourceOAuthCredentials:
"""
Exchange code for access_token.
"""
try:
"""
IMPLEMENT YOUR CREDENTIALS EXCHANGE HERE
"""
except Exception as e:
raise DatasourceOAuthError(str(e))
return DatasourceOAuthCredentials(
name="",
avatar_url="",
expires_at=-1,
credentials={},
)
```
</Tab>
<Tab title="_oauth_refresh_credentials">
```python
def _oauth_refresh_credentials(
self, redirect_uri: str, system_credentials: Mapping[str, Any], credentials: Mapping[str, Any]
) -> DatasourceOAuthCredentials:
"""
Refresh the credentials
"""
return DatasourceOAuthCredentials(
name="",
avatar_url="",
expires_at=-1,
credentials={},
)
```
</Tab>
</Tabs>
### Add the Data Source
The YAML file format and data source code format vary across the three types of data sources.
#### Web Crawler
In the provider YAML file for a web crawler data source plugin, `output_schema` must always return four parameters: `source_url`, `content`, `title`, and `description`.
```yaml
output_schema:
type: object
properties:
source_url:
type: string
description: the source url of the website
content:
type: string
description: the content from the website
title:
type: string
description: the title of the website
"description":
type: string
description: the description of the website
```
In the main logic code for a web crawler plugin, the class must inherit from `WebsiteCrawlDatasource` and implement the `_get_website_crawl` method. You then need to use the `create_crawl_message` method to return the web crawl message.
To crawl multiple web pages and return them in batches, you can set `WebSiteInfo.status` to `processing` and use the `create_crawl_message` method to return each batch of crawled pages. After all pages have been crawled, set `WebSiteInfo.status` to `completed`.
```python
class YourDataSource(WebsiteCrawlDatasource):
def _get_website_crawl(
self, datasource_parameters: dict[str, Any]
) -> Generator[ToolInvokeMessage, None, None]:
crawl_res = WebSiteInfo(web_info_list=[], status="", total=0, completed=0)
crawl_res.status = "processing"
yield self.create_crawl_message(crawl_res)
### your crawl logic
...
crawl_res.status = "completed"
crawl_res.web_info_list = [
WebSiteInfoDetail(
title="",
source_url="",
description="",
content="",
)
]
crawl_res.total = 1
crawl_res.completed = 1
yield self.create_crawl_message(crawl_res)
```
#### Online Document
The return value for an online document data source plugin must include at least a `content` field to represent the document's content. For example:
```yaml
output_schema:
type: object
properties:
workspace_id:
type: string
description: workspace id
page_id:
type: string
description: page id
content:
type: string
description: page content
```
In the main logic code for an online document plugin, the class must inherit from `OnlineDocumentDatasource` and implement two methods: `_get_pages` and `_get_content`.
When a user runs the plugin, it first calls the `_get_pages` method to retrieve a list of documents. After the user selects a document from the list, it then calls the `_get_content` method to fetch the document's content.
<Tabs>
<Tab title="_get_pages">
```python
def _get_pages(self, datasource_parameters: dict[str, Any]) -> DatasourceGetPagesResponse:
# your get pages logic
response = requests.get(url, headers=headers, params=params, timeout=30)
pages = []
for item in response.json().get("results", []):
page = OnlineDocumentPage(
page_name=item.get("title", ""),
page_id=item.get("id", ""),
type="page",
last_edited_time=item.get("version", {}).get("createdAt", ""),
parent_id=item.get("parentId", ""),
page_icon=None,
)
pages.append(page)
online_document_info = OnlineDocumentInfo(
workspace_name=workspace_name,
workspace_icon=workspace_icon,
workspace_id=workspace_id,
pages=[page],
total=pages.length(),
)
return DatasourceGetPagesResponse(result=[online_document_info])
```
</Tab>
<Tab title="_get_content">
```python
def _get_content(self, page: GetOnlineDocumentPageContentRequest) -> Generator[DatasourceMessage, None, None]:
# your fetch content logic, example
response = requests.get(url, headers=headers, params=params, timeout=30)
...
yield self.create_variable_message("content", "")
yield self.create_variable_message("page_id", "")
yield self.create_variable_message("workspace_id", "")
```
</Tab>
</Tabs>
#### Online Drive
An online drive data source plugin returns a file, so it must adhere to the following specification:
```yaml
output_schema:
type: object
properties:
file:
$ref: "https://dify.ai/schemas/v1/file.json"
```
In the main logic code for an online drive plugin, the class must inherit from `OnlineDriveDatasource` and implement two methods: `_browse_files` and `_download_file`.
When a user runs the plugin, it first calls `_browse_files` to get a file list. At this point, `prefix` is empty, indicating a request for the root directory's file list. The file list contains both folder and file type variables. If the user opens a folder, the `_browse_files` method is called again. At this point, the `prefix` in `OnlineDriveBrowseFilesRequest` will be the folder ID used to retrieve the file list within that folder.
After a user selects a file, the plugin uses the `_download_file` method and the file ID to get the file's content. You can use the `_get_mime_type_from_filename` method to get the file's MIME type, allowing the pipeline to handle different file types appropriately.
When the file list contains multiple files, you can set `OnlineDriveFileBucket.is_truncated` to `True` and set `OnlineDriveFileBucket.next_page_parameters` to the parameters needed to fetch the next page of the file list, such as the next page's request ID or URL, depending on the service provider.
<Tabs>
<Tab title="_browse_files">
```python
def _browse_files(
self, request: OnlineDriveBrowseFilesRequest
) -> OnlineDriveBrowseFilesResponse:
credentials = self.runtime.credentials
bucket_name = request.bucket
prefix = request.prefix or "" # Allow empty prefix for root folder; When you browse the folder, the prefix is the folder id
max_keys = request.max_keys or 10
next_page_parameters = request.next_page_parameters or {}
files = []
files.append(OnlineDriveFile(
id="",
name="",
size=0,
type="folder" # or "file"
))
return OnlineDriveBrowseFilesResponse(result=[
OnlineDriveFileBucket(
bucket="",
files=files,
is_truncated=False,
next_page_parameters={}
)
])
```
</Tab>
<Tab title="_download_file">
```python
def _download_file(self, request: OnlineDriveDownloadFileRequest) -> Generator[DatasourceMessage, None, None]:
credentials = self.runtime.credentials
file_id = request.id
file_content = bytes()
file_name = ""
mime_type = self._get_mime_type_from_filename(file_name)
yield self.create_blob_message(file_content, meta={
"file_name": file_name,
"mime_type": mime_type
})
def _get_mime_type_from_filename(self, filename: str) -> str:
"""Determine MIME type from file extension."""
import mimetypes
mime_type, _ = mimetypes.guess_type(filename)
return mime_type or "application/octet-stream"
```
</Tab>
</Tabs>
For storage services like AWS S3, the `prefix`, `bucket`, and `id` variables have special uses and can be applied flexibly as needed during development:
- `prefix`: Represents the file path prefix. For example, `prefix=container1/folder1/` retrieves the files or file list from the `folder1` folder in the `container1` bucket.
- `bucket`: Represents the file bucket. For example, `bucket=container1` retrieves the files or file list in the `container1` bucket. This field can be left blank for non-standard S3 protocol drives.
- `id`: Since the `_download_file` method does not use the `prefix` variable, the full file path must be included in the `id`. For example, `id=container1/folder1/file1.txt` indicates retrieving the `file1.txt` file from the `folder1` folder in the `container1` bucket.
<Tip>
You can refer to the specific implementations of the [official Google Drive plugin](https://github.com/langgenius/dify-official-plugins/blob/main/datasources/google_cloud_storage/datasources/google_cloud_storage.py) and the [official AWS S3 plugin](https://github.com/langgenius/dify-official-plugins/blob/main/datasources/aws_s3_storage/datasources/aws_s3_storage.py).
</Tip>
## Debug the Plugin
Data source plugins support two debugging methods: remote debugging or installing as a local plugin for debugging. Note the following:
- If the plugin uses OAuth authentication, the `redirect_uri` for remote debugging differs from that of a local plugin. Update the relevant configuration in your service provider's OAuth App accordingly.
- While data source plugins support single-step debugging, we still recommend testing them in a complete knowledge pipeline to ensure full functionality.
## Final Checks
Before packaging and publishing, make sure you've completed all of the following:
- Set the minimum supported Dify version to `1.9.0`.
- Set the SDK version to `dify-plugin>=0.5.0,<0.6.0`.
- Write the `README.md` and `PRIVACY.md` files.
- Include only English content in the code files.
- Replace the default icon with the data source provider's logo.
## Package and Publish
In the plugin directory, run the following command to generate a `.difypkg` plugin package:
```
dify plugin package . -o your_datasource.difypkg
```
Next, you can:
- Import and use the plugin in your Dify environment.
- Publish the plugin to Dify Marketplace by submitting a pull request.
<Info>
For the plugin publishing process, see [Publishing Plugins](/plugin-dev-en/0321-release-overview).
</Info>
{/*
Contributing Section
DO NOT edit this section!
It will be automatically generated by the script.
*/}
---
[Edit this page](https://github.com/langgenius/dify-docs/edit/main/plugin-dev-en/0222-datasource-plugin.mdx) | [Report an issue](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)

View File

@@ -377,5 +377,5 @@ It will be automatically generated by the script.
---
[Edit this page](https://github.com/langgenius/dify-docs/edit/main/en/plugins/quick-start/develop-plugins/tool-oauth.mdx) | [Report an issue](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)
[Edit this page](https://github.com/langgenius/dify-docs/edit/main/plugin-dev-en/0222-tool-oauth.mdx) | [Report an issue](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)

View File

@@ -0,0 +1,452 @@
---
title: "データソースプラグイン"
---
データソースData Sourceプラグインは、Dify 1.9.0で新たに導入されたプラグインタイプです。ナレッジパイプラインKnowledge Pipelineにおいて、ドキュメントデータのソースとして機能し、パイプライン全体の開始点となります。
この記事では、データソースプラグインの開発方法(プラグインの構造、コード例、デバッグ方法など)を紹介し、プラグインの開発と公開を迅速に完了させるお手伝いをします。
## 事前準備
この記事を読む前に、ナレッジパイプラインのプロセスに関する基本的な理解と、プラグイン開発に関する一定の知識があることを確認してください。関連する内容は、以下のドキュメントで確認できます:
- [ステップ2ナレッジパイプラインのオーケストレーション](/ja-jp/guides/knowledge-base/knowledge-pipeline/knowledge-pipeline-orchestration)
- [Dify プラグイン開発Hello World ガイド](/plugin-dev-ja/0211-getting-started-dify-tool)
## データソースプラグインのタイプ
Difyは、Webクローラー、オンラインドキュメント、オンラインストレージの3種類のデータソースプラグインをサポートしています。プラグインのコードを具体的に実装する際には、プラグイン機能を実現するクラスが異なるデータソースクラスを継承する必要があります。3つのプラグインタイプは、3つの親クラスに対応しています。
<Info>
親クラスを継承してプラグイン機能を実現する方法については、 [Dify プラグイン開発Hello World ガイド-4.4 ツール (Tool) ロジックの実装](/plugin-dev-ja/0211-getting-started-dify-tool#4-4-ツール-tool-ロジックの実装)をお読みください。。
</Info>
各データソースプラグインタイプは、複数のデータソースの設定をサポートしています。例:
- WebクローラーJina Reader、FireCrawl
- オンラインドキュメントNotion、Confluence、GitHub
- オンラインストレージOnedrive、Google Drive、Box、AWS S3、Tencent COS
データソースのタイプとデータソースプラグインのタイプの関係は、以下の図の通りです:
![](/images/data_source_type.png)
## プラグインの開発
### データソースプラグインの作成
スキャフォールディングのコマンドラインツールを使用して、データソースプラグインを作成し、`datasource` タイプを選択できます。設定が完了すると、コマンドラインツールが自動的にプラグインのプロジェクトコードを生成します。
```powershell
dify plugin init
```
![](/images/datasource_plugin_init.png)
<Info>
通常、データソースプラグインはDifyプラットフォームの他の機能を使用しないため、追加の権限を設定する必要はありません。
</Info>
#### データソースプラグインの構造
データソースプラグインは、主に3つの部分で構成されています。
- `manifest.yaml` ファイル:プラグインの基本情報を記述します
- `provider` ディレクトリ:プラグインプロバイダーの説明と認証を実装するコードが含まれます
- `datasources` ディレクトリ:データソースを取得するコアロジックを実装する説明とコードが含まれます
```
├── _assets
│   └── icon.svg
├── datasources
│   ├── your_datasource.py
│   └── your_datasource.yaml
├── main.py
├── manifest.yaml
├── PRIVACY.md
├── provider
│   ├── your_datasource.py
│   └── your_datasource.yaml
├── README.md
└── requirements.txt
```
#### 正しいバージョンとタグの設定
- `manifest.yaml` ファイルでは、プラグインがサポートするDifyの最低バージョンを次のように設定する必要があります
```yaml
minimum_dify_version: 1.9.0
```
- `manifest.yaml` ファイルで、プラグインに以下のデータソースタグを追加し、プラグインがDify Marketplaceでデータソースとして分類・表示されるようにする必要があります
```yaml
tags:
- rag
```
- `requirements.txt` ファイルでは、プラグイン開発に使用するプラグインSDKのバージョンを次のように設定する必要があります
```yaml
dify-plugin>=0.5.0,<0.6.0
```
### プロバイダーの追加
#### プロバイダーYAMLファイルの作成
プロバイダーYAMLファイルの定義と記述は、ツールプラグインと基本的に同じですが、以下の2点のみ異なります。
```yaml
# データソースのプロバイダータイプを指定します。online_drive、online_document、または website_crawl に設定できます
provider_type: online_drive # online_document, website_crawl
# データソースを指定します
datasources:
- datasources/PluginName.yaml
```
<Info>
プロバイダーYAMLファイルの作成に関する詳細については、[Dify プラグイン開発Hello World ガイド-
4.3 プロバイダー認証情報の設定](/plugin-dev-ja/0211-getting-started-dify-tool#4-3-プロバイダー認証情報の設定)をお読みください。
</Info>
<Info>
データソースプラグインは、OAuth 2.0 または API キーの2つの方法で認証をサポートしています。OAuth の設定方法については、[ツールプラグインに OAuth サポートを追加する](/plugin-dev-en/0222-tool-oauth)をお読みください。
</Info>
#### プロバイダーコードファイルの作成
- **APIキー認証モードを使用する場合**、データソースプラグインのプロバイダーコードファイルはツールプラグインと完全に同じですが、プロバイダークラスが継承する親クラスを `DatasourceProvider` に変更するだけで済みます。
```python
class YourDatasourceProvider(DatasourceProvider):
def _validate_credentials(self, credentials: Mapping[str, Any]) -> None:
try:
"""
IMPLEMENT YOUR VALIDATION HERE
"""
except Exception as e:
raise ToolProviderCredentialValidationError(str(e))
```
- **OAuth認証モードを使用する場合**、データソースプラグインはツールプラグインと若干異なります。OAuthを使用してアクセス権限を取得する際、データソースプラグインは同時にユーザー名とアバターを返し、フロントエンドに表示することができます。そのため、`_oauth_get_credentials` と `_oauth_refresh_credentials` は、`name`、`avatar_url`、`expires_at`、および `credentials` を含む `DatasourceOAuthCredentials` 型を返す必要があります。
- `DatasourceOAuthCredentials` クラスの定義は以下の通りです。返す際には対応する型に設定する必要があります:
```python
class DatasourceOAuthCredentials(BaseModel):
name: str | None = Field(None, description="The name of the OAuth credential")
avatar_url: str | None = Field(None, description="The avatar url of the OAuth")
credentials: Mapping[str, Any] = Field(..., description="The credentials of the OAuth")
expires_at: int | None = Field(
default=-1,
description="""The expiration timestamp (in seconds since Unix epoch, UTC) of the credentials.
Set to -1 or None if the credentials do not expire.""",
)
```
- `_oauth_get_authorization_url`、`_oauth_get_credentials`、および `_oauth_refresh_credentials` の関数シグネチャは以下の通りです:
<Tabs>
<Tab title="_oauth_get_authorization_url">
```python
def _oauth_get_authorization_url(self, redirect_uri: str, system_credentials: Mapping[str, Any]) -> str:
"""
Generate the authorization URL for {{ .PluginName }} OAuth.
"""
try:
"""
IMPLEMENT YOUR AUTHORIZATION URL GENERATION HERE
"""
except Exception as e:
raise DatasourceOAuthError(str(e))
return ""
```
</Tab>
<Tab title="_oauth_get_credentials">
```python
def _oauth_get_credentials(
self, redirect_uri: str, system_credentials: Mapping[str, Any], request: Request
) -> DatasourceOAuthCredentials:
"""
Exchange code for access_token.
"""
try:
"""
IMPLEMENT YOUR CREDENTIALS EXCHANGE HERE
"""
except Exception as e:
raise DatasourceOAuthError(str(e))
return DatasourceOAuthCredentials(
name="",
avatar_url="",
expires_at=-1,
credentials={},
)
```
</Tab>
<Tab title="_oauth_refresh_credentials">
```python
def _oauth_refresh_credentials(
self, redirect_uri: str, system_credentials: Mapping[str, Any], credentials: Mapping[str, Any]
) -> DatasourceOAuthCredentials:
"""
Refresh the credentials
"""
return DatasourceOAuthCredentials(
name="",
avatar_url="",
expires_at=-1,
credentials={},
)
```
</Tab>
</Tabs>
### データソースの追加
3種類のデータソースプラグインでは、作成するYAMLファイル形式とデータソースコード形式が異なります。以下でそれぞれ説明します。
#### **Webクローラー**(Web Crawler)
Webクローラー系プラグインのプロバイダーYAMLファイルでは、`output_schema` は `source_url`、`content`、`title`、`description` の4つのパラメータを固定で返す必要があります。
```yaml
output_schema:
type: object
properties:
source_url:
type: string
description: the source url of the website
content:
type: string
description: the content from the website
title:
type: string
description: the title of the website
"description":
type: string
description: the description of the website
```
Webクローラー系プラグインのメインロジックコードでは、`WebsiteCrawlDatasource` クラスを継承し、`_get_website_crawl` メソッドを実装した上で、`create_crawl_message` メソッドを使用してウェブクローラーメッセージを返す必要があります。
複数のウェブページをクロールし、バッチで返す必要がある場合は、`WebSiteInfo.status` を `processing` に設定し、`create_crawl_message` メソッドを使用して各バッチのWebクローラーメッセージを返すことができます。すべてのウェブページのクロールが完了したら、`WebSiteInfo.status` を `completed` に設定します。
```python
class YourDataSource(WebsiteCrawlDatasource):
def _get_website_crawl(
self, datasource_parameters: dict[str, Any]
) -> Generator[ToolInvokeMessage, None, None]:
crawl_res = WebSiteInfo(web_info_list=[], status="", total=0, completed=0)
crawl_res.status = "processing"
yield self.create_crawl_message(crawl_res)
### your crawl logic
...
crawl_res.status = "completed"
crawl_res.web_info_list = [
WebSiteInfoDetail(
title="",
source_url="",
description="",
content="",
)
]
crawl_res.total = 1
crawl_res.completed = 1
yield self.create_crawl_message(crawl_res)
```
#### オンラインドキュメント(Online Document)
オンラインドキュメント系プラグインの戻り値には、ドキュメントの内容を表す `content` フィールドを少なくとも含める必要があります。以下に例を示します。
```yaml
output_schema:
type: object
properties:
workspace_id:
type: string
description: workspace id
page_id:
type: string
description: page id
content:
type: string
description: page content
```
オンラインドキュメント系プラグインのメインロジックコードでは、`OnlineDocumentDatasource` クラスを継承し、`_get_pages` と `_get_content` の2つのメソッドを実装する必要があります。ユーザーがプラグインを実行すると、まず `_get_pages` メソッドを介してドキュメントリストを取得します。ユーザーがリストから特定のドキュメントを選択すると、次に `_get_content` メソッドを介してドキュメントの内容を取得します。
<Tabs>
<Tab title="_get_pages">
```python
def _get_pages(self, datasource_parameters: dict[str, Any]) -> DatasourceGetPagesResponse:
# your get pages logic
response = requests.get(url, headers=headers, params=params, timeout=30)
pages = []
for item in response.json().get("results", []):
page = OnlineDocumentPage(
page_name=item.get("title", ""),
page_id=item.get("id", ""),
type="page",
last_edited_time=item.get("version", {}).get("createdAt", ""),
parent_id=item.get("parentId", ""),
page_icon=None,
)
pages.append(page)
online_document_info = OnlineDocumentInfo(
workspace_name=workspace_name,
workspace_icon=workspace_icon,
workspace_id=workspace_id,
pages=[page],
total=pages.length(),
)
return DatasourceGetPagesResponse(result=[online_document_info])
```
</Tab>
<Tab title="_get_content">
```python
def _get_content(self, page: GetOnlineDocumentPageContentRequest) -> Generator[DatasourceMessage, None, None]:
# your fetch content logic, example
response = requests.get(url, headers=headers, params=params, timeout=30)
...
yield self.create_variable_message("content", "")
yield self.create_variable_message("page_id", "")
yield self.create_variable_message("workspace_id", "")
```
</Tab>
</Tabs>
#### オンラインストレージ(Online Drive)
オンラインストレージ系プラグインの戻り値の型はファイルであり、以下の仕様に従う必要があります。
```yaml
output_schema:
type: object
properties:
file:
$ref: "https://dify.ai/schemas/v1/file.json"
```
オンラインストレージ系プラグインのメインロジックコードでは、`OnlineDriveDatasource` クラスを継承し、`_browse_files` と `_download_file` の2つのメソッドを実装する必要があります。
ユーザーがプラグインを実行すると、まず `_browse_files` メソッドを介してファイルリストを取得します。このとき `prefix` は空で、ルートディレクトリ直下のファイルリストを取得することを示します。ファイルリストには、フォルダとファイルの2種類の変数が含まれます。ユーザーがさらにフォルダを開くと、`_browse_files` メソッドが再度実行されます。このとき、`OnlineDriveBrowseFilesRequest` の中の `prefix` はフォルダIDとなり、そのフォルダ内のファイルリストを取得するために使用されます。
ユーザーが特定のファイルを選択すると、プラグインは `_download_file` メソッドとファイルIDを介してファイルの内容を取得します。`_get_mime_type_from_filename` メソッドを使用してファイルのMIMEタイプを取得し、パイプラインで異なるファイルタイプに対して異なる処理を行うことができます。
ファイルリストに複数のファイルが含まれる場合、`OnlineDriveFileBucket.is_truncated` を `True` に設定し、`OnlineDriveFileBucket.next_page_parameters` をファイルリストの取得を続行するためのパラメータ例えば、次のページの要求IDやURLなど、サービスプロバイダーによって異なりますに設定することができます。
<Tabs>
<Tab title="_browse_files">
```python
def _browse_files(
self, request: OnlineDriveBrowseFilesRequest
) -> OnlineDriveBrowseFilesResponse:
credentials = self.runtime.credentials
bucket_name = request.bucket
prefix = request.prefix or "" # Allow empty prefix for root folder; When you browse the folder, the prefix is the folder id
max_keys = request.max_keys or 10
next_page_parameters = request.next_page_parameters or {}
files = []
files.append(OnlineDriveFile(
id="",
name="",
size=0,
type="folder" # or "file"
))
return OnlineDriveBrowseFilesResponse(result=[
OnlineDriveFileBucket(
bucket="",
files=files,
is_truncated=False,
next_page_parameters={}
)
])
```
</Tab>
<Tab title="_download_file">
```python
def _download_file(self, request: OnlineDriveDownloadFileRequest) -> Generator[DatasourceMessage, None, None]:
credentials = self.runtime.credentials
file_id = request.id
file_content = bytes()
file_name = ""
mime_type = self._get_mime_type_from_filename(file_name)
yield self.create_blob_message(file_content, meta={
"file_name": file_name,
"mime_type": mime_type
})
def _get_mime_type_from_filename(self, filename: str) -> str:
"""Determine MIME type from file extension."""
import mimetypes
mime_type, _ = mimetypes.guess_type(filename)
return mime_type or "application/octet-stream"
```
</Tab>
</Tabs>
AWS S3などのストレージサービスプロバイダーでは、`prefix`、`bucket`、`id` の各変数に特別な使用方法があり、実際の開発では必要に応じて柔軟に適用できます。
- `prefix`: ファイルパスのプレフィックスを表します。例えば、`prefix=container1/folder1/` は、`container1` バケット内の `folder1` フォルダにあるファイルまたはファイルリストを取得することを示します。
- `bucket`: ファイルバケットを表します。例えば、`bucket=container1` は、`container1` バケット直下のファイルまたはファイルリストを取得することを示します非標準のS3プロトコルのオンラインストレージの場合、このフィールドは空にすることができます
- `id`: `_download_file` メソッドは prefix 変数を使用しないため、ファイルパスを id に連結する必要があります。例えば、`id=container1/folder1/file1.txt` は、`container1` バケット内の folder1 フォルダにある `file1.txt` ファイルを取得することを示します。
<Tip>
[公式 Google Drive プラグイン](https://github.com/langgenius/dify-official-plugins/blob/main/datasources/google_cloud_storage/datasources/google_cloud_storage.py) および [公式 AWS S3 プラグイン](https://github.com/langgenius/dify-official-plugins/blob/main/datasources/aws_s3_storage/datasources/aws_s3_storage.py)の具体的な実装を参考にしてください。
</Tip>
## プラグインのデバッグ
データソースプラグインは、リモートデバッグとローカルプラグインとしてのインストールの2つのデバッグ方法をサポートしています。注意点
- プラグインがOAuth認証モードを使用している場合、リモートデバッグ時の `redirect_uri` はローカルプラグインの設定と一致しないため、サービスプロバイダーのOAuth App関連設定を変更する必要があります。
- データソースプラグインはステップ実行デバッグをサポートしていますが、機能の正確性を保証するため、完全なナレッジパイプラインでテストすることをお勧めします。
## 最終チェック
パッケージ化して公開する前に、以下の項目がすべて完了していることを確認してください:
- サポートする最低Difyバージョンを `1.9.0` に設定する
- SDKバージョンを `dify-plugin>=0.5.0,<0.6.0` に設定する
- `README.md` と `PRIVACY.md` ファイルを作成する
- コードファイルには英語のコンテンツのみが含まれていることを確認する
- デフォルトのアイコンをデータソースプロバイダーのロゴに置き換える
## パッケージ化と公開
プラグインディレクトリで以下のコマンドを実行すると、`.difypkg` プラグインパッケージが生成されます:
```
dify plugin package . -o your_datasource.difypkg
```
次に、以下のいずれかを行うことができます:
- あなたのDify環境にプラグインをインポートして使用する
- プルリクエストを送信して、Dify Marketplaceにプラグインを公開する
<Info>
プラグインの公開プロセスについては、[プラグインのリリース](/plugin-dev-ja/0321-release-overview)。
</Info>
{/*
Contributing Section
DO NOT edit this section!
It will be automatically generated by the script.
*/}
---
[このページを編集する](https://github.com/langgenius/dify-docs/edit/main/plugin-dev-ja/0222-datasource-plugin.mdx) | [問題を報告する](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)

View File

@@ -0,0 +1,451 @@
---
title: "数据源插件"
---
数据源Data Source插件是 Dify 1.9.0 新引入的一种插件类型。在知识流水线Knowledge Pipeline其作为文档数据的来源并充当整个流水线的起始点。
本文介绍如何开发数据源插件(包括插件结构、代码示例、调试方法等),帮助你快速完成插件开发与上线。
## 前置准备
在阅读本文之前,请确保你对知识流水线的流程有基本的了解,并且具有一定的插件开发知识。你可以在以下文档找到相关内容:
- [步骤二:知识流水线编排](/zh-hans/guides/knowledge-base/knowledge-pipeline/knowledge-pipeline-orchestration)
- [Dify 插件开发Hello World 指南](/plugin-dev-zh/0211-getting-started-dify-tool)
## 数据源插件类型
Dify 支持三种数据源插件:网页爬虫、在线文档和在线网盘。在具体实现插件代码时,实现插件功能的类需要继承不同的数据源类,三种插件类型对应三种父类。
<Info>
了解如何继承父类以实现插件功能,请阅读 [Dify 插件开发Hello World 指南-4.4 实现工具Tool逻辑](/plugin-dev-zh/0211-getting-started-dify-tool#4-4-实现工具-tool-逻辑)。
</Info>
每个数据源插件类型支持配置多种数据源,例如:
- 网页爬虫Jina ReaderFireCrawl
- 在线文档NotionConfluenceGitHub
- 在线网盘OnedriveGoogle DriveBoxAWS S3Tencent COS
数据源类型与数据源插件类型的关系如下图所示:
![](/images/data_source_type.png)
## 开发插件
### 创建数据源插件
你可以使用脚手架命令行工具来创建数据源插件,并选择 `datasource` 类型。完成设置后,命令行工具将自动生成插件项目代码。
```powershell
dify plugin init
```
![](/images/datasource_plugin_init.png)
<Info>
一般情况下,数据源插件不使用 Dify 平台的其他功能,因此无需为其设置额外权限。
</Info>
#### 数据源插件结构
数据源插件包含三个主要部分:
- `manifest.yaml` 文件:描述插件的基本信息
- `provider` 目录:包含插件供应商的描述与实现鉴权的代码
- `datasources` 目录:包含实现获取数据源核心逻辑的描述与代码
```
├── _assets
│   └── icon.svg
├── datasources
│   ├── your_datasource.py
│   └── your_datasource.yaml
├── main.py
├── manifest.yaml
├── PRIVACY.md
├── provider
│   ├── your_datasource.py
│   └── your_datasource.yaml
├── README.md
└── requirements.txt
```
#### 设置正确的版本及标签
- 在 `manifest.yaml` 文件中,插件支持的最低 Dify 版本需设置如下:
```yaml
minimum_dify_version: 1.9.0
```
- 在 `manifest.yaml` 文件中,需为插件添加如下数据源标签,使插件在 Dify Marketplace 中以数据源分类展示:
```yaml
tags:
- rag
```
- 在 `requirements.txt` 文件中,插件开发使用的插件 SDK 版本需设置如下:
```yaml
dify-plugin>=0.5.0,<0.6.0
```
### 添加供应商
#### 创建供应商 YAML 文件
供应商 YAML 文件的定义和编写与工具插件基本相同,仅有以下两点差异:
```yaml
# 指定数据源的 provider 类型,可设置为 online_driveonline_document 或 website_crawl
provider_type: online_drive # online_document, website_crawl
# 指定数据源
datasources:
- datasources/PluginName.yaml
```
<Info>
了解更多创建供应商 YAML 文件的信息,请阅读 [Dify 插件开发Hello World 指南-4.3 配置 Provider 凭证](/plugin-dev-zh/0211-getting-started-dify-tool#4-3-配置-provider-凭证)。
</Info>
<Info>
数据源插件支持以 OAuth 2.0 或 API Key 两种方式进行认证。了解如何配置 OAuth请阅读 [为工具插件添加 OAuth 支持](/plugin-dev-zh/tool-oauth)。
</Info>
#### 创建供应商代码文件
- **若使用 API Key 认证模式**,数据源插件的供应商代码文件与工具插件完全相同,仅需将供应商类继承的父类修改为 `DatasourceProvider` 即可。
```python
class YourDatasourceProvider(DatasourceProvider):
def _validate_credentials(self, credentials: Mapping[str, Any]) -> None:
try:
"""
IMPLEMENT YOUR VALIDATION HERE
"""
except Exception as e:
raise ToolProviderCredentialValidationError(str(e))
```
- **若使用 OAuth 认证模式**,数据源插件则与工具插件略有不同。使用 OAuth 获取访问权限时,数据源插件可同时返回用户名和头像并显示在前端。因此,`_oauth_get_credentials` 和 `_oauth_refresh_credentials` 需要返回包含 `name`、`avatar_url`、`expires_at` 和 `credentials` 的 `DatasourceOAuthCredentials` 类型。
- `DatasourceOAuthCredentials` 类的定义如下,返回时需设置为对应的类型:
```python
class DatasourceOAuthCredentials(BaseModel):
name: str | None = Field(None, description="The name of the OAuth credential")
avatar_url: str | None = Field(None, description="The avatar url of the OAuth")
credentials: Mapping[str, Any] = Field(..., description="The credentials of the OAuth")
expires_at: int | None = Field(
default=-1,
description="""The expiration timestamp (in seconds since Unix epoch, UTC) of the credentials.
Set to -1 or None if the credentials do not expire.""",
)
```
- `_oauth_get_authorization_url``_oauth_get_credentials` 和 `_oauth_refresh_credentials` 的函数签名如下:
<Tabs>
<Tab title="_oauth_get_authorization_url">
```python
def _oauth_get_authorization_url(self, redirect_uri: str, system_credentials: Mapping[str, Any]) -> str:
"""
Generate the authorization URL for {{ .PluginName }} OAuth.
"""
try:
"""
IMPLEMENT YOUR AUTHORIZATION URL GENERATION HERE
"""
except Exception as e:
raise DatasourceOAuthError(str(e))
return ""
```
</Tab>
<Tab title="_oauth_get_credentials">
```python
def _oauth_get_credentials(
self, redirect_uri: str, system_credentials: Mapping[str, Any], request: Request
) -> DatasourceOAuthCredentials:
"""
Exchange code for access_token.
"""
try:
"""
IMPLEMENT YOUR CREDENTIALS EXCHANGE HERE
"""
except Exception as e:
raise DatasourceOAuthError(str(e))
return DatasourceOAuthCredentials(
name="",
avatar_url="",
expires_at=-1,
credentials={},
)
```
</Tab>
<Tab title="_oauth_refresh_credentials">
```python
def _oauth_refresh_credentials(
self, redirect_uri: str, system_credentials: Mapping[str, Any], credentials: Mapping[str, Any]
) -> DatasourceOAuthCredentials:
"""
Refresh the credentials
"""
return DatasourceOAuthCredentials(
name="",
avatar_url="",
expires_at=-1,
credentials={},
)
```
</Tab>
</Tabs>
### 添加数据源
三种数据源插件需要创建的 YAML 文件格式与数据源代码格式有所不同,下面将分别介绍。
#### 网络爬虫Web Crawler
在网页爬虫类插件的供应商 YAML 文件中, `output_schema` 需固定返回四个参数:`source_url`、`content`、`title` 和 `description`。
```yaml
output_schema:
type: object
properties:
source_url:
type: string
description: the source url of the website
content:
type: string
description: the content from the website
title:
type: string
description: the title of the website
"description":
type: string
description: the description of the website
```
在网页爬虫类插件的主逻辑代码中,需继承 `WebsiteCrawlDatasource` 类并实现 `_get_website_crawl` 方法,然后使用 `create_crawl_message` 方法返回网页爬虫消息。
如需爬取多个网页并分批返回,可将 `WebSiteInfo.status` 设置为 `processing`,然后使用 `create_crawl_message` 方法返回每一批网页爬虫消息。当所有网页均爬取完成后,再将 `WebSiteInfo.status` 设置为 `completed`。
```python
class YourDataSource(WebsiteCrawlDatasource):
def _get_website_crawl(
self, datasource_parameters: dict[str, Any]
) -> Generator[ToolInvokeMessage, None, None]:
crawl_res = WebSiteInfo(web_info_list=[], status="", total=0, completed=0)
crawl_res.status = "processing"
yield self.create_crawl_message(crawl_res)
### your crawl logic
...
crawl_res.status = "completed"
crawl_res.web_info_list = [
WebSiteInfoDetail(
title="",
source_url="",
description="",
content="",
)
]
crawl_res.total = 1
crawl_res.completed = 1
yield self.create_crawl_message(crawl_res)
```
#### 在线文档Online Document
在线文档类插件的返回值至少需包含 `content` 字段用于表示文档内容,示例如下:
```yaml
output_schema:
type: object
properties:
workspace_id:
type: string
description: workspace id
page_id:
type: string
description: page id
content:
type: string
description: page content
```
在线文档类插件的主逻辑代码中,需继承 `OnlineDocumentDatasource` 类并实现 `_get_pages` 和 `_get_content` 两个方法。当用户运行插件时,首先通过 `_get_pages` 方法获取文档列表;当用户从列表中选择某个文档后,再通过 `_get_content` 方法获取文档内容。
<Tabs>
<Tab title="_get_pages">
```python
def _get_pages(self, datasource_parameters: dict[str, Any]) -> DatasourceGetPagesResponse:
# your get pages logic
response = requests.get(url, headers=headers, params=params, timeout=30)
pages = []
for item in response.json().get("results", []):
page = OnlineDocumentPage(
page_name=item.get("title", ""),
page_id=item.get("id", ""),
type="page",
last_edited_time=item.get("version", {}).get("createdAt", ""),
parent_id=item.get("parentId", ""),
page_icon=None,
)
pages.append(page)
online_document_info = OnlineDocumentInfo(
workspace_name=workspace_name,
workspace_icon=workspace_icon,
workspace_id=workspace_id,
pages=[page],
total=pages.length(),
)
return DatasourceGetPagesResponse(result=[online_document_info])
```
</Tab>
<Tab title="_get_content">
```python
def _get_content(self, page: GetOnlineDocumentPageContentRequest) -> Generator[DatasourceMessage, None, None]:
# your fetch content logic, example
response = requests.get(url, headers=headers, params=params, timeout=30)
...
yield self.create_variable_message("content", "")
yield self.create_variable_message("page_id", "")
yield self.create_variable_message("workspace_id", "")
```
</Tab>
</Tabs>
#### 在线网盘Online Drive
在线网盘类插件的返回值类型为一个文件,需遵循以下规范:
```yaml
output_schema:
type: object
properties:
file:
$ref: "https://dify.ai/schemas/v1/file.json"
```
在线网盘类插件的主逻辑代码中,需继承 `OnlineDriveDatasource` 类并实现 `_browse_files` 和 `_download_file` 两个方法。
当用户运行插件时,首先通过 `_browse_files` 方法获取文件列表。此时 `prefix` 为空,表示获取根目录下的文件列表。文件列表中包含文件夹与文件两种类型的变量。当用户继续打开文件夹时,将再次运行 `_browse_files` 方法。此时, `OnlineDriveBrowseFilesRequest` 中的 `prefix` 为文件夹 ID用于获取该文件夹内的文件列表。
当用户选择某个文件后,插件通过 `_download_file` 方法和文件 ID 获取文件内容。你可以使用 `_get_mime_type_from_filename` 方法获取文件的 MIME 类型,以便在流水线中对不同的文件类型进行不同的处理。
当文件列表包含多个文件时,可将 `OnlineDriveFileBucket.is_truncated` 设置为 `True`,并将`OnlineDriveFileBucket.next_page_parameters` 设置为继续获取文件列表的参数,如下一页的请求 ID 或 URL具体取决于不同的服务商。
<Tabs>
<Tab title="_browse_files">
```python
def _browse_files(
self, request: OnlineDriveBrowseFilesRequest
) -> OnlineDriveBrowseFilesResponse:
credentials = self.runtime.credentials
bucket_name = request.bucket
prefix = request.prefix or "" # Allow empty prefix for root folder; When you browse the folder, the prefix is the folder id
max_keys = request.max_keys or 10
next_page_parameters = request.next_page_parameters or {}
files = []
files.append(OnlineDriveFile(
id="",
name="",
size=0,
type="folder" # or "file"
))
return OnlineDriveBrowseFilesResponse(result=[
OnlineDriveFileBucket(
bucket="",
files=files,
is_truncated=False,
next_page_parameters={}
)
])
```
</Tab>
<Tab title="_download_file">
```python
def _download_file(self, request: OnlineDriveDownloadFileRequest) -> Generator[DatasourceMessage, None, None]:
credentials = self.runtime.credentials
file_id = request.id
file_content = bytes()
file_name = ""
mime_type = self._get_mime_type_from_filename(file_name)
yield self.create_blob_message(file_content, meta={
"file_name": file_name,
"mime_type": mime_type
})
def _get_mime_type_from_filename(self, filename: str) -> str:
"""Determine MIME type from file extension."""
import mimetypes
mime_type, _ = mimetypes.guess_type(filename)
return mime_type or "application/octet-stream"
```
</Tab>
</Tabs>
对于 AWS S3 等存储服务商,`prefix``bucket` 和 `id` 变量有特殊的使用方法,可在实际开发中按需灵活应用:
- `prefix`:表示文件路径前缀,例如 `prefix=container1/folder1/` 表示获取 `container1` 桶下的 `folder1` 文件夹中的文件或文件列表。
- `bucket`:表示文件桶,例如 `bucket=container1` 表示获取 `container1` 桶下的文件或文件列表(若是非标准 S3 协议网盘,该字段可为空 )。
- `id`:由于`_download_file` 方法不使用 `prefix` 变量,需将文件路径拼接到 `id` 中,例如 `id=container1/folder1/file1.txt` 表示获取 `container1` 桶下的 `folder1` 文件夹中的 `file1.txt` 文件。
<Tip>
你可以参考 [官方 Google Drive 插件](https://github.com/langgenius/dify-official-plugins/blob/main/datasources/google_cloud_storage/datasources/google_cloud_storage.py) 和 [官方 AWS S3 插件](https://github.com/langgenius/dify-official-plugins/blob/main/datasources/aws_s3_storage/datasources/aws_s3_storage.py) 中的具体实现。
</Tip>
## 调试插件
数据源插件支持两种调试方式:远程调试或安装为本地插件进行调试。需注意:
- 若插件使用 OAuth 认证模式,其远程调试时的 `redirect_uri` 与本地插件的设置并不一致,需要修改服务商 OAuth App 的相关配置。
- 数据源插件支持单步调试,但为了保证功能的正确性,我们仍推荐你在完整的知识流水线中进行测试。
## 最终检查
在打包与发布前,确保已完成以下事项:
- 设置支持的最低 Dify 版本为 `1.9.0`
- 设置 SDK 版本为 `dify-plugin>=0.5.0,<0.6.0`
- 编写 `README.md` 和 `PRIVACY.md` 文件
- 确保代码文件中仅包含英文内容
- 将默认图标替换为数据源供应商 Logo
## 打包与发布
在插件目录执行以下命令,即可生成 `.difypkg` 插件包:
```
dify plugin package . -o your_datasource.difypkg
```
接下来,你可以:
- 在你的 Dify 环境中导入并使用该插件
- 通过提交 Pull Request 的方式将插件发布到 Dify Marketplace
<Info>
了解插件发布流程,请阅读 [发布插件](/plugin-dev-zh/0321-release-overview)。
</Info>
{/*
Contributing Section
DO NOT edit this section!
It will be automatically generated by the script.
*/}
---
[编辑此页面](https://github.com/langgenius/dify-docs/edit/main/plugin-dev-zh/0222-datasource-plugin.mdx) | [提交问题](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)

View File

@@ -376,5 +376,5 @@ It will be automatically generated by the script.
---
[编辑此页面](https://github.com/langgenius/dify-docs/edit/main/zh-hans/plugins/quick-start/develop-plugins/tool-oauth.mdx) | [提交问题](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)
[编辑此页面](https://github.com/langgenius/dify-docs/edit/main/plugin-dev-zh/0222-tool-oauth.mdx) | [提交问题](https://github.com/langgenius/dify-docs/issues/new?template=docs.yml)

View File

@@ -6,8 +6,8 @@ dimensions:
level: intermediate
standard_title: Tool Plugin
language: zh
title: Tool 插件
description: 本文档详细介绍如何开发Dify的工具插件以Google Search为例实现了一个完整的工具插件开发流程。内容包括插件初始化、模板选择、模板选择、工具供应商配置文件定义、添加第三方服务凭证、工具功能代码实现、调试和打包发布等完整环节。
title: 工具插件
description: 本文档介绍如何开发 Dify 的工具插件,以 Google Search 为例实现了一个完整的工具插件开发流程。内容包括插件初始化、模板选择、模板选择、工具供应商配置文件定义、添加第三方服务凭证、工具功能代码实现、调试和打包发布等完整环节。
---
工具指的是能够被 Chatflow / Workflow / Agent 类型应用所调用的第三方服务,提供完整的 API 实现能力,用于增强 Dify 应用的能力。例如为应用添加在线搜索、图片生成等额外功能。

View File

@@ -7,7 +7,9 @@ title: "步骤二:知识流水线编排"
在这个章节,你将了解知识流水线的过程,理解不同节点的含义和配置,如何自定义构建数据处理流程,从而高效地管理和优化知识库。
### 界面状态
进入知识流水线编排界面时,你会看到:
- **标签页状态**Documents文档、Retrieval Test召回测试和 Settings设置标签页将显示为置灰且不可用状态
- **必要步骤**:您必须完成知识流水线的配置、调试和发布后,才能上传文件或使用其他功能
@@ -24,7 +26,7 @@ title: "步骤二:知识流水线编排"
在开始之前,我们先拆解知识流水线的处理流程,你可以更好地理解数据是如何一步步转化为可用的知识库。
<Tip>
**数据源配置 → 数据处理节点(文档提取器 + 分块器)→ 知识库节点(分块结构+索引配置) → 配置用户输入表单 → 测试发布**
**数据源配置 → 数据处理节点(文档提取器 + 分块器)→ 知识库节点(分块结构+索引配置) → 配置用户输入表单 → 测试发布**
</Tip>
1. **数据源配置**来自各种数据源的原始内容本地文件、Notion、网页等
@@ -37,7 +39,7 @@ title: "步骤二:知识流水线编排"
## 步骤一:数据源配置
在一个知识库里你可以选择单一或多个数据源。每个数据源可以被多次选中并包含不同配置。目前Dify 支持 5 种数据源:文件上传、在线网盘、在线数据和 Web Crawler
在一个知识库里你可以选择单一或多个数据源。每个数据源可以被多次选中并包含不同配置。目前Dify 支持 4 种数据源:文件上传、在线网盘、在线文档和网页爬虫
你也可以前往 [Dify Marketplace](https://marketplace.dify.ai),获得更多数据源。
@@ -45,53 +47,62 @@ title: "步骤二:知识流水线编排"
用户可以直接选择本地文件进行上传,以下是配置选项和限制。
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-1.PNG)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**配置选择**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![](/images/knowledge-base/knowledge-pipeline-orchestration-1.PNG)
| 配置项 | 说明 |
|--------|------|
| 文件格式 | 支持 pdf, xlxs, docs 等,用户可自定义选择 |
| 上传方式 | 通过拖拽或选择文件或文件夹上传本地文件,支持批量上传 |
**限制**
| 限制项 | 说明 |
|--------|------|
| 文件数量 | 每次最多上传 50 个文件 |
| 文件大小 | 每个文件大小不超过 15MB |
| 储存限制 | 不同 SaaS 版本的订阅计划对文档上传总数和向量存储空间有所限制 |
**输出变量**
| 输出变量 | 变量格式 |
|----------|----------|
| `{x} Document` | 单个文档 |
</div>
</div>
<div style={{ flex:2,minWidth:"300px" }}>
**配置选择**
| 配置项 | 说明 |
| ---- | ----------------------------- |
| 文件格式 | 支持 pdf, xlxs, docs 等,用户可自定义选择 |
| 上传方式 | 通过拖拽或选择文件或文件夹上传本地文件,支持批量上传 |
**限制**
| 限制项 | 说明 |
| ---- | --------------------------------- |
| 文件数量 | 每次最多上传 50 个文件 |
| 文件大小 | 每个文件大小不超过 15MB |
| 储存限制 | 不同 SaaS 版本的订阅计划对文档上传总数和向量存储空间有所限制 |
**输出变量**
| 输出变量 | 变量格式 |
| -------------- | ---- |
| `{x} Document` | 单个文档 |
</div>
</div>
---
### 在线数据
### 在线文档
#### Notion
将知识库连接 Notion 工作区,可直接导入 Notion 页面和数据库内容,支持后续的数据自动同步。
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![Notion](/images/knowledge-base/knowledge-pipeline-orchestration-2.PNG)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**配置选项说明**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![Notion](/images/knowledge-base/knowledge-pipeline-orchestration-2.PNG)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
**配置选项说明**
| 配置项 | 选项 | 输出变量 | 说明 |
| --------- | --- | -------------- | ------------ |
| Extractor | 开启 | `{x} Content` | 输出结构化处理的页面信息 |
| | 关闭 | `{x} Document` | 输出页面的原始文本信息 |
</div>
| 配置项 | 选项 | 输出变量 | 说明 |
|--------|------|----------|------|
| Extractor | 开启 | `{x} Content` | 输出结构化处理的页面信息 |
| | 关闭 | `{x} Document` | 输出页面的原始文本信息 |
</div>
</div>
### 网页爬虫
@@ -102,53 +113,61 @@ title: "步骤二:知识流水线编排"
开源网页解析工具,提供简洁易用的 API 服务,适合快速抓取和处理网页内容。
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![Jina Reader](/images/knowledge-base/knowledge-pipeline-orchestration-3.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**参数配置和说明**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![Jina Reader](/images/knowledge-base/knowledge-pipeline-orchestration-3.png)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
**参数配置和说明**
| 参数 | 类型 | 说明 |
| -------------------------- | --- | ---------- |
| URL | 必填 | 目标网页地址 |
| 爬取子页面 (Crawl sub-page) | 可选 | 是否抓取链接页面 |
| 使用站点地图 (Use sitemap) | 可选 | 利用网站地图进行爬取 |
| 爬取页数限制 (Limit) | 必填 | 设置最大抓取页面数量 |
| 启用内容提取器 (Enable Extractor) | 可选 | 选择数据提取方式 |
</div>
| 参数 | 类型 | 说明 |
|------|------|------|
| URL | 必填 | 目标网页地址 |
| 爬取子页面 (Crawl sub-page) | 可选 | 是否抓取链接页面 |
| 使用站点地图 (Use sitemap) | 可选 | 利用网站地图进行爬取 |
| 爬取页数限制 (Limit) | 必填 | 设置最大抓取页面数量 |
| 启用内容提取器 (Enable Extractor) | 可选 | 选择数据提取方式 |
</div>
</div>
#### Firecrawl
开源网页解析工具,提供更精细的爬取控制选项和 API 服务,支持复杂网站结构的深度爬取,适合需要批量处理和精确控制的场景。
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![Firecrawl](/images/knowledge-base/knowledge-pipeline-orchestration-4.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
**参数配置和说明**
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![Firecrawl](/images/knowledge-base/knowledge-pipeline-orchestration-4.png)
| 参数 | 类型 | 说明 |
|------|------|------|
| URL | 必填 | 目标网页地址 |
| 爬取页数限制 (Limit) | 必填 | 设置最大抓取页面数量 |
| 爬取子页面 (Crawl sub-page) | 可选 | 是否抓取链接页面 |
| 最大爬取深度 (Max depth) | 可选 | 控制爬取层级深度 |
| 排除路径 (Exclude paths) | 可选 | 设置不爬取的页面路径 |
| 仅包含路径 (Include only paths) | 可选 | 限制只爬取指定路径 |
| 启用内容提取器 (Enable Extractor) | 可选 | 选择数据处理方式 |
| 只提取主要内容 | 可选 | 过滤页面辅助信息 |
</div>
</div>
### 在线网盘 (Online Drive)
<div style={{ flex:2,minWidth:"300px" }}>
**参数配置和说明**
| 参数 | 类型 | 说明 |
| -------------------------- | --- | ---------- |
| URL | 必填 | 目标网页地址 |
| 爬取页数限制 (Limit) | 必填 | 设置最大抓取页面数量 |
| 爬取子页面 (Crawl sub-page) | 可选 | 是否抓取链接页面 |
| 最大爬取深度 (Max depth) | 可选 | 控制爬取层级深度 |
| 排除路径 (Exclude paths) | 可选 | 设置不爬取的页面路径 |
| 仅包含路径 (Include only paths) | 可选 | 限制只爬取指定路径 |
| 启用内容提取器 (Enable Extractor) | 可选 | 选择数据处理方式 |
| 只提取主要内容 | 可选 | 过滤页面辅助信息 |
</div>
</div>
### 在线网盘
连接你的在线云储存服务(例如 Google Drive、Dropbox、OneDriveDify 将自动检索云储存中的文件,你可以勾选并导入相应文档进行下一步处理,无需手动下载文件再进行上传。
<Tip>
关于第三方数据源授权,请前往[数据源授权](/zh-hans/guides/knowledge-base/knowledge-pipeline/authorize-data-source)。
关于第三方数据源授权,请前往[数据源授权](/zh-hans/guides/knowledge-base/knowledge-pipeline/authorize-data-source)。
</Tip>
---
@@ -168,7 +187,7 @@ title: "步骤二:知识流水线编排"
文档提取器节点可以理解为一个信息处理中心,通过识别并读取输入变量中的文件,提取信息后转化为下一个节点可使用的格式。
<Tip>
关于文档提取器的详细功能和配置方法,请参考[文档提取器](/zh-hans/guides/workflow/node/doc-extractor)。
关于文档提取器的详细功能和配置方法,请参考[文档提取器](/zh-hans/guides/workflow/node/doc-extractor)。
</Tip>
#### Dify 提取器 (Dify Extractor)
@@ -179,14 +198,18 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
#### Unstructured
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![Unstructured](/images/knowledge-base/knowledge-pipeline-orchestration-7.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
Unstructured 将文档转换为结构化的机器可读格式具有高度可定制的处理策略。它提供多种提取策略auto、hi_res、fast、OCR-only和分块方法by_title、by_page、by_similarity来处理各种文档类型提供详细的元素级元数据包括坐标、置信度分数和布局信息。推荐用于企业文档工作流、混合文件类型处理以及需要精确控制文档处理参数的场景。
</div>
</div>
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![Unstructured](/images/knowledge-base/knowledge-pipeline-orchestration-7.png)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
Unstructured 将文档转换为结构化的机器可读格式具有高度可定制的处理策略。它提供多种提取策略auto、hi_res、fast、OCR-only和分块方法by_title、by_page、by_similarity来处理各种文档类型提供详细的元素级元数据包括坐标、置信度分数和布局信息。推荐用于企业文档工作流、混合文件类型处理以及需要精确控制文档处理参数的场景。
</div>
</div>
你可前往 [Dify Marketplace](https://marketplace.dify.ai) 探索更多工具。
@@ -200,18 +223,18 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
#### 分块器类型概述
| 类型 | 特点 | 使用场景 |
|------|------|----------|
| 通用分块器 | 固定大小分块,支持自定义分隔符 | 结构简单的基础文档 |
| 父子分块器 | 双层分段结构,平衡匹配精准度和上下文 | 需要较多上下文信息的复杂文档结构 |
| 问答处理器 | 处理表格中的问答组合 | CSV 和 Excel 的结构化问答数据 |
| 类型 | 特点 | 使用场景 |
| ----- | ------------------ | -------------------- |
| 通用分块器 | 固定大小分块,支持自定义分隔符 | 结构简单的基础文档 |
| 父子分块器 | 双层分段结构,平衡匹配精准度和上下文 | 需要较多上下文信息的复杂文档结构 |
| 问答处理器 | 处理表格中的问答组合 | CSV 和 Excel 的结构化问答数据 |
#### 通用文本预处理规则
| 处理选项 | 说明 |
|----------|------|
| 处理选项 | 说明 |
| -------------- | ------------------------ |
| 替换连续空格、换行符和制表符 | 将文档中的连续空格、换行符和制表符替换为单个空格 |
| 移除所有 URL 和邮箱地址 | 自动识别并移除文本中的网址链接和邮箱地址 |
| 移除所有 URL 和邮箱地址 | 自动识别并移除文本中的网址链接和邮箱地址 |
#### 通用分块器 (General Chunker)
@@ -219,18 +242,18 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
**输入输出变量**
| 类型 | 变量 | 说明 |
|------|------|------|
| 输入变量 | `{x} Content` | 完整的文档内容块,通用分块器将其拆分为若干小段 |
| 输出变量 | `{x} Array[Chunk]` | 分块后的内容数组,每个片段适合进行检索和分析 |
| 类型 | 变量 | 说明 |
| ---- | ------------------ | ----------------------- |
| 输入变量 | `{x} Content` | 完整的文档内容块,通用分块器将其拆分为若干小段 |
| 输出变量 | `{x} Array[Chunk]` | 分块后的内容数组,每个片段适合进行检索和分析 |
**分块设置 (Chunk Settings)**
| 配置项 | 说明 |
|--------|------|
| 分段标识符 (Delimiter) | 默认值为 `\n`,即按照文本段落分段。你可以遵循正则表达式语法自定义分块规则,系统将在文本出现分段标识符时自动执行分段。 |
| 分段最大长度 (Maximum Chunk Length) | 指定分段内的文本字符数最大上限,超出该长度时将强制分段。 |
| 分段重叠长度 (Chunk Overlap) | 对数据进行分段时,段与段之间存在一定的重叠部分。这种重叠可以帮助提高信息的保留和分析的准确性,提升召回效果。 |
| 配置项 | 说明 |
| ----------------------------- | ------------------------------------------------------------- |
| 分段标识符 (Delimiter) | 默认值为 `\n`,即按照文本段落分段。你可以遵循正则表达式语法自定义分块规则,系统将在文本出现分段标识符时自动执行分段。 |
| 分段最大长度 (Maximum Chunk Length) | 指定分段内的文本字符数最大上限,超出该长度时将强制分段。 |
| 分段重叠长度 (Chunk Overlap) | 对数据进行分段时,段与段之间存在一定的重叠部分。这种重叠可以帮助提高信息的保留和分析的准确性,提升召回效果。 |
#### 父子分块器 (Parent-child Chunker)
@@ -243,20 +266,20 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
**输入输出变量**
| 类型 | 变量 | 说明 |
|------|------|------|
| 输入变量 | `{x} Content` | 完整的文档内容块,通用分块器将其拆分为若干小段 |
| 输出变量 | `{x} Array[ParentChunk]` | 父分块数组 |
| 类型 | 变量 | 说明 |
| ---- | ------------------------ | ----------------------- |
| 输入变量 | `{x} Content` | 完整的文档内容块,通用分块器将其拆分为若干小段 |
| 输出变量 | `{x} Array[ParentChunk]` | 父分块数组 |
**分块设置 (Chunk Settings)**
| 配置项 | 说明 |
|--------|------|
| 父分块分隔符 (Parent Delimiter) | 设置父分块的分割标识符 |
| 父分块最大长度 (Parent Maximum Chunk Length) | 控制父分块的最大字符数 |
| 子分块分隔符 (Child Delimiter) | 设置子分块的分割标识符 |
| 子分块最大长度 (Child Maximum Chunk Length) | 控制子分块的最大字符数 |
| 父块模式 (Parent Mode) | 选择"段落"(将文本分割为段落)或"完整文档"(使用整个文档作为父分块)进行直接检索 |
| 配置项 | 说明 |
| ------------------------------------- | ------------------------------------------ |
| 父分块分隔符 (Parent Delimiter) | 设置父分块的分割标识符 |
| 父分块最大长度 (Parent Maximum Chunk Length) | 控制父分块的最大字符数 |
| 子分块分隔符 (Child Delimiter) | 设置子分块的分割标识符 |
| 子分块最大长度 (Child Maximum Chunk Length) | 控制子分块的最大字符数 |
| 父块模式 (Parent Mode) | 选择"段落"(将文本分割为段落)或"完整文档"(使用整个文档作为父分块)进行直接检索 |
#### 问答处理器 Q&A Processor (Extractor+Chunker)
@@ -264,15 +287,15 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
**输入输出变量**
| 类型 | 变量 | 说明 |
|------|------|------|
| 输入变量 | `{x} Document` | 单个文档 |
| 类型 | 变量 | 说明 |
| ---- | -------------------- | ------ |
| 输入变量 | `{x} Document` | 单个文档 |
| 输出变量 | `{x} Array[QAChunk]` | 问答分块数组 |
**变量配置**
| 名称 | 说明 |
|------|------|
| 名称 | 说明 |
| ------ | ------------ |
| 问题所在的列 | 将内容所在的列设置为问题 |
| 答案所在的列 | 将内容所在的列设置为答案 |
@@ -291,13 +314,12 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
知识库支持两种分段模式:通用模式与父子模式。如果你是首次创建知识库,建议选择父子模式。
<Warning>
**重要提醒**:分段结构一旦保存发布后无法修改,请根据实际需求进行选择。
**重要提醒**:分段结构一旦保存发布后无法修改,请根据实际需求进行选择。
</Warning>
#### 通用模式
适用于大多数标准文档处理场景。
通用模式提供灵活的索引选项,你可以根据对质量和成本的不同要求选择合适的索引方法。通用模式支持高质量和经济的索引方式,以及多种检索设置。
适用于大多数标准文档处理场景。 通用模式提供灵活的索引选项,你可以根据对质量和成本的不同要求选择合适的索引方法。通用模式支持高质量和经济的索引方式,以及多种检索设置。
#### 父子模式
@@ -312,6 +334,7 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
输入变量用于接收来自数据处理节点的处理结果,用作知识库构建的数据源。你需要将前面配置的分块器节点的输出,连接到知识库节点并作为输入。
该节点根据所选的分段结构,支持不同类型的标准输入:
- **通用模式**`{x} Array[Chunk]` - 通用分块数组
- **父子模式**`{x} Array[ParentChunk]` - 父分块数组
- **问答模式**`{x} Array[QAChunk]` - 问答分块数组
@@ -320,29 +343,28 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
索引方式决定了知识库如何建立内容索引,检索设置则基于所选的索引方式提供相应的检索策略。你可以这么理解,索引方式决定了整理文档的方式,而检索设置告知使用者可以用什么方法来查找文档。
知识库提供了两种索引方式:高质量和经济,分别提供不同的检索设置选项。
在高质量模式下,使用 Embedding 嵌入模型将已分段的文本块转换为数字向量,帮助更加有效地压缩与存储大量文本信息。这使得即使用户的问题用词与文档不完全相同,系统也能找到语义相关的准确答案。
知识库提供了两种索引方式:高质量和经济,分别提供不同的检索设置选项。 在高质量模式下,使用 Embedding 嵌入模型将已分段的文本块转换为数字向量,帮助更加有效地压缩与存储大量文本信息。这使得即使用户的问题用词与文档不完全相同,系统也能找到语义相关的准确答案。
<Tip>
请查看[设定索引方法与检索设置](/zh-hans/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods),了解更多详情。
请查看[设定索引方法与检索设置](/zh-hans/guides/knowledge-base/create-knowledge-and-upload-documents/setting-indexing-methods),了解更多详情。
</Tip>
#### 索引方式和检索设置
| 索引方式 | 可用检索设置 | 说明 |
|----------|--------------|------|
| 高质量 | 向量搜索 | 基于语义相似度,理解查询深层含义。 |
| | 全文检索 | 基于关键词匹配的检索方式,提供全面的检索能力。 |
| | 混合检索 | 结合语义和关键词 |
| 经济 | 倒排索引 | 搜索引擎常用的检索方法,匹配问题与关键内容。 |
| 索引方式 | 可用检索设置 | 说明 |
| ---- | ------ | ----------------------- |
| 高质量 | 向量搜索 | 基于语义相似度,理解查询深层含义。 |
| | 全文检索 | 基于关键词匹配的检索方式,提供全面的检索能力。 |
| | 混合检索 | 结合语义和关键词 |
| 经济 | 倒排索引 | 搜索引擎常用的检索方法,匹配问题与关键内容。 |
关于配置分段结构、索引方法、配置参数和检索设置,你也可以参考下方表格。
| 分段结构 | 可选索引方式 | 可配置参数 | 可用检索设置 |
|----------|--------------|------------|--------------|
| 分段结构 | 可选索引方式 | 可配置参数 | 可用检索设置 |
| ---- | --------------------------- | ----------------------------------------- | ---------------------------------------- |
| 通用模式 | 高质量 <br /> <br /> <br /> 经济 | Embedding 嵌入模型 <br /> <br /> <br /> 关键词数量 | 向量检索 <br /> 全文检索 <br /> 混合检索 <br /> 倒排索引 |
| 父子模式 | 高质量(仅支持) | Embedding Model 嵌入模型 | 向量检索 <br /> 全文检索 <br /> 混合检索 |
| 问答模式 | 高质量(仅支持) | Embedding Model 嵌入模型 | 向量检索 <br /> 全文检索 <br /> 混合检索 |
| 父子模式 | 高质量(仅支持) | Embedding Model 嵌入模型 | 向量检索 <br /> 全文检索 <br /> 混合检索 |
| 问答模式 | 高质量(仅支持) | Embedding Model 嵌入模型 | 向量检索 <br /> 全文检索 <br /> 混合检索 |
## 步骤四:配置用户输入表单
@@ -354,14 +376,11 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
你可以通过下面两种方式,创建用户输入表单。
1. **知识流水线编排界面**
点击输入字段Input Field开始创建和配置输入表单。
![输入表单](/images/knowledge-base/knowledge-pipeline-orchestration-9.png)
1. **知识流水线编排界面** 点击输入字段Input Field开始创建和配置输入表单。
![输入表单](/images/knowledge-base/knowledge-pipeline-orchestration-9.png)
2. **节点参数面板** 选中节点,在右侧的面板需要填写的参数内,点击最下方的`+ 创建用户输入字段`+ Create user input)来创建新的输入项。新增的输入项将会汇总到输入字段Input Field的表单内。
2. **节点参数面板**
选中节点,在右侧的面板需要填写的参数内,点击最下方的`+ 创建用户输入字段`+ Create user input)来创建新的输入项。新增的输入项将会汇总到输入字段Input Field的表单内。
![节点参数](/images/knowledge-base/knowledge-pipeline-orchestration-10.png)
### 输入类型
@@ -387,40 +406,45 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
### 支持字段类型和填写说明
知识流水线支持以下七种类型的输入变量。
<div style={{display: 'flex', flexWrap: 'wrap', gap: '30px'}}>
<div style={{flex: 1, minWidth: '200px'}}>
![字段类型](/images/knowledge-base/knowledge-pipeline-orchestration-14.png)
</div>
<div style={{flex: 2, minWidth: '300px'}}>
| 字段类型 | 说明 |
|----------|------|
| 文本 | 短文本,由知识库使用者自行填写,最大长度为 256 字符 |
| 段落 | 长文本,知识库使用者可以输入较长字符 |
| 下拉选项 | 由编排者预设的固定选项供使用者选择,使用者无法自行填写内容 |
| 布尔值 | 只有真/假两个取值 |
| 数字 | 只能输入数字 |
| 单文件 | 上传单个文件,支持多种文件类型(文档、图片、音频、视频和其他文件类型) |
| 文件列表 | 批量上传文件,支持多种文件类型(文档、图片、音频、视频和其他文件类型) |
</div>
<div style={{ display:"flex",flexWrap:"wrap",gap:"30px" }}>
<div style={{ flex:1,minWidth:"200px" }}>
![字段类型](/images/knowledge-base/knowledge-pipeline-orchestration-14.png)
</div>
<div style={{ flex:2,minWidth:"300px" }}>
| 字段类型 | 说明 |
| ---- | ----------------------------------- |
| 文本 | 短文本,由知识库使用者自行填写,最大长度为 256 字符 |
| 段落 | 长文本,知识库使用者可以输入较长字符 |
| 下拉选项 | 由编排者预设的固定选项供使用者选择,使用者无法自行填写内容 |
| 布尔值 | 只有真/假两个取值 |
| 数字 | 只能输入数字 |
| 单文件 | 上传单个文件,支持多种文件类型(文档、图片、音频、视频和其他文件类型) |
| 文件列表 | 批量上传文件,支持多种文件类型(文档、图片、音频、视频和其他文件类型) |
</div>
</div>
<Tip>
请前往[输入字段](/zh-hans/guides/workflow/node/start#%E8%BE%93%E5%85%A5%E5%AD%97%E6%AE%B5),了解关于支持字段的更多说明。
请前往[输入字段](/zh-hans/guides/workflow/node/start#%E8%BE%93%E5%85%A5%E5%AD%97%E6%AE%B5),了解关于支持字段的更多说明。
</Tip>
所有类型的输入项包含:必填项、非必填项和更多设置,可以通过勾选设置为是否为必填。
| 名称 | 说明 | 示例 |
|------|------|------|
| **必填项** | | |
| 变量名称 Variable Name | 系统内部标识名称,通常使用英文和下划线进行命名 | `user_email` |
| 显示名称 Display Name | 界面展示的名称,通常是简洁易读的文字 | 用户邮箱 |
| **类型特定设置** | 不同字段类型的特殊要求 | 文本的最大长度为 100 字符 |
| **更多设置** | | |
| 默认值 Default Value | 用户未输入时的默认值 | 数字字段默认为0 文本字段默认为空 |
| 占位符 Placeholder | 输入框空白时的提示文字 | 请输入您的邮箱 |
| 提示 Tooltip | 解释或指引用户进行填写的文字,通常在用户鼠标悬停时显示 | 请输入有效的邮箱地址 |
| **特殊非必填信息** | 根据不同字段类型的额外设置选项 | 邮箱格式验证 |
| 名称 | 说明 | 示例 |
| ------------------ | --------------------------- | ------------------ |
| **必填项** | | |
| 变量名称 Variable Name | 系统内部标识名称,通常使用英文和下划线进行命名 | `user_email` |
| 显示名称 Display Name | 界面展示的名称,通常是简洁易读的文字 | 用户邮箱 |
| **类型特定设置** | 不同字段类型的特殊要求 | 文本的最大长度为 100 字符 |
| **更多设置** | | |
| 默认值 Default Value | 用户未输入时的默认值 | 数字字段默认为0 文本字段默认为空 |
| 占位符 Placeholder | 输入框空白时的提示文字 | 请输入您的邮箱 |
| 提示 Tooltip | 解释或指引用户进行填写的文字,通常在用户鼠标悬停时显示 | 请输入有效的邮箱地址 |
| **特殊非必填信息** | 根据不同字段类型的额外设置选项 | 邮箱格式验证 |
配置完成后,点击右上角的预览按钮,你可以在弹出的表单预览界面中浏览。你可以拖拽调整字段的分组,如果出现感叹号,则表明移动后引用失效。
@@ -433,12 +457,10 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
默认情况下,知识库名称为"Untitled + 序号”,权限设置为"仅自己可见”,图表为橙色书本。如果你使用 DSL文件导入则将使用其保存的图标。
点击左侧面板中的设置并填写以下信息:
- **名称和图标**
为你的知识库命名。你还可以选择一个 emoji、上传图片或粘贴图片 URL 作为知识库的图标。
- **知识库描述**
简要描述您的知识库。这有助于 AI 更好地理解和检索您的数据。如果留空Dify 将应用默认的检索策略
- **权限**
从下拉菜单中选择适当的访问权限。
- **名称和图标** 为你的知识库命名。你还可以选择一个 emoji、上传图片或粘贴图片 URL 作为知识库的图标。
- **知识库描述** 简要描述您的知识库。这有助于 AI 更好地理解和检索您的数据。如果留空Dify 将应用默认的检索策略。
- **权限** 从下拉菜单中选择适当的访问权限
## 步骤六:测试
@@ -460,8 +482,9 @@ Dify Extractor 是 Dify 开发的一款内置文档解析器。它支持多种
1. **开始测试**点击右上角的测试运行Test Run)按钮。
2. **导入测试文件**:在右侧弹出的数据源窗口中,导入文件。
<Warning>
**重要提醒**:为了便于调试和观测,在测试运行状态下,每次仅允许上传一个文件。
**重要提醒**:为了便于调试和观测,在测试运行状态下,每次仅允许上传一个文件。
</Warning>
3. **填写参数**:导入成功后,根据你之前配置的用户输入表单填写对应参数
4. **开始试运行**:点击下一步,开始测试整个流水线。