mirror of
https://github.com/langgenius/dify-docs.git
synced 2026-03-27 13:28:32 +07:00
* draft * migrate from old docs repo * adjust content based on experiencing the feature * Improvements * changes upon feedback * refinements * zh draft * add plugin dev docs * update old links * add jp docs * change the position of variables related to multimodal embedding in the environment variable doc --------- Co-authored-by: Riskey <riskey47@dify.ai>
272 lines
9.6 KiB
Plaintext
272 lines
9.6 KiB
Plaintext
---
|
|
title: Build Tool Plugins for Multimodal Data Processing in Knowledge Pipelines
|
|
---
|
|
|
|
In knowledge pipelines, the Knowledge Base node supports input in two multimodal data formats: `multimodal-Parent-Child` and `multimodal-General`.
|
|
|
|
When developing a tool plugin for multimodal data processing, to ensure that the plugin's multimodal output (such as text, images, audio, video, etc.) can be correctly recognized and embedded by the Knowledge Base node, you need to complete the following configuration:
|
|
|
|
- **In the tool code file**, call the tool session interface to upload files and construct the `files` object.
|
|
|
|
- **In the tool provider YAML file**, declare the `output_schema` as either `multimodal-Parent-Child` or `multimodal-General`.
|
|
|
|
## Upload Files and Construct File Objects
|
|
|
|
When processing multimodal data (such as images), you need to first upload the file using Dify's tool session tool to obtain the file metadata.
|
|
|
|
The following example uses the official Dify plugin, **Dify Extractor**, to demonstrate how to upload a file and construct a `files` object.
|
|
|
|
```python
|
|
# Upload the file using the tool session
|
|
file_res = self._tool.session.file.upload(
|
|
file_name, # filename
|
|
file_blob, # file binary data
|
|
mime_type, # MIME type, e.g., "image/png"
|
|
)
|
|
|
|
# Generate a Markdown image reference using the file preview URL
|
|
image_url = f""
|
|
```
|
|
The upload interface returns an `UploadFileResponse` object containing the file information. Its structure is as follows:
|
|
|
|
```python
|
|
from enum import Enum
|
|
from pydantic import BaseModel
|
|
|
|
class UploadFileResponse(BaseModel):
|
|
class Type(str, Enum):
|
|
DOCUMENT = "document"
|
|
IMAGE = "image"
|
|
VIDEO = "video"
|
|
AUDIO = "audio"
|
|
|
|
@classmethod
|
|
def from_mime_type(cls, mime_type: str):
|
|
if mime_type.startswith("image/"):
|
|
return cls.IMAGE
|
|
if mime_type.startswith("video/"):
|
|
return cls.VIDEO
|
|
if mime_type.startswith("audio/"):
|
|
return cls.AUDIO
|
|
return cls.DOCUMENT
|
|
id: str
|
|
name: str
|
|
size: int
|
|
extension: str
|
|
mime_type: str
|
|
type: Type | None = None
|
|
preview_url: str | None = None
|
|
```
|
|
|
|
You can map the file information (such as `name`, `size`, `extension`, `mime_type`, etc.) to the `files` field in the multimodal output structure.
|
|
|
|
<CodeGroup>
|
|
```yaml multimodal_parent_child_structure highlight={22-62} expandable
|
|
{
|
|
"$id": "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json",
|
|
"$schema": "http://json-schema.org/draft-07/schema#",
|
|
"version": "1.0.0",
|
|
"type": "object",
|
|
"title": "Multimodal Parent-Child Structure",
|
|
"description": "Schema for multimodal parent-child structure (v1)",
|
|
"properties": {
|
|
"parent_mode": {
|
|
"type": "string",
|
|
"description": "The mode of parent-child relationship"
|
|
},
|
|
"parent_child_chunks": {
|
|
"type": "array",
|
|
"items": {
|
|
"type": "object",
|
|
"properties": {
|
|
"parent_content": {
|
|
"type": "string",
|
|
"description": "The parent content"
|
|
},
|
|
"files": {
|
|
"type": "array",
|
|
"items": {
|
|
"type": "object",
|
|
"properties": {
|
|
"name": {
|
|
"type": "string",
|
|
"description": "file name"
|
|
},
|
|
"size": {
|
|
"type": "number",
|
|
"description": "file size"
|
|
},
|
|
"extension": {
|
|
"type": "string",
|
|
"description": "file extension"
|
|
},
|
|
"type": {
|
|
"type": "string",
|
|
"description": "file type"
|
|
},
|
|
"mime_type": {
|
|
"type": "string",
|
|
"description": "file mime type"
|
|
},
|
|
"transfer_method": {
|
|
"type": "string",
|
|
"description": "file transfer method"
|
|
},
|
|
"url": {
|
|
"type": "string",
|
|
"description": "file url"
|
|
},
|
|
"related_id": {
|
|
"type": "string",
|
|
"description": "file related id"
|
|
}
|
|
},
|
|
"required": ["name", "size", "extension", "type", "mime_type", "transfer_method", "url", "related_id"]
|
|
},
|
|
"description": "List of files"
|
|
},
|
|
"child_contents": {
|
|
"type": "array",
|
|
"items": {
|
|
"type": "string"
|
|
},
|
|
"description": "List of child contents"
|
|
}
|
|
},
|
|
"required": ["parent_content", "child_contents"]
|
|
},
|
|
"description": "List of parent-child chunk pairs"
|
|
}
|
|
},
|
|
"required": ["parent_mode", "parent_child_chunks"]
|
|
}
|
|
```
|
|
|
|
```yaml multimodal_general_structure highlight={18-56} expandable
|
|
{
|
|
"$id": "https://dify.ai/schemas/v1/multimodal_general_structure.json",
|
|
"$schema": "http://json-schema.org/draft-07/schema#",
|
|
"version": "1.0.0",
|
|
"type": "array",
|
|
"title": "Multimodal General Structure",
|
|
"description": "Schema for multimodal general structure (v1) - array of objects",
|
|
"properties": {
|
|
"general_chunks": {
|
|
"type": "array",
|
|
"items": {
|
|
"type": "object",
|
|
"properties": {
|
|
"content": {
|
|
"type": "string",
|
|
"description": "The content"
|
|
},
|
|
"files": {
|
|
"type": "array",
|
|
"items": {
|
|
"type": "object",
|
|
"properties": {
|
|
"name": {
|
|
"type": "string",
|
|
"description": "file name"
|
|
},
|
|
"size": {
|
|
"type": "number",
|
|
"description": "file size"
|
|
},
|
|
"extension": {
|
|
"type": "string",
|
|
"description": "file extension"
|
|
},
|
|
"type": {
|
|
"type": "string",
|
|
"description": "file type"
|
|
},
|
|
"mime_type": {
|
|
"type": "string",
|
|
"description": "file mime type"
|
|
},
|
|
"transfer_method": {
|
|
"type": "string",
|
|
"description": "file transfer method"
|
|
},
|
|
"url": {
|
|
"type": "string",
|
|
"description": "file url"
|
|
},
|
|
"related_id": {
|
|
"type": "string",
|
|
"description": "file related id"
|
|
}
|
|
},
|
|
"description": "List of files"
|
|
}
|
|
}
|
|
},
|
|
"required": ["content"]
|
|
},
|
|
"description": "List of content and files"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
</CodeGroup>
|
|
|
|
## Declare Multimodal Output Structure
|
|
|
|
The structure of multimodal data is defined by Dify's official JSON schema.
|
|
|
|
To enable the Knowledge Base node to recognize the plugin's multimodal output type, you need to point the `result` field under `output_schema` in the plugin's provider YAML file to the corresponding official schema URL.
|
|
|
|
```yaml
|
|
output_schema:
|
|
type: object
|
|
properties:
|
|
result:
|
|
# multimodal-Parent-Child
|
|
$ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
|
|
|
|
# multimodal-General
|
|
# $ref: "https://dify.ai/schemas/v1/multimodal_general_structure.json"
|
|
```
|
|
|
|
Taking `multimodal-Parent-Child` as an example, a complete YAML configuration is as follows:
|
|
|
|
```yaml expandable
|
|
identity:
|
|
name: multimodal_tool
|
|
author: langgenius
|
|
label:
|
|
en_US: multimodal tool
|
|
zh_Hans: 多模态提取器
|
|
pt_BR: multimodal tool
|
|
description:
|
|
human:
|
|
en_US: Process documents into multimodal-Parent-Child chunk structures
|
|
zh_Hans: 将文档处理为多模态父子分块结构
|
|
pt_BR: Processar documentos em estruturas de divisão pai-filho
|
|
llm: Processes documents into hierarchical multimodal-Parent-Child chunk structures
|
|
|
|
parameters:
|
|
- name: input_text
|
|
human_description:
|
|
en_US: The text you want to chunk.
|
|
zh_Hans: 输入文本
|
|
pt_BR: Conteúdo de Entrada
|
|
label:
|
|
en_US: Input Content
|
|
zh_Hans: 输入文本
|
|
pt_BR: Conteúdo de Entrada
|
|
llm_description: The text you want to chunk.
|
|
required: true
|
|
type: string
|
|
form: llm
|
|
|
|
output_schema:
|
|
type: object
|
|
properties:
|
|
result:
|
|
$ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
|
|
extra:
|
|
python:
|
|
source: tools/parent_child_chunk.py
|
|
``` |