Files
dify-docs/en/develop-plugin/dev-guides-and-walkthroughs/develop-multimodal-data-processing-tool.mdx
Riskey b65aabc65e Add docs for multimodal embedding (#600)
* draft

* migrate from old docs repo

* adjust content based on experiencing the feature

* Improvements

* changes upon feedback

* refinements

* zh draft

* add plugin dev docs

* update old links

* add jp docs

* change the position of variables related to multimodal embedding in the environment variable doc

---------

Co-authored-by: Riskey <riskey47@dify.ai>
2025-12-09 18:43:41 +08:00

272 lines
9.6 KiB
Plaintext

---
title: Build Tool Plugins for Multimodal Data Processing in Knowledge Pipelines
---
In knowledge pipelines, the Knowledge Base node supports input in two multimodal data formats: `multimodal-Parent-Child` and `multimodal-General`.
When developing a tool plugin for multimodal data processing, to ensure that the plugin's multimodal output (such as text, images, audio, video, etc.) can be correctly recognized and embedded by the Knowledge Base node, you need to complete the following configuration:
- **In the tool code file**, call the tool session interface to upload files and construct the `files` object.
- **In the tool provider YAML file**, declare the `output_schema` as either `multimodal-Parent-Child` or `multimodal-General`.
## Upload Files and Construct File Objects
When processing multimodal data (such as images), you need to first upload the file using Dify's tool session tool to obtain the file metadata.
The following example uses the official Dify plugin, **Dify Extractor**, to demonstrate how to upload a file and construct a `files` object.
```python
# Upload the file using the tool session
file_res = self._tool.session.file.upload(
file_name, # filename
file_blob, # file binary data
mime_type, # MIME type, e.g., "image/png"
)
# Generate a Markdown image reference using the file preview URL
image_url = f"![image]({file_res.preview_url})"
```
The upload interface returns an `UploadFileResponse` object containing the file information. Its structure is as follows:
```python
from enum import Enum
from pydantic import BaseModel
class UploadFileResponse(BaseModel):
class Type(str, Enum):
DOCUMENT = "document"
IMAGE = "image"
VIDEO = "video"
AUDIO = "audio"
@classmethod
def from_mime_type(cls, mime_type: str):
if mime_type.startswith("image/"):
return cls.IMAGE
if mime_type.startswith("video/"):
return cls.VIDEO
if mime_type.startswith("audio/"):
return cls.AUDIO
return cls.DOCUMENT
id: str
name: str
size: int
extension: str
mime_type: str
type: Type | None = None
preview_url: str | None = None
```
You can map the file information (such as `name`, `size`, `extension`, `mime_type`, etc.) to the `files` field in the multimodal output structure.
<CodeGroup>
```yaml multimodal_parent_child_structure highlight={22-62} expandable
{
"$id": "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"version": "1.0.0",
"type": "object",
"title": "Multimodal Parent-Child Structure",
"description": "Schema for multimodal parent-child structure (v1)",
"properties": {
"parent_mode": {
"type": "string",
"description": "The mode of parent-child relationship"
},
"parent_child_chunks": {
"type": "array",
"items": {
"type": "object",
"properties": {
"parent_content": {
"type": "string",
"description": "The parent content"
},
"files": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "file name"
},
"size": {
"type": "number",
"description": "file size"
},
"extension": {
"type": "string",
"description": "file extension"
},
"type": {
"type": "string",
"description": "file type"
},
"mime_type": {
"type": "string",
"description": "file mime type"
},
"transfer_method": {
"type": "string",
"description": "file transfer method"
},
"url": {
"type": "string",
"description": "file url"
},
"related_id": {
"type": "string",
"description": "file related id"
}
},
"required": ["name", "size", "extension", "type", "mime_type", "transfer_method", "url", "related_id"]
},
"description": "List of files"
},
"child_contents": {
"type": "array",
"items": {
"type": "string"
},
"description": "List of child contents"
}
},
"required": ["parent_content", "child_contents"]
},
"description": "List of parent-child chunk pairs"
}
},
"required": ["parent_mode", "parent_child_chunks"]
}
```
```yaml multimodal_general_structure highlight={18-56} expandable
{
"$id": "https://dify.ai/schemas/v1/multimodal_general_structure.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"version": "1.0.0",
"type": "array",
"title": "Multimodal General Structure",
"description": "Schema for multimodal general structure (v1) - array of objects",
"properties": {
"general_chunks": {
"type": "array",
"items": {
"type": "object",
"properties": {
"content": {
"type": "string",
"description": "The content"
},
"files": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "file name"
},
"size": {
"type": "number",
"description": "file size"
},
"extension": {
"type": "string",
"description": "file extension"
},
"type": {
"type": "string",
"description": "file type"
},
"mime_type": {
"type": "string",
"description": "file mime type"
},
"transfer_method": {
"type": "string",
"description": "file transfer method"
},
"url": {
"type": "string",
"description": "file url"
},
"related_id": {
"type": "string",
"description": "file related id"
}
},
"description": "List of files"
}
}
},
"required": ["content"]
},
"description": "List of content and files"
}
}
}
```
</CodeGroup>
## Declare Multimodal Output Structure
The structure of multimodal data is defined by Dify's official JSON schema.
To enable the Knowledge Base node to recognize the plugin's multimodal output type, you need to point the `result` field under `output_schema` in the plugin's provider YAML file to the corresponding official schema URL.
```yaml
output_schema:
type: object
properties:
result:
# multimodal-Parent-Child
$ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
# multimodal-General
# $ref: "https://dify.ai/schemas/v1/multimodal_general_structure.json"
```
Taking `multimodal-Parent-Child` as an example, a complete YAML configuration is as follows:
```yaml expandable
identity:
name: multimodal_tool
author: langgenius
label:
en_US: multimodal tool
zh_Hans: 多模态提取器
pt_BR: multimodal tool
description:
human:
en_US: Process documents into multimodal-Parent-Child chunk structures
zh_Hans: 将文档处理为多模态父子分块结构
pt_BR: Processar documentos em estruturas de divisão pai-filho
llm: Processes documents into hierarchical multimodal-Parent-Child chunk structures
parameters:
- name: input_text
human_description:
en_US: The text you want to chunk.
zh_Hans: 输入文本
pt_BR: Conteúdo de Entrada
label:
en_US: Input Content
zh_Hans: 输入文本
pt_BR: Conteúdo de Entrada
llm_description: The text you want to chunk.
required: true
type: string
form: llm
output_schema:
type: object
properties:
result:
$ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
extra:
python:
source: tools/parent_child_chunk.py
```