dify-docs/zh/develop-plugin/dev-guides-and-walkthroughs/develop-multimodal-data-processing-tool.mdx

---
title: 构建在知识流水线中处理多模态数据的工具插件
---

<Note> ⚠️ 本文档由 AI 自动翻译。如有任何不准确之处，请参考[英文原版](/en/develop-plugin/dev-guides-and-walkthroughs/develop-multimodal-data-processing-tool)。</Note>

在知识流水线中，知识库节点支持两种多模态数据格式的入参：`multimodal-Parent-Child` 和 `multimodal-General`。

开发用于多模态数据处理的工具插件时，若希望插件输出的多模态数据（如文字、图片、音视频等）能够被知识库节点正确识别并向量化，需要完成以下配置：

- **在工具代码中**，调用接口上传并构造文件对象 `files`。

- **在工具提供者 YAML 文件中**，将 `output_schema` 声明为 `multimodal-Parent-Child` 或 `multimodal-General`。

## 上传并构造文件对象

在处理多模态数据（如图片）时，需要先通过 Dify 的工具会话接口上传文件，以获取文件元数据。

下面以 Dify 官方插件 Dify Extractor 为例，展示如何上传文件并构造文件对象。

```python

# Upload the file using the tool session
file_res = self._tool.session.file.upload(
    file_name,   # filename
    file_blob,   # file binary data
    mime_type,   # MIME type, e.g., "image/png"
)

# Generate a Markdown image reference using the file preview URL
image_url = f"![image]({file_res.preview_url})"
```

上传接口会返回一个 `UploadFileResponse` 对象，包含文件的基本信息：

```python
    from enum import Enum
    from pydantic import BaseModel

    class UploadFileResponse(BaseModel):
        class Type(str, Enum):
            DOCUMENT = "document"
            IMAGE = "image"
            VIDEO = "video"
            AUDIO = "audio"

            @classmethod
            def from_mime_type(cls, mime_type: str):
                if mime_type.startswith("image/"):
                    return cls.IMAGE
                if mime_type.startswith("video/"):
                    return cls.VIDEO
                if mime_type.startswith("audio/"):
                    return cls.AUDIO
                return cls.DOCUMENT
        id: str
        name: str
        size: int
        extension: str
        mime_type: str
        type: Type | None = None
        preview_url: str | None = None
```

根据其结构，可将文件信息（如 `name`, `size`, `extension`, `mime_type` 等）映射到多模态输出结构中的 `files` 字段。

<CodeGroup>
    ```yaml multimodal_parent_child_structure highlight={22-62} expandable
    {
        "$id": "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json",
        "$schema": "http://json-schema.org/draft-07/schema#",
        "version": "1.0.0",
        "type": "object",
        "title": "Multimodal Parent-Child Structure",
        "description": "Schema for multimodal parent-child structure (v1)",
        "properties": {
            "parent_mode": {
            "type": "string",
            "description": "The mode of parent-child relationship"
            },
            "parent_child_chunks": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                "parent_content": {
                    "type": "string",
                    "description": "The parent content"
                },
                "files": {
                    "type": "array",
                    "items": {
                    "type": "object",
                    "properties": {
                        "name": {
                        "type": "string",
                        "description": "file name"
                        },
                        "size": {
                        "type": "number",
                        "description": "file size"
                        },
                        "extension": {
                        "type": "string",
                        "description": "file extension"
                        },
                        "type": {
                        "type": "string",
                        "description": "file type"
                        },
                        "mime_type": {
                        "type": "string",
                        "description": "file mime type"
                        },
                        "transfer_method": {
                        "type": "string",
                        "description": "file transfer method"
                        },
                        "url": {
                        "type": "string",
                        "description": "file url"
                        },
                        "related_id": {
                        "type": "string",
                        "description": "file related id"
                        }
                    },
                    "required": ["name", "size", "extension", "type", "mime_type", "transfer_method", "url", "related_id"]
                    },
                    "description": "List of files"
                },
                "child_contents": {
                    "type": "array",
                    "items": {
                    "type": "string"
                    },
                    "description": "List of child contents"
                }
                },
                "required": ["parent_content", "child_contents"]
            },
            "description": "List of parent-child chunk pairs"
            }
        },
        "required": ["parent_mode", "parent_child_chunks"]
    }
    ```

    ```yaml multimodal_general_structure highlight={18-56} expandable
    {
        "$id": "https://dify.ai/schemas/v1/multimodal_general_structure.json",
        "$schema": "http://json-schema.org/draft-07/schema#",
        "version": "1.0.0",
        "type": "array",
        "title": "Multimodal General Structure",
        "description": "Schema for multimodal general structure (v1) - array of objects",
        "properties": {
            "general_chunks": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                "content": {
                    "type": "string",
                    "description": "The content"
                },
                "files": {
                    "type": "array",
                    "items": {
                    "type": "object",
                    "properties": {
                        "name": {
                        "type": "string",
                        "description": "file name"
                        },
                        "size": {
                        "type": "number",
                        "description": "file size"
                        },
                        "extension": {
                        "type": "string",
                        "description": "file extension"
                        },
                        "type": {
                        "type": "string",
                        "description": "file type"
                        },
                        "mime_type": {
                        "type": "string",
                        "description": "file mime type"
                        },
                        "transfer_method": {
                        "type": "string",
                        "description": "file transfer method"
                        },
                        "url": {
                        "type": "string",
                        "description": "file url"
                        },
                        "related_id": {
                        "type": "string",
                        "description": "file related id"
                        }
                    },
                    "description": "List of files"
                }
                }
                },
                "required": ["content"]
            },
            "description": "List of content and files"
            }
        }
    }
    ```
</CodeGroup>

## 声明多模态输出结构

多模态数据的结构由 Dify 官方提供的 JSON Schema 定义。

为了让知识库节点识别插件的多模态输出类型，需在插件的提供者 YAML 文件中将 `output_schema` 的 `result` 字段指向对应的官方 Schema URL。

```yaml
output_schema:
  type: object
  properties:
    result:
      # multimodal-Parent-Child
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"

      # multimodal-General
      # $ref: "https://dify.ai/schemas/v1/multimodal_general_structure.json"
```

以 `multimodal-Parent-Child` 为例，一个完整的 YAML 文件配置如下：

```yaml expandable
identity:
  name: multimodal_tool
  author: langgenius
  label:
    en_US: multimodal tool
    zh_Hans: 多模态提取器
    pt_BR: multimodal tool
description:
  human:
    en_US: Process documents into multimodal-Parent-Child chunk structures
    zh_Hans: 将文档处理为多模态父子分块结构
    pt_BR: Processar documentos em estruturas de divisão pai-filho
  llm: Processes documents into hierarchical multimodal-Parent-Child chunk structures

parameters:
  - name: input_text
    human_description:
      en_US: The text you want to chunk.
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    label:
      en_US: Input Content
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    llm_description: The text you want to chunk.
    required: true
    type: string
    form: llm

output_schema:
  type: object
  properties:
    result:
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
extra:
  python:
    source: tools/parent_child_chunk.py
```