dify-docs/en/develop-plugin/dev-guides-and-walkthroughs/develop-multimodal-data-processing-tool.mdx

---
title: Build Tool Plugins for Multimodal Data Processing in Knowledge Pipelines
---

In knowledge pipelines, the Knowledge Base node supports input in two multimodal data formats: `multimodal-Parent-Child` and `multimodal-General`.

When developing a tool plugin for multimodal data processing, to ensure that the plugin's multimodal output (such as text, images, audio, video, etc.) can be correctly recognized and embedded by the Knowledge Base node, you need to complete the following configuration:

- **In the tool code file**, call the tool session interface to upload files and construct the `files` object.

- **In the tool provider YAML file**, declare the `output_schema` as either `multimodal-Parent-Child` or `multimodal-General`.

## Upload Files and Construct File Objects

When processing multimodal data (such as images), you need to first upload the file using Dify's tool session tool to obtain the file metadata.

The following example uses the official Dify plugin, **Dify Extractor**, to demonstrate how to upload a file and construct a `files` object.

```python
# Upload the file using the tool session
file_res = self._tool.session.file.upload(
    file_name,   # filename
    file_blob,   # file binary data
    mime_type,   # MIME type, e.g., "image/png"
)

# Generate a Markdown image reference using the file preview URL
image_url = f"![image]({file_res.preview_url})"
```
The upload interface returns an `UploadFileResponse` object containing the file information. Its structure is as follows:

```python
    from enum import Enum
    from pydantic import BaseModel

    class UploadFileResponse(BaseModel):
        class Type(str, Enum):
            DOCUMENT = "document"
            IMAGE = "image"
            VIDEO = "video"
            AUDIO = "audio"

            @classmethod
            def from_mime_type(cls, mime_type: str):
                if mime_type.startswith("image/"):
                    return cls.IMAGE
                if mime_type.startswith("video/"):
                    return cls.VIDEO
                if mime_type.startswith("audio/"):
                    return cls.AUDIO
                return cls.DOCUMENT
        id: str
        name: str
        size: int
        extension: str
        mime_type: str
        type: Type | None = None
        preview_url: str | None = None
```

You can map the file information (such as `name`, `size`, `extension`, `mime_type`, etc.) to the `files` field in the multimodal output structure.

<CodeGroup>
    ```yaml multimodal_parent_child_structure highlight={22-62} expandable
    {
        "$id": "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json",
        "$schema": "http://json-schema.org/draft-07/schema#",
        "version": "1.0.0",
        "type": "object",
        "title": "Multimodal Parent-Child Structure",
        "description": "Schema for multimodal parent-child structure (v1)",
        "properties": {
            "parent_mode": {
            "type": "string",
            "description": "The mode of parent-child relationship"
            },
            "parent_child_chunks": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                "parent_content": {
                    "type": "string",
                    "description": "The parent content"
                },
                "files": {
                    "type": "array",
                    "items": {
                    "type": "object",
                    "properties": {
                        "name": {
                        "type": "string",
                        "description": "file name"
                        },
                        "size": {
                        "type": "number",
                        "description": "file size"
                        },
                        "extension": {
                        "type": "string",
                        "description": "file extension"
                        },
                        "type": {
                        "type": "string",
                        "description": "file type"
                        },
                        "mime_type": {
                        "type": "string",
                        "description": "file mime type"
                        },
                        "transfer_method": {
                        "type": "string",
                        "description": "file transfer method"
                        },
                        "url": {
                        "type": "string",
                        "description": "file url"
                        },
                        "related_id": {
                        "type": "string",
                        "description": "file related id"
                        }
                    },
                    "required": ["name", "size", "extension", "type", "mime_type", "transfer_method", "url", "related_id"]
                    },
                    "description": "List of files"
                },
                "child_contents": {
                    "type": "array",
                    "items": {
                    "type": "string"
                    },
                    "description": "List of child contents"
                }
                },
                "required": ["parent_content", "child_contents"]
            },
            "description": "List of parent-child chunk pairs"
            }
        },
        "required": ["parent_mode", "parent_child_chunks"]
    }
    ```

    ```yaml multimodal_general_structure highlight={18-56} expandable
    {
        "$id": "https://dify.ai/schemas/v1/multimodal_general_structure.json",
        "$schema": "http://json-schema.org/draft-07/schema#",
        "version": "1.0.0",
        "type": "array",
        "title": "Multimodal General Structure",
        "description": "Schema for multimodal general structure (v1) - array of objects",
        "properties": {
            "general_chunks": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                "content": {
                    "type": "string",
                    "description": "The content"
                },
                "files": {
                    "type": "array",
                    "items": {
                    "type": "object",
                    "properties": {
                        "name": {
                        "type": "string",
                        "description": "file name"
                        },
                        "size": {
                        "type": "number",
                        "description": "file size"
                        },
                        "extension": {
                        "type": "string",
                        "description": "file extension"
                        },
                        "type": {
                        "type": "string",
                        "description": "file type"
                        },
                        "mime_type": {
                        "type": "string",
                        "description": "file mime type"
                        },
                        "transfer_method": {
                        "type": "string",
                        "description": "file transfer method"
                        },
                        "url": {
                        "type": "string",
                        "description": "file url"
                        },
                        "related_id": {
                        "type": "string",
                        "description": "file related id"
                        }
                    },
                    "description": "List of files"
                }
                }
                },
                "required": ["content"]
            },
            "description": "List of content and files"
            }
        }
    }
    ```
</CodeGroup>

## Declare Multimodal Output Structure

The structure of multimodal data is defined by Dify's official JSON schema.

To enable the Knowledge Base node to recognize the plugin's multimodal output type, you need to point the `result` field under `output_schema` in the plugin's provider YAML file to the corresponding official schema URL.

```yaml
output_schema:
  type: object
  properties:
    result:
      # multimodal-Parent-Child
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"

      # multimodal-General
      # $ref: "https://dify.ai/schemas/v1/multimodal_general_structure.json"
```

Taking `multimodal-Parent-Child` as an example, a complete YAML configuration is as follows:

```yaml expandable
identity:
  name: multimodal_tool
  author: langgenius
  label:
    en_US: multimodal tool
    zh_Hans: 多模态提取器
    pt_BR: multimodal tool
description:
  human:
    en_US: Process documents into multimodal-Parent-Child chunk structures
    zh_Hans: 将文档处理为多模态父子分块结构
    pt_BR: Processar documentos em estruturas de divisão pai-filho
  llm: Processes documents into hierarchical multimodal-Parent-Child chunk structures

parameters:
  - name: input_text
    human_description:
      en_US: The text you want to chunk.
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    label:
      en_US: Input Content
      zh_Hans: 输入文本
      pt_BR: Conteúdo de Entrada
    llm_description: The text you want to chunk.
    required: true
    type: string
    form: llm

output_schema:
  type: object
  properties:
    result:
      $ref: "https://dify.ai/schemas/v1/multimodal_parent_child_structure.json"
extra:
  python:
    source: tools/parent_child_chunk.py
```