draft

2026-03-26 13:18:34 +07:00 · 2026-03-19 16:44:34 +08:00
parent caab54d1dd
commit 6828ce0f8a
4 changed files with 177 additions and 2 deletions
--- a/docs.json
+++ b/docs.json
@@ -84,7 +84,8 @@
                          "en/use-dify/build/predefined-error-handling-logic",
                          "en/use-dify/build/mcp",
                          "en/use-dify/build/version-control",
-                          "en/use-dify/build/additional-features"
+                          "en/use-dify/build/additional-features",
+                          "en/use-dify/build/snippet"
                        ]
                      },
                      {
@@ -94,7 +95,8 @@
                          "en/use-dify/debug/step-run",
                          "en/use-dify/debug/variable-inspect",
                          "en/use-dify/debug/history-and-logs",
-                          "en/use-dify/debug/error-type"
+                          "en/use-dify/debug/error-type",
+                          "en/use-dify/debug/evaluate-workflow"
                        ]
                      },
                      {
@@ -197,6 +199,7 @@
                            ]
                          },
                          "en/use-dify/knowledge/test-retrieval",
+                          "en/use-dify/knowledge/evaluate-knowledge",
                          "en/use-dify/knowledge/integrate-knowledge-within-application",
                          "en/use-dify/knowledge/knowledge-request-rate-limit"
                        ]
--- a/en/use-dify/build/snippet.mdx
+++ b/en/use-dify/build/snippet.mdx
@@ -0,0 +1,39 @@
+---
+title: Snippets
+sidebarTitle: Snippets
+description: Save and reuse groups of nodes across workflows
+icon: puzzle-piece
+---
+
+Useful node patterns have a way of showing up in more than one workflow. Without a way to save and reuse them, you end up rebuilding the same logic every time: a particular chain of LLM and tool nodes, a data processing sequence, or a retrieval-and-summarization pipeline. 
+
+Snippets let you save a group of nodes as a reusable unit. Build the logic once, reuse it across your workflows, and share it with your team to save others the effort of building the same thing from scratch.
+
+## Create a Snippet
+
+There are two ways to create a snippet:
+
+<Tabs>
+    <Tab title="From a Workflow">
+        Select the nodes you want to reuse, right-click, and choose **Add to Snippet**. 
+    </Tab>
+    <Tab title="From Studio">
+        In the **Snippets** tab, click **Create from Blank** to start with an empty canvas, or import an existing snippet from a DSL file.
+    </Tab>
+</Tabs>
+
+## Edit and Manage Snippets
+
+Each snippet has its own editor, similar to the workflow canvas. 
+
+Since snippets are designed to be inserted into existing workflows, they don't include a Start node. Instead, you define **Input Fields** that specify what data the snippet expects to receive, such as a variable from an upstream node. When someone adds your snippet to a workflow, these input fields become the connection points.
+
+To share a snippet in your workspace, publish it. While unpublished, the snippet is marked as a draft and is only visible to you. You can also export a snippet as a DSL file to share it outside your workspace.
+
+## Add a Snippet to a Workflow
+
+Right-click on the orchestration canvas, click **Add Note** > **Snippets**, and select one to insert it into your workflow. Any changes you make in the workflow won't affect the original snippet.
+
+## Evaluate Before Sharing
+
+To ensure a snippet works reliably when reused across different workflows, you can configure and run [evaluation](/en/use-dify/debug/evaluate-workflow) for it before publishing.
--- a/en/use-dify/debug/evaluate-workflow.mdx
+++ b/en/use-dify/debug/evaluate-workflow.mdx
@@ -0,0 +1,103 @@
+---
+title: Evaluate Workflows
+sidebarTitle: Evaluation
+description: Measure workflow quality with built-in and custom metrics
+icon: chart-column
+---
+
+Without a way to measure output quality, every change to your app—tweaking a prompt, swapping a model, tuning retrieval settings—is based on gut feeling. 
+
+You might test a few cases manually and decide it looks good, but there's no solid data to back that decision. This becomes especially costly in production environments where reliability matters.
+
+**Evaluation replaces guesswork with measurement**. Select metrics like factual grounding and retrieval relevance, choose an LLM to judge outputs against them, and set pass/fail thresholds. 
+
+Before the app goes live, you can batch test with mock data to verify it meets the defined criteria. Once published, every run is automatically evaluated as real-world inputs flow in, with results viewable in the **Logs** page.
+
+## Configure
+
+### Judge Model
+
+Select an LLM from your configured model providers to serve as the judge. This model scores the outputs of evaluated nodes against the built-in metrics you select.
+
+### Metrics
+
+#### Built-in Metrics
+
+Built-in metrics use the judge model to score the outputs of specific nodes in your workflow. After adding a metric, select which nodes it applies to.
+
+<Tabs>
+    <Tab title="LLM Metrics">
+
+    > For LLM and Agent nodes
+
+    | Metric | Description |
+    | --- | --- |
+    | Faithfulness | Is the response making things up? <br></br><br></br>Checks whether every claim in the LLM's output can be traced back to the retrieved context. Higher scores indicate less hallucinated content. |
+    | Answer Relevancy | Is the response actually answering the question? <br></br><br></br>Measures how well the output addresses the user's question. Higher scores indicate more on-topic responses; lower scores suggest irrelevant content or a failure to address what was asked. |
+    | Answer Correctness | Is the response factually correct and complete? <br></br><br></br>Compares the output against a ground-truth reference, evaluating both meaning and key-fact coverage. |
+    | Semantic Similarity | Does the response convey the same meaning as the reference answer? <br></br><br></br>Measures how closely the two texts align in meaning, independent of factual correctness. |
+        
+    </Tab>
+    <Tab title="Retrieval Metrics">
+    > For Knowledge Retrieval nodes
+
+    | Metric | Description |
+    | --- | --- |
+    | Context Precision | Is the knowledge base returning relevant results, or noise? <br></br><br></br>Higher scores indicate more of the retrieved content is useful for answering the question. |
+    | Context Recall | Is the knowledge base returning enough information? <br></br><br></br>Higher scores indicate the retrieved content covers more of the key information needed to answer the question. |
+    | Context Relevance | How relevant is each individual chunk? <br></br><br></br>Scores each retrieved chunk independently against the query, offering a more granular view than Context Precision's overall assessment. |
+
+        
+    </Tab>
+    <Tab title="Agent Metrics">
+    > For Agent nodes
+
+    | Metric | Description |
+    | --- | --- |
+    | Tool Correctness | Is the agent calling the right tools with the right inputs? <br></br><br></br>Higher scores indicate the agent's tool usage more closely matches expected behavior. |
+    | Task Completion | Did the agent accomplish the user's goal? <br></br><br></br>Evaluates the full reasoning chain, intermediate steps, and final output. Higher scores indicate more complete task accomplishment. |
+        
+    </Tab>
+</Tabs>
+
+#### Custom Evaluators
+
+When built-in metrics don't cover what you need, you can create a custom evaluator—a workflow that scores outputs using your own logic.
+
+**Create an Evaluation Workflow**
+
+1. Build a workflow with your evaluation logic. 
+
+    In the **User Input** node, define input variables for the data the evaluator needs to score. For example, an `llm_response` variable to receive the LLM's output from the evaluated workflow.
+
+2. Use an **Output** node to output the evaluation results. Each output variable becomes a custom metric. 
+
+    For example, a `word_count_pass` variable that returns whether the response meets your length requirements.
+
+3. When publishing, choose **Publish as an Evaluation Workflow**.
+
+<Info>
+An evaluation workflow can be converted back to a standard workflow, but any evaluation configurations that reference it as a custom evaluator will stop working.
+</Info>
+
+**Configure a Custom Evaluator**
+
+Select a published evaluation workflow as your custom evaluator, then map its input variables to the corresponding variables in the evaluated workflow. For example, map the evaluator's `llm_response` input to the actual LLM node's output.
+
+The evaluator's output variables automatically appear as available metrics.
+
+<Info>The judge model only applies to built-in metrics. Custom evaluators run their own logic and produce results independently.</Info>
+
+### Judgement Conditions
+
+Define conditions based on your selected metrics to determine whether a workflow run passes or fails. For example, you might require faithfulness to score above 0.8 and context precision above 0.7.
+
+Each run is marked as either **pass** or **fail** in the results.
+
+## Batch Test
+
+Before publishing or after making changes to a workflow, you can evaluate it against multiple test cases at once to verify quality.
+
+Once evaluation is configured, download the pre-generated template, fill in your test cases, and click **Upload & Run Test**. 
+
+Switch to **Test History** to track progress and review past tests. Each entry records when the test was run, who ran it, and which workflow version was tested. You can download the results from any completed test.
--- a/en/use-dify/knowledge/evaluate-knowledge.mdx
+++ b/en/use-dify/knowledge/evaluate-knowledge.mdx
@@ -0,0 +1,30 @@
+---
+title: Evaluate Knowledge Bases
+sidebarTitle: Evaluation
+description: Measure retrieval quality with automated testing
+icon: chart-column
+---
+
+Tuning a knowledge base involves a lot of trial and adjustment, but without concrete data it's often hard to tell whether a change actually helped. **Evaluation makes this measurable**—scoring every retrieval against specific metrics, so each adjustment is backed by data.
+
+Upload test queries with expected results, choose an LLM as judge, and see exactly how your knowledge base performs against retrieval metrics like precision, recall, and relevance.
+
+## Judge Model
+
+Select an LLM from your configured model providers to serve as the judge. This model scores retrieval results against the metrics you select.
+
+## Metrics
+
+| Metric | Description |
+| --- | --- |
+| Context Precision | Is the knowledge base returning relevant results, or noise? <br></br><br></br>Higher scores indicate more of the retrieved content is useful for answering the question. |
+| Context Recall | Is the knowledge base returning enough information? <br></br><br></br>Higher scores indicate the retrieved content covers more of the key information needed to answer the question. |
+| Context Relevance | How relevant is each individual chunk? <br></br><br></br>Scores each retrieved chunk independently against the query, offering a more granular view than Context Precision's overall assessment. |
+
+After selecting a metric, set a **Pass if ≥** threshold to define the minimum acceptable score. Each test query will be marked as pass or fail based on these thresholds.
+
+## Batch Test
+
+Download the pre-generated Excel template, fill in your test queries and expected results, and click **Upload & Run Test**. You will see the results for each query, including the actual retrieved content and the scores for each selected metric.
+
+Results appear in the **Test Details** panel, showing each query's expected result, actual result, and individual metric scores. You can export the full results for further analysis.