📝 feat: File Context (OCR) or Upload Files as Text (#262)

* 🌍 docs: Add OCR configuration documentation and object structure * 🌍 docs: Enhance OCR capabilities documentation and add new feature details * 🌍 docs: Add OCR configuration details and update links in documentation * 🌍 docs: Update OCR configuration details and enhance documentation for new features * 🌍 docs: Add OCR capability details and update changelog for new text extraction features * 🌍 docs: Clarify OCR processing details in agent context and update documentation for text extraction * 🌍 docs: Update OCR documentation title and enhance configuration details for Mistral model * fix: example OCR `mistralModel` and clarifying comment on configuration
2026-03-27 02:38:32 +07:00 · 2025-03-10 17:23:59 -04:00
parent 0955c987e9
commit 214d8bb281
9 changed files with 307 additions and 7 deletions
--- a/components/changelog/content/config_v1.2.2.mdx
+++ b/components/changelog/content/config_v1.2.2.mdx
@@ -0,0 +1,14 @@
+- Added [OCR Configuration](/docs/configuration/librechat_yaml/object_structure/ocr)
+  - Enables Optical Character Recognition (OCR) for extracting text from images and documents
+  - Supports two strategies: `mistral_ocr` (default) and `custom_ocr` (planned for future)
+  - Configurable parameters include `mistralModel`, `apiKey`, and `baseURL`
+  - Environment variable parsing supported for all parameters
+  - Available as an agent capability for processing images and documents
+  - Accessible via "Upload as Text" in chat and "File Context" in agent settings
+- Added `ocr` to [Agent Capabilities](/docs/configuration/librechat_yaml/object_structure/agents#capabilities)
+  - New capability allows agents to extract text from images and documents
+  - Integrates with the OCR configuration for text extraction
+  - Adds "File Context (OCR)" as a new file upload category for agents
+  - This setting also affects the "Upload as Text" feature in chat (uploading as a message attachment)
+- Added `titleModel` to [Shared Endpoint Settings](), making this configurable for all endpoints
+    - Note, this doesn't included `assistants` nor `all` configurations
--- a/pages/changelog/config_v1.2.2.mdx
+++ b/pages/changelog/config_v1.2.2.mdx
@@ -0,0 +1,13 @@
+---
+date: 2025/3/10
+title: ⚙️ Config v1.2.2
+---
+
+import { ChangelogHeader } from '@/components/changelog/ChangelogHeader'
+import Content from '@/components/changelog/content/config_v1.2.2.mdx'
+
+<ChangelogHeader />
+
+---
+
+<Content />
--- a/pages/docs/configuration/librechat_yaml/object_structure/agents.mdx
+++ b/pages/docs/configuration/librechat_yaml/object_structure/agents.mdx
@@ -10,7 +10,7 @@ endpoints:
    recursionLimit: 50
    disableBuilder: false
    # (optional) Agent Capabilities available to all users. Omit the ones you wish to exclude. Defaults to list below.
-    # capabilities: ["execute_code", "file_search", "actions", "tools"]
+    # capabilities: ["execute_code", "file_search", "actions", "tools", "ocr"]
 ```
 > This configuration enables the builder interface for agents.

@@ -52,7 +52,7 @@ disableBuilder: false
  ]}
 />

-**Default:** `["execute_code", "file_search", "actions", "tools"]`
+**Default:** `["execute_code", "file_search", "actions", "tools", "ocr"]`

 **Example:**
 ```yaml filename="endpoints / agents / capabilities"
@@ -61,6 +61,7 @@ capabilities:
  - "file_search"
  - "actions"
  - "tools"
+  - "ocr"
 ```
 **Note:** This field is optional. If omitted, the default behavior is to include all the capabilities listed in the default.

@@ -72,6 +73,7 @@ The `capabilities` field allows you to enable or disable specific functionalitie
 - **file_search**: Enables the agent to search and interact with files.
 - **actions**: Permits the agent to perform predefined actions.
 - **tools**: Grants the agent access to various tools.
+- **ocr**: Enables uploading files as additional context, leveraging Optical Character Recognition for extracting text from images and documents.

 By specifying the capabilities, you can control the features available to users when interacting with agents.

@@ -86,9 +88,10 @@ endpoints:
    capabilities:
      - "execute_code"
      - "actions"
+      - "ocr"
 ```

-In this example, the builder interface for agents is disabled, and only the `execute_code` and `actions` capabilities are enabled.
+In this example, the builder interface for agents is disabled, and only the `execute_code`, `actions`, and `ocr` capabilities are enabled.

 ## Notes

--- a/pages/docs/configuration/librechat_yaml/object_structure/config.mdx
+++ b/pages/docs/configuration/librechat_yaml/object_structure/config.mdx
@@ -78,6 +78,27 @@
  ]}
 />

+## ocr
+
+**Key:**
+<OptionTable
+  options={[
+    ['ocr', 'Object', 'Configures Optical Character Recognition (OCR) settings for extracting text from images.', ''],
+  ]}
+/>
+
+**Subkeys:**
+<OptionTable
+  options={[
+    ['apiKey', 'String', 'The API key for the OCR service.', ''],
+    ['baseURL', 'String', 'The base URL for the OCR service API.', ''],
+    ['strategy', 'String', 'The OCR strategy to use. Options are "mistral_ocr" or "custom_ocr".', ''],
+    ['mistralModel', 'String', 'The Mistral model to use for OCR processing.', ''],
+  ]}
+/>
+
+see: [OCR Config Object Structure](/docs/configuration/librechat_yaml/object_structure/ocr)
+
 ## fileConfig

 **Key:**
@@ -359,3 +380,4 @@ see: [MCP Servers Object Structure](/docs/configuration/librechat_yaml/object_st
 - [Azure OpenAI Endpoint Object Structure](/docs/configuration/librechat_yaml/object_structure/azure_openai)
 - [Assistants Endpoint Object Structure](/docs/configuration/librechat_yaml/object_structure/assistants_endpoint)
 - [Agents](/docs/configuration/librechat_yaml/object_structure/agents)
+- [OCR Config Object Structure](/docs/configuration/librechat_yaml/object_structure/ocr)
--- a/pages/docs/configuration/librechat_yaml/object_structure/ocr.mdx
+++ b/pages/docs/configuration/librechat_yaml/object_structure/ocr.mdx
@@ -0,0 +1,98 @@
+# OCR Config Object Structure
+
+## Overview
+
+The `ocr` object allows you to configure Optical Character Recognition (OCR) settings for the application, enabling the extraction of text from images. This section provides a detailed breakdown of the `ocr` object structure.
+
+There are 4 main fields under `ocr`:
+
+  - `mistralModel`
+  - `apiKey`
+  - `baseURL`
+  - `strategy`
+
+**Notes:**
+
+- If using the Mistral OCR API, you don't need to edit your `librechat.yaml` file.
+    - You only need the following environment variables to get started: `OCR_API_KEY` and `OCR_BASEURL`.
+- OCR functionality allows the application to extract text from images, which can then be processed by AI models.
+- The default strategy is `mistral_ocr`, which uses Mistral's OCR capabilities.
+- You can also configure a custom OCR service by setting the strategy to `custom_ocr`.
+- If using the default Mistral OCR, you may optionally specify a specific Mistral model to use.
+- Environment variable parsing is supported for `apiKey`, `baseURL`, and `mistralModel` parameters.
+- A `user_provided` strategy option is planned for future releases but is not yet implemented.
+
+## Example
+
+```yaml filename="ocr"
+ocr:
+  mistralModel: "mistral-ocr-latest"
+  apiKey: "your-mistral-api-key"
+  strategy: "mistral_ocr"
+```
+
+Example with custom OCR:
+
+```yaml filename="ocr with custom OCR"
+ocr:
+  apiKey: "your-custom-ocr-api-key"
+  baseURL: "https://your-custom-ocr-service.com/api"
+  strategy: "custom_ocr"
+```
+
+## mistralModel
+
+<OptionTable
+  options={[
+    ['mistralModel', 'String', 'The Mistral model to use for OCR processing.', 'Optional. Specifies which Mistral model should be used when the strategy is set to mistral_ocr.'],
+  ]}
+/>
+
+```yaml filename="ocr / mistralModel"
+ocr:
+  mistralModel: "mistral-ocr-latest"
+```
+
+## apiKey
+
+<OptionTable
+  options={[
+    ['apiKey', 'String', 'The API key for the OCR service.', 'Optional. Defaults to the environment variable OCR_API_KEY if not specified.'],
+  ]}
+/>
+
+```yaml filename="ocr / apiKey"
+ocr:
+  apiKey: "your-ocr-api-key"
+```
+
+## baseURL
+
+<OptionTable
+  options={[
+    ['baseURL', 'String', 'The base URL for the OCR service API.', 'Optional. Defaults to the environment variable OCR_BASEURL if not specified.'],
+  ]}
+/>
+
+```yaml filename="ocr / baseURL"
+ocr:
+  baseURL: "https://your-ocr-service.com/api"
+```
+
+## strategy
+
+<OptionTable
+  options={[
+    ['strategy', 'String', 'The OCR strategy to use.', 'Determines which OCR service to use. Options are "mistral_ocr" or "custom_ocr". Defaults to "mistral_ocr".'],
+  ]}
+/>
+
+```yaml filename="ocr / strategy"
+ocr:
+  strategy: "custom_ocr"
+```
+
+**Available Strategies:**
+
+- `mistral_ocr`: Uses Mistral's OCR capabilities.
+- `custom_ocr`: Uses a custom OCR service specified by the baseURL.
--- a/pages/docs/features/agents.mdx
+++ b/pages/docs/features/agents.mdx
@@ -58,6 +58,17 @@ The File Search capability enables:
 - Context-aware responses based on file contents
 - File attachment support at both agent and chat thread levels

+### File Context (using Optical Character Recognition)
+
+The File Context (OCR) capability allows your agent to extract and process text from images and documents:
+
+- Extract text while maintaining document structure and formatting
+- Process complex layouts including multi-column text and mixed content
+- Handle tables, equations, and other specialized content
+- Work with multilingual content
+- [More info about OCR](/docs/features/ocr)
+  - **Currently uses Mistral OCR API which may incur costs**
+
 ### Tools

 Agents can also be enhanced with various built-in tools:
@@ -136,14 +147,19 @@ LibreChat is at the forefront of implementing flexible, scalable MCP server inte

 ## File Management

-Agents support three distinct file upload categories:
+Agents support four distinct file upload categories:

 1. **Image Upload**: For visual content processing
 2. **File Search Upload**: Documents for RAG capabilities
 3. **Code Interpreter Upload**: Files for code processing
+4. **File Context (OCR)**: Documents processed with OCR and added to the agent's instructions

 Files can be attached directly to the agent configuration or within individual chat threads.

+![File Context using OCR for agents](/images/ocr/file_context_ocr.png)
+
+Files uploaded as "File Context" are processed using OCR to extract text, which is then added to the Agent's instructions. This is ideal for documents, images with text, or PDFs where you need the full text content of a file to be available to the agent. Note, the OCR is performed at the time of upload and is not stored as a separate file, rather purely as text in the database.
+
 ## Sharing and Permissions

 ### Administrator Controls
@@ -182,7 +198,7 @@ LibreChat allows admins to configure the use of agents via the [`librechat.yaml`

 - Provide clear, specific instructions for your agent
 - Carefully consider which tools are necessary for your use case
- Organize files appropriately across the three upload categories
+- Organize files appropriately across the four upload categories
 - Review permission settings before sharing agents
 - Test your agent thoroughly before deploying to other users

@@ -191,7 +207,7 @@ LibreChat allows admins to configure the use of agents via the [`librechat.yaml`
 1. Select "Agents" from the endpoint dropdown menu
 2. Open the Agent Builder panel
 3. Fill out the required agent details
-4. Configure desired capabilities (Code Interpreter, File Search)
+4. Configure desired capabilities (Code Interpreter, File Search, File Context or OCR)
 5. Add necessary tools and files
 6. Set sharing permissions if desired
 7. Create and start using your agent
@@ -218,4 +234,4 @@ AI Agents in LibreChat provide a powerful way to create specialized assistants w

 ---

-#LibreChat #AIAssistants #NoCode #OpenSource
+#LibreChat #AIAssistants #NoCode #OpenSource
--- a/pages/docs/features/ocr.mdx
+++ b/pages/docs/features/ocr.mdx
@@ -0,0 +1,134 @@
+---
+title: File Context (OCR)
+description: Learn how to use LibreChat's OCR capability to extract text from images and documents for AI processing.
+---
+
+# File Context via Optical Character Recognition (OCR)
+
+LibreChat's OCR (Optical Character Recognition) feature enables AI agents to extract and process text from images and documents. This capability enhances the AI's ability to work with visual content, making it possible to analyze, understand, and respond to information contained in images.
+
+## Overview
+
+OCR functionality in LibreChat allows agents to:
+
+- Extract text from images and documents
+- Maintain document structure and formatting
+- Process complex layouts including multi-column text
+- Handle tables, equations, and other specialized content
+- Work with multilingual content
+
+## Availability
+
+Currently, OCR is **only available as an agent capability**. This means you must use an agent via the Agents endpoint to leverage OCR functionality.
+
+## Configuration
+
+OCR can be enabled in the LibreChat configuration file (`librechat.yaml`). The OCR configuration supports two strategies:
+
+1. **Mistral OCR** (Default and currently the only available option)
+2. **Custom OCR** (Planned for future releases)
+
+### Basic Configuration Example
+
+If using the Mistral OCR API, you only need the following environment variables to get started:
+
+```.env
+# `.env`
+OCR_API_KEY=your-mistral-api-key
+# OCR_BASEURL=https://api.mistral.ai/v1 # this is the default value
+```
+
+For additional, detailed configuration options, see the [OCR Config Object Structure](/docs/configuration/librechat_yaml/object_structure/ocr).
+
+```yaml
+# `librechat.yaml`
+ocr:
+  mistralModel: "mistral-ocr-latest"  # Optional: Specify Mistral model, defaults to "mistral-ocr-latest"
+  apiKey: "your-mistral-api-key"        # Optional: Defaults to OCR_API_KEY env variable
+  baseURL: "https://api.mistral.ai/v1"  # Optional: Defaults to OCR_BASEURL env variable, or Mistral's API if no variable set
+  strategy: "mistral_ocr"               # Optional: Defaults to "mistral_ocr" (only option currently available)
+```
+
+## Mistral OCR
+
+Currently, LibreChat uses Mistral's OCR API as the default and only available OCR provider. Mistral OCR offers state-of-the-art document understanding capabilities.
+
+### Key Features of Mistral OCR
+
+- **Document Structure Preservation**: Maintains formatting like headers, paragraphs, lists, and tables
+- **Multilingual Support**: Processes text in multiple languages and scripts
+- **Complex Layout Handling**: Handles multi-column text and mixed content
+- **Mathematical Expression Recognition**: Accurately processes equations and formulas
+- **High-Speed Processing**: Processes up to 2000 pages per minute
+
+### Important Considerations
+
+- **Cost**: Using Mistral OCR may incur costs as it's a paid API service (though free trials may be available)
+- **Data Privacy**: Data processed through Mistral OCR is subject to Mistral's cloud environment and their terms of service
+- **Document Limitations**: 
+  - Maximum file size: 50 MB
+  - Maximum document length: 1,000 pages
+
+### Future Plans
+
+- Mistral plans to make their OCR API available through their cloud partners, such as GCP and AWS, and enterprise self-hosting for organizations with stringent data privacy requirements ([source](https://mistral.ai/fr/news/mistral-ocr)).
+- LibreChat will continue to support Mistral OCR and explore additional OCR providers, including open-source solutions, for enhanced functionality.
+- LibreChat currently does not include the parsed image content from the OCR process in its responses, even though services like [Mistral's OCR API may provide](https://docs.mistral.ai/api/#tag/ocr) these in the result. This feature may be supported in future updates.
+
+## Using File Context (OCR) in LibreChat
+
+LibreChat provides two main ways to use OCR functionality:
+
+### 1. Upload as Text in Chat
+
+In any chat conversation, you can use OCR to extract text from images or documents:
+
+1. Click the attachment icon in the chat input
+2. Select "Upload as Text" from the menu
+3. Choose an image or document file
+4. The OCR system will process the file and insert the extracted text into your message
+
+![Upload as Text option in the attachment menu](/images/ocr/upload_as_text.png)
+
+### 2. File Context for Agents
+
+When working with agents, you can add documents as context using OCR:
+
+1. Open the Agent Builder panel or edit an existing agent
+2. In the File Context section, click "Upload File Context"
+3. Select a document or image file
+4. The OCR system will extract text from the file and add it to the agent's instructions
+
+![File Context using OCR for agents](/images/ocr/file_context_ocr.png)
+
+Files uploaded as "Context" are processed using OCR to extract text, which is then added to the Agent's instructions. This is ideal for documents, images with text, or PDFs where you need the full text content of a file to be available to the agent.
+
+**Note,** the OCR is performed at the time of upload and is not stored as a separate file, rather purely as text in the database.
+
+## Example Use Cases
+
+- **Document Analysis**: Extract and analyze text from scanned documents, PDFs, or images
+- **Data Extraction**: Pull specific information from forms, receipts, or invoices
+- **Research Assistance**: Process academic papers, articles, or books
+- **Language Translation**: Extract text from foreign language documents for translation
+- **Content Digitization**: Convert printed materials into digital, searchable text
+
+## Limitations
+
+- OCR accuracy may vary depending on image quality, document complexity, and text clarity
+- Some specialized formatting or unusual layouts might not be perfectly preserved
+- Very large documents may be truncated due to token limitations of the underlying AI models
+
+## Future Enhancements
+
+LibreChat plans to expand OCR capabilities in future releases:
+
+- Support for custom OCR providers
+- A `user_provided` strategy option that will allow users to choose their preferred OCR service
+- Integration with open-source OCR solutions
+- Enhanced document processing options
+- More granular control over OCR settings
+
+---
+
+For more information on configuring OCR, see the [OCR Config Object Structure](/docs/configuration/librechat_yaml/object_structure/ocr).
--- a/public/images/ocr/file_context_ocr.png
+++ b/public/images/ocr/file_context_ocr.png
--- a/public/images/ocr/upload_as_text.png
+++ b/public/images/ocr/upload_as_text.png