---
sidebar_position: 4000
title: "🪶 Apache Tika Extraction"
---

:::warning
This tutorial is a community contribution and is not supported by the Open WebUI team. It serves only as a demonstration on how to customize Open WebUI for your specific use case. Want to contribute? Check out the contributing tutorial.
:::

## 🪶 Apache Tika Extraction

This documentation provides a step-by-step guide to integrating Apache Tika with Open WebUI. Apache Tika is a content analysis toolkit that can be used to detect and extract metadata and text content from over a thousand different file types. All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Prerequisites
------------

* Open WebUI instance
* Docker installed on your system
* Docker network set up for Open WebUI

Integration Steps
----------------

### Step 1: Create a Docker Compose File or Run the Docker Command for Apache Tika

You have two options to run Apache Tika:

**Option 1: Using Docker Compose**

Create a new file named `docker-compose.yml` in the same directory as your Open WebUI instance. Add the following configuration to the file:

```yml
services:
  tika:
    image: apache/tika:latest-full
    container_name: tika
    ports:
      - "9998:9998"
    restart: unless-stopped
```

Run the Docker Compose file using the following command:

```bash
docker-compose up -d
```

**Option 2: Using Docker Run Command**

Alternatively, you can run Apache Tika using the following Docker command:

```bash
docker run -d --name tika \
  -p 9998:9998 \
  --restart unless-stopped \
  apache/tika:latest-full
```

Note that if you choose to use the Docker run command, you'll need to specify the `--network` flag if you want to run the container in the same network as your Open WebUI instance.

### Step 2: Configure Open WebUI to Use Apache Tika

To use Apache Tika as the context extraction engine in Open WebUI, follow these steps:

* Log in to your Open WebUI instance.
* Navigate to the `Admin Panel` settings menu.
* Click on `Settings`.
* Click on the `Documents` tab.
* Change the `Default` content extraction engine dropdown to `Tika`.
* Update the context extraction engine URL to `http://tika:9998`.
* Save the changes.

 Verifying Apache Tika in Docker
=====================================

To verify that Apache Tika is working correctly in a Docker environment, you can follow these steps:

### 1. Start the Apache Tika Docker Container

First, ensure that the Apache Tika Docker container is running. You can start it using the following command:

```bash
docker run -p 9998:9998 apache/tika
```

This command starts the Apache Tika container and maps port 9998 from the container to port 9998 on your local machine.

### 2. Verify the Server is Running

You can verify that the Apache Tika server is running by sending a GET request:

```bash
curl -X GET http://localhost:9998/tika
```

This command should return the following response:

```
This is Tika Server. Please PUT
```

### 3. Verify the Integration

Alternatively, you can also try sending a file for analysis to test the integration. You can test Apache Tika by sending a file for analysis using the `curl` command:

```bash
curl -T test.txt http://localhost:9998/tika
```

Replace `test.txt` with the path to a text file on your local machine.

Apache Tika will respond with the detected metadata and content type of the file.

### Using a Script to Verify Apache Tika

If you want to automate the verification process, this script sends a file to Apache Tika and checks the response for the expected metadata. If the metadata is present, the script will output a success message along with the file's metadata; otherwise, it will output an error message and the response from Apache Tika.

```python
import requests

def verify_tika(file_path, tika_url):
    try:
        # Send the file to Apache Tika and verify the output
        response = requests.put(tika_url, files={'file': open(file_path, 'rb')})

        if response.status_code == 200:
            print("Apache Tika successfully analyzed the file.")
            print("Response from Apache Tika:")
            print(response.text)
        else:
            print("Error analyzing the file:")
            print(f"Status code: {response.status_code}")
            print(f"Response from Apache Tika: {response.text}")
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    file_path = "test.txt"  # Replace with the path to your file
    tika_url = "http://localhost:9998/tika"

    verify_tika(file_path, tika_url)
```

Instructions to run the script:

### Prerequisites

* Python 3.x must be installed on your system
* `requests` library must be installed (you can install it using pip: `pip install requests`)
* Apache Tika Docker container must be running (use `docker run -p 9998:9998 apache/tika` command)
* Replace `"test.txt"` with the path to the file you want to send to Apache Tika

### Running the Script

1. Save the script as `verify_tika.py` (e.g., using a text editor like Notepad or Sublime Text)
2. Open a terminal or command prompt
3. Navigate to the directory where you saved the script (using the `cd` command)
4. Run the script using the following command: `python verify_tika.py`
5. The script will output a message indicating whether Apache Tika is working correctly

Note: If you encounter any issues, ensure that the Apache Tika container is running correctly and that the file is being sent to the correct URL.

### Conclusion

By following these steps, you can verify that Apache Tika is working correctly in a Docker environment. You can test the setup by sending a file for analysis, verifying the server is running with a GET request, or use a script to automate the process. If you encounter any issues, ensure that the Apache Tika container is running correctly and that the file is being sent to the correct URL.

Troubleshooting
--------------

* Make sure the Apache Tika service is running and accessible from the Open WebUI instance.
* Check the Docker logs for any errors or issues related to the Apache Tika service.
* Verify that the context extraction engine URL is correctly configured in Open WebUI.

Benefits of Integration
----------------------

Integrating Apache Tika with Open WebUI provides several benefits, including:

* **Improved Metadata Extraction**: Apache Tika's advanced metadata extraction capabilities can help you extract accurate and relevant data from your files.
* **Support for Multiple File Formats**: Apache Tika supports a wide range of file formats, making it an ideal solution for organizations that work with diverse file types.
* **Enhanced Content Analysis**: Apache Tika's advanced content analysis capabilities can help you extract valuable insights from your files.

Conclusion
----------

Integrating Apache Tika with Open WebUI is a straightforward process that can improve the metadata extraction capabilities of your Open WebUI instance. By following the steps outlined in this documentation, you can easily set up Apache Tika as a context extraction engine for Open WebUI.