mirror of https://github.com/ollama/ollama.git synced 2026-03-27 02:58:43 +07:00

Files

easonysliu 810d4f9c22 runner: fix swallowed error in allocModel graph reservation

In allocModel(), the first call to reserveWorstCaseGraph(true) had its
error silently discarded — `return nil` was used instead of `return err`.

This meant that if the prompt-sized graph reservation failed (e.g. due
to insufficient memory), the error was swallowed, allocModel reported
success, and the model appeared to load correctly. Subsequent inference
would then fail in unexpected ways because the worst-case graph was
never properly reserved.

Fix: return the actual error so the caller can handle the failure
(retry with reduced parallelism, report OOM, etc.).

Co-Authored-By: Claude (claude-opus-4-6) <noreply@anthropic.com>

2026-03-16 15:48:45 -07:00

common

server: add logprobs and top_logprobs support to Ollama's API (#12899 )

2025-11-11 08:49:50 -08:00

llamarunner

flash attn: add auto mode for llama engine (#13052 )

2025-12-12 13:27:19 -08:00

ollamarunner

runner: fix swallowed error in allocModel graph reservation

2026-03-16 15:48:45 -07:00

README.md

…

runner.go

Add MLX runner with GLM4-MoE-Lite model support (#14185 )

2026-02-10 14:57:57 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding