mirror of
https://github.com/ollama/ollama.git
synced 2026-03-27 02:58:43 +07:00
In allocModel(), the first call to reserveWorstCaseGraph(true) had its error silently discarded — `return nil` was used instead of `return err`. This meant that if the prompt-sized graph reservation failed (e.g. due to insufficient memory), the error was swallowed, allocModel reported success, and the model appeared to load correctly. Subsequent inference would then fail in unexpected ways because the worst-case graph was never properly reserved. Fix: return the actual error so the caller can handle the failure (retry with reduced parallelism, report OOM, etc.). Co-Authored-By: Claude (claude-opus-4-6) <noreply@anthropic.com>
runner
Note: this is a work in progress
A minimial runner for loading a model and running inference via a http web server.
./runner -model <model binary>
Completion
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion
Embeddings
curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding