mirror of https://github.com/ollama/ollama.git synced 2026-03-27 02:58:43 +07:00

Files

Jeffrey Morgan 1044b0419a model: add MLA absorption for glm4moelite (#13810 )

* model: add MLA absorption for glm4moelite

Split the combined KV_B tensor into separate K_B and V_B tensors
during conversion, enabling MLA (Multi-head Latent Attention)
absorption which compresses the KV cache for improved efficiency.

* ggml: enable MLA flash attention for GLM-4.7-flash

Add support for gqa_ratio 4 in MLA flash attention kernels. GLM-4.7-flash
uses head size 576 with gqa_ratio 4, which was previously only supported
for gqa_ratio 16 (DeepSeek).

Metal changes:
- Enable head size 576 for flash attention
- Increase simdgroups to 8 for large heads (>=512)
- Add case 8 kernel dispatch for 8 simdgroups

CUDA changes:
- Add gqa_ratio 4 support for head 576/512
- Add tile configs for (576, 512, 4) and (576, 512, 8)
- Add MMA config cases for ncols 4
- Add template instances for ncols2=4

* model: add compatibility validation for glm4moelite architecture

2026-01-23 14:47:42 -08:00

llama.cpp

GGML update to ec98e2002 (#13451 )

2025-12-17 13:13:55 -08:00

patches

model: add MLA absorption for glm4moelite (#13810 )

2026-01-23 14:47:42 -08:00

.gitignore

Re-introduce the llama package (#5034 )

2024-10-08 08:53:54 -07:00

build-info.cpp

GGML update to ec98e2002 (#13451 )

2025-12-17 13:13:55 -08:00

build-info.cpp.in

chore: update gitattributes (#8860 )

2025-02-05 16:37:18 -08:00

llama_test.go

llama: test case typo and readability improvements (#13078 )

2025-11-15 18:54:27 -08:00

llama.go

GGML update to ec98e2002 (#13451 )

2025-12-17 13:13:55 -08:00

README.md

docs: improve syntax highlighting in code blocks (#8854 )

2025-02-07 09:55:07 -08:00

sampling_ext.cpp

GGML update to ec98e2002 (#13451 )

2025-12-17 13:13:55 -08:00

sampling_ext.h

api: remove unused sampling parameters (#10581 )

2025-05-08 08:31:08 -07:00

README.md

`llama`

This package provides Go bindings to llama.cpp.

Vendoring

Ollama vendors llama.cpp and ggml. While we generally strive to contribute changes back upstream to avoid drift, we carry a small set of patches which are applied to the tracking commit.

If you update the vendoring code, start by running the following command to establish the tracking llama.cpp repo in the ./vendor/ directory.

make -f Makefile.sync apply-patches

Updating Base Commit

Pin to new base commit

To change the base commit, update FETCH_HEAD in Makefile.sync.

When updating to a newer base commit, the existing patches may not apply cleanly and require manual merge resolution.

Start by applying the patches. If any of the patches have conflicts, the git am will stop at the first failure.

make -f Makefile.sync apply-patches

If there are conflicts, you will see an error message. Resolve the conflicts in ./vendor/, and continue the patch series with git am --continue and rerun make -f Makefile.sync apply-patches. Repeat until all patches are successfully applied.

Once all patches are applied, commit the changes to the tracking repository.

make -f Makefile.sync format-patches sync

Generating Patches

When working on new fixes or features that impact vendored code, use the following model. First get a clean tracking repo with all current patches applied:

make -f Makefile.sync clean apply-patches

Iterate until you're ready to submit PRs. Once your code is ready, commit a change in the ./vendor/ directory, then generate the patches for ollama with

make -f Makefile.sync format-patches

In your ./vendor/ directory, create a branch, and cherry-pick the new commit to that branch, then submit a PR upstream to llama.cpp.

Commit the changes in the ollama repo and submit a PR to Ollama, which will include the vendored code update with your change, along with the patches.

After your PR upstream is merged, follow the Updating Base Commit instructions above, however first remove your patch before running apply-patches since the new base commit contains your change already.