New models

What's Changed

Command A and North family models now run on Apple Silicon with the MLX engine
Updated the underlying llama.cpp engine to build 9672
Fixed build artifacts for MLX

Full Changelog: v0.30.9...v0.30.10

What's Changed

Support for Cohere2Moe architecture
Fixed LFM2 parser/render for cases where thinking was not emitted
Fixed issue where ollama launch claude and other coding agent or assistant use cases would only output one token
Ollama will now return an error if a single message is larger than the current context window

Full Changelog: v0.30.8...v0.30.9-rc1

What's Changed

Fixed ollama launch selecting the wrong provider in some cases
Improved prompt caching by decoupling it from context shift for better KV cache reuse
More stable MLX inference with hardened linear and embedding layers
MLX runner now creates snapshots during prompt processing and speculative decoding for improved reliability
Improved recurrent model support with per-boundary states from the gated-delta kernels

Full Changelog: v0.30.7...v0.30.8

Ollama Launch now supports Hermes Desktop, a native desktop interface for the Hermes agent. Run it alongside your Hermes agent to get a visual interface for managing conversations, integrations, and messaging apps.

ollama launch hermes-desktop

What's Changed

Hermes Desktop is now available via ollama launch hermes-desktop with native Windows configuration path support
OpenAI-compatible API models list now aligns with available model tags
Added documentation describing the llama.cpp update process
Updated Zod schema examples to use the native toJSONSchema helper

Full Changelog: v0.30.6...v0.30.7

New models

Gemma 4 QAT weights: the Gemma 4 family is now optimized with Quantization-Aware Training (QAT) to dramatically reduce memory requirements and maximize on-device performance. Look for the tags ending in -qat:
- gemma4:e2b-it-qat
- gemma4:e4b-it-qat
- gemma4:12b-it-qat
- gemma4:26b-a4b-it-qat
- gemma4:31b-it-qat

What's Changed

ollama launch omp now integrates with Oh My Pi, an AI coding agent with IDE integration
MLX embedding layers now use NVFP4 global scale for improved quantization on Apple Silicon

Full Changelog: v0.30.5...v0.30.6

What's Changed

Fixed the gemma4:12b floating point exception crash on x86, CUDA, Linux, and Windows systems.
ollama launch hermes-desktop now launches Hermes Desktop and can skip rebuilding when a packaged desktop app is already installed.
ollama launch hermes now supports native Windows installs through the Hermes PowerShell installer.
Added Cline CLI integration docs.

Full Changelog: v0.30.4...v0.30.5

New models

Nemotron-3-Ultra: NVIDIA Nemotron 3 Ultra is built for high-throughput reasoning and long-running agent workflows.

What's Changed

Fixed multimodal models not using GPU on the llama.cpp backend can now use Metal GPU offload on Apple Silicon, improving multimodal performance on supported Macs.
ollama create --experimental now respects REQUIRES in Modelfiles for MLX-based models.
ollama launch codex now cleans up old conflicting Codex profile config before launching.
ollama launch pi now migrates users from the legacy Pi package to the official package and preserves the correct npm install prefix.
Pi web search setup now updates only when a newer package is available.
Windows cleanup now terminates the llama.cpp backend more reliably.
Updated the llama.cpp backend.

Known Issues

gemma4:12b crashes with floating point exception

Full Changelog: v0.30.3...v0.30.4

New models

Gemma 4 12B: high-performance multimodal intelligence that runs directly on laptops, combining efficiency with advanced reasoning.

What's Changed

Added support for gemma4:12b.

Full Changelog: v0.30.2...v0.30.3

What's Changed

ollama launch now supports Qwen Code and can guide users through installing the Cline CLI when it is missing.
ollama launch codex now uses an isolated launch configuration, avoiding conflicts with a user's existing Codex settings.
Added llama.cpp backend compatibility support for Poolside's Laguna architecture.
The llama.cpp backend now includes cached prompt tokens in token accounting, improving usage reporting for requests with prompt cache hits.
The llama.cpp backend now ignores SSE ping comments, improving streaming compatibility with newer backend behavior.
The llama.cpp backend now detects load stalls from server output so failed model loads surface more reliably instead of hanging.
Radeon 8060S integrated GPUs are now allowed by default.
Template details are included in logs to make troubleshooting model prompts easier.
Added Hermes Desktop configuration docs.
Fixed a build issue in the Laguna compatibility patch, restoring Laguna support in release builds.

Full Changelog: v0.30.0...v0.30.2

Ollama 0.30 is now available, with improved compatibility and performance using llama.cpp. This augments the MLX engine on Apple Silicon, bringing support to a wider range of hardware.

This release brings support for a wider range of models, including GGUF-based models from Hugging Face and your own fine-tuned models along with faster performance on NVIDIA hardware.

Known issues:

laguna-xs.2 is not yet supported on Windows/Linux.
llama3.2-vision is not yet supported
nomic-embed-text now converts inputs to lowercase per the model card where prior Ollama versions incorrectly preserved mixed case

Releases: ollama/ollama

v0.30.10

What's Changed

Uh oh!

v0.30.9

What's Changed

Uh oh!

v0.30.8

What's Changed

Uh oh!

v0.30.7

Uh oh!

v0.30.6

New models

What's Changed

Uh oh!

v0.30.5

What's Changed

Uh oh!

v0.30.4

New models

What's Changed

Known Issues

Uh oh!

v0.30.3

New models

What's Changed

Uh oh!

v0.30.2

What's Changed

Uh oh!

v0.30.0

Known issues:

Uh oh!