Added:
- New release: **v0.23.0 Pre-release** (12 Jun 2025) with 408 commits from 200 contributors (63 new).
- **Highlights**:
  - **DeepSeek-V4**: Further optimised across backends – decoupled sparse MLA metadata, TRTLLM-gen attention kernel, EPLB for Mega-MoE, selective prefix-cache retention, index-share for DSA MTP, XPU decode path, and more.
  - **Model Runner V2 (MRv2)**: Now default for Llama and Mistral dense models. Gained FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, kernel block-size support for hybrid models, and Gemma 4 MTP.
  - **Rust frontend (experimental)**: Added streaming generate, dynamic LoRA endpoints, `/version` and `/server_info` endpoints, server-router extension, request-ID headers, and new tool parsers.
  - **Gemma 4**: Encoder-free Unified support and MTP added; numerous accuracy and startup fixes.
  - **Transformers v5 compatibility**: vLLM now targets Transformers v5 (v4 deprecated), with vendored MiniCPM-V/O processors and compatibility fixes.
  - **Multi-tier KV cache offloading**: Object-store secondary tier, HMA default for capable connectors, per-request offloading policy.
  - **Unified parser**: Reasoning and tool-call parsing unified behind a single `Parser.parse()` interface.
- **Model Support**: New models – MiMo-V2.5, Step-3.7-Flash, Cosmos3 Reasoner, Gemma 4 Unified encoder-free, JetBrains Mellum v2, Granite Speech Plus, Cohere Mini Code. Plus extensive fixes and enhancements for Qwen3-VL, GLM, Bailing-MoE, MiniCPM, and others.
- **Engine Core**: MRv2 defaults, FlashInfer sampler, breakable CUDA graphs, pipeline-parallel bubble elimination, speculative decoding improvements (DFlash, lookahead caching), attention/hybrid Mamba refactors, pluggable KVCacheSpec, garbage collector freeze, sparse NCCL weight transfer.
- **Large Scale Serving & Distributed**: KV cache offloading enhancements, PP-aware handshake, multiple connector fixes (Nixl, Mooncake, LMCache), async EPLB default, DP placement groups, SSL for DP supervisor.
- **Hardware & Performance**: FP8 FlashInfer attention for ViT, Triton MoE backend on Hopper default, CUTLASS FP8 scaled-mm padding bypass, MoE-permute pre-allocation, TurboQuant. ROCm 7.2.3, AITER v0.1.13.post1, native W4A16 kernels for RDNA3, attention-sink support. Intel XPU updates. CPU zentorch W8A8/W4A16, TPU upgrades. Continued libtorch stable ABI migration.
- **Quantization**: ModelOpt LM-head quantization and MXFP8 non-gated MoE; compressed-tensors WNA8O8Int linears.

Removed:
- **v0.18.0** release (20 Mar 2025) was removed from the releases list.