Releases: vllm-project/vllm
v0.23.0
vLLM v0.23.0 Release Notes
Please note that Minimax M3 is not yet supported in this version. Please follow vLLM recipe for usage guides for M3.
Highlights
This release features 408 commits from 200 contributors (63 new)!
- DeepSeek-V4 matures across backends: Following its introduction in v0.22.0, DeepSeek-V4 received another large hardening and optimization pass. Its sparse MLA metadata is now decoupled from DeepSeek-V3.2 (#44699), it gained a TRTLLM-gen attention kernel (#43827), EPLB support for the Mega-MoE (#43339), selective prefix-cache retention for sliding-window KV cache (#43447), and an index-share feature for DSA MTP (#44420). The model was also detached from
torch.compile(#43746, #43891), its attention and RoPE paths were refactored (#44569, #44262, #43926), and an XPU attention decode path was added (#42953). - Model Runner V2 expands to more dense models: MRv2 is now selected by default for Llama and Mistral dense models (#43458) in addition to Qwen3. It gained a FlashInfer sampler (#42472), breakable CUDA graphs (#44050), pipeline-parallel bubble elimination (#42187), kernel block-size support for hybrid models (#38831), and Gemma 4 MTP (#43241).
- Rust frontend grows up: The experimental Rust frontend added a streaming
generateendpoint (#43779), dynamic LoRA endpoints (#43778),/version(#43854) and/server_info(#43942) endpoints, a server-router extension hook (#43774), request-ID headers (#43883), and many new tool parsers (InternLM2 #43481, hy_v3 #43872, Phi-4-mini #44213, Gemma4 #43850). - Gemma 4: Added encoder-free Gemma 4 Unified support (#44429) and Gemma 4 MTP (#43241), plus numerous accuracy and startup fixes.
- Transformers v5 compatibility: vLLM now targets Transformers v5, deprecating v4 support (#40389), with vendored MiniCPM-V/O processors (#44282) and compatibility fixes for Sarvam (#38804) and Voxtral (#44559).
- Multi-tier KV cache offloading: The offloading framework gained an object-store secondary tier (#41968), HMA enabled by default for capable connectors (#41847), tiering support for HMA models (#44287), and a per-request offloading policy via the
on_new_requestlifecycle hook (#43205). - Unified parser: Reasoning and tool-call parsing are now unified behind a single
Parser.parse()interface (#44267), with the Responses parser migrated to it (#42977).
Model Support
- New models: MiMo-V2.5 (#40967), Step-3.7-Flash (#43859), Cosmos3 Reasoner (#43356), Gemma 4 Unified encoder-free (#44429), JetBrains Mellum v2 (#43992), Granite Speech Plus (#43519), Cohere Mini Code (#44707).
- Gemma 4: Encoder-free Unified support (#44429), MTP (#43241), native ViT linear layers (#43798), vision-embedder excluded from quantization (#44571), and fixes for MTP under TP>1 (#43909), block-table mismatch under concurrency (#43982), transformers-processor startup crash (#44232), and CPU init (#44615).
- Transformers v5: Deprecate v4 support (#40389), vendor MiniCPM-V/O processors (#44282), Sarvam compat (#38804), Voxtral
fetch_audiofor transformers≥5.10 (#44559). - Model fixes & enhancements: Qwen3-VL/Qwen3-omni-thinker deepstack accuracy under
torch.compile(#43617), EVS for Qwen3-VL (#44205), GLM-5.1 PP loading (#42944), GLM-4.1V processor logits (#43575), GLM-4.6V video loader (#44417), OlmoHybrid init (#43846), HyperCLOVAX remote-code removal (#43860), Bailing-MoE rotary factor (#43770), Step3 PP residual KeyError (#37622), MiniCPM-V-4.6 video (#44509), MiniCPM-O audio unpadding (#38053), MiniCPM-V batched preprocessing (#44609), FunASR-Nano init (#44215), Cohere routing method (#44021), Kimi-K2.5 FlashInfer ViT metadata (#44493). - Multimodal: Auto-select registered video loader for VLMs (#44126), O(log n) multimodal item handling per step (#44212), local image encoding in benchmarks (#43843), interleaved custom image benchmark datasets (#43636).
- Pooling/Classification: Proper exceptions for pooling UX (#44593),
extra_repr()for pooler classes (#44805), LoRA-adapter-name pooling fix (#44410), resettled generative scoring entrypoint (#44153), expanded pooler unit tests (#43818, #44471). - Refactor: AutoWeightsLoader for InternLM2 (#38278).
Engine Core
- Model Runner V2: Default for Llama and Mistral dense models (#43458), FlashInfer sampler (#42472), breakable CUDA graphs (#44050), removed Eagle's dedicated CUDA graph pool (#44078), pipeline-parallel bubble elimination (#42187), kernel block size for hybrid models (#38831), zeroing of freshly allocated KV blocks for hybrid + FP8 KV cache (#43990), actual batch
max_seq_lenfor attention metadata (#43991), rejection-sampling acceptance-rate fix (#40651), KVConnector + PP cleanup (#43732), speculator-prefill warmup/capture (#44253). - Speculative decoding (DFlash): Causal DFlash (#43445), proper lookahead-slot allocation (#43733), prefix-cache corruption fix (#42971); independent drafter attention-backend selection (#39930), attention-group split by
num_heads_qfor drafts (#43543), EAGLE/MTP lookahead caching in the SWA prefix-cache mask (#44082). - Attention & hybrid/Mamba: FlexAttention/FlashAttention num-blocks-first layouts (#42095), OOT MLA prefill backend registration (#43325), FlashAttention upstream sync (#44065), Mamba LINEAR attention-module refactor (#43556), corrupted MLA + linear attention fix (#43961), KDA conv-state unification (#44539) and gate/cumsum fusion (#43667), Mamba SSD
do_not_specialize(#43803), Qwen3.5 mixed prefill+decode split routing (#44700), MiniMax-M2 gate kernel (#38445). - KV cache & scheduler: Pluggable
KVCacheSpec(#37505),scheduler_block_sizethreaded into KVCacheManager/Coordinator (#44165),max_concurrent_batchesmoved toVllmConfig(#44274), config validation rejecting 0/negative knobs (#43794, #44057, #44207), KV-cache scale boilerplate removed from weight loading (#43167). - Core: Freeze the garbage collector in workers after model init (#44363), sparse NCCL weight transfer for in-place updates (#40096), graceful spinloop ext-load failure handling (#43659), scheduled-function deprecations (#43358).
Large Scale Serving & Distributed
- KV cache offloading: Object-store secondary tier (#41968), HMA on by default for capable connectors (#41847) and tiering (#44287), per-request offloading policy (
on_new_request) (#43205) andon_schedule_end()hook (#44206), token-offset selective offload (#39983), skip decode-phase blocks in CPU offload (#43797), page-size block alignment (#43689), Triton fast-path for small CPU→GPUswap_blocks_batch(#42212), stale sliding-window block fix (#42959). - KV connectors / disaggregated serving: PP-aware handshake aggregation and intermediate-PP output plumbing (#43720), multiple-async-KV-load deadlock fix (#44560), Nixl Mamba prefix-caching mode (#42554), NixlConnector
kv_bothrole deprecation cycle (#43874), Mooncake fixes (#43742, #44103, #42694), LMCacheLMCacheMPConnector(#42865), EC connector shutdown API (#42423) and non-blocking lookup (#41627), KV-transfer tokens excluded fromiteration_tokens_total(#43346). - EPLB: Async EPLB by default (#43219), EPLB for DeepSeek-V4 Mega-MoE (#43339), Nixl zero-copy EPLB transfers (#41633).
- Data parallel: DP Ray placement groups on specific nodes (#44669) and grouped-node allocation fix (#43998), SSL for the DP supervisor (#43688), DP-coordinator startup timeout raised to 120s (#42343), per-GPU-worker RDMA NIC selection (#42083).
Hardware & Performance
- NVIDIA / kernels: FP8 FlashInfer attention for ViT (#38065), Triton MoE backend on Hopper by default (#44220), CUTLASS FP8 scaled-mm padding bypass (+20%) (#43706), MoE-permute buffer pre-allocation (+9–14%) (#43014),
Fp8BlockScaledMMnew_empty()optimization (#43677), TurboQuant shared dequant buffers (#40941), tunedselective_state_updatefor H200/RTX PRO (#44251), Inductor fast-path fallback for vLLM/AITER custom ops (#42129), Gemma RMS all-reduce fusion (#42646), NUMA auto-binding on DGX B300 (#43270). - AMD ROCm: ROCm 7.2.3 (#43136), AITER v0.1.13.post1 (#44265), native W4A16 (#41394) and fused-MoE W4A16 HIP (#44075) kernels for RDNA3 (gfx1100), AITER top-k/top-p sampler by default (#43331), attention-sink support in AITER FA (#43817), AITER hipBLASLt GEMM online tuning (#40426),
permute_colsfor ROCm (#44674), blocks-first KV layout for AMD (#43660), N=5 wvSplitK for spec decode (#40687), MoRI connector improvements (#43303, #41751, #40344). - Intel XPU: vllm-xpu-kernel v0.1.7 (#41019),
block_fp8_moe(#42139), block-scaled W8A8 FP8 path (#39968), WNA16 oracle for GPTQ sym-int4 (#41426), rms_norm/act quant fusions (#43963), GDN-attention MTP (#43565), Triton selective-scan op (#43421), transparent sleep mode (#37149), CPU/tiering offloading on XPU (#36423), DeepSeek-V4 attention decode path (#42953). - CPU & other architectures: zentorch-accelerated W8A8/W4A16 on AMD Zen CPUs (#41813), CPU top-k/top-p Triton sampling (#43633), non-divisible GQA decode in mixed batches (#43032),
cpu_awqfolded intoawq_marlin(#43841), RISC-V RVV WNA16 helpers (#42730), fused GDN gated-delta-rule kernels (#43534), PowerPC SHM communicator (#43754), arm64 CI image (#41303). - TPU: tpu-inference upgraded to v0.20.0 (#43394) then v0.21.0 (#44621).
- torch stable ABI: Continued migration of kernels to the libtorch stable ABI — merge_attn_states/mamba/sampler [8/n] (#43361), attention/cache kernels [9/n] (#43717), header files (#44013), cuda_view/silu_and_mul [10/n] (#44334), custom all-reduce/DeepSeek-V4 fused MLA/MXFP8 MoE [10b/n] (#44365); ROCm fallback to regular ABI (#44648),
_has_moduletrial-import verification (#44035).
Quantization
v0.22.1
Highlights
This release features 8 commits from 6 contributors (1 new)!
v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.
Model Support
- New model: JetBrains' Mellum v2, an open-weights Mixture-of-Experts code-generation model (#43992).
- DeepSeek-V4: resolve a CUTLASS
fmincompatibility issue that broke initialization (0decac0). - Fix
OlmoHybridForCausalLMfailing to initialise after the checkpoint changedrope_parametersfromNoneto{"rope_type": None}(#43846). - Fix HyperCLOVAX loading after the upstream HuggingFace repo removed its remote code (now native in
transformers >= 5.9.0): register thehyperclovaxmodel_type so vLLM uses its vendored config instead of the staleauto_map(#43860).
Hardware & Performance
- AMD Zen CPUs: route W8A8 (int8 dynamic-symmetric) and W4A16 (GPTQ) linear inference through zentorch kernels, registered ahead of the generic oneDNN CPU kernels, with transparent fallback on non-Zen CPUs, GPUs, and XPU (#41813).
Large Scale Serving
- Fix a deterministic hang in multi-node Ray data-parallel serving with
num_api_servers > 1by excluding the Ray DP backend from the deferred (kernel-assigned) port allocation introduced in #42585 (#43864).
Build & CI
- Docker: stop installing
flashinfer-jit-cachevia--extra-index-urlwhile it is quarantined on PyPI, fixing image builds (#44366). - Normalize NIXL KV-connector wheel installs so only the wheel matching the image's CUDA major is kept, fixing
ImportError: libcudart.so.12when importingnixl_epon CUDA 13 images (#44266).
Contributors
@khluu, @vadiklyutiy, @aadwived, @shadeMe, @alec-flowers, @hmellor
New Contributors
v0.22.0
Highlights
This release features 459 commits from 230 contributors (63 new)!
- DeepSeek V4 maturity: DeepSeek V4 received a major hardening pass this cycle — the model was reorganized into a dedicated
vllm/models/deepseek_v4/package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE,mhc, Q-norm, indexer, sparse MLA) and ROCm parity fixes landed alongside accuracy fixes (#42810, #43710). - Model Runner V2 advances toward default: MRv2 is now default for Qwen3 dense models. vLLM will fall back to MRv1 for features that aren't yet supported in MRv2 (#39337). sleep-mode weight reload (#42673),
update_config(#42783), and shared KV-cache layers (#35045), plus many correctness fixes. - Experimental Rust frontend: A new Rust front-end integration landed (#40848), with the implementation moved into the tree (#43283) and a DP Supervisor for data-parallel serving (#40841).
- Batch invariance, faster: Batch-invariant inference gained Cutlass FP8 support for a 28.9% end-to-end latency improvement (#40408), compile-mode support on SM80 (#42456), and an NVFP4 Cutlass linear path (#39912).
- Multi-tier KV cache offloading: A new multi-tier KV cache offloading framework (#40020) with a Python filesystem secondary tier (#41735), DSv4 support (#43142), and Mooncake disk offloading (#42689) extends offloading beyond CPU memory.
Model Support
- New architectures: MiniCPM-V 4.6 (#41254), InternS2 Preview (#42705), OpenVLA (#42654), MolmoWeb
hf_overridesdocs (#42163); EXAONE-4.5 aligned with Transformers update (#42246). - Speculative decoding: custom callable proposer backend (#39487), post-norm EAGLE-3 speculators (#42764), peagle speculators (#41826), hybrid-attention models in
extract_hidden_states(#39949), non-MTP speculation for NemotronH (#43130), shared MTP weights in MRv2 (#42538). - DeepSeek V4: NVFP4 MoE (#42209), CUDA graph full/piecewise (#42604), MTP (#43385), model package refactor (#43004, #43039, #43073, #43077), sparse MLA + compressor refactor (#43149, #43710), MegaMoE input-prep kernel move (#43632).
- Qwen3.5/3.6: GDN output-projection flatten (#42311), GatedDeltaNet Marlin TP≥2 fix (#36329), ViT full CUDA graph (#42151), runai-streamer weight loading for Qwen3.5/MTP/Qwen3-VL (#42521, #42716), KDA chunk-prefill exp2 semantics (#43195).
- Gemma3/Gemma4: mixed-resolution image co-batching crash fix (#42217), MoE routing closure fix (#42250), tool-parser float-corruption fix (#42128), batched vision encoder for image/video (#43169), multi-GPU fix (#42630).
- Kimi-K2.5: skip vision-tower dtype conversion under quantization (#42869),
mm_projectordtype fix (#42081). - Cohere: enable Cohere MoE (#43143), pipeline parallelism for Cohere vision (#42819).
- Tool calling: Apertus tool parser (#41154), Qwen3Coder
anyOf/oneOf/$refresolution re-land (#37831), sharedcoerce_to_schema_typeacross MiniMax-M2 / DeepSeek-V3.2 / Seed-OSS parsers (#43006, #43019, #43140). - ViT CUDA graph: Qwen2-VL (#41736), Step3-VL encoder (#42224), Qwen3.5 (#42151), FlashInfer metadata for Qwen2.5-VL vision attention (#42787).
Engine Core
- Model Runner V2: Qwen3-dense-by-default oracle (#39337), sleep-mode reload weights (#42673),
update_config(#42783), shared KV-cache layers (#35045), FP32 gumbel sampling (#41775), auto-fallback to MRv1 with connectors (#42955),logprob_token_idscorrectness (#43125, #41761), prompt-logprobs size fix (#42778). - KV offloading: multi-tier framework (#40020), Python filesystem secondary tier (#41735), DSv4 support (#43142), tier-offload follow-up (#42529), prefer HND layout (#41928),
reset_cache()(#41956), per-request tracking (#42507), store-deferral fix (#41945). - MoE refactor:
ExpertMapManager(#41046), experts moved toexperts/(#42334),RoutedExpertsalias for FusedMoE (#40735), EPLB refactoring for FusedMoE (#41055). - Mamba: attention module refactor (#41126), Mamba2 SSD kernel warmup (#39822), bf16 SSM cache (#41680), GPU-side state postprocessing fused kernel (#40172), run single-token extends as decodes (#42430).
- KV events: emit KV cache metadata (#40984).
- Allocator: manual cumem allocator enable (#33648), stream-aware free callback (#43020).
- elastic-EP: stage/commit MoE quant method on reconfigure (#40881).
Hardware & Performance
- NVIDIA Blackwell / SM12x: FlashInfer b12x MoE + FP4 GEMM for SM120/121 (#40082), per-tensor FP8 CUTLASS on SM12.1 (#41215),
head_dim=512for FlashInfer TRTLLM attention (#38822), FlashInfer Blackwell GDN prefill (#40717), GDN prefill kernel for SM100 (#43273). - Performance: batch-invariant Cutlass FP8 (+28.9% E2E) (#40408), CutlassFP8 padding pre-processing (+13.5% TTFT) (#42651), padded NVFP4 quant kernel (+2.4–5.7% E2E) (#42774), GPU<->CPU sync elimination 1/n (#41429) and 4/n (#42347), fused RoPE+KVCache+q_concat for MLA (#40392), MLA
compute_prefill_context/_v_up_projoptimizations (#42460, #42561), penalties Triton kernel (#40657),do_not_specializein fused FP8 RoPE (#42849), FULL CUDA graph capture for TRITON_MLA decode (#42885). - AMD ROCm: DSV4 functionality + accuracy fixes (#42810, #43679 Tilelang MHC), flash sparse MLA Triton kernels (#41812), gluon paged MQA logits on gfx950/MI355X (#42062), RMSNorm+Quant fusion for gfx950 (#41825), AITER FA backend cleanup (#41942), XGMI backend for MoRI connector (#41753), QuickReduce min-size override (#41675), DSV4 MTP (#43385).
- CPU / RISC-V: RVV-optimized attention kernels for RISC-V Vector Extension (#40119) with VLEN=256 (#42943), fused GDN for AMX CPU (#42707), MXFP4 W4A16 MoE (#41922), experimental Triton + MRv2 on CPU (#43225), improved CPU thread utilization (#42666),
--cpu-distributed-timeout-seconds(#42968). - Intel XPU: GPTQ int4 support (#37844), mxfp8 MoE (#41918), FP8 block-scaled quantization (#42952), custom-op collective behavior (#41354), multiple sparse-attention kernels (#37888), MoE topk routing + MXFP4 fallback (#42951), CT W4A4 MXFP4 path (#38896), reduced XPU MoE host overhead (#42915).
- Kernel ABI: continued migration to libtorch stable ABI — 5/n (#42339), 6/n (#42663), 7/n (#43209).
- Experimental: breakable CUDA graph (#42304).
Large Scale Serving
- Disaggregated serving (NIXL): lease-renewal TTL for KV blocks on P (#41383), handshake-failure policy honoring (#40364), GDN support for PD with NIXL (#41869), multi-node TP>8 fix (#39907), side-channel host-selection fix (#41806).
- Mooncake: disk offloading in MooncakeStoreConnector (#42689), HMA support for DSV4 (#42828), operation metrics (#43392), load-failure propagation (#42788), block-aligned full hits (#43494), finish-after-preemption handling (#43281).
- Data parallel: DP Supervisor (#40841), publish request counts at engine-step start (#41626), forward
X-data-parallel-rankheader (#42330). - EPLB: change default EPLB communicator (#43110), VLM-wrapper init fix (#39805), remove dead
torch.accelerator.synchronize()(#40733). - LoRA: one-shot Triton kernel for MoE LoRA (#42290), simultaneous 2D & 3D MoE LoRA adapters (#42242), reduced 2D-weight memory under EP (#42737), MoE LoRA align-kernel grid fix (#40131).
Quantization
- MXFP4: linear layers + compressed-tensors integration (#41664), CPU W4A16 MoE (#41922), XPU mxfp8 MoE (#41918).
- NVFP4: DeepSeek V4 fused MoE (#42209), ModelOpt W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566), batch-invariant NVFP4 Cutlass linear (#39912), FlashInfer TRTLLM NvFP4 monolithic MoE routing fix (#43223), TRTLLM NVFP4 MoE chunking fix (#43599).
- Quark: load Quark NVFP4 checkpoints (#35859), W8A8 INT8 garbage-output fix on Step-3.5-Flash (#41892), W4A4 oracle refactor (#41436).
- AutoRound: W4A16 support (#39778).
- ModelOpt: Qwen3.5/3.6 VLM quantized prefix mapping (#42546).
- Framework: rework
quantization_configto useQuantKeywith activation override (#41566), MoE W4A8 CT migrated to oracle (#42680), AWQ Marlin MoE onto modular WNA16 oracle (#42483), GPTQ consolidation (gptq_marlin→auto_gptq) (#38288).
API & Frontend
- Rust frontend: integration (#40848), in-tree code move (#43283), utility call-ID newtype (#43405), simplified
AuthenticationMiddlewarepath extraction (#43426). - Responses API:
chat_template_kwargssupport (#42272), message-merging fix (#42189), empty channel/recipient harmony fix (#35540). - Completions:
thinking_token_budgetsupport (#42116) with inverted-condition fix (#41674); mapreasoning_efforttoenable_thinking(#43401). - Frontend: truncation side for OpenAI endpoints (#43260), normalize
reasoning_content→reasoning(#42664), reworked fastokens integration (#43168), consolidated Speech-to-Text entrypoints (#42370, #42274), beam-search consolidation viaBeamSearchMixin(#42946), score/rerank chat-template instructions (#42412). - Auth: API-key authorization for
/v2endpoints (#42594). - Offline API: pooling offline API split into
PoolingOfflineMixin(#42267), split offline inference APIs/utils (#43553).
Build & Dependencies
- CUDA 12.9 wheel builds switched to PyTorch
manylinux_2_28base (#41668). - FlashInfer bumped to v0.6.11.post2 (#41711);
nvidia-cutlass-dslto 4.5.2 (#42991, #43230, #43745); llguidance to 1.7 (#42150);triton_kernelsdowngraded to v3.5.1 for gpt-oss (#43135). - Rust frontend build:
setuptools-rustdependency (#43287, #43377), pinnedprotocin rust-build stages (#43292). - Docker: non-root
vllm-openaitarget (#40275), buildmooncake-transfer-enginefrom source (#42114), AINIC & Thor NIC support (#40453); Python-only installation made optional (#42293). - vllm-tpu: disable build isolation for CUDA deps (#43038), tpu-inference docker build fix (#43360).
hummingMoE backend dependency adde...
v0.21.0
Highlights
This release features 367 commits from 202 contributors (49 new)!
- Transformers v4 deprecated: This release formally deprecates
transformersv4 support (#40389). Users should migrate totransformersv5. - C++20 build requirement: vLLM now requires a C++20-compatible compiler for compatibility with PyTorch (#40380). This is a breaking build change.
- KV Offload + Hybrid Memory Allocator (HMA): The KV offloading subsystem now integrates with the Hybrid Memory Allocator, including scheduler-side sliding window group support and full HMA enablement (#41228, #41445, #39571).
- Speculative decoding with thinking budget: Speculative decoding now respects reasoning/thinking budgets, enabling correct spec decode for reasoning models (#34668).
- TOKENSPEED_MLA backend on Blackwell: A new TOKENSPEED_MLA attention backend is available for DeepSeek-R1/Kimi-K25 prefill + decode on Blackwell GPUs (#41778).
Model Support
- New architectures: MiMo-V2.5 (#40967), Laguna XS.2 (#41129, #41880), Moondream3 (#32325), Qianfan-OCR (#40136), Cohere MoE (#40817), Cohere Eagle (#42078).
- Speculative decoding: EAGLE for Mistral (#41024), Gemma4 MTP (#41745), MTP for MiMo-V2.5 (#41905), Cohere Eagle (#42078).
- DeepSeek V4: AMD/ROCm support (#40871), pipeline parallelism (#41694),
maxreasoning effort (#40982), disaggregated serving fixes (#41957). - Tool calling: Cohere reasoning and tool parsers (#40422), LFM2/2.5 tool parser (#39243).
- Gemma3/Gemma4:
hidden_actvariant support (#40588), pipeline parallelism fix (#40786), MoE fixes (#41206, #41574, #41401), tool parser crash fix (#41991, #42188). - Model Runner V2: Qwen3.5/Mamba hybrid model support (#35520),
logprob_token_idssupport (#40559). - CUDA graph: ViT CUDA graph support for Qwen2.5-VL (#40830).
- Compatibility: Vendor HCXVisionConfig for Transformers v5 (#38447), legacy
rope_typecheckpoint support (#41734).
Engine Core
- KV offloading + HMA: Scheduler-side sliding window groups (#41228), full HMA enablement (#41445), multi-connector HMA (#39571), per-job store completion (#39186), DCP/PCP support in OffloadingConnector (#41549), MooncakeStoreConnector for distributed KV offloading (#40900).
- Speculative decoding: Thinking budget support (#34668), independent drafter attention backend selection (#39930), multimodal model support with warning (#41752), per-step allocation elimination (#41043).
- Model Runner V2: Rejection sampling acceptance rate fix (#40651), skip metadata rebuild before draft prefill (#40410), rebuild metadata between draft decode steps (#41162), Qwen3.5/Mamba hybrid support (#35520).
- Routing: Replace routing replay with device cache and async D2H pipeline (#39917).
- Ray: RayExecutorV2 enabled by default (#41421), actor name collision fix for DP > 1 (#40398).
- Stability: Two-phase pause to prevent scheduler deadlock (#39366), thread-safe HF tokenizer wrappers (#41181), OOM prevention via
max_split_size_mbduring model loading (#41268). - IndexCache support for DSA models (#37735).
Hardware & Performance
- NVIDIA Blackwell: TOKENSPEED_MLA backend for DSR1/Kimi-K25 (#41778), faster per-token FP8 group quant packed kernel (#41326), FP8 on NVIDIA Thor/SM110 (#39712), CUTLASS scaled mm for non-compatible sizes (#41868).
- Performance: FlashInfer top-k/top-p sampler enabled by default (#40376), FP8 FlashInfer attention for ViT (#38065), TurboQuant shared dequant buffers (#40941),
AllPool.forward51% faster (#41163), GPU<->CPU sync elimination in pooling (#41433) and attention (#41434), numpy zero-copy embedding serialization (#41681), multimodal processor skip for text-only (#41246), FlashInfer FP8 async TP fusion (#39505), NVFP4 all-gather GEMM fusion for AsyncTP (#41882), re-enable allreduce+RMS fusion for DP/PP (#41458), DeepSeek bf16→fp32 viatorch.mm(#41300), persistent MLA for sparse backend (#41990), configurable safetensors checkpoint prefetch (#41499), fused mhc_post_pre kernel (#41536), 2D-grid W8W8 group quant kernel (#42153), relaxed memory ordering for KV cache swaps (#39306). - AMD ROCm: ROCm 7.2.2 (#41386), DBO (Dynamic Batch Optimization) (#34726), AITER Fused Allreduce+RMSNorm (#37646), Fused Shared Expert (FSE) for Qwen3-Next (#39280), DeepSeek V3.2 TP4 AITER MLA (#41835), GDN linear attention fusion (#40711), eliminate redundant MoE buffer copies in AITER (#41713), CPU offloading support (#40549), DeepEP API update (#39721), cap Triton paged attention block size to fix shared memory OOM (#38502).
- CPU: FP8 attention for AMX/AVX-512 (#39445), FP8 W8A16 linear (#41186), FP8 W8A16 MoE (#41314), DNNL AVX2 W8A8 Int8 (#41318), Gated DeltaNet Attention for Qwen 3.5/3.6 (#41025), RISC-V OMP thread auto-binding (#40569).
- Intel XPU: Top-k/top-p sample kernel (#39285), out-of-place all-reduce (#41808), LoRA support (#38206).
- IBM Power: VSX attention backend (#40451).
- FlexAttention: Re-enabled for batch invariant mode (#40842).
- MLA: Abstracted MLA prefill backends, eliminated cuDNN dependency (#32623).
Large Scale Serving
- Disaggregated serving: Bi-directional KV cache transfers between P and D (#32553), NIXL transfer redesign (#40731), EPLB memory overhead optimization (#40013), NIXL connector bumped to 1.x (#42364), Mooncake KVConnectorStats for transfer observability (#40414), NIXL P-node pre-admission rejection notification (#41269), KV block release for skipped P-ranks (#40449).
- DCP: Pack output and LSE in DCP A2A (#41160).
- MoE: PluggableLayer interface for out-of-tree MoE runners (#35178).
- LoRA: Initial expert parallel (EP) support (#40867), Qwen3.5 LoRA fusion fix (#37912).
Quantization
- NVFP4: KV cache support (#40177), Triton dequant/QDQ emulation kernels for Hopper and AMD (#40033), GELU on TRT-LLM NvFP4 fused MoE for Gemma4 (#41050), ModelOpt NVFP4 W4A16 (#41769), NVFP4 all-gather GEMM fusion for AsyncTP (#41882), GLM4-MoE NVFP4 loading fix (#41755).
- MXFP4: Humming MXFP4 MoE backend (#41083), FlashInfer CUTLASS MXFP4-MXFP8 MoE fix (#42089).
- TurboQuant: Hybrid model and uniform quantization support (#39931).
- Compressed tensors: Allow configs with non-explicit ignores (#41965).
- FP8: Bias loading fix (#41424), FlashInfer autotune temporarily disabled for correctness (#41524).
- DSV4: Improved fused Indexer Q quant kernel (#41428).
API & Frontend
- Responses API: Streaming tool/function calling with
required(#40700) and named tool/function choice (#41110), resubmitting output items with missing fields (#41355). - OpenAI compatibility:
system_fingerprintfield in responses (#40537),prompt_embedscontent part support (#40720),defer_loadingandtool_referencesupport (#40190), rendered prompt text in chat completion response (#42052), tolerate empty content in forced tool choice (#40148). - Tool calling: XGrammar 0.2.0 with structural tags for strict tool calling + reasoning (#40894), Cohere reasoning/tool parsers (#40422), LFM2/2.5 tool parser (#39243).
- Tokenizer: Fastokens support (#41741).
- RLHF: Explicit
/start_weight_updateand/finish_weight_updateAPIs (#39212). - ASR: Engine request abort on cancellation (#41266).
- Configuration:
VLLM_SKIP_MODEL_NAME_VALIDATIONenv var (#34676), configurable model weights loading tracking (#41086), Triton JIT compilation monitor (#40137).
Build & Dependencies
- Breaking: C++20 required for PyTorch compatibility (#40380).
- Breaking: Transformers v4 deprecated (#40389).
- Docker image size reduced by ~2.5 GB via deferred FlashInfer cubin download (#41134).
- CUDA 13.0 wheels switched to PyTorch manylinux_2_28 base (#41416).
- DeepGEMM bundled wheel built per-Python for CPython compatibility (#41516).
- Container image provenance metadata embedded (#40653).
- tpu-inference upgraded to v0.19.0 (#41844).
- NIXL connector bumped to 1.x (#42364).
- ROCm 7.2.2 (#41386).
Contributors
@AndreasKaratzas, @haosdent, @khluu, @yewentao256, @stecasta, @mgoin, @Isotr0py, @hmellor, @chaunceyjiang, @jeejeelee, @noooop, @MatthewBonanni, @njhill, @zyongye, @yzong-rh, @ronensc, @NickLucche, @chaojun-zhang, @dzhengAP, @chfeng-cs, @TheEpicDolphin, @esmeetu, @wzhao18, @ZJY0516, @juliendenize, @kylesayrs, @fadara01, @Etelis, @tianmu-li, @arpera, @ekagra-ranjan, @orozery, @wxsIcey, @jikunshang, @izhuhaoran, @rasmith, @russellb, @Lucaskabela, @Harry-Chen, @alec-flowers, @pmaybank, @Terrencezzj, @hickeyma, @Baekpica, @itej89, @fxmarty-amd, @WoosukKwon, @juhi10071998, @sychen52, @baonudesifeizhai, @vllmellm, @johncalesp, @the-david-oy, @lucianommartins, @bittoby, @Dao007forever, @lyd1992, @yuwenzho, @lesj0610, @sfeng33, @micah-wil, @akii96, @yma11, @SoluMilken, @mmangkad, @SiluPanda, @ojhaanshika, @zhandaz, @bhoomit, @simon-mo, @msanft, @angelayi, @anthonsu, @artem-spector, @zhangxin81, @benoittgt, @joerowell, @yangrz7, @chelnnexy, @liangel-02, @walterbm, @rishitdholakia13, @SKRohit, @BugenZhao, @JaredforReal, @amd-lalithnc, @frgossen, @h-avsha, @DarkLight1337, @danisereb, @laithsakka, @Bortlesboat, @wangluochao902, @Rohan138, @hao-aaron, @puririshi98, @roikoren755, @heachary, @UranusSeven, @dsingal0, @ChenxiQ, @snadampal, @ilmarkov, @wendyliu235, @lequytra, @JisoLya, @LuisRobaina, @sniper35, @eicherseiji, @Yuyi-Ao, @raviguptaamd, @sungsooha, @ganyi1996ppo, @andylolu2, @FredericOdermatt, @ProExpertProg, @rbrugaro-amd, @mcsantiago, @hnt2601, @jinzhen-lin, @taneem-ibrahim, @tomeras91, @alex-jw-brooks, @Aktsvigun, @HanFa, @netanel-haber, @JasonKeyiL, @gshtras, @joa-stdn, @Seven-Streams, @JartX, @xuechendi, @BowenBao, @Akashcodes732, @jeffreywang-anyscale, @czhu-cohere, @zhewenl, @marvinzh, @Lidang-Jiang, @gcanlin, @whx-sjtu, @S1ro1, @liulanze, @Dhruvilbhatt, @laviier, @wi-adam, @aaab8b, @yuankaichen-amd, @ZhanqiuHu, @QwertyJack, @viktorpusTT, @divakar-amd, @starkwj, @benchislett, @jcyang43, @JLiu4Coding, @xy3xy3, @hongxiaya...
v0.20.2
vLLM v0.20.2
Highlights
This release features 6 commits from 6 contributors (0 new)!
This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL
Bug Fixes
- DeepSeek V4 sparse attention: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of
max_seq_len, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605). - DeepSeek V4 KV cache: Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager (#41282).
- gpt-oss MXFP4 + torch.compile: Plumbed
hidden_dim_unpaddedthrough themoe_forwardfake op so MXFP4 works undertorch.compileon v0.20.x (#42002, backport of #41646). - Qwen3-VL: Removed an invalid deepstack boundary check that could fail under heavy load (#40932).
Contributors
v0.20.1
vLLM v0.20.1
This is a patch release on top of v0.20.0 primarily focused on DeepSeek V4 stabilization and performance improvements, along with several important bug fixes.
DeepSeek V4
- Base model support (#41006).
- Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default
VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD(#41526). - BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960).
- PTX
cvtinstruction for faster FP32->FP4 conversion (#41015). - Integrated tile kernels (
head_compute_mix_kernel) for optimized head computation (#41255). - Guard megamoe flag with Pure TP (#41522).
- Fixed persistent topk cooperative deadlock at TopK=1024 (#41189) and inter-CTA init race on RadixRowState (#41444), with temporary disable of persistent topk as a workaround (#41442).
- Fixed import error due to AOT compile cache loading (#41090).
- Fixed torch inductor error (#41135).
- Fixed repeated RoPE cache initialization (#41148).
- Fixed missing type conversion for non-streaming tool calls in DSV3.2/V4 (#41198).
Bug Fixes
- Fixed
max_num_batched_tokennot being captured in CUDA graph (#40734). - Fixed
num_gpu_blocks_overridenot accounted for inmax_model_lenchecks (#41069). - Auto-disable
expandable_segmentsaround cumem memory pool (#40812). - Fixed BailingMoE linear layer (#40859) and MLA RoPE rotation for BailingMoE V2.5 (#41185).
- Fixed reasoning parser kwargs not being passed to structured output (#41199).
- [ROCm] Fixed
input_idsandexpert_mapargs for Quark W4A8 GPT-OSS (#41165).
List of contributors
@BugenZhao, @chaunceyjiang, @gau-nernst, @ghphotoframe, @Isotr0py, @jeejeelee, @khluu, @njhill, @Rohan138, @wzhao18, @youkaichao, @ywang96, @ZJY0516, @zixi-qi, @zyongye
v0.20.0
vLLM v0.20.0
Highlights
This release features 752 commits from 320 contributors (123 new)!
- DeepSeek V4: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the shared expert (#40950).
- CUDA 13.0 default: Default CUDA wheel on PyPI and
vllm/vllm-openai:v0.20.0image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to match PyTorch 2.11.0 (#40669). As a general rule of thumb, our CUDA version policy follows PyTorch's. We highly recommend to install vLLM withuvand use--torch-backend=cu129if you are on CUDA 12.9. - PyTorch 2.11 upgrade (#34644): vLLM ships on torch 2.11 for CUDA, and XPU is now also on torch 2.11 (#37947) — XPU is no longer pinned to 2.10. This is a breaking change for environment dependency.
- Python 3.14: Added to the supported Python version list (#34770).
- Transformers v5: vLLM now runs on HuggingFace
transformers>=5(#30566), with vision-encoder torch.compile bypass (#30518) and continued v4/v5 compat fixes including PaddleOCR-VL image processormax_pixels(#38629), Mistral YaRN warning (#37292), and Jina ColBERT rotary inv_freq recompute (#39176). - New large models: Hunyuan v3 (Hy3) preview (#40681) with HYV3 reasoning parser (#40713); Granite 4.1 Vision as a built-in multimodal model (#40282).
- FlashAttention 4 as default MLA prefill: FA4 re-enabled as the default MLA prefill backend (#38819) with head-dim 512 and paged-KV support on SM90+ (#38835), plus an upstream FA4 sync (#38690).
- TurboQuant 2-bit KV cache: New attention backend delivering 2-bit KV cache compression with 4× capacity (#38479), now with FA3/FA4 prefill support (#40092).
- Online quantization frontend: New end-to-end online quantization frontend (#38138), with docs (#39736); experts_int8 consolidated into the FP8 online path (#38463); MXFP8 online quant moved to the new frontend (#40152).
- vLLM IR: Initial IR skeleton with rms_norm op (#33825), OOT-platform kernel imports (#38807), gemma_rms_norm reworked on IR (#39014), and IR op testing/benchmarking infra added (#40167) — foundation for future kernel work.
- Model Runner V2 advances: Eagle prefill full-CUDA-graph (#37588), auto-resolve cudagraph mode/sizes from attention backend (#32936), fused probabilistic rejection sample kernels (#38496), config validation for unsupported features (#38758), piecewise-fallback disabled for eagle draft decodes (#39773), multiple prompt-logprobs support (#39937), prefill warmup coverage (#40746), and a fix for accuracy regression caused by stale sampled/draft tokens (#39833).
- MoE refactor series: Unquantized migrated to Full Oracle Flow (#36286), CT W8A8 to Oracle (#39187), SharedExperts class (#35153),
SharedFusedMoEremoved (#35782), DefaultMoERunner split (#35326) and later combined back intoMoERunnerBase(#40560), shared/fused expert output sum moved intoMoERunnerBase(#35949), ZeroExpertFusedMoE in new framework (#35549),compressed_tensors_moe.pysplit (#38960),GPTQMarlinMoEMethodreworked with MK (#37990), XPU & CUTLASS MoE relocated tofused_moe/experts/(#40568, #40574),make_expert_params_mappingrenamed (#40671), MoE LoRA refactor (#40338), and MoE DP chunking removed (#39107). - Performance: Optimize batch invariant with fused rms norm — 2.1% E2E latency improvement (#40413); avoid
seq_lens_cpuGPU→CPU sync (#40654); cacheInductorPass.hash_source(#39328); skip FX-graph deserialization on loading for faster warm compile (#40151); CUDAGraph memory profiling enabled by default for clearer startup memory accounting (#38284).
Model Support
- New architectures: DeepSeek V4 (#40860), Hunyuan v3 preview (#40681), Granite 4.1 Vision (#40282), EXAONE-4.5 (#39388), BharatGen Param2MoE (#38000), Phi-4-reasoning-vision-15B (#38306), Cheers multimodal (#38788), telechat3 (#38510), FireRedLID (#39290), jina-reranker-v3 (#38800), Jina Embeddings v5 (#39575), Nemotron-v3 VL Nano/Super (#39747).
- Gemma4 series: fast prefill (#38879), quantized MoE (#39045), Eagle3 (#39450), block-local attention + YaRN for Gemma3 (#39823), bidirectional vision attention for sliding layers (#40534), token-repetition fix via dynamic BOS (#39842), multimodal embedder norm-order fix (#40411), plus a string of streaming/tool-call fixes (#38844, #38909, #38992, #39114, #39679, #39027).
- Quantization formats: GGUF support for MiniMax-M2.1 (#36965), non-standard GGUF quant types with prefix such as UD-IQ1_S (#39471).
- Speculative decoding: Eagle3 for MiniMax-M2 (#37512), Eagle3 for Gemma4 (#39450).
- LoRA: Qwen3ASRForConditionalGeneration (#37247), Gemma4ForConditionalGeneration (#39291, #38844), DeepSeek V3.2 (#35077), Qwen3.5 / Step3.x expert base_layer extension (#37114), MoE LoRA refactor (#40338), dual-CUDA-streams linear layer (#35721).
- Multimodal MRoPE refresh: mm_features-based MRoPE for Ernie-4.5 VL (#39753), Keye-VL / Keye-1.5-VL (#39869), PaddleOCR-VL (#39888).
- Other: Nano-Nemotron-VL static image inputs fix (#40724); Qwen3 MoE no longer calls gate twice (#40664); DeepSeek V2-Lite accuracy drop fix (#40673); Parakeet UX / perf enhancements (#39423); ColModernVBERT updated for latest HF checkpoint (#39307); NemotronH default
mamba_ssm_cache_dtype=float32with NemotronHNanoVLV2 auto-hook (#39032); new TP plan styles for the Transformers backend (#40467); GLM-5.1 fix on ROCm (#40763).
Engine Core
- Model Runner V2: Full CUDA graph for eagle prefill (#37588), auto cudagraph mode/sizes based on attention backend (#32936), fused probabilistic rejection-sample kernels (#38496), config validation (#38758), eagle-draft piecewise fallback disabled (#39773), multiple prompt logprobs (#39937), prefill warmup coverage (#40746), stale sampled/draft tokens accuracy fix (#39833).
- vLLM IR: IR skeleton + rms_norm (#33825), OOT kernel import hooks (#38807), gemma_rms_norm on IR (#39014), IR op testing/benchmarking infra (#40167).
- torch.compile: Opaque Objects on torch 2.11 (#39286), AOT compile with batch-invariance mode (#39201), Inductor cache nested under AOT dir (#39718), split FX graph via codegen (#38657), Inductor pre-grad passes re-enabled for torch≥2.12 (#38944), strings in custom ops without compile regressions (#38123), MLA + group FP8 fusion (#38877), SiluMul activation+quant fusion refactor (#39684),
donate_graph_module=Trueforstandalone_compile(#39733), skip FX graph deserialization on loading (#40151), include Inductor & functorch configs in compile-cache key (#40627), respectTORCH_COMPILE_DISABLEat vLLM config level (#40715), disable Sequence Parallelism for piecewise compilation (#38373). - Attention: FA4 as default MLA prefill (#38819), head-dim 512 + paged-KV on sm90+FA4 (#38835), FA4 upstream sync (#38690), full CUDA graph for FlexAttention (#36298), FlexAttention non-causal support (#40394), unified 2D/3D triton_unified_attention (#40631), TRTLLM minimax_allreduce_rms ported (#37045),
concat_mla_qhalf-types only (#37892), batch-invariance-aware backend auto-selection (#40193), avoidseq_lens_cpuGPU→CPU sync (#40654). - Helion kernels: torch.compile support for Helion kernels (#38592).
- HMA / KV offload: GPU-side KV events for HMA (#37688), group block hashes/IDs tracked (#37109), unified memory layout for offloading workers (#37206),
shutdown()on OffloadingConnector (#39182), request context passed through KV offload (#39185), sliding-window lookup (#36645), multi-group worker transfer (#38453), multi-KV-group lookup/load/store (#39401, #39402, #39403). - Features: NUMA binding for GPU workers (#38635), opt-in
VLLM_MEDIA_CACHEmedia URL caching (#37123), safe request abort when FSM fails to advance (#38663), KV connector prioritized over internal registry (#38301), CUDAGraph memory profiling on by default (#38284), shared-expert overlap restored (#39222),CONFIG_REGISTRYconfig-class lookup fix when on-disk model_type differs (#39554), workspace-resize GPU memory leak fix (#39226), SWA/chunked-local runtime admission capped to startup pool-sizing bound (#40946). - Pluggable layers: Applied to llm_head / vocab embedding (#33465) and MoE layers (#33556).
- Mamba: Stochastic rounding (#35753), different Conv state layouts (#37416), FlashInfer
selective_state_update(#36162). - Metrics & scheduling: Labeled waiting-breakdown (capacity/deferred) metric (#38435), API server handshake simplified (#39364), mm-scheduler
get_num_embedoverhead reduced (#40143),request_idonFinishedRequestStats(#39710). - Executor: RayExecutorV2 introduced (#36836); unified engine process monitoring with Ray backend (#35862).
Hardware & Performance
- NVIDIA: swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#38325), MXFP4 W4A4 CUTLASS MoE for SM100 (#37463), TRTLLM GEN NVFP4 MoE with non-512-aligned hidden dims via weight padding (#39510), TRTLLM FP8 MoE with shuffled weights + BlockMajorK layout (#38993), fused qknorm+rope kernel on SM9.0 (#37376), tuned fused_moe config for RTX PRO 6000 Blackwell (#39183), ViT full CUDA graph for Qwen3-VL video (#38061),
--enable-vit-cuda-graphfor VLM examples (#40580), defaultmax_frames_per_batchauto-infer for ViT CG video (#40445), fused FP8 output quantization intomerge_attn_states(#36518), batched KV-cache swap viacuMemcpyBatchAsync(#38460), sm_110 (Jetson Thor) added to CUDA 13.0 build targets (#39233). - AMD ROCm: ZenCPU / AMD Zen CPU backend via zentorch (#39967), RDNA 3.5/4 device IDs (gfx1150/1151/1201) (#38455), gfx1102/gfx1103 added (#40037), MORI EP for unquantized MoE with AITER (#37529), MoRI build with AMD AINIC stack (#38371), MoRI-IO message format aligned with P2pNcclConnector and vllm-router (#39565), MORI prefill/decode API correction (#39835), AITER gemm w8a8 ptpc integration (#33773), TritonW4...
v0.19.1
This is a patch release on top of v0.19.0 with Transformers v5.5.3 upgrade and bug fixes for Gemma4:
- Update to transformers v5 (#30566)
- [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992)
- [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909)
- [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114)
- [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027)
- [Gemma4] Support quantized MoE (#39045)
- Add Gemma4 Eagle3 support (#39450)
- [Gemma4][Bugfix]: Enable Gemma4ForCasualLM to load lora adapters correctly (#38844)
- [Bugfix] Fix Gemma4 tool parser converting bare null to string "null" (#39679)
- [Model] Fix Gemma 4 token repetition by dynamic BOS injection for PT models (#39842)
- fix(kimi_k25): resolve media_placeholder_token_id from tokenizer (#39344)
v0.19.0
vLLM v0.19.0
Highlights
This release features 448 commits from 197 contributors (54 new)!
- Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires
transformers>=5.5.0. We recommend using pre-built docker imagevllm/vllm-openai:gemma4for out of box usage. - Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput (#32951).
- Model Runner V2 maturation: MRV2 gains piecewise CUDA graphs for pipeline parallelism (#35162), spec decode rejection sampler with greedy/logprobs support (#37238, #37237), multi-modal embeddings for spec decode (#36097), streaming inputs (#37028), and EPLB support (#37488).
- ViT Full CUDA Graphs: Vision encoders (ViT) now support full CUDA graph capture for reduced overhead (#35963).
- General CPU KV cache offloading: A simple yet general CPU KV cache offloading mechanism for V1, with pluggable cache policy and block-level preemption handling (#37160, #37874, #34805, #36642, #37853).
- DBO (Dual-Batch Overlap) generalization: The microbatch optimization (DBO) now works with general models, not just specific architectures (#37926).
- NVIDIA B300/GB300 (SM 10.3) support: Allreduce fusion enabled by default with tuned all-reduce communicator (#37755, #37756).
- Transformers v5 compatibility: Broad compatibility fixes across many models for HuggingFace Transformers v5 (#37681, #38127, #38090, #38247, #38410).
Model Support
- New architectures: Gemma 4 (#38826), Cohere ASR (#35809), Cohere Transcribe (#38120), ColQwen3.5 4.5B (#36887), LFM2-ColBERT-350M (#37528), Granite 4.0 1B Speech (#38019), Qwen3-ForcedAligner (#35367).
- Speculative decoding: Eagle3 for Pixtral (#37182), EagleMistralLarge3 fix (#37232).
- LoRA expansion: H2OVL tower/connector LoRA (#31696),
--lora-target-modulesto restrict LoRA to specific modules (#34984),language_model_onlyrespected (#37375), Mistral3 fix (#36928), Qwen3.5 fix (#36976), out-of-tree ops replacement (#37181). - Model fixes: NemotronH MTP + Chunked Prefill (#35447), Qwen3-VL video timestamps (#37439), Qwen3.5 GDN quantized models (#37448), Qwen3Next A_log FP32 (#37810), JAIS ALiBi (#37820), RoBERTa CUDA graph position IDs (#37873), AudioFlamingo3/MusicFlamingo (#37643), Music Flamingo loading (#35535), bge-m3 task selection (#37632), Nemotron Parse loading (#37407), GLM OCR patch merger (#37962), PaddleOCR checkpoint compat (#38232), DeepSeek v3.2 params (#33703), MiniMax NVFP4 weight loading (#37214), gated model HF token (#37920), Parakeet OOM on long audio (#36671).
- Features: Temporal compression for Nemotron-3-VL videos (#36808), NemotronH Puzzle + MTP (#37803), torch.compile for InternVL vision encoder (#38049), multiple embedding types in single call (#35829).
- Performance: GLM-4.xv ViT optimization (#37779).
Engine Core
- Zero-bubble async scheduling + speculative decoding (#32951).
- Model Runner V2: PP CUDA graphs (#35162), spec decode rejection sampler greedy (#37238) + logprobs (#37237), multimodal embeddings for spec decode (#36097), streaming inputs (#37028), configurable acceptance rate (#38045), FP32 draft logits (#37526), FP64 Gumbel noise (#37798), warmup with spec decode (#37812).
- ViT Full CUDA Graph capture (#35963).
- General CPU KV cache offloading with pluggable CachePolicy (#37160, #37874), block-level preemption (#34805), multiple KV groups (#36642), hybrid model support (#37853).
- DBO for general models: Microbatch optimization generalized beyond specific architectures (#37926).
- Compilation: Mega AOT artifact for torch 2.12+ (#37198), lazy graph module to defer recompile (#37609), remove model tag requirement for compile cache (#37345), Triton autotuning disk cache enabled by default (#37188), inductor runtime asserts disabled by default (#37485).
- FlexAttention: Custom mask modification support (#37692).
- Attention: Distinguish short extends vs decodes (#37303), allow qk_nope_head_dim=192 in FlashInfer MLA (#37475), skip sliding window attention layers with FP8 KV cache (#33695).
- Scheduling: Schedule requests based on full input sequence length (#37307).
- Spec decode: Per-draft-model MoE backend via
--speculative-config(#37880), Eagle3 drafter quant_config propagation (#37280), Eagle3 norm_before_fc propagation (#38111). - Extensibility: PluggableLayer for CustomQwen2Decoder (#37293), tensor IPC transfer for multimodal data (#32104).
- Performance: Optimize top-k in Triton sampler (#37225), optimize token_embed for pooling models with 1% improvement (#37347), fix slow hasattr in CUDAGraphWrapper (#37425), NFS prefetch auto-enabled with RAM guard (#37673), pybase64 replacement (#37290), optimize swap_states for hybrid models (#34733).
- Bugfixes: Fix gibberish from FP8 MLA KV scale inconsistency (#37054), Mamba state corruption (#37728), deadlock with pause/resume (#37024), FlashInfer MNNVL socket collisions (#36674), multimodal prefix cache key collisions (#36708), DP coordinator ZMQ TOCTOU (#37452), CUDA graph memory double-counting (#37426), pooling non-determinism (#37775), AllReduce Fusion shutdown crash (#36955), FlashInfer allreduce workspace (#37461), async spec decoding with hybrid models (#38556), MLA sparse indexer prefill chunking (#36178), KV offloading + MLA (#37536), async scheduling extra CUDA context (#37449), DP MTP dummy run (#35243), offloading+prefetch for GLM-4.7-FP8 (#37178), max memory for multiple KV-cache groups (#36030).
Hardware & Performance
- NVIDIA:
- B300/GB300 (SM 10.3): Allreduce fusion enabled by default (#37755), tuned all-reduce communicator (#37756).
- Blackwell: Optimized SM120 CUTLASS blockwise FP8 GEMM (#37970), fix NVFP4 NaN on desktop Blackwell (#37725), fix DeepGEMM E8M0 accuracy for Qwen3.5 FP8 (#38083), restore FP8 FlashMLA CUDA graph persistent buffers (#35175), DGX Spark fix (#38126).
- FlashInfer sparse MLA as default for FP8 KV cache (#37252).
- Tuned prefill configs for FP8 FA3 (#36265), tuned Triton MoE config for Qwen3.5 on H200 with 9.9% E2E improvement (#37340), H800 MoE configs (#31201).
- GPT-OSS: Router GEMM kernel (#37205), eliminate padding with FlashInfer MXFP4/MXFP8 MoE (#30647), reduce redundant SparseMatrix creation (#37683).
- NVFP4 CUTLASS MoE non-gated support (#37320), fuse pack topk in TRTLLM MoE via torch.compile (#37695).
- Non-contiguous KV cache in TRTLLM FP8 dequant kernel (#36867), Qwen3 dual stream input projection (#36795).
- AMD ROCm:
- ROCm 7.2.1, torch 2.10, triton 3.6 (#38252).
- DeepEP as all2all backend (#34692).
- Persistent MLA kernel from AITER (#36574), FP8xFP8 attention in AITER (#36927).
- AWQ Marlin support (#36505), wvSplitK skinny GEMM for RDNA4/gfx1x (#34709).
- Nightly Docker image and wheel releases (#37283).
- Bugfixes: Sleep mode memory leak (#37533), hybrid model stride (#37228), qwen3_next crash (#36795).
- Intel XPU: MLA model support (#37143), CompressedTensor W4A8 (#37207), auto-detect XPU build platform (#37634).
- TPU: Async scheduling interface (#36924), Qwen3.5 FP8 weight loading fix (#37348).
- CPU: Enable tcmalloc by default (#37607), graceful degradation without tcmalloc/libiomp (#37561), 48.9% throughput improvement for pooling models (#38139), OpenMP thread fix for torch.compile (#37538), structured output crash fix (#37706), KV cache block zeroing crash fix (#37550), slot mapping kernel (#37987), W4A16 compressed tensors (#38219).
- Performance fixes: FP8 DeepGEMM batch invariance (#37718), Triton autotuning for Qwen3.5 (#37338), TRTLLM NVFP4 routing precision (#36725).
Large Scale Serving
- Disaggregated serving: PD kv_transfer_params for Anthropic Messages (#37535) and Responses API (#37424), Mooncake heterogeneous TP (#36869), Mamba N-1 prefill for P/D (#37310).
- EPLB: MRV2 support (#37488), improved responsiveness (#36271), EP weight filter fix (#37322).
- Elastic EP: Fix repeated scale up/down cycles (#37131), fix stateless group port races (#36330).
- DBO: Generalized to work with all models (#37926).
- Multi-node: Fix allreduce fusion (#38136).
- KV connector: Plugin-overridable metadata build (#37336).
- Constraints: Cap API servers to 1 with Elastic EP (#37466).
Quantization
- Online MXFP8 quantization for MoE and dense models (#35448).
- FP8: WoQ kernel abstraction (#32929), Marlin FP8 for compressed tensors fix (#38092).
- NVFP4: Rescale weight scales to fix BF16 dequant underflow (#34577), fix Marlin NaN/Inf with float16 (#33972).
- QeRL: Online quantization composed with quantized reloading for RLHF (#38032).
- CPU: W4A16 compressed tensors (#38219).
- XPU: CompressedTensor W4A8 (#37207).
- ROCm: AWQ Marlin support (#36505).
- MXFP8 + DeepGEMM: Fix crash when both are active (#37358).
- Removals: Per-tensor-per-channel FP8 removed (#32700), Sparse24 integration and kernels removed (#36799).
API & Frontend
- New endpoints:
/v1/chat/completions/batchfor batched chat completions (#38011). - Features: Limit thinking tokens (hard limit) (#20859), multiple embedding types in single call (#35829), numpy array embeddings for multimodal (#38119),
--lora-target-modules(#34984),-scshorthand for--speculative-config(#38380). - Tool parsing: GigaChat 3.1 parser (#36664), Kimi-K2.5 reasoning/tool parser (#37438), Gemma 4 tool parser (#38847), tools passed to parser constructor (#38029), fix Mistral parser (#37209), fix DeepSeek v3.2 streaming (#36056), fix GLM-4.7 parsing (#37386), fix Hermes streaming (#38168), fix OpenAI tool parser IndexError (#37958), fix Anthropic streaming (#37510).
- Responses API: Fix crash with tool_choice=required exceeding ma...
v0.18.1
This is a patch release on top of v0.18.0 to address a few issues:
- Change default SM100 MLA prefill backend back to TRT-LLM (#38562)
- Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158)
- Disable monolithic TRTLLM MoE for Renormalize routing #37605
- Pre-download missing FlashInfer headers in Docker build #38391
- Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083)