Not yet seconds ago
False
Not yet seconds ago
Triggered text Ignored text Blocked text
2 hours ago
Skip to content
Navigation Menu
Toggle navigation
Sign in
Appearance settings
* Platform
+ AI CODE CREATION
o GitHub Copilot Write better code with AI
o GitHub Copilot app Direct agents from issue to merge
o MCP Registry New Integrate external tools
+ DEVELOPER WORKFLOWS
o Actions Automate any workflow
o Codespaces Instant dev environments
o Issues Plan and track work
o Code Review Manage code changes
+ APPLICATION SECURITY
o GitHub Advanced Security Find and fix vulnerabilities
o Code security Secure your code as you build
o Secret protection Stop leaks before they start
+ EXPLORE
o Why GitHub
o Documentation
o Blog
o Changelog
o Marketplace
View all features
* Solutions
+ BY COMPANY SIZE
o Enterprises
o Small and medium teams
o Startups
o Nonprofits
+ BY USE CASE
o App Modernization
o DevSecOps
o DevOps
o CI/CD
o View all use cases
+ BY INDUSTRY
o Healthcare
o Financial services
o Manufacturing
o Government
o View all industries
View all solutions
* Resources
+ EXPLORE BY TOPIC
o AI
o Software Development
o DevOps
o Security
o View all topics
+ EXPLORE BY TYPE
o Customer stories
o Events & webinars
o Ebooks & reports
o Business insights
o GitHub Skills
+ SUPPORT & SERVICES
o Documentation
o Customer support
o Community forum
o Trust center
o Partners
View all resources
* Open Source
+ COMMUNITY
o GitHub Sponsors Fund open source developers
+ PROGRAMS
o Security Lab
o Maintainer Community
o Accelerator
o GitHub Stars
o Archive Program
+ REPOSITORIES
o Topics
o Trending
o Collections
* Enterprise
+ ENTERPRISE SOLUTIONS
o Enterprise platform AI-powered developer platform
+ AVAILABLE ADD-ONS
o GitHub Advanced Security Enterprise-grade security features
o Copilot for Business Enterprise-grade AI features
o Premium Support Enterprise-grade 24/7 support
* Pricing
Search or jump to...
Search code, repositories, users, issues, pull requests...
Search
Clear
Search syntax tips
Provide feedback
We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel Submit feedback
Saved searches
Use saved searches to filter your results more quickly
Name
Query
To see all available qualifiers, see our documentation.
Cancel Create saved search
Sign in
Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
vllm-project / vllm Public
* Uh oh!
There was an error while loading. Please reload this page.
* Notifications You must be signed in to change notification settings
* Fork 17.9k
* Star 82.5k
* Code
* Issues 2k
* Pull requests 3.3k
* Discussions
* Actions
* Projects
* Security and quality 44
* Insights
Additional navigation options
* Code
* Issues
* Pull requests
* Discussions
* Actions
* Projects
* Security and quality
* Insights
Releases: vllm-project/vllm
Releases Tags
Releases Β· vllm-project/vllm
v0.22.1
05 Jun 10:10
khluu
v0.22.1
0decac0
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.22.1 Latest
Latest
Highlights
This release features 8 commits from 6 contributors (1 new)!
v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.
Model Support
* New model: JetBrains' Mellum v2, an open-weights Mixture-of-Experts code-generation model (#43992).
* DeepSeek-V4: resolve a CUTLASS fmin compatibility issue that broke initialization (0decac0).
* Fix OlmoHybridForCausalLM failing to initialise after the checkpoint changed rope_parameters from None to {"rope_type": None} (#43846).
* Fix HyperCLOVAX loading after the upstream HuggingFace repo removed its remote code (now native in transformers >= 5.9.0): register the hyperclovax model_type so vLLM uses its vendored config instead of the stale auto_map (#43860).
Hardware & Performance
* AMD Zen CPUs: route W8A8 (int8 dynamic-symmetric) and W4A16 (GPTQ) linear inference through zentorch kernels, registered ahead of the generic oneDNN CPU kernels, with transparent fallback on non-Zen CPUs, GPUs, and XPU (#41813).
Large Scale Serving
* Fix a deterministic hang in multi-node Ray data-parallel serving with num_api_servers > 1 by excluding the Ray DP backend from the deferred (kernel-assigned) port allocation introduced in #42585 (#43864).
Build & CI
* Docker: stop installing flashinfer-jit-cache via --extra-index-url while it is quarantined on PyPI, fixing image builds (#44366).
* Normalize NIXL KV-connector wheel installs so only the wheel matching the image's CUDA major is kept, fixing ImportError: libcudart.so.12 when importing nixl_ep on CUDA 13 images (#44266).
Contributors
@khluu, @vadiklyutiy, @aadwived, @shadeMe, @alec-flowers, @hmellor
New Contributors
* @aadwived made their first contribution in #41813
Contributors
*
*
*
*
*
*
shadeMe, hmellor, and 4 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 9 marquisburg, gaby, ivanbaldo, rainprob, abid-jawad, Roysky, rasmussenjustin02-dotcom, zhouxinyi87-ux, and PJC20020805 reacted with thumbs up emoji β€οΈ 1 abid-jawad reacted with heart emoji π 5 yankeexe, gau-nernst, ivanbaldo, wedobetter, and Vetin reacted with rocket emoji
All reactions
* π 9 reactions
* β€οΈ 1 reaction
* π 5 reactions
13 people reacted
v0.22.0
29 May 10:28
khluu
v0.22.0
0b3ba88
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.22.0
Highlights
This release features 459 commits from 230 contributors (63 new)!
* DeepSeek V4 maturity: DeepSeek V4 received a major hardening pass this cycle β the model was reorganized into a dedicated vllm/models/deepseek_v4/ package (#43004, #43039, #43073, #43077, #43149), gained NVFP4 fused MoE support (#42209), full + piecewise CUDA graph (#42604), and MTP speculative decoding (#43385). A large set of fused kernels (MegaMoE, mhc, Q-norm, indexer, sparse MLA) and ROCm parity fixes landed alongside accuracy fixes (#42810, #43710).
* Model Runner V2 advances toward default: MRv2 is now default for Qwen3 dense models. vLLM will fall back to MRv1 for features that aren't yet supported in MRv2 (#39337). sleep-mode weight reload (#42673), update_config (#42783), and shared KV-cache layers (#35045), plus many correctness fixes.
* Experimental Rust frontend: A new Rust front-end integration landed (#40848), with the implementation moved into the tree (#43283) and a DP Supervisor for data-parallel serving (#40841).
* Batch invariance, faster: Batch-invariant inference gained Cutlass FP8 support for a 28.9% end-to-end latency improvement (#40408), compile-mode support on SM80 (#42456), and an NVFP4 Cutlass linear path (#39912).
* Multi-tier KV cache offloading: A new multi-tier KV cache offloading framework (#40020) with a Python filesystem secondary tier (#41735), DSv4 support (#43142), and Mooncake disk offloading (#42689) extends offloading beyond CPU memory.
Model Support
* New architectures: MiniCPM-V 4.6 (#41254), InternS2 Preview (#42705), OpenVLA (#42654), MolmoWeb hf_overrides docs (#42163); EXAONE-4.5 aligned with Transformers update (#42246).
* Speculative decoding: custom callable proposer backend (#39487), post-norm EAGLE-3 speculators (#42764), peagle speculators (#41826), hybrid-attention models in extract_hidden_states (#39949), non-MTP speculation for NemotronH (#43130), shared MTP weights in MRv2 (#42538).
* DeepSeek V4: NVFP4 MoE (#42209), CUDA graph full/piecewise (#42604), MTP (#43385), model package refactor (#43004, #43039, #43073, #43077), sparse MLA + compressor refactor (#43149, #43710), MegaMoE input-prep kernel move (#43632).
* Qwen3.5/3.6: GDN output-projection flatten (#42311), GatedDeltaNet Marlin TPβ₯2 fix (#36329), ViT full CUDA graph (#42151), runai-streamer weight loading for Qwen3.5/MTP/Qwen3-VL (#42521, #42716), KDA chunk-prefill exp2 semantics (#43195).
* Gemma3/Gemma4: mixed-resolution image co-batching crash fix (#42217), MoE routing closure fix (#42250), tool-parser float-corruption fix (#42128), batched vision encoder for image/video (#43169), multi-GPU fix (#42630).
* Kimi-K2.5: skip vision-tower dtype conversion under quantization (#42869), mm_projector dtype fix (#42081).
* Cohere: enable Cohere MoE (#43143), pipeline parallelism for Cohere vision (#42819).
* Tool calling: Apertus tool parser (#41154), Qwen3Coder anyOf/oneOf/$ref resolution re-land (#37831), shared coerce_to_schema_type across MiniMax-M2 / DeepSeek-V3.2 / Seed-OSS parsers (#43006, #43019, #43140).
* ViT CUDA graph: Qwen2-VL (#41736), Step3-VL encoder (#42224), Qwen3.5 (#42151), FlashInfer metadata for Qwen2.5-VL vision attention (#42787).
Engine Core
* Model Runner V2: Qwen3-dense-by-default oracle (#39337), sleep-mode reload weights (#42673), update_config (#42783), shared KV-cache layers (#35045), FP32 gumbel sampling (#41775), auto-fallback to MRv1 with connectors (#42955), logprob_token_ids correctness (#43125, #41761), prompt-logprobs size fix (#42778).
* KV offloading: multi-tier framework (#40020), Python filesystem secondary tier (#41735), DSv4 support (#43142), tier-offload follow-up (#42529), prefer HND layout (#41928), reset_cache() (#41956), per-request tracking (#42507), store-deferral fix (#41945).
* MoE refactor: ExpertMapManager (#41046), experts moved to experts/ (#42334), RoutedExperts alias for FusedMoE (#40735), EPLB refactoring for FusedMoE (#41055).
* Mamba: attention module refactor (#41126), Mamba2 SSD kernel warmup (#39822), bf16 SSM cache (#41680), GPU-side state postprocessing fused kernel (#40172), run single-token extends as decodes (#42430).
* KV events: emit KV cache metadata (#40984).
* Allocator: manual cumem allocator enable (#33648), stream-aware free callback (#43020).
* elastic-EP: stage/commit MoE quant method on reconfigure (#40881).
Hardware & Performance
* NVIDIA Blackwell / SM12x: FlashInfer b12x MoE + FP4 GEMM for SM120/121 (#40082), per-tensor FP8 CUTLASS on SM12.1 (#41215), head_dim=512 for FlashInfer TRTLLM attention (#38822), FlashInfer Blackwell GDN prefill (#40717), GDN prefill kernel for SM100 (#43273).
* Performance: batch-invariant Cutlass FP8 (+28.9% E2E) (#40408), CutlassFP8 padding pre-processing (+13.5% TTFT) (#42651), padded NVFP4 quant kernel (+2.4β5.7% E2E) (#42774), GPU<->CPU sync elimination 1/n (#41429) and 4/n (#42347), fused RoPE+KVCache+q_concat for MLA (#40392), MLA compute_prefill_context / _v_up_proj optimizations (#42460, #42561), penalties Triton kernel (#40657), do_not_specialize in fused FP8 RoPE (#42849), FULL CUDA graph capture for TRITON_MLA decode (#42885).
* AMD ROCm: DSV4 functionality + accuracy fixes (#42810, #43679 Tilelang MHC), flash sparse MLA Triton kernels (#41812), gluon paged MQA logits on gfx950/MI355X (#42062), RMSNorm+Quant fusion for gfx950 (#41825), AITER FA backend cleanup (#41942), XGMI backend for MoRI connector (#41753), QuickReduce min-size override (#41675), DSV4 MTP (#43385).
* CPU / RISC-V: RVV-optimized attention kernels for RISC-V Vector Extension (#40119) with VLEN=256 (#42943), fused GDN for AMX CPU (#42707), MXFP4 W4A16 MoE (#41922), experimental Triton + MRv2 on CPU (#43225), improved CPU thread utilization (#42666), --cpu-distributed-timeout-seconds (#42968).
* Intel XPU: GPTQ int4 support (#37844), mxfp8 MoE (#41918), FP8 block-scaled quantization (#42952), custom-op collective behavior (#41354), multiple sparse-attention kernels (#37888), MoE topk routing + MXFP4 fallback (#42951), CT W4A4 MXFP4 path (#38896), reduced XPU MoE host overhead (#42915).
* Kernel ABI: continued migration to libtorch stable ABI β 5/n (#42339), 6/n (#42663), 7/n (#43209).
* Experimental: breakable CUDA graph (#42304).
Large Scale Serving
* Disaggregated serving (NIXL): lease-renewal TTL for KV blocks on P (#41383), handshake-failure policy honoring (#40364), GDN support for PD with NIXL (#41869), multi-node TP>8 fix (#39907), side-channel host-selection fix (#41806).
* Mooncake: disk offloading in MooncakeStoreConnector (#42689), HMA support for DSV4 (#42828), operation metrics (#43392), load-failure propagation (#42788), block-aligned full hits (#43494), finish-after-preemption handling (#43281).
* Data parallel: DP Supervisor (#40841), publish request counts at engine-step start (#41626), forward X-data-parallel-rank header (#42330).
* EPLB: change default EPLB communicator (#43110), VLM-wrapper init fix (#39805), remove dead torch.accelerator.synchronize() (#40733).
* LoRA: one-shot Triton kernel for MoE LoRA (#42290), simultaneous 2D & 3D MoE LoRA adapters (#42242), reduced 2D-weight memory under EP (#42737), MoE LoRA align-kernel grid fix (#40131).
Quantization
* MXFP4: linear layers + compressed-tensors integration (#41664), CPU W4A16 MoE (#41922), XPU mxfp8 MoE (#41918).
* NVFP4: DeepSeek V4 fused MoE (#42209), ModelOpt W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566), batch-invariant NVFP4 Cutlass linear (#39912), FlashInfer TRTLLM NvFP4 monolithic MoE routing fix (#43223), TRTLLM NVFP4 MoE chunking fix (#43599).
* Quark: load Quark NVFP4 checkpoints (#35859), W8A8 INT8 garbage-output fix on Step-3.5-Flash (#41892), W4A4 oracle refactor (#41436).
* AutoRound: W4A16 support (#39778).
* ModelOpt: Qwen3.5/3.6 VLM quantized prefix mapping (#42546).
* Framework: rework quantization_config to use QuantKey with activation override (#41566), MoE W4A8 CT migrated to oracle (#42680), AWQ Marlin MoE onto modular WNA16 oracle (#42483), GPTQ consolidation (gptq_marlin β auto_gptq) (#38288).
API & Frontend
* Rust frontend: integration (#40848), in-tree code move (#43283), utility call-ID newtype (#43405), simplified AuthenticationMiddleware path extraction (#43426).
* Responses API: chat_template_kwargs support (#42272), message-merging fix (#42189), empty channel/recipient harmony fix (#35540).
* Completions: thinking_token_budget support (#42116) with inverted-condition fix (#41674); map reasoning_effort to enable_thinking (#43401).
* Frontend: truncation side for OpenAI endpoints (#43260), normalize reasoning_content β reasoning (#42664), reworked fastokens integration (#43168), consolidated Speech-to-Text entrypoints (#42370, #42274), beam-search consolidation via BeamSearchMixin (#42946), score/rerank chat-template instructions (#42412).
* Auth: API-key authorization for /v2 endpoints (#42594).
* Offline API: pooling offline API split into PoolingOfflineMixin (#42267), split offline inference APIs/utils (#43553).
Build & Dependencies
* CUDA 12.9 wheel builds switched to PyTorch manylinux_2_28 base (#41668).
* FlashInfer bumped to v0.6.11.post2 (#41711); nvidia-cutlass-dsl to 4.5.2 (#42991, #43230, #43745); llguidance to 1.7 (#42150); triton_kernels downgraded to v3.5.1 for gpt-oss (#43135).
* Rust frontend build: setuptools-rust dependency (#43287, #43377), pinned protoc in rust-build stages (#43292).
* Docker: non-root vllm-openai target (#40275), build mooncake-transfer-engine from source (#42114), AINIC & Thor NIC support (#40453); Python-only installation made optional (#42293).
* vllm-tpu: disable build isolation for CUDA deps (#43038), tpu-inference docker build fix (#43360).
* humming MoE backend dependency adde...
Read more
Contributors
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
bbrowning, rasmith, and 198 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 12 Huang-ei, naveline67, adamw7, serdarildercaglar, jankogasic, aldubert, variant-star, Aollo, nawhji, zhouxinyi87-ux, and 2 more reacted with thumbs up emoji π 15 aleksandaryanakiev, QingZhou-YangHY, gaby, gau-nernst, Miocio-nora, sphinxkkkbc, Shabonasar, prravda, bewestphal, jankogasic, and 5 more reacted with hooray emoji β€οΈ 1 ivanbaldo reacted with heart emoji π 12 piro4you, xq25478, gaby, ojus1, gau-nernst, reneleonhardt, wedobetter, varungup90, CHNtentes, jankogasic, and 2 more reacted with rocket emoji π 4 12210122, gaby, ojus1, and ivanbaldo reacted with eyes emoji
All reactions
* π 12 reactions
* π 15 reactions
* β€οΈ 1 reaction
* π 12 reactions
* π 4 reactions
31 people reacted
v0.21.0
15 May 08:44
khluu
v0.21.0
ad7125a
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.21.0
Highlights
This release features 367 commits from 202 contributors (49 new)!
* Transformers v4 deprecated: This release formally deprecates transformers v4 support (#40389). Users should migrate to transformers v5.
* C++20 build requirement: vLLM now requires a C++20-compatible compiler for compatibility with PyTorch (#40380). This is a breaking build change.
* KV Offload + Hybrid Memory Allocator (HMA): The KV offloading subsystem now integrates with the Hybrid Memory Allocator, including scheduler-side sliding window group support and full HMA enablement (#41228, #41445, #39571).
* Speculative decoding with thinking budget: Speculative decoding now respects reasoning/thinking budgets, enabling correct spec decode for reasoning models (#34668).
* TOKENSPEED_MLA backend on Blackwell: A new TOKENSPEED_MLA attention backend is available for DeepSeek-R1/Kimi-K25 prefill + decode on Blackwell GPUs (#41778).
Model Support
* New architectures: MiMo-V2.5 (#40967), Laguna XS.2 (#41129, #41880), Moondream3 (#32325), Qianfan-OCR (#40136), Cohere MoE (#40817), Cohere Eagle (#42078).
* Speculative decoding: EAGLE for Mistral (#41024), Gemma4 MTP (#41745), MTP for MiMo-V2.5 (#41905), Cohere Eagle (#42078).
* DeepSeek V4: AMD/ROCm support (#40871), pipeline parallelism (#41694), max reasoning effort (#40982), disaggregated serving fixes (#41957).
* Tool calling: Cohere reasoning and tool parsers (#40422), LFM2/2.5 tool parser (#39243).
* Gemma3/Gemma4: hidden_act variant support (#40588), pipeline parallelism fix (#40786), MoE fixes (#41206, #41574, #41401), tool parser crash fix (#41991, #42188).
* Model Runner V2: Qwen3.5/Mamba hybrid model support (#35520), logprob_token_ids support (#40559).
* CUDA graph: ViT CUDA graph support for Qwen2.5-VL (#40830).
* Compatibility: Vendor HCXVisionConfig for Transformers v5 (#38447), legacy rope_type checkpoint support (#41734).
Engine Core
* KV offloading + HMA: Scheduler-side sliding window groups (#41228), full HMA enablement (#41445), multi-connector HMA (#39571), per-job store completion (#39186), DCP/PCP support in OffloadingConnector (#41549), MooncakeStoreConnector for distributed KV offloading (#40900).
* Speculative decoding: Thinking budget support (#34668), independent drafter attention backend selection (#39930), multimodal model support with warning (#41752), per-step allocation elimination (#41043).
* Model Runner V2: Rejection sampling acceptance rate fix (#40651), skip metadata rebuild before draft prefill (#40410), rebuild metadata between draft decode steps (#41162), Qwen3.5/Mamba hybrid support (#35520).
* Routing: Replace routing replay with device cache and async D2H pipeline (#39917).
* Ray: RayExecutorV2 enabled by default (#41421), actor name collision fix for DP > 1 (#40398).
* Stability: Two-phase pause to prevent scheduler deadlock (#39366), thread-safe HF tokenizer wrappers (#41181), OOM prevention via max_split_size_mb during model loading (#41268).
* IndexCache support for DSA models (#37735).
Hardware & Performance
* NVIDIA Blackwell: TOKENSPEED_MLA backend for DSR1/Kimi-K25 (#41778), faster per-token FP8 group quant packed kernel (#41326), FP8 on NVIDIA Thor/SM110 (#39712), CUTLASS scaled mm for non-compatible sizes (#41868).
* Performance: FlashInfer top-k/top-p sampler enabled by default (#40376), FP8 FlashInfer attention for ViT (#38065), TurboQuant shared dequant buffers (#40941), AllPool.forward 51% faster (#41163), GPU<->CPU sync elimination in pooling (#41433) and attention (#41434), numpy zero-copy embedding serialization (#41681), multimodal processor skip for text-only (#41246), FlashInfer FP8 async TP fusion (#39505), NVFP4 all-gather GEMM fusion for AsyncTP (#41882), re-enable allreduce+RMS fusion for DP/PP (#41458), DeepSeek bf16βfp32 via torch.mm (#41300), persistent MLA for sparse backend (#41990), configurable safetensors checkpoint prefetch (#41499), fused mhc_post_pre kernel (#41536), 2D-grid W8W8 group quant kernel (#42153), relaxed memory ordering for KV cache swaps (#39306).
* AMD ROCm: ROCm 7.2.2 (#41386), DBO (Dynamic Batch Optimization) (#34726), AITER Fused Allreduce+RMSNorm (#37646), Fused Shared Expert (FSE) for Qwen3-Next (#39280), DeepSeek V3.2 TP4 AITER MLA (#41835), GDN linear attention fusion (#40711), eliminate redundant MoE buffer copies in AITER (#41713), CPU offloading support (#40549), DeepEP API update (#39721), cap Triton paged attention block size to fix shared memory OOM (#38502).
* CPU: FP8 attention for AMX/AVX-512 (#39445), FP8 W8A16 linear (#41186), FP8 W8A16 MoE (#41314), DNNL AVX2 W8A8 Int8 (#41318), Gated DeltaNet Attention for Qwen 3.5/3.6 (#41025), RISC-V OMP thread auto-binding (#40569).
* Intel XPU: Top-k/top-p sample kernel (#39285), out-of-place all-reduce (#41808), LoRA support (#38206).
* IBM Power: VSX attention backend (#40451).
* FlexAttention: Re-enabled for batch invariant mode (#40842).
* MLA: Abstracted MLA prefill backends, eliminated cuDNN dependency (#32623).
Large Scale Serving
* Disaggregated serving: Bi-directional KV cache transfers between P and D (#32553), NIXL transfer redesign (#40731), EPLB memory overhead optimization (#40013), NIXL connector bumped to 1.x (#42364), Mooncake KVConnectorStats for transfer observability (#40414), NIXL P-node pre-admission rejection notification (#41269), KV block release for skipped P-ranks (#40449).
* DCP: Pack output and LSE in DCP A2A (#41160).
* MoE: PluggableLayer interface for out-of-tree MoE runners (#35178).
* LoRA: Initial expert parallel (EP) support (#40867), Qwen3.5 LoRA fusion fix (#37912).
Quantization
* NVFP4: KV cache support (#40177), Triton dequant/QDQ emulation kernels for Hopper and AMD (#40033), GELU on TRT-LLM NvFP4 fused MoE for Gemma4 (#41050), ModelOpt NVFP4 W4A16 (#41769), NVFP4 all-gather GEMM fusion for AsyncTP (#41882), GLM4-MoE NVFP4 loading fix (#41755).
* MXFP4: Humming MXFP4 MoE backend (#41083), FlashInfer CUTLASS MXFP4-MXFP8 MoE fix (#42089).
* TurboQuant: Hybrid model and uniform quantization support (#39931).
* Compressed tensors: Allow configs with non-explicit ignores (#41965).
* FP8: Bias loading fix (#41424), FlashInfer autotune temporarily disabled for correctness (#41524).
* DSV4: Improved fused Indexer Q quant kernel (#41428).
API & Frontend
* Responses API: Streaming tool/function calling with required (#40700) and named tool/function choice (#41110), resubmitting output items with missing fields (#41355).
* OpenAI compatibility: system_fingerprint field in responses (#40537), prompt_embeds content part support (#40720), defer_loading and tool_reference support (#40190), rendered prompt text in chat completion response (#42052), tolerate empty content in forced tool choice (#40148).
* Tool calling: XGrammar 0.2.0 with structural tags for strict tool calling + reasoning (#40894), Cohere reasoning/tool parsers (#40422), LFM2/2.5 tool parser (#39243).
* Tokenizer: Fastokens support (#41741).
* RLHF: Explicit /start_weight_update and /finish_weight_update APIs (#39212).
* ASR: Engine request abort on cancellation (#41266).
* Configuration: VLLM_SKIP_MODEL_NAME_VALIDATION env var (#34676), configurable model weights loading tracking (#41086), Triton JIT compilation monitor (#40137).
Build & Dependencies
* Breaking: C++20 required for PyTorch compatibility (#40380).
* Breaking: Transformers v4 deprecated (#40389).
* Docker image size reduced by ~2.5 GB via deferred FlashInfer cubin download (#41134).
* CUDA 13.0 wheels switched to PyTorch manylinux_2_28 base (#41416).
* DeepGEMM bundled wheel built per-Python for CPython compatibility (#41516).
* Container image provenance metadata embedded (#40653).
* tpu-inference upgraded to v0.19.0 (#41844).
* NIXL connector bumped to 1.x (#42364).
* ROCm 7.2.2 (#41386).
Contributors
@AndreasKaratzas, @haosdent, @khluu, @yewentao256, @stecasta, @mgoin, @Isotr0py, @hmellor, @chaunceyjiang, @jeejeelee, @noooop, @MatthewBonanni, @njhill, @zyongye, @yzong-rh, @ronensc, @NickLucche, @chaojun-zhang, @dzhengAP, @chfeng-cs, @TheEpicDolphin, @esmeetu, @wzhao18, @ZJY0516, @juliendenize, @kylesayrs, @fadara01, @Etelis, @tianmu-li, @arpera, @ekagra-ranjan, @orozery, @wxsIcey, @jikunshang, @izhuhaoran, @rasmith, @russellb, @Lucaskabela, @Harry-Chen, @alec-flowers, @pmaybank, @Terrencezzj, @hickeyma, @Baekpica, @itej89, @fxmarty-amd, @WoosukKwon, @juhi10071998, @sychen52, @baonudesifeizhai, @vllmellm, @johncalesp, @the-david-oy, @lucianommartins, @bittoby, @Dao007forever, @lyd1992, @yuwenzho, @lesj0610, @sfeng33, @micah-wil, @akii96, @yma11, @SoluMilken, @mmangkad, @SiluPanda, @ojhaanshika, @zhandaz, @bhoomit, @simon-mo, @msanft, @angelayi, @anthonsu, @artem-spector, @zhangxin81, @benoittgt, @joerowell, @yangrz7, @chelnnexy, @liangel-02, @walterbm, @rishitdholakia13, @SKRohit, @BugenZhao, @JaredforReal, @amd-lalithnc, @frgossen, @h-avsha, @DarkLight1337, @danisereb, @laithsakka, @Bortlesboat, @wangluochao902, @Rohan138, @hao-aaron, @puririshi98, @roikoren755, @heachary, @UranusSeven, @dsingal0, @ChenxiQ, @snadampal, @ilmarkov, @wendyliu235, @lequytra, @JisoLya, @LuisRobaina, @sniper35, @eicherseiji, @Yuyi-Ao, @raviguptaamd, @sungsooha, @ganyi1996ppo, @andylolu2, @FredericOdermatt, @ProExpertProg, @rbrugaro-amd, @mcsantiago, @hnt2601, @jinzhen-lin, @taneem-ibrahim, @tomeras91, @alex-jw-brooks, @Aktsvigun, @HanFa, @netanel-haber, @JasonKeyiL, @gshtras, @joa-stdn, @Seven-Streams, @JartX, @xuechendi, @BowenBao, @Akashcodes732, @jeffreywang-anyscale, @czhu-cohere, @zhewenl, @marvinzh, @Lidang-Jiang, @gcanlin, @whx-sjtu, @S1ro1, @liulanze, @Dhruvilbhatt, @laviier, @wi-adam, @aaab8b, @yuankaichen-amd, @ZhanqiuHu, @QwertyJack, @viktorpusTT, @divakar-amd, @starkwj, @benchislett, @jcyang43, @JLiu4Coding, @xy3xy3, @hongxiaya...
Read more
Contributors
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
rasmith, russellb, and 198 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 12 adamw7, dododream, LuisRobaina, dimens66, akalongman, varungup90, yewentao256, ivanbaldo, khotsopitso, MateenahJAHAN, and 2 more reacted with thumbs up emoji π 11 Prostagma1, gau-nernst, XeonBloomfield, hejiapeng2, Baekpica, Bartket, LuisRobaina, j-lojek, yewentao256, ivanbaldo, and khotsopitso reacted with hooray emoji π 11 subnet-dev, Baekpica, Dobrynia100, erickstryck, LuisRobaina, j-lojek, jasonline1, yewentao256, ivanbaldo, khotsopitso, and ucyang reacted with rocket emoji
All reactions
* π 12 reactions
* π 11 reactions
* π 11 reactions
24 people reacted
v0.20.2
10 May 07:37
khluu
v0.20.2
bc150f5
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.20.2
vLLM v0.20.2
Highlights
This release features 6 commits from 6 contributors (0 new)!
This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL
Bug Fixes
* DeepSeek V4 sparse attention: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of max_seq_len, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605).
* DeepSeek V4 KV cache: Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager (#41282).
* gpt-oss MXFP4 + torch.compile: Plumbed hidden_dim_unpadded through the moe_forward fake op so MXFP4 works under torch.compile on v0.20.x (#42002, backport of #41646).
* Qwen3-VL: Removed an invalid deepstack boundary check that could fail under heavy load (#40932).
Contributors
@ywang96, @zyongye, @stecasta, @wzhao18, @Isotr0py, @khluu
Contributors
*
*
*
*
*
*
Isotr0py, zyongye, and 4 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 15 adamw7, qiuuuuu622, iammeizu, aaarkai, preference-kim, bewestphal, iafarhan, Liibon, hwb96, TianchouYin, and 5 more reacted with thumbs up emoji π 5 adonig, gau-nernst, zhaoyan272898-cyber, SurealCereal, and yien-tsai-appier reacted with hooray emoji β€οΈ 4 Emilien-Etadam, gau-nernst, erickstryck, and yien-tsai-appier reacted with heart emoji π 9 cjackal, crazyguitar, webspeller, gau-nernst, rocfatcat, zhenhuan-yang, yien-tsai-appier, varungup90, and khotsopitso reacted with rocket emoji
All reactions
* π 15 reactions
* π 5 reactions
* β€οΈ 4 reactions
* π 9 reactions
28 people reacted
v0.20.1
04 May 10:36
khluu
v0.20.1
132765e
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.20.1
vLLM v0.20.1
This is a patch release on top of v0.20.0 primarily focused on DeepSeek V4 stabilization and performance improvements, along with several important bug fixes.
DeepSeek V4
* Base model support (#41006).
* Multi-stream pre-attention GEMM (#41061), configurable pre-attn GEMM knob (#41443), and tuned default VLLM_MULTI_STREAM_GEMM_TOKEN_THRESHOLD (#41526).
* BF16 and MXFP8 all-to-all support for FlashInfer one-sided communication (#40960).
* PTX cvt instruction for faster FP32->FP4 conversion (#41015).
* Integrated tile kernels (head_compute_mix_kernel) for optimized head computation (#41255).
* Guard megamoe flag with Pure TP (#41522).
* Fixed persistent topk cooperative deadlock at TopK=1024 (#41189) and inter-CTA init race on RadixRowState (#41444), with temporary disable of persistent topk as a workaround (#41442).
* Fixed import error due to AOT compile cache loading (#41090).
* Fixed torch inductor error (#41135).
* Fixed repeated RoPE cache initialization (#41148).
* Fixed missing type conversion for non-streaming tool calls in DSV3.2/V4 (#41198).
Bug Fixes
* Fixed max_num_batched_token not being captured in CUDA graph (#40734).
* Fixed num_gpu_blocks_override not accounted for in max_model_len checks (#41069).
* Auto-disable expandable_segments around cumem memory pool (#40812).
* Fixed BailingMoE linear layer (#40859) and MLA RoPE rotation for BailingMoE V2.5 (#41185).
* Fixed reasoning parser kwargs not being passed to structured output (#41199).
* [ROCm] Fixed input_ids and expert_map args for Quark W4A8 GPT-OSS (#41165).
List of contributors
@BugenZhao, @chaunceyjiang, @gau-nernst, @ghphotoframe, @Isotr0py, @jeejeelee, @khluu, @njhill, @Rohan138, @wzhao18, @youkaichao, @ywang96, @ZJY0516, @zixi-qi, @zyongye
Contributors
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
njhill, chaunceyjiang, and 13 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 25 kustomzone, QuanyiLi, wskr00, ywang96, staryxchen, gau-nernst, zhangj1an, slfan1989, jayakumarpujar, BenWongCityuCS, and 15 more reacted with thumbs up emoji
All reactions
* π 25 reactions
25 people reacted
v0.20.0
27 Apr 21:20
khluu
v0.20.0
88d34c6
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.20.0
vLLM v0.20.0
Highlights
This release features 752 commits from 320 contributors (123 new)!
* DeepSeek V4: Initial DeepSeek V4 support landed (#40860), with DSML token-leakage fix in DSV4/3.2 (#40806), DSA + MTP IMA fix (#40772), and a silu clamp limit on the shared expert (#40950).
* CUDA 13.0 default: Default CUDA wheel on PyPI and vllm/vllm-openai:v0.20.0 image switched to CUDA 13.0; architecture lists and build-args cleaned up (#39878), and CUDA bumped to 13.0.2 to match PyTorch 2.11.0 (#40669). As a general rule of thumb, our CUDA version policy follows PyTorch's. We highly recommend to install vLLM with uv and use --torch-backend=cu129 if you are on CUDA 12.9.
* PyTorch 2.11 upgrade (#34644): vLLM ships on torch 2.11 for CUDA, and XPU is now also on torch 2.11 (#37947) β XPU is no longer pinned to 2.10. This is a breaking change for environment dependency.
* Python 3.14: Added to the supported Python version list (#34770).
* Transformers v5: vLLM now runs on HuggingFace transformers>=5 (#30566), with vision-encoder torch.compile bypass (#30518) and continued v4/v5 compat fixes including PaddleOCR-VL image processor max_pixels (#38629), Mistral YaRN warning (#37292), and Jina ColBERT rotary inv_freq recompute (#39176).
* New large models: Hunyuan v3 (Hy3) preview (#40681) with HYV3 reasoning parser (#40713); Granite 4.1 Vision as a built-in multimodal model (#40282).
* FlashAttention 4 as default MLA prefill: FA4 re-enabled as the default MLA prefill backend (#38819) with head-dim 512 and paged-KV support on SM90+ (#38835), plus an upstream FA4 sync (#38690).
* TurboQuant 2-bit KV cache: New attention backend delivering 2-bit KV cache compression with 4Γ capacity (#38479), now with FA3/FA4 prefill support (#40092).
* Online quantization frontend: New end-to-end online quantization frontend (#38138), with docs (#39736); experts_int8 consolidated into the FP8 online path (#38463); MXFP8 online quant moved to the new frontend (#40152).
* vLLM IR: Initial IR skeleton with rms_norm op (#33825), OOT-platform kernel imports (#38807), gemma_rms_norm reworked on IR (#39014), and IR op testing/benchmarking infra added (#40167) β foundation for future kernel work.
* Model Runner V2 advances: Eagle prefill full-CUDA-graph (#37588), auto-resolve cudagraph mode/sizes from attention backend (#32936), fused probabilistic rejection sample kernels (#38496), config validation for unsupported features (#38758), piecewise-fallback disabled for eagle draft decodes (#39773), multiple prompt-logprobs support (#39937), prefill warmup coverage (#40746), and a fix for accuracy regression caused by stale sampled/draft tokens (#39833).
* MoE refactor series: Unquantized migrated to Full Oracle Flow (#36286), CT W8A8 to Oracle (#39187), SharedExperts class (#35153), SharedFusedMoE removed (#35782), DefaultMoERunner split (#35326) and later combined back into MoERunnerBase (#40560), shared/fused expert output sum moved into MoERunnerBase (#35949), ZeroExpertFusedMoE in new framework (#35549), compressed_tensors_moe.py split (#38960), GPTQMarlinMoEMethod reworked with MK (#37990), XPU & CUTLASS MoE relocated to fused_moe/experts/ (#40568, #40574), make_expert_params_mapping renamed (#40671), MoE LoRA refactor (#40338), and MoE DP chunking removed (#39107).
* Performance: Optimize batch invariant with fused rms norm β 2.1% E2E latency improvement (#40413); avoid seq_lens_cpu GPUβCPU sync (#40654); cache InductorPass.hash_source (#39328); skip FX-graph deserialization on loading for faster warm compile (#40151); CUDAGraph memory profiling enabled by default for clearer startup memory accounting (#38284).
Model Support
* New architectures: DeepSeek V4 (#40860), Hunyuan v3 preview (#40681), Granite 4.1 Vision (#40282), EXAONE-4.5 (#39388), BharatGen Param2MoE (#38000), Phi-4-reasoning-vision-15B (#38306), Cheers multimodal (#38788), telechat3 (#38510), FireRedLID (#39290), jina-reranker-v3 (#38800), Jina Embeddings v5 (#39575), Nemotron-v3 VL Nano/Super (#39747).
* Gemma4 series: fast prefill (#38879), quantized MoE (#39045), Eagle3 (#39450), block-local attention + YaRN for Gemma3 (#39823), bidirectional vision attention for sliding layers (#40534), token-repetition fix via dynamic BOS (#39842), multimodal embedder norm-order fix (#40411), plus a string of streaming/tool-call fixes (#38844, #38909, #38992, #39114, #39679, #39027).
* Quantization formats: GGUF support for MiniMax-M2.1 (#36965), non-standard GGUF quant types with prefix such as UD-IQ1_S (#39471).
* Speculative decoding: Eagle3 for MiniMax-M2 (#37512), Eagle3 for Gemma4 (#39450).
* LoRA: Qwen3ASRForConditionalGeneration (#37247), Gemma4ForConditionalGeneration (#39291, #38844), DeepSeek V3.2 (#35077), Qwen3.5 / Step3.x expert base_layer extension (#37114), MoE LoRA refactor (#40338), dual-CUDA-streams linear layer (#35721).
* Multimodal MRoPE refresh: mm_features-based MRoPE for Ernie-4.5 VL (#39753), Keye-VL / Keye-1.5-VL (#39869), PaddleOCR-VL (#39888).
* Other: Nano-Nemotron-VL static image inputs fix (#40724); Qwen3 MoE no longer calls gate twice (#40664); DeepSeek V2-Lite accuracy drop fix (#40673); Parakeet UX / perf enhancements (#39423); ColModernVBERT updated for latest HF checkpoint (#39307); NemotronH default mamba_ssm_cache_dtype=float32 with NemotronHNanoVLV2 auto-hook (#39032); new TP plan styles for the Transformers backend (#40467); GLM-5.1 fix on ROCm (#40763).
Engine Core
* Model Runner V2: Full CUDA graph for eagle prefill (#37588), auto cudagraph mode/sizes based on attention backend (#32936), fused probabilistic rejection-sample kernels (#38496), config validation (#38758), eagle-draft piecewise fallback disabled (#39773), multiple prompt logprobs (#39937), prefill warmup coverage (#40746), stale sampled/draft tokens accuracy fix (#39833).
* vLLM IR: IR skeleton + rms_norm (#33825), OOT kernel import hooks (#38807), gemma_rms_norm on IR (#39014), IR op testing/benchmarking infra (#40167).
* torch.compile: Opaque Objects on torch 2.11 (#39286), AOT compile with batch-invariance mode (#39201), Inductor cache nested under AOT dir (#39718), split FX graph via codegen (#38657), Inductor pre-grad passes re-enabled for torchβ₯2.12 (#38944), strings in custom ops without compile regressions (#38123), MLA + group FP8 fusion (#38877), SiluMul activation+quant fusion refactor (#39684), donate_graph_module=True for standalone_compile (#39733), skip FX graph deserialization on loading (#40151), include Inductor & functorch configs in compile-cache key (#40627), respect TORCH_COMPILE_DISABLE at vLLM config level (#40715), disable Sequence Parallelism for piecewise compilation (#38373).
* Attention: FA4 as default MLA prefill (#38819), head-dim 512 + paged-KV on sm90+FA4 (#38835), FA4 upstream sync (#38690), full CUDA graph for FlexAttention (#36298), FlexAttention non-causal support (#40394), unified 2D/3D triton_unified_attention (#40631), TRTLLM minimax_allreduce_rms ported (#37045), concat_mla_q half-types only (#37892), batch-invariance-aware backend auto-selection (#40193), avoid seq_lens_cpu GPUβCPU sync (#40654).
* Helion kernels: torch.compile support for Helion kernels (#38592).
* HMA / KV offload: GPU-side KV events for HMA (#37688), group block hashes/IDs tracked (#37109), unified memory layout for offloading workers (#37206), shutdown() on OffloadingConnector (#39182), request context passed through KV offload (#39185), sliding-window lookup (#36645), multi-group worker transfer (#38453), multi-KV-group lookup/load/store (#39401, #39402, #39403).
* Features: NUMA binding for GPU workers (#38635), opt-in VLLM_MEDIA_CACHE media URL caching (#37123), safe request abort when FSM fails to advance (#38663), KV connector prioritized over internal registry (#38301), CUDAGraph memory profiling on by default (#38284), shared-expert overlap restored (#39222), CONFIG_REGISTRY config-class lookup fix when on-disk model_type differs (#39554), workspace-resize GPU memory leak fix (#39226), SWA/chunked-local runtime admission capped to startup pool-sizing bound (#40946).
* Pluggable layers: Applied to llm_head / vocab embedding (#33465) and MoE layers (#33556).
* Mamba: Stochastic rounding (#35753), different Conv state layouts (#37416), FlashInfer selective_state_update (#36162).
* Metrics & scheduling: Labeled waiting-breakdown (capacity/deferred) metric (#38435), API server handshake simplified (#39364), mm-scheduler get_num_embed overhead reduced (#40143), request_id on FinishedRequestStats (#39710).
* Executor: RayExecutorV2 introduced (#36836); unified engine process monitoring with Ray backend (#35862).
Hardware & Performance
* NVIDIA: swapAB support for SM120 CUTLASS blockwise FP8 GEMM (#38325), MXFP4 W4A4 CUTLASS MoE for SM100 (#37463), TRTLLM GEN NVFP4 MoE with non-512-aligned hidden dims via weight padding (#39510), TRTLLM FP8 MoE with shuffled weights + BlockMajorK layout (#38993), fused qknorm+rope kernel on SM9.0 (#37376), tuned fused_moe config for RTX PRO 6000 Blackwell (#39183), ViT full CUDA graph for Qwen3-VL video (#38061), --enable-vit-cuda-graph for VLM examples (#40580), default max_frames_per_batch auto-infer for ViT CG video (#40445), fused FP8 output quantization into merge_attn_states (#36518), batched KV-cache swap via cuMemcpyBatchAsync (#38460), sm_110 (Jetson Thor) added to CUDA 13.0 build targets (#39233).
* AMD ROCm: ZenCPU / AMD Zen CPU backend via zentorch (#39967), RDNA 3.5/4 device IDs (gfx1150/1151/1201) (#38455), gfx1102/gfx1103 added (#40037), MORI EP for unquantized MoE with AITER (#37529), MoRI build with AMD AINIC stack (#38371), MoRI-IO message format aligned with P2pNcclConnector and vllm-router (#39565), MORI prefill/decode API correction (#39835), AITER gemm w8a8 ptpc integration (#33773), TritonW4...
Read more
Contributors
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
jefp, bai, and 121 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 60 adamw7, also-good, DJGoo, MateKristof, Cameron977, novelsk, hualin-wu-2000, azampatti, ShuoleiWang, cryptokeenz, and 50 more reacted with thumbs up emoji π 3 mike1858, zhijiangnaiweiguo2568, and SushantGautam reacted with laugh emoji π 15 klarkc, novelsk, NilsHellwig, bewestphal, duongck, prravda, mike1858, parsa-rahbari-82, z1ying, SushantGautam, and 5 more reacted with hooray emoji β€οΈ 10 mgoin, SparkShiStardust, Brensom, kibitzing, mike1858, jawnsy, thomasbergersen, Ligator, SushantGautam, and ivanbaldo reacted with heart emoji π 50 josericardo-fo, etatros, 1zilc, alfred-liu96, CycloneBoy, yfyang007, guenhter, echonjdavid, miguelcaldas0103, MateKristof, and 40 more reacted with rocket emoji π 2 mike1858 and ivanbaldo reacted with eyes emoji
All reactions
* π 60 reactions
* π 3 reactions
* π 15 reactions
* β€οΈ 10 reactions
* π 50 reactions
* π 2 reactions
103 people reacted
v0.19.1
18 Apr 05:44
khluu
v0.19.1
b1388b1
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.19.1
This is a patch release on top of v0.19.0 with Transformers v5.5.3 upgrade and bug fixes for Gemma4:
* Update to transformers v5 (#30566)
* [Bugfix] Fix invalid JSON in Gemma 4 streaming tool calls by stripping partial delimiters (#38992)
* [Bugfix][Frontend] Fix Gemma4 streaming HTML duplication after tool calls (#38909)
* [Bugfix] Fix Gemma4 streaming tool call corruption for split boolean/number values (#39114)
* [Tool] adjust_request to reasoning parser, and Gemma4 fixes (#39027)
* [Gemma4] Support quantized MoE (#39045)
* Add Gemma4 Eagle3 support (#39450)
* [Gemma4][Bugfix]: Enable Gemma4ForCasualLM to load lora adapters correctly (#38844)
* [Bugfix] Fix Gemma4 tool parser converting bare null to string "null" (#39679)
* [Model] Fix Gemma 4 token repetition by dynamic BOS injection for PT models (#39842)
* fix(kimi_k25): resolve media_placeholder_token_id from tokenizer (#39344)
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 36 allanchan339, ikaadil, adamw7, dergunovs, saattrupdan, Philipp-Sc, zlxi02, lizable, jhrystrom, adarshmadrecha, and 26 more reacted with thumbs up emoji π 4 kinbod, bakkerme, bobbysy, and ivanbaldo reacted with hooray emoji π 14 jiosephlee, jeejeelee, ikaadil, saattrupdan, eaplatanios, zlxi02, kinbod, jhrystrom, AadamHaq, bakkerme, and 4 more reacted with rocket emoji
All reactions
* π 36 reactions
* π 4 reactions
* π 14 reactions
45 people reacted
v0.19.0
03 Apr 02:19
khluu
v0.19.0
2a69949
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.19.0
vLLM v0.19.0
Highlights
This release features 448 commits from 197 contributors (54 new)!
* Gemma 4 support: Full Google Gemma 4 architecture support including MoE, multimodal, reasoning, and tool-use capabilities (#38826, #38847). Requires transformers>=5.5.0. We recommend using pre-built docker image vllm/vllm-openai:gemma4 for out of box usage.
* Zero-bubble async scheduling + speculative decoding: Async scheduling now supports speculative decoding with zero-bubble overlap, significantly improving throughput (#32951).
* Model Runner V2 maturation: MRV2 gains piecewise CUDA graphs for pipeline parallelism (#35162), spec decode rejection sampler with greedy/logprobs support (#37238, #37237), multi-modal embeddings for spec decode (#36097), streaming inputs (#37028), and EPLB support (#37488).
* ViT Full CUDA Graphs: Vision encoders (ViT) now support full CUDA graph capture for reduced overhead (#35963).
* General CPU KV cache offloading: A simple yet general CPU KV cache offloading mechanism for V1, with pluggable cache policy and block-level preemption handling (#37160, #37874, #34805, #36642, #37853).
* DBO (Dual-Batch Overlap) generalization: The microbatch optimization (DBO) now works with general models, not just specific architectures (#37926).
* NVIDIA B300/GB300 (SM 10.3) support: Allreduce fusion enabled by default with tuned all-reduce communicator (#37755, #37756).
* Transformers v5 compatibility: Broad compatibility fixes across many models for HuggingFace Transformers v5 (#37681, #38127, #38090, #38247, #38410).
Model Support
* New architectures: Gemma 4 (#38826), Cohere ASR (#35809), Cohere Transcribe (#38120), ColQwen3.5 4.5B (#36887), LFM2-ColBERT-350M (#37528), Granite 4.0 1B Speech (#38019), Qwen3-ForcedAligner (#35367).
* Speculative decoding: Eagle3 for Pixtral (#37182), EagleMistralLarge3 fix (#37232).
* LoRA expansion: H2OVL tower/connector LoRA (#31696), --lora-target-modules to restrict LoRA to specific modules (#34984), language_model_only respected (#37375), Mistral3 fix (#36928), Qwen3.5 fix (#36976), out-of-tree ops replacement (#37181).
* Model fixes: NemotronH MTP + Chunked Prefill (#35447), Qwen3-VL video timestamps (#37439), Qwen3.5 GDN quantized models (#37448), Qwen3Next A_log FP32 (#37810), JAIS ALiBi (#37820), RoBERTa CUDA graph position IDs (#37873), AudioFlamingo3/MusicFlamingo (#37643), Music Flamingo loading (#35535), bge-m3 task selection (#37632), Nemotron Parse loading (#37407), GLM OCR patch merger (#37962), PaddleOCR checkpoint compat (#38232), DeepSeek v3.2 params (#33703), MiniMax NVFP4 weight loading (#37214), gated model HF token (#37920), Parakeet OOM on long audio (#36671).
* Features: Temporal compression for Nemotron-3-VL videos (#36808), NemotronH Puzzle + MTP (#37803), torch.compile for InternVL vision encoder (#38049), multiple embedding types in single call (#35829).
* Performance: GLM-4.xv ViT optimization (#37779).
Engine Core
* Zero-bubble async scheduling + speculative decoding (#32951).
* Model Runner V2: PP CUDA graphs (#35162), spec decode rejection sampler greedy (#37238) + logprobs (#37237), multimodal embeddings for spec decode (#36097), streaming inputs (#37028), configurable acceptance rate (#38045), FP32 draft logits (#37526), FP64 Gumbel noise (#37798), warmup with spec decode (#37812).
* ViT Full CUDA Graph capture (#35963).
* General CPU KV cache offloading with pluggable CachePolicy (#37160, #37874), block-level preemption (#34805), multiple KV groups (#36642), hybrid model support (#37853).
* DBO for general models: Microbatch optimization generalized beyond specific architectures (#37926).
* Compilation: Mega AOT artifact for torch 2.12+ (#37198), lazy graph module to defer recompile (#37609), remove model tag requirement for compile cache (#37345), Triton autotuning disk cache enabled by default (#37188), inductor runtime asserts disabled by default (#37485).
* FlexAttention: Custom mask modification support (#37692).
* Attention: Distinguish short extends vs decodes (#37303), allow qk_nope_head_dim=192 in FlashInfer MLA (#37475), skip sliding window attention layers with FP8 KV cache (#33695).
* Scheduling: Schedule requests based on full input sequence length (#37307).
* Spec decode: Per-draft-model MoE backend via --speculative-config (#37880), Eagle3 drafter quant_config propagation (#37280), Eagle3 norm_before_fc propagation (#38111).
* Extensibility: PluggableLayer for CustomQwen2Decoder (#37293), tensor IPC transfer for multimodal data (#32104).
* Performance: Optimize top-k in Triton sampler (#37225), optimize token_embed for pooling models with 1% improvement (#37347), fix slow hasattr in CUDAGraphWrapper (#37425), NFS prefetch auto-enabled with RAM guard (#37673), pybase64 replacement (#37290), optimize swap_states for hybrid models (#34733).
* Bugfixes: Fix gibberish from FP8 MLA KV scale inconsistency (#37054), Mamba state corruption (#37728), deadlock with pause/resume (#37024), FlashInfer MNNVL socket collisions (#36674), multimodal prefix cache key collisions (#36708), DP coordinator ZMQ TOCTOU (#37452), CUDA graph memory double-counting (#37426), pooling non-determinism (#37775), AllReduce Fusion shutdown crash (#36955), FlashInfer allreduce workspace (#37461), async spec decoding with hybrid models (#38556), MLA sparse indexer prefill chunking (#36178), KV offloading + MLA (#37536), async scheduling extra CUDA context (#37449), DP MTP dummy run (#35243), offloading+prefetch for GLM-4.7-FP8 (#37178), max memory for multiple KV-cache groups (#36030).
Hardware & Performance
* NVIDIA:
+ B300/GB300 (SM 10.3): Allreduce fusion enabled by default (#37755), tuned all-reduce communicator (#37756).
+ Blackwell: Optimized SM120 CUTLASS blockwise FP8 GEMM (#37970), fix NVFP4 NaN on desktop Blackwell (#37725), fix DeepGEMM E8M0 accuracy for Qwen3.5 FP8 (#38083), restore FP8 FlashMLA CUDA graph persistent buffers (#35175), DGX Spark fix (#38126).
+ FlashInfer sparse MLA as default for FP8 KV cache (#37252).
+ Tuned prefill configs for FP8 FA3 (#36265), tuned Triton MoE config for Qwen3.5 on H200 with 9.9% E2E improvement (#37340), H800 MoE configs (#31201).
+ GPT-OSS: Router GEMM kernel (#37205), eliminate padding with FlashInfer MXFP4/MXFP8 MoE (#30647), reduce redundant SparseMatrix creation (#37683).
+ NVFP4 CUTLASS MoE non-gated support (#37320), fuse pack topk in TRTLLM MoE via torch.compile (#37695).
+ Non-contiguous KV cache in TRTLLM FP8 dequant kernel (#36867), Qwen3 dual stream input projection (#36795).
* AMD ROCm:
+ ROCm 7.2.1, torch 2.10, triton 3.6 (#38252).
+ DeepEP as all2all backend (#34692).
+ Persistent MLA kernel from AITER (#36574), FP8xFP8 attention in AITER (#36927).
+ AWQ Marlin support (#36505), wvSplitK skinny GEMM for RDNA4/gfx1x (#34709).
+ Nightly Docker image and wheel releases (#37283).
+ Bugfixes: Sleep mode memory leak (#37533), hybrid model stride (#37228), qwen3_next crash (#36795).
* Intel XPU: MLA model support (#37143), CompressedTensor W4A8 (#37207), auto-detect XPU build platform (#37634).
* TPU: Async scheduling interface (#36924), Qwen3.5 FP8 weight loading fix (#37348).
* CPU: Enable tcmalloc by default (#37607), graceful degradation without tcmalloc/libiomp (#37561), 48.9% throughput improvement for pooling models (#38139), OpenMP thread fix for torch.compile (#37538), structured output crash fix (#37706), KV cache block zeroing crash fix (#37550), slot mapping kernel (#37987), W4A16 compressed tensors (#38219).
* Performance fixes: FP8 DeepGEMM batch invariance (#37718), Triton autotuning for Qwen3.5 (#37338), TRTLLM NVFP4 routing precision (#36725).
Large Scale Serving
* Disaggregated serving: PD kv_transfer_params for Anthropic Messages (#37535) and Responses API (#37424), Mooncake heterogeneous TP (#36869), Mamba N-1 prefill for P/D (#37310).
* EPLB: MRV2 support (#37488), improved responsiveness (#36271), EP weight filter fix (#37322).
* Elastic EP: Fix repeated scale up/down cycles (#37131), fix stateless group port races (#36330).
* DBO: Generalized to work with all models (#37926).
* Multi-node: Fix allreduce fusion (#38136).
* KV connector: Plugin-overridable metadata build (#37336).
* Constraints: Cap API servers to 1 with Elastic EP (#37466).
Quantization
* Online MXFP8 quantization for MoE and dense models (#35448).
* FP8: WoQ kernel abstraction (#32929), Marlin FP8 for compressed tensors fix (#38092).
* NVFP4: Rescale weight scales to fix BF16 dequant underflow (#34577), fix Marlin NaN/Inf with float16 (#33972).
* QeRL: Online quantization composed with quantized reloading for RLHF (#38032).
* CPU: W4A16 compressed tensors (#38219).
* XPU: CompressedTensor W4A8 (#37207).
* ROCm: AWQ Marlin support (#36505).
* MXFP8 + DeepGEMM: Fix crash when both are active (#37358).
* Removals: Per-tensor-per-channel FP8 removed (#32700), Sparse24 integration and kernels removed (#36799).
API & Frontend
* New endpoints: /v1/chat/completions/batch for batched chat completions (#38011).
* Features: Limit thinking tokens (hard limit) (#20859), multiple embedding types in single call (#35829), numpy array embeddings for multimodal (#38119), --lora-target-modules (#34984), -sc shorthand for --speculative-config (#38380).
* Tool parsing: GigaChat 3.1 parser (#36664), Kimi-K2.5 reasoning/tool parser (#37438), Gemma 4 tool parser (#38847), tools passed to parser constructor (#38029), fix Mistral parser (#37209), fix DeepSeek v3.2 streaming (#36056), fix GLM-4.7 parsing (#37386), fix Hermes streaming (#38168), fix OpenAI tool parser IndexError (#37958), fix Anthropic streaming (#37510).
* Responses API: Fix crash with tool_choice=required exceeding ma...
Read more
Contributors
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
laudney, brandonpelfrey, and 52 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 15 adamw7, yassermessahli, MatejRojec, jaxhend, jhempenius, BenWongCityuCS, llocretsa, mkMoSs, igor-susic1, mayshin10, and 5 more reacted with thumbs up emoji π 29 crazyguitar, nanbogong, LiAnQing279, 1zilc, MengqingCao, noooop, 5aharsh, tristan-renaud, Paxtiny, scyyh11, and 19 more reacted with hooray emoji π 4 wedobetter, jiosephlee, LuisRobaina, and ivanbaldo reacted with rocket emoji
All reactions
* π 15 reactions
* π 29 reactions
* π 4 reactions
40 people reacted
v0.18.1
31 Mar 00:53
khluu
v0.18.1
a26e8dc
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.18.1
This is a patch release on top of v0.18.0 to address a few issues:
* Change default SM100 MLA prefill backend back to TRT-LLM (#38562)
* Fix mock.patch resolution failure for standalone_compile.FakeTensorMode on Python <= 3.10 (#37158)
* Disable monolithic TRTLLM MoE for Renormalize routing #37605
* Pre-download missing FlashInfer headers in Docker build #38391
* Fix DeepGemm E8M0 accuracy degradation for Qwen3.5 FP8 on Blackwell (#38083)
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 14 kyoungbinkim, snomile, Mr-Penguin, adamw7, maxx4144136-ship-it, ivanbaldo, zuhaira631, sejohnst, dangoldbj, ucyang, and 4 more reacted with thumbs up emoji
All reactions
* π 14 reactions
14 people reacted
v0.18.0
20 Mar 21:31
khluu
v0.18.0
bcf2be9
Compare
Choose a tag to compare
Sorry, something went wrong.
Filter
Loading
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.
No results found
View all tags
v0.18.0
vLLM v0.18.0
Known issues
* Degraded accuracy when serving Qwen3.5 with FP8 KV cache on B200 (#37618)
* If you previously ran into CUBLAS_STATUS_INVALID_VALUE and had to use a workaround in v0.17.0, you can reinstall torch 2.10.0. PyTorch published an updated wheel that addresses this bug.
Highlights
This release features 445 commits from 213 contributors (61 new)!
* gRPC Serving Support: vLLM now supports gRPC serving via the new --grpc flag (#36169), enabling high-performance RPC-based serving alongside the existing HTTP/REST interface.
* GPU-less Render Serving: New vllm launch render command (#36166, #34551) enables GPU-less preprocessing and rendering, allowing separation of multimodal preprocessing from GPU inference.
* NGram GPU Speculative Decoding: NGram speculative decoding now runs on GPU and is compatible with the async scheduler (#29184), significantly reducing spec decode overhead.
* KV Cache Offloading Improvements: Smart CPU offloading that stores only frequently-reused blocks (#35342), plus FlexKV as a new offloading backend (#34328) and support for multiple KV groups in offloading spec (#36610).
* Elastic Expert Parallelism Milestone 2: NIXL-EP integration (#35627) enables dynamic GPU scaling for MoE experts, with new --enable-ep-weight-filter CLI option (#37351) for faster EP model loading.
* FlashInfer 0.6.6: Updated FlashInfer dependency (#36768) with numerous performance and correctness improvements.
* Responses API Streaming Tool Calls: The OpenAI Responses API now supports tool/function calling with streaming (#29947).
* Online Beam Search for ASR: Beam search support for encoder/decoder models both offline (#36153) and online transcriptions (#36160).
* Ray No Longer a Default Dependency: Ray has been removed as a default dependency (#36170) β install it explicitly if needed.
Model Support
* New architectures: Sarvam MoE (#33942), OLMo Hybrid (#32550), HyperCLOVAX-SEED-Think-32B VLM (#31471), HyperCLOVAX-SEED-Think-14B (#37107), Kimi-Audio-7B-Instruct (#36127), ColPali late-interaction retrieval (#36818), ERNIE pooling models (#36385).
* Speculative decoding: Eagle3 for Qwen3.5 (#36658), Eagle3 for Kimi K2.5 MLA (#36361), Eagle for Mistral Large 3 with dense layers (#36163).
* LoRA: Whisper LoRA (#29856), FP8 LoRA dense kernel (#35242).
* Multimodal: Online use_audio_in_video (#36319), audio extraction from MP4 for Nemotron Nano VL (#35539), audio transcription for MP4/M4A/WebM (#35109), expose media_io_kwargs at runtime (#34778), fast media preprocessing for Nano Nemotron VL (#35657).
* Compatibility: Gemma/Gemma2 inputs_embeds (#36787), SigLIP/CLIP Transformers v5 (#37200), fused expert weights in Transformers backend (#36997).
* Performance: Qwen3 Next fused GDN kernel (#35777), LFM2 tuned H100 MoE configs (#36699).
* Fixes: DeepSeek-V3.2 tokenizer space stripping (#37004), Qwen3.5 tool calling (#36774), Qwen3-VL timestamp mismatch (#36136), Qwen3-Next TP>1 weight sharding (#36242), Qwen3-ASR torch.compile (#35869), MiniCPM-V audio inference (#36751), MiniCPM-O 4.5 ViT attention (#34127), routed experts for hybrid models (#35744), Qwen2.5-Omni/Qwen3-Omni multi-video audio_in_video (#37147), DeepSeek-OCR empty images crash (#36670).
Engine Core
* Model Runner V2: Probabilistic rejection sampling for spec decode (#35461), pooling models (#36019), extensible CUDA graph dispatch (#35959), WhisperModelState (#35790), XD-RoPE (#36817), model_state CUDA graph capture (#36544).
* KV cache offloading: Reuse-frequency-gated CPU stores (#35342), FlexKV offloading backend (#34328), multiple KV groups (#36610), async scheduling fix (#33881).
* Speculative decoding: NGram GPU implementation with async scheduler (#29184), fused EAGLE step slot mapping (#33503).
* Performance: Remove busy loop from idle buffer readers (#28053), 2.7% E2E throughput for pooling via worker-side maxsim (#36159), 3.2% via batched maxsim (#36710), CUDA graph memory accounting during profiling (#30515), checkpoint prefetch to OS page cache (#36012), InstantTensor weight loader (#36139), sporadic stall fix via pin_memory removal (#37006).
* Stability: VLM concurrent throughput degradation fix (#36557), DP deadlock fix (#35194), DeepSeek V3.2 OOM during CG profiling (#36691), Ray DP startup crash (#36665), NCCL rank calculation fix (#36940), zero-init MLA output buffers for NaN prevention (#37442), CUDA OOM fix (#35594).
* Defaults: Cascade attention disabled by default (#36318).
* Extensibility: OOT linear method registration (#35981), custom collective ops registration for non-CUDA platforms (#34760).
Kernel
* FA4 for MLA prefill (#34732).
* FlashInfer Sparse MLA: FP8 KV cache support (#35891), CUDA graphs on ROCm (#35719), MTP lens > 1 on ROCm (#36681).
* TRTLLM FP8 MoE modular kernel (#36307).
* FP8 KV cache for Triton MLA decode (#34597).
* FlashInfer MoE A2A kernel (#36022).
* Remove chunking from FusedMoE for full batch processing (#34086).
* CustomOp FusedRMSNormGated for torch.compile compatibility (#35877).
* Mamba2 SSD prefill Triton kernel optimization (#35397).
* DeepSeek-V3.2: Vectorized MLA query concat kernel (#34917), optimized FP8 KV cache gather for context parallel (#35290).
* 320-dimension MLA head size support (#36161).
* Packed recurrent fast path for decode (#36596).
* EP scatter race condition fix (#34991).
Hardware & Performance
* NVIDIA: FA4 for MLA prefill (#34732), DeepSeek-V3.2 MLA kernel optimizations (#34917, #35290).
* AMD ROCm: Sparse MLA CUDA graphs (#35719), MTP lens > 1 in Sparse MLA (#36681), MLA with nhead<16 + FP8 KV for TP=8 (#35850), RoPE+KV cache fusion for AITER FA (#35786), AITER MLA CPU sync avoidance (#35765), Quark W4A8 MXFP4/FP8 (#35316), gfx1152/gfx1153 Krackan support (#36499), fused_topk_bias AITER optimization (#36253), skinny GEMM improvements (#34304), DeepEP in ROCm Dockerfile (#36086), startup OOM fix (#36720).
* Intel XPU: Model Runner V2 enabled (#36078), MLA Sparse backend for DeepSeek V3.2 (#33230), LoRA via torch.compile (#36962), block FP8 MoE fallback (#36458), deepseek_scaling_rope fused kernel (#36612).
* CPU: aarch64 int8 matmul via OneDNN upgrade (#36147), AMD Zen CPU backend via zentorch (#35970).
* RISC-V: CPU backend support (#36578).
* Performance: 5% E2E improvement for PD disaggregation scheduling (#35781), packed recurrent decode fast path (#36596), pooling model maxsim 2.7%+3.2% throughput (#36159, #36710).
* torch.compile: FakeTensors instead of real GPU tensors for single-size compilation (#36093), non-contiguous fused RMSNorm + group quant (#36551), stop lazy compiling (#35472).
Large Scale Serving
* Elastic EP Milestone 2: NIXL-EP integration (#35627), --enable-ep-weight-filter for faster EP loading (#37351).
* PD Disaggregation: ~5% scheduler overhead reduction (#35781), KV transfer fix with spec decode (#35158), P/D for hybrid SSM-FA models via NIXL (#36687), PP for multimodal models on Transformers backend (#37057).
* KV Connectors: HMA + NIXL connector (#35758), FlexKV offloading (#34328), workerβscheduler metadata (#31964), All-to-All DCP backend (#34883).
* LMCache: Fault tolerance mechanism (#36586), memory leak fix (#35931), race condition fix (#35831), TP size for MLA multi-reader locking (#36129).
* EP loading: Skip non-local expert weights (#37136).
Quantization
* ModelOpt MXFP8 MoE support (#35986).
* MXFP4 MoE routing simulation override for accuracy (#33595).
* FP8 LoRA dense kernel (#35242).
* ROCm: Quark W4A8 MXFP4/FP8 for LinearLayer (#35316), compressed-tensors fix for DeepSeek-R1 on MI300x (#36247).
* Fixes: MLA crash with AWQ/GPTQ quantized models (#34695), score layer quantization for reranker models (#35849), GLM-4.1V non-default quantization (#36321), FP8 k_scale/v_scale loading for Qwen3-MoE (#35656).
API & Frontend
* gRPC: New --grpc flag for gRPC serving (#36169).
* GPU-less serving: vllm launch render for preprocessing-only serving (#36166), vllm launch for GPU-less preprocessing (#34551).
* Responses API: Streaming tool/function calling (#29947), reasoning item fixes (#34499, #36516).
* Anthropic API: Accept redacted thinking blocks (#36992).
* ASR: Online beam search transcriptions (#36160), offline beam search (#36153), audio transcription for MP4/M4A/WebM (#35109), realtime endpoint metrics (#35500).
* Tool calling: Granite4 tool parser (#36827), Qwen3Coder anyOf double encoding fix (#36032).
* New options: --distributed-timeout-seconds (#36047), --attention-backend auto (#35738), reasoning_effort=none (#36238), PyTorch profiler schedule (#35240).
* Cohere Embed v2 API support (#37074).
* Azure Blob Storage support for RunAI Model Streamer (#34614).
* Graceful shutdown timeout for in-flight requests (#36666).
* Fixes: tool_choice=required exceeding max_tokens crash (#36841), negative max_tokens with long prompts (#36789), concurrent classify/token_classify race (#36614), Anthropic billing header prefix cache miss (#36829), render endpoint crash for multimodal requests (#35684), xgrammar dtype mismatch on macOS CPU (#32384), minimax_m2 tool parser with stream interval > 1 (#35895).
Security
* Respect user trust_remote_code setting in NemotronVL and KimiK25 (#36192).
* Upgrade xgrammar for security fix (#36168).
* Guard RLHF weight sync deserialization behind insecure serialization flag (#35928).
Dependencies
* FlashInfer 0.6.6 (#36768).
* Ray removed from default dependencies (#36170).
* kaldi_native_fbank made optional (#35996).
* OpenAI dependency bounded to 2.24.0 (#36471).
* Deprecated items from v0.18 removed (#36470, #36006).
* Mistral common v10 (#36971).
Breaki...
Read more
Contributors
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
shubhra, sungsooha, and 59 other contributors
Assets 9
Loading
Uh oh!
There was an error while loading. Please reload this page.
π 18 1zilc, Brensom, amukho, DonliFly, gravesprite, harveyff, bigbear07, slfan1989, kyoungbinkim, ucyang, and 8 more reacted with thumbs up emoji π 12 zhewenl, DonliFly, Billy-Davies-2, mertalev, manhld0206, Richardyu114, Bambuuai, wedobetter, LittleExian, subnet-dev, and 2 more reacted with rocket emoji
All reactions
* π 18 reactions
* π 12 reactions
28 people reacted
Previous 1 2 3 4 5 β¦ 9 10 Next
Previous Next
Footer
Β© 2026 GitHub, Inc.
Footer navigation
* Terms
* Privacy
* Security
* Status
* Community
* Docs
* Contact
* Manage cookies
* Do not share my personal information
You canβt perform that action at this time.
For now, Differences are performed on text, not graphically, only the latest screenshot is available.
Screenshot requires a Content Fetcher ( Sockpuppetbrowser, selenium, etc ) that supports screenshots.