Releases: huggingface/transformers
Release v5.12.0
Release v5.12.0
New Model additions
MiniMax-M3-VL
MiniMax-M3-VL is the vision-language member of the MiniMax-M3 family that pairs a CLIP-style vision tower with 3D rotary position embeddings with the MiniMax-M3 text backbone. It uses a mixed dense/sparse Mixture-of-Experts decoder with SwiGLU-OAI gated experts and a lightning indexer for block-sparse attention. The model processes images through a Conv3d patch embedding system and includes specialized components for efficient multimodal understanding and generation.
Links: Documentation
- Add minimax m3vl (#46600) by @ArthurZucker in #46600
PP-OCRv6: update documentation and slow tests (#46576)
The official weights for PP-OCRv6 are out: PP-OCRv6 is a lightweight OCR system that combines architectural innovation with data-centric optimization. It redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge.
- PP-OCRv6: update documentation and slow tests (#46576) by @ zhang-prog
Add Parakeet-RNNT (#46331)
ParakeetForRNNT: a Fast Conformer Encoder + an RNN-T (RNN Transducer) decoder
- RNN-T Decoder: Standard neural transducer:
- LSTM prediction network maintains language context across token predictions.
- Joint network combines encoder and decoder outputs.
- Greedy transducer decoding for inference: a blank emission advances the encoder frame by one, a non-blank emission stays on the same frame.
- LSTM prediction network maintains language context across token predictions.
Bugfixes and improvements
- [CI] don't export OTELs within the tests (#46602) by @tarekziade in [#46602]
- [CI] capture checkers output in OTEL (#46601) by @tarekziade in [#46601]
- Lfm2: thread
seq_idxthrough ShortConv for packed/varlen inputs (#46588) by @ChangyiYang in [#46588] - put output_hidden_states into filter_output_hidden_states (#46422) by @molbap in [#46422]
- a11 for checkers (#46599) by @tarekziade in [#46599]
- Fix stop string matching for byte-fragment tokens (#46530) by @Incheonkirin in [#46530]
- [DiffusionGemma] better docs and links (#46569) by @gante in [#46569]
- Require
trust_remote_codeto run a local-directorycustom_generate(#46483) by @LinZiyuu in [#46483] - Fix torchaudio version not tied to torch version in docker file (#46594) by @ydshieh in [#46594]
- [CI] Enable PR CI for all fork PRs via security gate (#46591) by @ydshieh in [#46591]
- [CB] [Minor] Add parameter to tune default compile level (#46533) by @remi-or in [#46533]
- Make DiffusionGemma trainable (#46568) by @kashif in [#46568]
- docs: 🌐 add Turkish translation for README file (#46312) by @onuralpszr in [#46312]
- fix-trainer-tests (#46541) by @SunMarc in [#46541]
- Remove unnecessary expand_as in get_placeholder_mask across VLMs (#44907) by @syncdoth in [#44907]
- [CI] Catch all shell/process execution issues in security gate via Bandit JSON report (#46560) by @ydshieh in [#46560]
- Honor a concrete dtype in AutoModel for composite checkpoints (#46514) by @qflen in [#46514]
- [CI] Implement real security check in PR CI security gate (#46557) by @ydshieh in [#46557]
- [CI] Add 60s delay in security gate for flow observation (#46555) by @ydshieh in [#46555]
- [TBC] [CI] Auto-approve PR CI for fork PRs via security gate (#46553) by @ydshieh in [#46553]
- [CI] fix and make less flaky (#46543) by @zucchini-nlp in [#46543]
- Fix hf_hub_download not placing file in current dir for url_to_local_path (#46545) by @ydshieh in [#46545]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ArthurZucker
- Add minimax m3vl (#46600)
- @eustlb
- Add Parakeet-RNNT (#46331)
Release v5.11.0
Release v5.11.0
New Model additions
DiffusionGemma
DiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models by employing an encoder-decoder architecture specifically optimized for inference speed. During inference, DiffusionGemma leverages multi-canvas sampling, where rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. This block-autoregressive approach facilitates text generation at higher speeds compared to traditional sequential generation methods.
Links: Documentation
DeepSeek-V3.2
DeepSeek-V3.2-Exp is an experimental model from DeepSeek-AI that introduces DeepSeek Sparse Attention (DSA), a trainable, fine-grained sparse attention mechanism designed to improve training and inference efficiency in long-context scenarios. Built on top of DeepSeek-V3.1-Terminus with a 685B-parameter Mixture-of-Experts backbone, it reduces the quadratic cost of attention over long sequences by attending only to a selected subset of past tokens while maintaining virtually identical benchmark performance. The work was extended in DeepSeek-V3.2 which pairs DSA with scalable reinforcement learning and achieves gold-medal level results on competition math and competitive programming benchmarks.
Links: Documentation | Paper
- Add deepseek 3.2 exp (#41251) by @ArthurZucker in #41251
Kernels
The KernelConfig API was extended to support n-to-1 module fusion and parameter transformation, simplifying how custom kernels are integrated with Transformers modules. Additional fixes include resolving a dtype mismatch in the Mamba2 CUDA kernel path for NemotronH/Zamba2, adding fine-grained fp8/fp4 Triton kernel support, and correcting the FalconMamba fast-path warning to recommend pip install kernels instead of mamba-ssm.
- Extended & simplified n-to-1 kernel fusion via KernelConfig (#46339) by @michaelbenayoun in [#46339]
- Triton finegrained fp8/fp4 (#46407) by @IlyasMoutawwakil in [#46407]
- Fix dtype mismatch in NemotronH/Zamba2 Mamba2 CUDA-kernel path (
out_proj) (#46487) by @yuekaizhang in [#46487] - fix(falcon_mamba): recommend
pip install kernelsin fast-path warning (#46343) by @Anai-Guo in [#46343]
Parallelization
Fixed model parallel beam search bugs in the Qwen2-VL, Qwen2.5-VL, and Qwen3-VL MoE model families, and added documentation for tensor parallelism support with continuous batching.
- [docs] tp for continuous batching (#46019) by @stevhliu in [#46019]
- revisit history parallel beam search tests to avoid unnecessary fix (#46495) by @kaixuanliu in [#46495]
- fix qwen series VL model's model parallel bug (#46316) by @kaixuanliu in [#46316]
Bugfixes and improvements
- Fix the offsets in processing (#46525) by @zucchini-nlp in [#46525]
- Fix buggy action sha pin (#46534) by @ydshieh in [#46534]
- Fix trailing comma bug in DataCollatorForLanguageModeling example (#46527) by @JemmaUZH in [#46527]
- Fix missing Gemma4Processor._compute_audio_num_tokens (#46416) by @csantosbh in [#46416]
- Fix InternVL models (#46524) by @hmellor in [#46524]
- fix(afmoe): reduce tokens in test_compile_static_cache to avoid flaky bfloat16 drift (#46521) by @ydshieh in [#46521]
- [CB] Add a "max_requests_per_batch" parameter (#46434) by @remi-or in [#46434]
- revamp cv docs and fix rf-detr (#46219) by @merveenoyan in [#46219]
- Update hub metadata (#46379) by @zucchini-nlp in [#46379]
- extend DeepseekV4FlashIntegrationTest to non-cuda device (#46517) by @sywangyi in [#46517]
- [docs] deepgemm (#46361) by @stevhliu in [#46361]
- [fix] regression introduced by #45534 (#46456) by @eustlb in [#46456]
- Use torchvision's native LANCZOS interpolation instead of PIL fallback (#46496) by @NicolasHug in [#46496]
- Add debugging info in
pr-ci-caller.yml(#46505) by @ydshieh in [#46505] - Fix tests: 'Cohere2MoeModel' object has no attribute 'hf_device_map' (#46337) by @kaixuanliu in [#46337]
- Bump the actions group across 1 directory with 19 updates (#46414) by @dependabot[bot] in [#46414]
- Log some information in
.github/workflows/pr-ci-post-dashboard-link.yml(#46499) by @ydshieh in [#46499] - feat(quantizers): support non-weight param names in TorchAo safetensors loading (#46325) by @agesf in [#46325]
- docs: fix typo in make_list_of_images docstring (#46469) by @ramkumar27072006 in [#46469]
- add XPU expectation for deepseek_ocr2 model tests (#46492) by @kaixuanliu in [#46492]
- Fix sapiens2 tests: add XPU device expectations (#46488) by @kaixuanliu in [#46488]
- Add vLLM smoke test to CI (#46383) by @hmellor in [#46383]
- extend deepseek v4 test to xpu (#46366) by @sywangyi in [#46366]
- Added cosmos3 model (#46146) by @MaciejBalaNV in [#46146]
- fbgemm_fp8:Keep the current device aligned with the input tensor (#46403) by @kaixuanliu in [#46403]
- [Modular] Add
no_inherit_decoratorsand fixup wrong RoPE related inheritances (#46440) by @Bissmella in [#46440] - skip deepgemm test except cuda (#46090) by @jiqing-feng in [#46090]
- Fix/video classification pipeline video processor (#46256) by @J3r3myPerera in [#46256]
- ci: less flaky test_assisted_decoding_matches_greedy_search_1_same (#46445) by @ydshieh in [#46445]
- Fix flip_back graph break (#46344) by @guarin in [#46344]
- Add the other processors to auto-mappings (#46046) by @zucchini-nlp in [#46046]
- fix: compatibility with torch<=2.7 (#46393) by @andylin-hao in [#46393]
- fix: remove dynamic per-actor Slack ID lookup in ssh-runner workflow (#46327) by @ydshieh in [#46327]
- [docs] Romanian translation of
pipeline_tutorial.md,pipeline_gradio.md,pipeline_webserver.mdandadd_new_pipeline.md. (#46388) by @filipinescu in [#46388] - [docs] gemma4 typos (#46351) by @stevhliu in [#46351]
- [docs] padding-free training (#46333) by @stevhliu in [#46333]
- fix[vLLM x v5]: Default untied embeddings in AudioFlamingo3 and VibeVoice (#46400) by @harshaljanjani in [#46400]
- Fix deepspeed docker (#46108) by @SunMarc in [#46108]
- Fix conversion for clip models (#46406) by @zucchini-nlp in [#46406]
- ci: mention code quality failure in CI dashboard comment (#46415) by @ydshieh in [#46415]
- Fix noisy logging from image_processing module aliases issue - 46298 (#46350) by @skshmjn in [#46350]
- Raise tqdm minimum to 4.60 to match tqdm.contrib.logging import (#46397) by @n0gu-furiosa in [#46397]
- fix(gemma4_unified): conversion script and config bugs (#46398) by @douglas-reid in [#46398]
- [docs] remove sparsity from compressed-tensors (#46387) by @stevhliu in [#46387]
- [CB] Fix crashes when fork is not possible (#46251) by @remi-or in [#46251]
- Improve CI dashboard comment: rename and deduplicate (#46412) by @ydshieh in [#46412]
- Fix missing f-string prefixes in error messages (#46354) by @joaopedroassad in [#46354]
- Add workflow to post CI Grafana dashboard link to PR (#46410) by @ydshieh in [#46410]
- [docs] Romanian translation of
fast_tokenizers.md,custom_tokenizers.md,tokenizer_summary.md,image_processors.mdandvideo_processors.md. (#46356) by @filipinescu in [#46356] - Clean up new models after release (#46092) by @zucchini-nlp in [#46092]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @ArthurZucker
- Add deepseek 3.2 exp (#41251)
- @gante
- GPU go brr (#46540)
- @merveenoyan
- revamp cv docs and fix rf-detr (#46219)
- @sgerrard
- Quantization for small models (#46449)
- @MaciejBalaNV
- Added cosmos3 model (#46146)
- @J3r3myPerera
- Fix/video classification pipeline video processor (#46256)
- @filipinescu
Patch release v5.10.2
Patch release v5.10.2
There was a big bug in the model conversion of models related to clip, this affected models like sam3 and others. Please make sure to update 🙏
- Fix conversion for clip models by @zucchini-nlp (#46406)
Full Changelog: v5.10.1...v5.10.2
Release v5.10.1
Release v5.10.1
v5.10.0 was yanked as we publish on a corrupted branch. Sorry everyone, this happens when we rush a release!!!
New Model additions
Gemma4 unified+ Gemma4 MTP
Gemma 4 12B Unified is an encoder-free multimodal model with pretrained and instruction-tuned variants. Unlike standard Gemma 4, which uses dedicated encoder towers, Gemma 4 12B Unified projects raw inputs directly into the language model's embedding space through lightweight linear pipelines. This results in a simpler architecture while maintaining strong multimodal performance.
Key differences from standard Gemma 4:
- No Vision Tower: Raw pixel patches are projected directly into LM space via a
Dense + LayerNormpipeline with factorized 2D positional embeddings, replacing the vision encoder. - No Audio Tower: Raw 16 kHz waveform samples are chunked into fixed-length frames and projected through a simple
RMSNorm → Linearpipeline, replacing the mel spectrogram + Conformer encoder. - Shared Multimodal Pipeline: Both vision and audio use the same
Gemma4UnifiedMultimodalEmbedder(RMSNorm → Linear) for the final projection to text hidden space.
You can find the original Gemma 4 12B Unified checkpoints under the Gemma 4 release.
- who needs encoders? (#46385) by @douglas-reid @sgerrard @vasqu @molbap
Sapiens2
Sapiens2 is a family of high-resolution vision transformers pretrained on ~1 billion curated human images, designed for human-centric computer vision tasks including pose estimation, body-part segmentation, surface normal estimation, and pointmap estimation. The models scale from 0.4B to 5B parameters and train at native 1K resolution, with hierarchical 4K variants for extended spatial reasoning. Sapiens2 achieves substantial improvements over its predecessor with +4 mAP in pose estimation, +24.3 mIoU in body-part segmentation, and 45.6% error reduction in normal estimation.
Links: Documentation | Paper
DeepSeek-OCR-2
DeepSeek-OCR-2 is an OCR-specialized vision-language model built on a distinctive architecture that combines a SAM ViT-B vision encoder with a Qwen2 hybrid attention encoder, connected through an MLP projector to a DeepSeek-V2 Mixture-of-Experts (MoE) language model. The model features a hybrid attention mechanism that applies bidirectional attention over image tokens and causal attention over query tokens, enabling efficient and accurate document understanding. It supports both plain OCR tasks and grounding capabilities with coordinate-aware output for document conversion to markdown format.
Links: Documentation
- Add Deepseek-OCR-2 model (#45075) by @thisisiron in #45075
Mellum
Mellum is a code-focused Mixture-of-Experts language model developed by JetBrains. It is derived from the Qwen3-MoE architecture with per-layer-type RoPE and interleaved sliding window attention. The model has 12B total parameters with 2.5B active parameters per token, using 64 routed experts with 8 activated per token across 28 layers.
Links: Documentation
Breaking changes
The Gemma4 vision pooler now casts inputs to float32 before scaling to prevent float16 overflow (inf saturation) with large checkpoints, which may cause minor numerical differences in outputs for users running Gemma-4 vision models in float16.
- 🚨 Fix float16 overflow in Gemma4 vision pooler (#46277) by @Bluear7878
Audio Language Models (ALMs) now have a dedicated base model class without a language modeling head, aligning them with the design of Vision Language Models (VLMs); users relying on the previous model class structure should update their code to use the new base model class where appropriate.
Parallelization
This release includes numerous bug fixes for model parallelism across multiple models (Gemma4, AltCLIP, ChineseClip, Blip-2, Whisper, Ovis2, Moshi) and parallel execution strategies, including fixes for tensor parallelism (TP), expert parallelism (EP), beam search under model parallel settings, and loss over-counting under TP/EP configurations. The continuous batching manager was also reworked for clearer control flow and improved TP race condition handling, and FSDP initialization via from_pretrained was introduced.
- Fix dsv4 dequant + tp/ep (#46378) by @IlyasMoutawwakil in [#46378]
- [CB] [Major] Rework manager to have clearer control flow + handle TP (#46070) by @remi-or in [#46070]
- fix series of bugs for model parallel beam search (#46280) by @kaixuanliu in [#46280]
- Fix model parallel issue for altclip model and ChineseClip model (#45487) by @kaixuanliu in [#45487]
- Model parallel fix (#46230) by @kaixuanliu in [#46230]
- [
Revert] FSDP+Dtensor refactor related changes (#46246) by @vasqu in [#46246] - Fix model parallel bugs for Gemma4 (#45817) by @kaixuanliu in [#45817]
- init FSDP through from_pretrained (#46102) by @3outeille in [#46102]
- fix model parallel device mismatch issue in
create_bidirectional_mask(#46221) by @kaixuanliu in [#46221] - Trainer.compute_loss: fix loss over-counting under TP and EP-as-TP (#45994) by @AmineDiro in [#45994]
- Fix caching allocator warmup byte estimation for EP model loading (#46149) by @sywangyi in [#46149]
Cache
Fixed a regression in encoder-decoder cache initialization where the decoder config was incorrectly applied to the cross-attention cache, and resolved a RuntimeError caused by buffer size limits when warming up the cache on MPS devices. Additional test infrastructure improvements were made to support read-only cache environments used in CI.
- fix: cache warmup
RuntimeErroron mps (#46239) by @McPatate in [#46239] - Make more tests work with read-only cache (#46299) by @ydshieh in [#46299]
- Update a test to avoid writing to the default xet cache (#46250) by @ydshieh in [#46250]
- Fix a regression in encoder-decoder generation cache initialization (#46111) by @kaixuanliu in [#46111]
Quantization
Added support for DeepGEMM BF16, mixed FP8/FP4, and MegaMoE quantization via a grouped linear refactor, while fixing two bugs: an FP8 MoE reverse substring issue affecting DSv4 initialization, and a BitsAndBytes 4-bit/8-bit quantization bug that silently dropped chunked tensors from one-to-many weight converters.
- DeepGEMM BF16 + mixed FP8/FP4 + MegaMoE + refactor (#45634) by @IlyasMoutawwakil in [#45634]
- Fix fp8 moe reverse substring (#46265) by @ArthurZucker in [#46265]
- Fix bnb 4bit/8bit quantization drop chunked tensors bug (#46210) by @kaixuanliu in [#46210]
Bugfixes and improvements
- Fix wrong changes produced by style/repo. check bot (#46371) by @ydshieh in [#46371]
- Fix path traversal when saving Bark voice preset embeddings (#46237) by @LinZiyuu in [#46237]
- Pass library_name/version to Hub calls via a shared HfApi (#46318) by @Wauplin in [#46318]
- docs: update ACL Anthology URL in CITATION.cff (#46352) by @irfaan101 in [#46352]
- [docs] contributing (#45465) by @stevhliu in [#45465]
- [docs] Romanian translation of
contributing.md,modular_transformers.md,multimodal_processing.md,add_vision_processing_components.md,add_audio_processing_components.md,modeling_rules.md,model_output_tracing.md,auto_docstring.md,testing.md,pr_checks.mdandadd_new_model.md. (#46345) by @filipinescu in [#46345] - [docs] xpu continuous batching (#46334) by @stevhliu in [#46334]
- Fix incorrect attribute mapping relationships in GLM MoE DSA Config (#46338) by @Dovis01 in [#46338]
- Fix grammar typos in Whisper documentation (#46336) by @calliec-1223 in [#46336]
- [docs] update num_items_in_batch for causal LMs (#46335) by @stevhliu in [#46335]
- Update compressed tensors minimum version (#46342) by @SunMarc in [#46342]
- Fix _is_package_available reporting available without a version (#46125) by @blipbyte in [#46125]
- remove sec (#46346) by @ydshieh in [#46346]
- fix: include transitive relative imports when loading from local directory (#46022) by @trducng in [#46022]
- perf(feature_extraction_sequence): skip re-splitting already-batched numpy arrays in pad() (#46329) by @Anai-Guo in [#46329]
- [Zamba] Support attn_implementation dispatch (#46317) by @YangKai0616 in [#46317]
- Fix TestAppRoutes test failures caused by deprecated asyncio.get_event_loop() on Python 3.10+ (#46340) by @ydshieh in [#46340]
- [Qwen3VL] Fix video token placeholder: use self.video_token instead of hardcoded "<|placeholder|>" (#46296) by @kpal002 in [#46296]
- chore(linter): fixes for rule 16 (#46023) by @tarekziade in [#46023]
- [docs] Romanian translation of
weightconverter.md,models.md,custom_models.md,monkey_patching.md,fusion_mapping.md,how_to_hack_models.md,model_sharing.mdandserialization.md. (#46309) by @filipinescu in [#46309] - Normalize CUDA OOM errors when comparing commit failures in check_bad_commit (#46322) by @ydshieh in [#46322]
- Fix unhandled exception noise from background safetensors conversion thread (#45752) by @dhruv7477 in [#45752]
- Add Expectations for pipeline token classification tests (#46151) by @ka...
Release v5.9.0
Release v5.9.0
New Model additions
Cohere2Moe
Command A+ is a Mixture-of-Experts (MoE) language model from Cohere that features a hybrid attention pattern combining sliding window and full attention layers. The model incorporates both shared and routed experts and supports a very large context window for processing extensive text sequences.
Links: Documentation
- Add new cohere2_moe model (#46115) by @Cyrilvallez in #46115
Parakeet tdt (#44171)
HRM-Text
HRM-Text is an improved autoregressive language-modeling variant of the Hierarchical Reasoning Model (HRM) that uses a hierarchical recurrent forward pass with two transformer stacks - one for slow, abstract planning (H) and one for fast, detailed computation (L) - reused inside a nested recurrence. It features PrefixLM attention where instruction tokens attend bidirectionally while response tokens attend causally, per-head sigmoid output gates, and parameterless RMSNorm. The model is designed as a base language model without instruction tuning or chat templates.
Links: Documentation | Paper
Breaking changes
The text_embeds input for SAM3, EdgeTAM, and SAM3-Lite-Text models now expects full text embeddings instead of just pooler outputs, aligning with other models in the library — users must update their inputs accordingly.
- 🚨Fix memory leaks caused by lru decorators in vision models (#45922) by @yonigozlan
Audio
Audio support was expanded with the addition of AudioFlamingoNext model checkpoints and improved compilability of audio/vision encoders via standalone pure functions. Additional improvements include better error messaging when loading audio from video files and new documentation for audio/video processors.
- user friendly error when loading audio from video (#45221) by @eustlb in [#45221]
- [docs] adding audio/video processors (#45795) by @stevhliu in [#45795]
- Support Audio Flamingo Next checkpoints (#44830) by @lashahub in [#44830]
- Extract dynamic vision/audio tensors into standalone pure functions (#45396) by @IlyasMoutawwakil in [#45396]
Generation
Fixed generation issues including inputs_embeds and per_layer_inputs handling for Gemma4, an AttributeError in RAG's generate() caused by missing config fields, and flaky VLM generation tests by blocking special image tokens during sampling.
- Fix Gemma4 generation from inputs_embeds and per_layer_inputs (#46049) by @Cyrilvallez in [#46049]
- Fix AttributeError in RAG generate() for missing config fields (#46035) by @Sriniketh24 in [#46035]
- Block image_start/end_token_id in generation test sampling (#45914) by @Rocketknight1 in [#45914]
Bugfixes and improvements
- Remove mask visualization tool from
masking_utils.py(#46066) by @Cyrilvallez in [#46066] - fix: owned_by field in GET /v1/models returns list instead of string (#46006) by @nileshpatil6 in [#46006]
- [CB] Remove OpenTelemetry (#45984) by @remi-or in [#45984]
- docs(readme): use canonical
huggingface.codomain in prose links (#46042) by @kiwigitops in [#46042] - Fix remaining RAG doc examples that crash on current transformers (#46044) by @Sriniketh24 in [#46044]
- Init the actual tensor, not a copy (#46030) by @Rocketknight1 in [#46030]
- docs: sync legacy ACL anthology URLs and update metrics across i18n READMEs (#46027) by @irfaan101 in [#46027]
- [MultimodalLM] add language_model to the get/set_input_embeddings logic (#46029) by @eustlb in [#46029]
- [
HRM Text] Add integration tests (#46033) by @vasqu in [#46033] - hy_v3: add XPU expectations (#45858) by @kaixuanliu in [#45858]
- exaone4_5: add XPU expectations (#45890) by @kaixuanliu in [#45890]
- hyperclovax: add XPU Expectations for CI test (#45926) by @kaixuanliu in [#45926]
- chore(ci): remove dead env vars from circleci-failure-summary-comment.yml (#45972) by @XciD in [#45972]
- [CB] [Major] Add tensor paralellism (#45821) by @remi-or in [#45821]
- docs: update models architecture count and sync ACL anthology URLs (#46001) by @irfaan101 in [#46001]
- bugfix(ci): avoid E2BIG in pr_slow_ci_suggestion (#45983) by @tarekziade in [#45983]
- RFDetr - use correct Roboflow org for release (#45946) by @sbucaille in [#45946]
- docs: Fix formatting issues in weightconverter.md (#45988) by @ArjunSrivastava1 in [#45988]
- Fix colqwen2 test (#45981) by @IlyasMoutawwakil in [#45981]
- Fix M-RoPE device mismatch in Qwen3VL family under FSDP2 CPU offload (#45861) by @jamesbraza in [#45861]
- [docs] chat template prefill (#45947) by @stevhliu in [#45947]
- [docs] decode fast path (#45899) by @stevhliu in [#45899]
- fix: restore
_attn_implementationand fix request offset ingenerate_batch()(#45943) by @sergiopaniego in [#45943] - Expose
per_layer_inputsfor every Gemma4 variants (#45927) by @Cyrilvallez in [#45927] - chore: update benchmark_v2.yml (#45966) by @hf-security-analysis[bot] in [#45966]
- fix(ci): set persist-credentials: false on actions/checkout and close remaining template injection findings (#45964) by @XciD in [#45964]
- chore(ci): set default workflow permissions to contents: read (#45961) by @XciD in [#45961]
- fix(ci): remove template injection on pull_request_target workflows (#45956) by @XciD in [#45956]
- chore(ci): pin all GitHub Actions and reusable workflows by SHA (#45955) by @XciD in [#45955]
- [docs] ALMModelTest (#45900) by @stevhliu in [#45900]
- Enhance apply_chat_template to support custom field prefilling (reasoning_content, thinking, etc.) (#45896) by @Mamiglia in [#45896]
- BUGFIX: Support hubert models that don't have conv_pos_batch_norm configured (#45921) by @igordertigor in [#45921]
- Revert 45777 (#45942) by @Rocketknight1 in [#45942]
- pass the otel secrets (#45933) by @tarekziade in [#45933]
- Add initial torch_tpu backend support (#45918) by @tengomucho in [#45918]
- [CB] Hide activation footprint by using the CUDA graph pool (#45911) by @remi-or in [#45911]
- Require input_ids for repetition penalty (#45389) by @ruben-aghayan in [#45389]
- Fix undefined 'input' variable (#45895) by @fullyz in [#45895]
- Fix post processing RF-DETR (#46041) by @yonigozlan (direct commit on v5.9.0)
- [loading] Free up tensors faster inside ConversionOps (#46110) by @Cyrilvallez (direct commit on v5.9.0)
- Add new cohere2_moe model (#46115) by @Cyrilvallez (direct commit on v5.9.0)
- Fix cohere2 tp_plan for release by @Cyrilvallez (direct commit on v5.9.0)
- Release v5.9.0 by @Cyrilvallez (direct commit on v5.9.0)
Significant community contributions
The following contributors have made significant changes to the library over the last release:
Patch release v5.8.1
Patch release v5.8.1
This release is mainly to fix the Deepseek V4 integration!!!
- [fix] Add fatal_error to ContinuousBatchingManager so the serving... by @qgallouedec, @remi-or
- Fix WeightConverter regex incorrectly matching shared_experts as experts by @silencelamb, @claude
- Fix deepseek v4 by @ArthurZucker (#45892)
- Deepseek v4 csa mask collapse by @ArthurZucker, @Sawyer117 (#45928)
Release 5.8.0
Release v5.8.0
New Model additions
DeepSeek-V4
DeepSeek-V4 is the next-generation MoE (Mixture of Experts) language model from DeepSeek that introduces several architectural innovations over DeepSeek-V3. The architecture replaces Multi-head Latent Attention (MLA) with a hybrid local + long-range attention design, swaps residual connections for Manifold-Constrained Hyper-Connections (mHC), and bootstraps the first few MoE layers with a static token-id → expert-id hash table. This implementation covers DeepSeek-V4-Flash, DeepSeek-V4-Pro, and their -Base pretrained variants, which share the same architecture but differ in width, depth, expert count and weights.
Links: Documentation | Paper
- Add DeepSeek V4 (#45643) by @ArthurZucker in #45643
Gemma 4 Assistant
Gemma 4 Assistant is a small, text-only model that enables speculative decoding for Gemma 4 models using the Multi-Token Prediction (MTP) method and associated candidate generator. The model shares the same Gemma4TextModel backbone as other Gemma 4 models but uses KV sharing throughout the entire model, allowing it to reuse the KV cache populated by the target model and skip the pre-fill phase entirely. This architecture includes cross-attention to make the most of the target model's context, allowing the assistant to accurately predict more drafted tokens per drafting round.
Links: Documentation
- First model (#45788) by @SindhuRaghuram97 in #45788
GraniteSpeechPlus
Granite Speech Plus is a variant of Granite Speech that enhances the projector by consuming the concatenation of the encoder's final hidden states with an arbitrary subset of its intermediate hidden states along the feature dimension. It is a multimodal speech-to-text model that can transcribe audio, provide speaker annotation and word level timestamps by responding to text prompts. The model inherits the same architecture components as Granite Speech including the speech encoder, query transformer projector, language model, and optional LoRA adapter.
Links: Documentation
Granite4Vision
Granite Vision 4.1 is a vision-language model from IBM Research designed for enterprise-grade document data extraction. It specializes in chart extraction (Chart2CSV, Chart2Summary, Chart2Code), table extraction (JSON, HTML, OTSL), and semantic key-value pair extraction. The model builds on LLaVA-NeXT with architectural innovations including SigLIP2 Vision Encoder, Window Q-Former Projectors, and DeepStack Feature Injection with 8 vision-to-LLM injection points.
Links: Documentation
- Add Granite 4.1 Vision (granite4_vision) (#45597) by @artem-spector in #45597
EXAONE-4.5
EXAONE 4.5 is the first open-weight vision language model developed by LG AI Research, integrating a dedicated visual encoder into the existing EXAONE 4.0 framework to expand multimodal capabilities. The model features 33 billion parameters in total, including 1.2 billion parameters from the vision encoder, and achieves competitive performance in general benchmarks while outperforming similar-sized models in document understanding and Korean contextual reasoning. It builds on EXAONE 4.0 with key enhancements including an expanded vocabulary of 153,600 tokens, support for up to 256K token context windows, and a Multi-Token Prediction (MTP) mechanism.
Links: Documentation | Paper | Blog Post
PP-FormulaNet
PP-FormulaNet-L and PP-FormulaNet_plus-L are lightweight models designed for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The models are part of the SLANet series and can be used for image-to-text tasks, specifically for detecting and processing mathematical formulas and table structures from images.
Links: Documentation
- [Model] Add PP-FormulaNet Model Support (#45626) by @zhang-prog in #45626
Breaking changes
Apex integration has been removed from the library (including RMSNorm usage in T5 and related models), so users relying on Apex for mixed precision or fused ops should migrate to PyTorch's native equivalents instead.
- 🚨 Get rid of most Apex references (#45723) by @Rocketknight1
Tokenization
Fixed tokenizer mapping issues for DeepSeek R1 distilled (Qwen2) and DeepSeek OCR models, and resolved a significant performance regression in PreTrainedTokenizer.convert_ids_to_tokens where skip_special_tokens=True was rebuilding the special token set on every iteration, resulting in a ~300x speedup for that code path.
- deepseek r1 distilled tokenizer fix for qwen2 mapping (#45741) by @itazap in [#45741]
- DeepSeek OCR specifies an incorrect tokenizer class on the Hub (#45739) by @hmellor in [#45739]
- PythonBackend slow tokenizer convert_ids_to_tokens fix (#45728) by @i3hz in [#45728]
Bugfixes and improvements
- fix: correct spelling in continuous_api docstring (#45749) by @Dhruv908615 in [#45749]
- Fix link to modular transformers documentation (#45746) by @SangbumChoi in [#45746]
- Gemma4: fix failed test cases (#45568) by @kaixuanliu in [#45568]
- Fix CI: Allow more artifacts to be download in CI (#45785) by @ydshieh in [#45785]
- Add
concurrencytoPR CIworkflow file (pr-ci-caller.yml) (#45786) by @ydshieh in [#45786] - Reorder decorators for autodoc and dataclass (#45702) by @zucchini-nlp in [#45702]
- Unwrap
text_configinAutoModelFor*.from_config(#45770) by @jamesbraza in [#45770] - fix: Added Mps support in float fallback backends list (#45687) by @rigen1048 in [#45687]
- Github Actions PR CI (caller) (#45476) by @ydshieh in [#45476]
- make sure we call check_auto in CI (#45775) by @tarekziade in [#45775]
- Fix auto mapping script (#45774) by @Cyrilvallez in [#45774]
- [MINISTRAL3] Fix conversion script yarn's apply_scale support. (#45744) by @juliendenize in [#45744]
- [nemotron_h] respect _no_reinit flag on dt_bias and out_proj.weight (#45591) by @vai-minzhou in [#45591]
- fix(utils): Resolve backbone utils test regressions (#45594) by @harshaljanjani in [#45594]
- [CB] Better overall script and decode bucketting (#45653) by @remi-or in [#45653]
- [docs] model testing (#45152) by @stevhliu in [#45152]
- update dev (#45726) by @vasqu in [#45726]
- Doc translate to Persian(farsi) (#45664) by @zeoses in [#45664]
- [
OAI Privacy Filter] Add integration test (#45725) by @vasqu in [#45725] - Speedup Qwen2VLImageProcessor (#45719) by @lgeiger in [#45719]
- Remove dead beam-search dummies from dummy_pt_objects.py (#45722) by @jw9603 in [#45722]
- chore(typing): add ty type checking for 10 utility files (#45703) by @moonbogi in [#45703]
- Llama3 video fix (#45040) by @sywangyi in [#45040]
- Fix custom-module copies inheriting read-only permissions (#45686) by @nurpax in [#45686]
- Python code in model docs (#45608) by @zucchini-nlp in [#45608]
- fix failed test cases for blt model (#45596) by @kaixuanliu in [#45596]
- chore(typing): add ty type checking for 3 pipeline files (#45667) by @moonbogi in [#45667]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @artem-spector
- Add Granite 4.1 Vision (granite4_vision) (#45597)
- @SindhuRaghuram97
- First model (#45788)
- @nuxlear
- Add EXAONE 4.5 implementations (#45471)
- @ArthurZucker
- Add DeepSeek V4 (#45643)
- @remi-or
- [CB] Better overall script and decode bucketting (#45653)
- @zhang-prog
- [Model] Add PP-FormulaNet Model Support (#45626)
- @zvik
- Support for a new Granite-Speech-Plus model (#45695)
Release v5.7.0
Release v5.7.0
New Model additions
Laguna
Laguna is Poolside's mixture-of-experts language model family that extends standard SwiGLU MoE transformers with two key innovations. It features per-layer head counts allowing different decoder layers to have different query-head counts while sharing the same KV cache shape, and implements a sigmoid MoE router with auxiliary-loss-free load balancing that uses element-wise sigmoid of gate logits plus learned per-expert bias for router scoring.
Links: Documentation
- Laguna XS.2 implementation (#45673) by @joerowell in #45673
DEIMv2
DEIMv2 (DETR with Improved Matching v2) is a real-time object detection model that extends DEIM with DINOv3 features and spans eight model sizes from X to Atto for diverse deployment scenarios. It uses a Spatial Tuning Adapter (STA) for larger variants to convert DINOv3's single-scale output into multi-scale features, while ultra-lightweight models employ pruned HGNetv2 backbones. The unified design achieves superior performance-cost trade-offs, with DEIMv2-X reaching 57.8 AP with only 50.3M parameters and DEIMv2-S being the first sub-10M model to exceed 50 AP on COCO.
Links: Documentation | Paper
- model: Add DEIMv2 to Transformers (#44339) by @harshaljanjani in #44339
Attention
Several attention-related bugs were fixed across multiple models, including a cross-attention cache type error in T5Gemma2 for long inputs, incorrect cached forward behavior in Qwen3.5's gated-delta-net linear attention, and a crash in GraniteMoeHybrid when no Mamba layers are present. Attention function dispatch was also updated to align with the latest model implementations.
- Fix cross-attention cache layer type for T5Gemma2 long inputs (#45540) by @Beichen-Ma in [#45540]
- [Qwen3.5] Fix GDN linear attention multi-token cached forward (#45513) by @kashif in [#45513]
- Fix GraniteMoeHybrid _update_mamba_mask crash on attention-only models (#45514) by @tianhaocui in [#45514]
- Align latest model attention function dispatch (#45598) by @Cyrilvallez in [#45598]
Tokenizers
There was a bug in AutoTokenizer that caused the wrong tokenizer class to be initialized. This caused regressions in models like DeepSeek R1.
Generation
Continuous batching generation received several fixes and improvements, including correcting KV deduplication and memory estimation for long sequences (16K+), and removing misleading warnings about num_return_sequences and other unsupported features that were incorrectly firing even when functionality worked correctly. Documentation for per-request sampling parameters was also added.
- generate: drop stale num_return_sequences warning on continuous batching path (#45582) by @joaquinhuigomez in [#45582]
- Remove unnecessary generate warnings (#45619) by @Cyrilvallez in [#45619]
- [CB] Changes for long generation (#45530) by @remi-or in [#45530]
- [docs] per-request sampling params (#45553) by @stevhliu in [#45553]
Kernels
Improved kernel support by fixing configuration reading and error handling for FP8 checkpoints (e.g., Qwen3.5-35B-A3B-FP8), enabling custom expert kernels registered from the HF Hub to be properly loaded, and resolving an incompatibility that prevented Gemma3n and Gemma4 from using the rotary kernel.
- Fix configuration reading and error handling for kernels (#45610) by @hmellor in [#45610]
- Allow for registered experts from kernels hub (#45577) by @winglian in [#45577]
- Gemma3n and Gemma4 cannot use rotary kernel (#45564) by @Cyrilvallez in [#45564]
Bugfixes and improvements
- fixing more typos (#45689) by @vasqu in [#45689]
- [docs] cb memory management (#45587) by @stevhliu in [#45587]
- [docs] cpu offloading (#45660) by @stevhliu in [#45660]
- docs(README_zh-hans): clarify conditions for not using Transformers (#45688) by @GuaiZai233 in [#45688]
- fix padding side issue for fast_vlm tests (#45592) by @kaixuanliu in [#45592]
- Fix
x_clip: 8 failed test cases (#45394) by @kaixuanliu in [#45394] - zero_shot_object_detection ValueError fix for python 3.13 (#45669) by @AnkitAhlawat7742 in [#45669]
- Fix pageable H2D copies in Gated DeltaNet PyTorch fallback (#45665) by @ruixiang63 in [#45665]
- Fix UnboundLocalError in shard_and_distribute_module for replicated parameters (#45675) by @Abdennacer-Badaoui in [#45675]
- [MistralCommonBackend] Soften validation mode and apply_chat_template arguments check (#45628) by @juliendenize in [#45628]
- Fix
NameError: PeftConfigLiketriggered byPreTrainedModel.__init_subclass__(#45658) by @qgallouedec in [#45658] - chore(typing): added modeling_utils to ty (#45425) by @tarekziade in [#45425]
- [gemma4] infer from config instead of hardcoding (#45606) by @eustlb in [#45606]
- Update quants tests (#45480) by @SunMarc in [#45480]
- 🔴🔴🔴 fix: skip
clean_up_tokenizationfor BPE tokenizers inPreTrainedTokenizerFast(#44915) by @maxsloef-goodfire in [#44915] - Fix colmodernvbert tests (#45652) by @Cyrilvallez in [#45652]
- [CB] [Major] Add CPU request offloading (#45184) by @remi-or in [#45184]
- Fix peft constructors (#45622) by @Cyrilvallez in [#45622]
- chore: speedup modular converter (~30%) (#45046) by @tarekziade in [#45046]
- Fix whisper return language (#42227) by @FredHaa in [#42227]
- Add
supports_gradient_checkpointingtoNemotronHPreTrainedModel(#45625) by @sergiopaniego in [#45625] - Raise clear error for
problem_type="single_label_classification"withnum_labels=1(#45611) by @gaurav0107 in [#45611] - CircleCI with torch 2.11 (#45633) by @ydshieh in [#45633]
- chore: bump doc-builder SHA for main doc build workflow (#45631) by @rtrompier in [#45631]
- Allow more artifacts to be download in CI (#45629) by @ydshieh in [#45629]
- chore(qa): split pipeline and add type checking (#45432) by @tarekziade in [#45432]
- Skip failing offloading tests (#45624) by @Cyrilvallez in [#45624]
- fix: compute auxiliary losses when denoising is disabled in D-FINE (#45601) by @Abineshabee in [#45601]
- qa: bumped mlinter and allow local override (#45585) by @tarekziade in [#45585]
- Processing Utils: continue when content is a string (#45605) by @RyanMullins in [#45605]
- SonicMoe (#45433) by @IlyasMoutawwakil in [#45433]
- fix transformers + torchao nvfp4 serialization (#45573) by @vkuzo in [#45573]
- [AMD CI] Fix expectations for Gemma3n (#45602) by @Abdennacer-Badaoui in [#45602]
- [docs] multi-turn tool calling (#45554) by @stevhliu in [#45554]
- Fix
AttributeErrorons_aux=Noneinflash_attention_forward(#45589) by @jamesbraza in [#45589] - do not index past decoded chars with special tokens (#45435) by @itazap in [#45435]
- Update dev version (#45583) by @vasqu in [#45583]
- Update torchao usage for XPU and CPU (#45560) by @jiqing-feng in [#45560]
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @vasqu
- @joerowell
- Laguna XS.2 implementation (#45673)
- @tarekziade
- @harshaljanjani
- model: Add DEIMv2 to Transformers (#44339)
- @remi-or
Patch release v5.6.2
Patch release v5.6.2
Qwen 3.5 and 3.6 MoE (text-only) were broken when using with FP8. It should now work again with this 🫡
Full Changelog: v5.6.1...v5.6.2
Patch release v5.6.1
Patch release v5.6.1
Flash attention path was broken! Sorry everyone for this one 🤗
- Fix AttributeError on s_aux=None in flash_attention_forward (#45589) by @jamesbraza