SAMBA is a hybrid neural architecture that effectively processes very long sequences by combining Sliding Window Attention (SWA) with Mamba, a state space model (SSM). SAMBA achieves speed and memory efficiency by fusing the exact recall capabilities of attention with the linear-time recurrent dynamics of Mamba. SAMBA surpasses Transformers and pure SSMs on important benchmarks like MMLU and GSM8K after being trained on 3.2 trillion tokens with up to 3.8 billion parameters.SAMBA is a hybrid neural architecture that effectively processes very long sequences by combining Sliding Window Attention (SWA) with Mamba, a state space model (SSM). SAMBA achieves speed and memory efficiency by fusing the exact recall capabilities of attention with the linear-time recurrent dynamics of Mamba. SAMBA surpasses Transformers and pure SSMs on important benchmarks like MMLU and GSM8K after being trained on 3.2 trillion tokens with up to 3.8 billion parameters.

Microsoft’s SAMBA Model Redefines Long-Context Learning for AI

2025/10/28 17:13

:::info Authors:

(1) Liliang Ren, Microsoft and University of Illinois at Urbana-Champaign (liliangren@microsoft.com);

(2) Yang Liu†, Microsoft (yaliu10@microsoft.com);

(3) Yadong Lu†, Microsoft (yadonglu@microsoft.com);

(4) Yelong Shen, Microsoft (yelong.shen@microsoft.com);

(5) Chen Liang, Microsoft (chenliang1@microsoft.com);

(6) Weizhu Chen, Microsoft (wzchen@microsoft.com).

:::

Abstract and 1. Introduction

  1. Methodology

  2. Experiments and Results

    3.1 Language Modeling on vQuality Data

    3.2 Exploration on Attention and Linear Recurrence

    3.3 Efficient Length Extrapolation

    3.4 Long-Context Understanding

  3. Analysis

  4. Conclusion, Acknowledgement, and References

A. Implementation Details

B. Additional Experiment Results

C. Details of Entropy Measurement

D. Limitations

\

Abstract

Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present SAMBA, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). SAMBA selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale SAMBA up to 3.8B parameters with 3.2T training tokens and show that SAMBA substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, SAMBA can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, SAMBA enjoys a 3.73× higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64× speedup when generating 64K tokens with unlimited streaming. A sample implementation of SAMBA is publicly available in https://github.com/microsoft/Samba.

1 Introduction

Attention-based models [VSP+17, BCB14] have dominated the neural architectures of Large Language Models (LLMs) [RWC+19, BMR+20, Ope23, BCE+23] due to their ability to capture complex long-term dependencies and the efficient parallelization for large-scale training [DFE+22]. Recently, State Space Models (SSMs) [GGR21, SWL23, GGGR22, GD23] have emerged as a promising alternative, offering linear computation complexity and the potential for better extrapolation to longer sequences than seen during training. Specifically, Mamba[GD23], a variant of SSMs equipped with selective state spaces, has demonstrated notable promise through strong empirical performance and efficient hardware-aware implementation. Recent work also shows that transformers have poorer modeling capacities than input-dependent SSMs in state tracking problems [MPS24]. However, SSMs struggle with memory recall due to their Markovian nature [AET+23], and experimental results on information retrieval-related tasks [FDS+23, WDL24, AEZ+24], have further shown that SSMs are not as competitive as their attention-based counterparts.

\ Previous works [ZLJ+22, FDS+23, MZK+23, RLW+23] have explored different approaches to hybridize SSMs and the attention mechanism, but none of them achieve unlimited-length extrapolation

\ Figure 1: SAMBA shows improved prediction up to 1M tokens in the Proof-Pile test set while achieving a 3.64× faster decoding throughput than the Llama-3 architecture [Met24] (a state-of-theart Transformer [VSP+17] with Grouped-Query Attention [ALTdJ+23]) on 64K generation length. We also include an SE-Llama-3 1.6B baseline which applies the SelfExtend [JHY+24] approach for zero-shot length extrapolation. Throughput measured on a single A100 80GB GPU. All models are trained on the Phi-2 [LBE+23] dataset with 4K sequence length.

\ with linear-time complexity. The existing length generalization techniques [HWX+23, XTC+23, JHY+24] developed for the attention mechanism suffer from quadratic computation complexity or limited context extrapolation ability. In this paper, we introduce SAMBA, a simple neural architecture that harmonizes the strengths of both the SSM and the attention-based models, while achieving an unlimited sequence length extrapolation with linear time complexity. SAMBA combines SSMs with attention through layer-wise interleaving Mamba [GD23], SwiGLU [Sha20], and Sliding Window Attention (SWA) [BPC20]. Mamba layers capture the time-dependent semantics and provide a backbone for efficient decoding, while SWA fills in the gap modeling complex, non-Markovian dependencies.

\ We scale SAMBA with 421M, 1.3B, 1.7B and up to 3.8B parameters. In particular, the largest 3.8B base model pre-trained with 3.2T tokens achieves a 71.2 score for MMLU [HBB+21], 54.9 for HumanEval [CTJ+21], and 69.6 for GSM8K [CKB+21], substantially outperforming strong open source language models up to 8B parameters, as detailed in Table 1. Despite being pre-trained in the 4K sequence length, SAMBA can be extrapolated to 1M length in zero shot with improved perplexity on Proof-Pile [ZAP22] while still maintaining the linear decoding time complexity with unlimited token streaming, as shown in Figure 1. We show that when instruction-tuned in a 4K context length with only 500 steps, SAMBA can be extrapolated to a 256K context length with perfect memory recall in Passkey Retrieval [MJ23]. In contrast, the fine-tuned SWA-based model simply cannot recall memories beyond 4K length. We further demonstrate that the instruction-tuned SAMBA 3.8B model can achieve significantly better performance than the SWA-based models on downstream long-context summarization tasks, while still keeping its impressive performance on the short-context benchmarks. Finally, we conduct rigorous and comprehensive analyzes and ablation studies, encompassing up to 1.7 billion parameters, to validate the architectural design of SAMBA. These meticulous investigations not only justify our architectural designs but also elucidate the potential mechanisms underpinning the remarkable effectiveness of this simple hybrid approach.

2 Methodology

We explore different hybridization strategies consisting of the layers of Mamba, Sliding Window Attention (SWA), and Multi-Layer Perceptron [Sha20, DFAG16]. We conceptualize the functionality of Mamba as the capture of recurrent sequence structures, SWA as the precise retrieval of memory, and MLP as the recall of factual knowledge. We also explore other linear recurrent layers including Multi-Scale Retention [SDH+23] and GLA [YWS+23] as potential substitutions for Mamba in Section 3.2. Our goal of hybridization is to harmonize between these distinct functioning blocks and find an efficient architecture for language modeling with unlimited-length extrapolation ability.

2.1 Architecture

As illustrated in Figure 2, we explore three kinds of layerwise hybridization strategies on the 1.7B scale: Samba, Mamba-SWA-MLP, and Mamba-MLP. We also explore other hybridization approaches with full self-attention on smaller scales in Section 4. The number of layers N is set to 48 for Samba, Mamba-MLP, and Mamba, while Mamba-SWA-MLP has 54 layers, so each model has approximately 1.7B parameters. We only modify the layer-level arrangement for each of the models and keep every other configuration the same to have apple-to-apple comparisons. More details on the configuration of each layer are explained in the following subsections.

\ Figure 2: From left to right: Samba, Mamba-SWA-MLP, Mamba-MLP, and Mamba. The illustrations depict the layer-wise integration of Mamba with various configurations of Multi-Layer Perceptrons (MLPs) and Sliding Window Attention (SWA). We assume the total number of intermediate layers to be N, and omit the embedding layers and output projections for simplicity. Pre-Norm [XYH+20, ZS19] and skip connections [HZRS16] are applied for each of the intermediate layers.

\ 2.1.1 Mamba Layer

\

\

\ 2.1.2 Sliding Window Attention (SWA) Layer

\ The Sliding Window Attention [BPC20] layer is designed to address the limitations of the Mamba layer in capturing non-Markovian dependencies in sequences. Our SWA layer operates on a window size w = 2048 that slides over the input sequence, ensuring that the computational complexity remains linear with respect to the sequence length. The RoPE [SLP+21] relative positions are applied within the sliding window. By directly accessing the contents in the context window through attention, the SWA layer can retrieve high-definition signals from the middle to short-term history that cannot be clearly captured by the recurrent states of Mamba. We use FlashAttention 2 [Dao23] for the efficient implementation of self-attention throughout this work. We also choose the 2048 sliding window size for efficiency consideration; FlashAttention 2 has the same training speed as Mamba’s selective parallel scan at the sequence length of 2048 based on the measurements in [GD23].

\ 2.1.3 Multi-Layer Perceptron (MLP) Layer

\ The MLP layers in SAMBA serve as the architecture’s primary mechanism for nonlinear transformation and recall of factual knowledge [DDH+22]. We use SwiGLU [Sha20] for all the models trained in this paper and denote its intermediate hidden size as dp. As shown in Figure 2, Samba applies separate MLPs for different types of information captured by Mamba and the SWA layers.

\

:::info This paper is available on arxiv under CC BY 4.0 license.

:::

∗Work partially done during internship at Microsoft.

†Equal second-author contribution.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Nieuwe reus betreedt crypto: Western Union lanceert Solana-stablecoin

Nieuwe reus betreedt crypto: Western Union lanceert Solana-stablecoin

Check onze Discord Connect met "like-minded" crypto enthousiastelingen Leer gratis de basis van Bitcoin & trading - stap voor stap, zonder voorkennis. Krijg duidelijke uitleg & charts van ervaren analisten. Sluit je aan bij een community die samen groeit. Nu naar Discord   De Amerikaanse betaalgigant Western Union stapt de cryptowereld binnen met de lancering van een eigen stablecoin op het Solana-netwerk. De CEO noemt het een ‘’grote stap richting snellere, efficiëntere en inclusievere betalingen’’ en dat lijken steeds meer grote spelers de laatste tijd te beseffen. USDTP-stablecoin op Solana De stap komt niet uit de lucht vallen, want Western Union heeft al vaker geflirt met het idee om stablecoins in te zetten voor hun wereldwijde geldtransferdienst. Deze maand kondigde CEO Devin McGranahan de lancering van een pilot aan voor stablecoin-betalingen. En in de zomer zei hij al dat het bedrijf stablecoins als een kans zien en niet als een bedreiging. Nu heeft Western Union in een persbericht aangekondigd dat het in de eerste helft van volgend jaar zijn eigen aan de Amerikaanse dollar gekoppelde crypto op de markt brengt: de U.S. Dollar Payment Token (USDPT). De munt wordt uitgegeven door de gereguleerde Anchorage Digital Bank en draait op het Solana-netwerk. Western Union wil met deze stablecoin zijn bestaande geldtransferdiensten moderniseren, zodat klanten in meer dan 200 landen makkelijker, sneller en goedkoper geld kunnen versturen of ontvangen zonder afhankelijk te zijn van de traditionele bankrails. Western Union werd in 1851 opgericht en verwerkt elke dag miljoenen transacties. Het bedrijf vormt eigenlijk een brug tussen mensen in verschillende landen die snel geld willen sturen of ontvangen, zonder dat beide partijen een bankrekening nodig hebben. Het bedrijf wil nieuwe technologieën gebruiken om klanten en gemeenschappen sterker te maken. Door de stap naar digitale valuta te zetten, kan Western Union met zijn eigen stablecoin USDPT ook zelf verdienen aan de groei van de cryptomarkt. Daarnaast komt er een Digital Asset Network, een systeem dat het makkelijker maakt om crypto om te zetten naar contant geld. Dankzij Western Union’s wereldwijde netwerk kunnen gebruikers straks hun digitale munten snel en eenvoudig “uitcashen” via aangesloten wallets en partners. “Ons Digital Asset Network en USDPT zullen een belangrijke rol spelen bij het realiseren van onze missie: financiële diensten voor iedereen, overal ter wereld toegankelijk maken,’’ aldus McGranahan. Waarom een eigen stablecoin? Voor Western Union biedt het gebruik van een (eigen) stablecoin meerdere grote voordelen Snellere transacties: Stablecoins maken internationale betalingen bijna real-time, in plaats van dagen via traditionele banken. Lagere kosten: Transacties via blockchain zijn vaak veel goedkoper, zeker bij grensoverschrijdende overschrijvingen. 24/7 beschikbaarheid: Anders dan banken werkt de blockchain dag en nacht, ook in het weekend. Directe brug tussen crypto en cash: Gebruikers kunnen stablecoins ontvangen of versturen en ze via Western Union’s netwerk snel omzetten naar contant geld. Meer controle en efficiëntie: Door een eigen stablecoin te gebruiken, behoudt Western Union zelf de controle over zijn geldstromen en kan het winst maken op transacties in plaats van afhankelijk te zijn van externe partijen. Het Solana-netwerk staat bovendien bekend om zijn bliksemsnelle transacties en extreem lage kosten: precies wat nodig is voor wereldwijde betalingen op grote schaal. Dankzij de samenwerking met de gereguleerde Anchorage Digital Bank kan Western Union dat doen op een manier die veilig, duurzaam en volledig compliant blijft met internationale regelgeving. Populariteit groeit dankzij nieuwe wetgeving Dat een bedrijf als Western Union, toch een dinosaurus in de financiële wereld, zich nu op crypto stort zegt natuurlijk veel. Maar sinds de GENIUS Act is goedgekeurd in de Verenigde Staten, de eerste federale wet die regulering invoert voor stablecoin, staan er meer grote spelers te trappelen om deze snelgroeiende markt te betreden. Deze maand nam Citi Ventures, de investeringsafdeling van de elfde grootste bank ter wereld Citigroup, nog een belang in het Londense fintech bedrijf BVNK, dat infrastructuur bouwt voor wereldwijde stablecoin-betalingen. En eerder dit jaar doken er al berichten op dat Citigroup samen met JPMorgan, Bank of America en Wells Fargo, gesprekken zouden voeren voor een gezamenlijke stablecoin-lancering. Sinds die wetgeving er is, durven meer bedrijven en gebruikers stablecoins te gebruiken, omdat ze nu weten dat het veilig en officieel gereguleerd is. Volgens een nieuw rapport van Artemis is het gebruik van stablecoins voor betalingen al met 70 procent gestegen sinds de GENIUS Act werd aangenomen. Best wallet - betrouwbare en anonieme wallet Best wallet - betrouwbare en anonieme wallet Meer dan 60 chains beschikbaar voor alle crypto Vroege toegang tot nieuwe projecten Hoge staking belongingen Lage transactiekosten Best wallet review Koop nu via Best Wallet Let op: cryptocurrency is een zeer volatiele en ongereguleerde investering. Doe je eigen onderzoek. Het bericht Nieuwe reus betreedt crypto: Western Union lanceert Solana-stablecoin is geschreven door Ivo Melchers en verscheen als eerst op Bitcoinmagazine.nl.
Share
2025/10/29 04:31