SAMBA presents a hybrid architecture that combines state-space and attention methods to provide remarkable reasoning, efficiency, and long-context comprehension. SAMBA outperforms top models such as LLaMA-3, Mistral, and Mamba in benchmark performance across many scales up to 3.8B parameters. It exhibits robust length extrapolation up to 1M tokens, better throughput, and stronger arithmetic reasoning. SAMBA sets a new benchmark for scalable and effective AI models by achieving state-of-the-art performance on both short- and long-context tasks while preserving linear computational scaling through effective memory recall and instruction tweaking.SAMBA presents a hybrid architecture that combines state-space and attention methods to provide remarkable reasoning, efficiency, and long-context comprehension. SAMBA outperforms top models such as LLaMA-3, Mistral, and Mamba in benchmark performance across many scales up to 3.8B parameters. It exhibits robust length extrapolation up to 1M tokens, better throughput, and stronger arithmetic reasoning. SAMBA sets a new benchmark for scalable and effective AI models by achieving state-of-the-art performance on both short- and long-context tasks while preserving linear computational scaling through effective memory recall and instruction tweaking.

SAMBA Proves Hybrid Design Is the Future of Long-Context Modeling

2025/10/28 17:13

Abstract and 1. Introduction

  1. Methodology

  2. Experiments and Results

    3.1 Language Modeling on vQuality Data

    3.2 Exploration on Attention and Linear Recurrence

    3.3 Efficient Length Extrapolation

    3.4 Long-Context Understanding

  3. Analysis

  4. Conclusion, Acknowledgement, and References

A. Implementation Details

B. Additional Experiment Results

C. Details of Entropy Measurement

D. Limitations

\

3 Experiments and Results

We pre-train four SAMBA models with different parameter sizes, 421M, 1.3B, 1.7B and 3.8B, to investigate its performance across different scales. The details of the hyperparameters for the training and architecture designs are shown in Table 10 of Appendix A. We also train other hybrid architectures as mentioned in Section 2.1, including the baseline Mamba, Llama-3, and Mistral architecture on a scale of around 1.7B, with detailed hyperparameters in Table 9 of Appendix A. We do comprehensive downstream evaluations on a wide range of benchmarks, focusing on four main capabilities of the models: commonsense reasoning (ARC [CCE+18], PIQA [BZB+20], WinoGrande [SBBC21], SIQA [SRC+19]), language understanding (HellaSwag [ZHB+19], BoolQ [CLC+19], OpenbookQA [MCKS18], SQuAD [RZLL16], MMLU [HBB+21]), truthfulness (TruthfulQA [LHE22]) and math and coding (GSM8K [CKB+21], MBPP [AON+21], HumanEval [CTJ+21]).

3.1 Language Modeling on Textbook Quality Data

We first present results from our largest 3.8B SAMBA model, trained on the same data set used by Phi3 [AJA+24] with 3.2T tokens. We follow the same multi-phase pretraining strategy as Phi3-mini for a fair comparison. We also report the performance of the Transformer++ (TFM++ in Table 1) model, which uses the same architecture and training recipe as Phi3-mini, for a fair comparison.

\ In Table 1, we conduct comprehensive evaluations on a diverse subset of the benchmarks to assess SAMBA’s performance across all the domains mentioned above to ensure a thorough examination of the model’s capabilities. The details of the generation configurations are included in Appendix A.

\ Table 1: Downstream performance comparison of SAMBA 3.8B with other pretrained base language models without instruction tuning. ARC-C and HellaSwag are measured with character-normalized accuracy. MMLU and GSM8K are measured in 5-shot, while others are in zero-shot. We report the MC2 score for TruthfulQA, maj@1 for GSM8K, and pass@1 for HumanEval. ∗ Measured by ours.

\ We compare with several strong baselines, including Llama 2 [TMS+23], Mistral [JSM+23], Mamba [GD23], Gemma [Tea24], Recurrent-Gemma (R-Gemma) [BDS+24], Llama 3 [Met24] and TFM++. As shown in Table 1, SAMBA achieves the highest average score on all benchmarks, demonstrating its superior performance in handling various language comprehension tasks. Notably, SAMBA excels in the GSM8K benchmark, achieving an absolute 18.1% higher accuracy than TFM++ trained on the same dataset. This shows the surprising complementary effect of combining SSM with the attention mechanism. We conjecture that when combined with attention, Mamba, as an input-dependent SSM, can focus more on performing the arithmetic operation through its recurrent states than on doing the retrieval operation which can be easily learned by the sliding window attention.

\ Table 2: Downstream evaluation of the architectures trained on 230B tokens of the Phi2 dataset. We report the unnormalized accuracy for multiple choice tasks. GSM8K is evaluated with 5-shot examples while other tasks are in zero-shot. Best results are in bold, second best underlined.

\ To examine the different hybridization strategies mentioned in Section 2.1, we train 6 models with around 1.7B parameters on the Phi2 [LBE+23] dataset with 230B tokens and evaluate them in the full suite of 15 downstream benchmarks to have a holistic assessment of hybrid and purebred architectures. As shown in Table 2, SAMBA demonstrates superior performance on a diverse set of tasks, including commonsense reasoning (ARC-Challenge), language understanding (MMLU, SQuAD), TruthfulQA and code generation (HumanEval, MBPP). It outperforms both the pure attention-based and SSMbased models in most tasks and achieves the best average performance. We can observe that replacing Mamba blocks with MLPs does not harm commonsense reasoning ability, but its performance on language understanding and complex reasoning ability, such as coding and mathematical reasoning, degenerates significantly. We can also see that pure Mamba models fall short on retrieval intensive tasks such as SQuAD due to their lack of precise memory retrieval ability. The best results are achieved through the combination of the attention and Mamba modules, as shown with our Samba architecture. We can also notice that Mamba-SWA-MLP has significantly better performance on GSM8K, potentially resulting from a closer collaboration between the Mamba and the SWA layers. The distinct downstream performances of different hybridization strategies pose interesting future work for developing task-adaptive dynamic architectures.

3.2 Exploration on Attention and Linear Recurrence

Since SSMs belong to a broader realm of linear recurrent models [OSG+23, QYZ23, YWS+23, Kat23, QYS+24], there exist multiple alternatives other than Mamba when combing attentionbased layers with recurrent neural networks. In addition to Mamba and Samba, we investigate the comparative analysis of the following architectures:

\ • Llama-2 [TMS+23] is an attention-based Transformer architecture that utilizes full selfattention across the entire sequence.

\ • Llama-2-SWA is an attention-based architecture that replaces all full attention layers in Llama-2 with sliding window attention.

\ • Sliding RetNet replaces Mamba layers in the Samba architecture with Multi-Scale Retention [SDH+23] layers. RetNet is a linear attention model with fixed and input-independent decay applying to the recurrent hidden states.

\ • Sliding GLA replaces Mamba layers in the Samba architecture with Gated Linear Attention (GLA) [YWS+23]. GLA is a more expressive variant of linear attention with input-dependent gating.

\ Table 3: Perplexity on the validation set of SlimPajama for different attention and linear recurrent model architectures trained at 4,096 context length. We use window size 2,048 for Sliding Window Attention (SWA). The perplexity results have a fluctuation around ±0.3%.

\ We pre-train all models on the same SlimPajama [SAKM+23] dataset under both around 438M and 1.3B settings, and evaluate these models by calculating perplexity on the validation set with context length at 4096, 8192, and 16384 tokens to investigate their zero-shot length extrapolation ability. Peak training throughput is also measured as an efficiency metric. The details of the hyperparameter settings are included in Appendix A. As shown in Table 3, SAMBA consistently outperforms all other models in different context lengths and model sizes. The training speed of SAMBA is competitive compared to pure Transformer-based models on the 1.3B scale. Mamba has significantly worse training throughput because Mamba layers have slower training speed than MLP layers, and the purebred Mamba models need to have more layers than other models at the same number of parameters. We can notice that the full attention-based model cannot extrapolate beyond its context length without specific length extrapolation techniques, which motivates us to use SWA for Samba. In Section 4, we further show that even hybridizing with one full attention layer will still lead to exploding perplexity at 16k sequence length. We can also find that while RetNet can extrapolate well under the 438M scale, it has an increasing perplexity on 16K length at the 1.4B scale, which may indicate that its input-independent decay may need specific tuning at different scales to work well.

3.3 Efficient Length Extrapolation

We use the test split of the Proof-Pile [ZAP22] dataset to evaluate the length extrapolation ability of our models at a scale of around 1.7B parameters. We follow Position Interpolation [CWCT23] for data pre-processing. The sliding window approach [PSL21] is used for the perplexity evaluation with a window size of 4096. Besides having the decoding throughput in Figure 1 for the generation efficiency metric, we also measure the prompt processing speed in Figure 3 for the models SAMBA 1.7B, Mistral 1.6B, Mamba 1.8B, Llama-3 1.6B and its Self-Extended [JHY+24] version SE-Llama-3 1.6B with the prompt length sweeping from 1K to 128K. We set the group size to 4 and the neighborhood window to 1024 for self-extension. We fix the total processing tokens per measurement to be 128K and varying the batch size accordingly. The throughput is measured on a single A100 GPU with the precision of bfloat16. We repeat the measurements 10 times and report the averaged results. We can see that Samba achieves 3.73× higher throughput in prompt processing compared to Llama-3 1.6B at the 128K prompt length, and the processing time remains linear with respect to the sequence length. We can also observe that the existing zero-shot length extrapolation technique introduces significant inference latency overhead on the full-attention counterpart, while it still cannot extrapolate infinitely with perplexity performance comparable to that of Samba. In Figure 1, we can also see that Mamba has a slowly and stably increasing perplexity up to 1M sequence length, which indicates that linear recurrent models can still not extrapolate infinitely if the context length is extremely large.

\ Figure 3: Prompt processing throughput of different models with around 1.7B parameters.

\ Beyond its efficiency in processing long context, Samba can also extrapolate its memory recall ability to 256K context length through supervised fine-tuning, and still keeps its linear computation complexity. We fine-tune Samba 1.7B on Passkey Retrieval with a 4K training sequence length for only 500 steps. As presented in Figure 4, SAMBA 1.7B demonstrates a remarkable ability to recall information from significantly longer contexts compared to Mistral 1.6B, a model based solely on Sliding Window Attention (SWA). This capability is particularly evident in the heatmap, where SAMBA maintains the perfect retrieval performance across a wider range of pass-key positions in a long document of up to 256K length. We also draw the training loss curve and the overall passkey retrieval accuracy across the fine-tuning procedure in Figure 6 and Figure 7 of Appendix B. We find that despite the fact that both architectures can reach near-zero training loss in less than 250 steps, Samba can achieve near-perfect retrieval early at 150 training steps, while the Mistral architecture struggles at around 30% accuracy throughout the training process. This shows that Samba can have better long-range retrieval ability than SWA due to the input selection mechanism introduced by the Mamba layers.

3.4 Long-Context Understanding

The impressive results on the synthetic passkey retrieval task encourage us to perform full-cycle instruction tuning of the Samba-3.8B model. We follow the same post-training recipe used for the Phi-3-mini series and evaluate the downstream performance of the instruction-tuned Samba3.8B-IT (preview) on both the long-context summarization tasks (GovReport [HCP+21], SQuALITY [WPC+22]) and the main shortcontext benchmarks (MMLU, GSM8K, HumanEval), as shown in Table 4. We can see that Samba has substantially better performance than Phi-3-mini-4k-instruct on both the short-context (MMLU, GSM8K, HumanEval) and longcontext (GovReport) tasks, while still having the 2048 window size of its SWA layer and maintaining the linear complexity for efficient processing of long documents.

\ Figure 4: Passkey Retrieval performance up to 256K context length for SAMBA 1.7B (Left) vs. Mistral 1.6B (right) instruction tuned on 4K sequence length with 500 steps.

\ ble 4: Downstream performance comparison between instruction-tuned Samba 3.8B and Phi-3- mini-4K on both long-context and short-context tasks. We report 5-shot accuracy (averaged by category) for MMLU, 8-shot CoT [WWS+22] for GSM8K, 0-shot pass@1 for HumanEval, ROUGEL for both GovReport and SQuALITY. † Results from the Phi-3 technical report [AJA+24].

\

:::info Authors:

(1) Liliang Ren, Microsoft and University of Illinois at Urbana-Champaign (liliangren@microsoft.com);

(2) Yang Liu†, Microsoft (yaliu10@microsoft.com);

(3) Yadong Lu†, Microsoft (yadonglu@microsoft.com);

(4) Yelong Shen, Microsoft (yelong.shen@microsoft.com);

(5) Chen Liang, Microsoft (chenliang1@microsoft.com);

(6) Weizhu Chen, Microsoft (wzchen@microsoft.com).

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

US Spot ETH ETFs Witness Remarkable $244M Inflow Surge

US Spot ETH ETFs Witness Remarkable $244M Inflow Surge

BitcoinWorld US Spot ETH ETFs Witness Remarkable $244M Inflow Surge The world of digital assets is buzzing with exciting news! US spot ETH ETFs recently experienced a significant milestone, recording a whopping $244 million in net inflows on October 28. This marks the second consecutive day of positive movement for these crucial investment vehicles, signaling a growing appetite for Ethereum exposure among mainstream investors. What’s Fueling the Latest US Spot ETH ETFs Inflow? This impressive influx of capital into US spot ETH ETFs highlights a clear trend: institutional and retail investors are increasingly comfortable with regulated crypto investment products. The figures, reported by industry tracker Trader T, show a robust interest that could reshape the market. Fidelity’s FETH led the charge, attracting a substantial $99.27 million. This demonstrates strong confidence in Fidelity’s offering and Ethereum’s long-term potential. BlackRock’s ETHA wasn’t far behind, securing $74.74 million in inflows. BlackRock’s entry into the crypto ETF space has been closely watched, and these numbers confirm its growing influence. Grayscale’s Mini ETH also saw significant action, pulling in $73.03 million. This new product is quickly gaining traction, offering investors another avenue for Ethereum exposure. It’s important to note that while most products saw positive flows, Grayscale’s ETHE experienced a net outflow of $2.66 million. This might suggest a shift in investor preference towards newer, perhaps more cost-effective, spot ETF options. Why Are US Spot ETH ETFs Attracting Such Significant Capital? The appeal of US spot ETH ETFs is multifaceted. For many investors, these products offer a regulated and accessible way to gain exposure to Ethereum without directly owning the cryptocurrency. This removes some of the complexities associated with digital asset management, such as setting up wallets, managing private keys, or dealing with less regulated exchanges. Key benefits include: Accessibility: Investors can buy and sell shares of the ETF through traditional brokerage accounts, just like stocks. Regulation: Being regulated by financial authorities provides a layer of security and trust that some investors seek. Diversification: For traditional portfolios, adding exposure to a leading altcoin like Ethereum through an ETF can offer diversification benefits. Liquidity: ETFs are generally liquid, allowing for easy entry and exit from positions. Moreover, Ethereum itself continues to be a powerhouse in the blockchain space, underpinning a vast ecosystem of decentralized applications (dApps), NFTs, and decentralized finance (DeFi) protocols. Its ongoing development and significant network activity make it an attractive asset for long-term growth. What Does This US Spot ETH ETFs Trend Mean for Investors? The consistent positive inflows into US spot ETH ETFs could be a strong indicator of maturing institutional interest in the broader crypto market. It suggests that major financial players are not just dabbling but are actively integrating digital assets into their investment strategies. For individual investors, this trend offers several actionable insights: Market Validation: The increasing capital flow validates Ethereum’s position as a significant digital asset with real-world utility and investor demand. Potential for Growth: Continued institutional adoption through ETFs could contribute to greater price stability and potential upward momentum for Ethereum. Observing Investor Behavior: The shift from products like Grayscale’s ETHE to newer spot ETFs highlights how investors are becoming more discerning about their investment vehicles, prioritizing efficiency and cost. However, it is crucial to remember that the crypto market remains volatile. While these inflows are positive, investors should always conduct their own research and consider their risk tolerance before making investment decisions. A Compelling Outlook for US Spot ETH ETFs The recent $244 million net inflow into US spot ETH ETFs is more than just a number; it’s a powerful signal. It underscores a growing confidence in Ethereum as an asset class and the increasing mainstream acceptance of regulated cryptocurrency investment products. With major players like Fidelity and BlackRock leading the charge, the landscape for digital asset investment is evolving rapidly, offering exciting new opportunities for both seasoned and new investors alike. This positive momentum suggests a potentially bright future for Ethereum’s integration into traditional financial portfolios. Frequently Asked Questions (FAQs) What is a US spot ETH ETF? A US spot ETH ETF (Exchange-Traded Fund) is an investment product that allows investors to gain exposure to the price movements of Ethereum (ETH) without directly owning the cryptocurrency. The fund holds actual Ethereum, and shares of the fund are traded on traditional stock exchanges. Which firms are leading the inflows into US spot ETH ETFs? On October 28, Fidelity’s FETH led with $99.27 million, followed by BlackRock’s ETHA with $74.74 million, and Grayscale’s Mini ETH with $73.03 million. Why are spot ETH ETFs important for the crypto market? Spot ETH ETFs are crucial because they provide a regulated, accessible, and often more familiar investment vehicle for traditional investors to enter the cryptocurrency market. This can lead to increased institutional adoption, greater liquidity, and enhanced legitimacy for Ethereum as an asset class. What was Grayscale’s ETHE outflow and what does it signify? Grayscale’s ETHE experienced a net outflow of $2.66 million. This might indicate that some investors are shifting capital from older, perhaps less efficient, Grayscale products to newer spot ETH ETFs, which often offer better fee structures or direct exposure without the previous trust structure limitations. If you found this article insightful, consider sharing it with your network! Your support helps us bring more valuable insights into the world of cryptocurrency. Spread the word and let others discover the exciting trends shaping the digital asset space. To learn more about the latest Ethereum trends, explore our article on key developments shaping Ethereum institutional adoption. This post US Spot ETH ETFs Witness Remarkable $244M Inflow Surge first appeared on BitcoinWorld.
Share
2025/10/29 11:45
First Ethereum Treasury Firm Sells ETH For Buybacks: Death Spiral Incoming?

First Ethereum Treasury Firm Sells ETH For Buybacks: Death Spiral Incoming?

Ethereum-focused treasury company ETHZilla said it has sold roughly $40 million worth of ether to fund ongoing share repurchases, a maneuver aimed at closing what it calls a “significant discount to NAV.” In a press statement on Monday, the company disclosed that since Friday, October 24, it has bought back about 600,000 common shares for approximately $12 million under a broader authorization of up to $250 million, and that it intends to continue buying while the discount persists. ETHZilla Dumps ETH For BuyBacks The company framed the buybacks as balance-sheet arbitrage rather than a strategic retreat from its core Ethereum exposure. “We are leveraging the strength of our balance sheet, including reducing our ETH holdings, to execute share repurchases,” chairman and CEO McAndrew Rudisill said, adding that ETH sales are being used as “cash” while common shares trade below net asset value. He argued the transactions would be immediately accretive to remaining shareholders. Related Reading: Crypto Analyst Shows The Possibility Of The Ethereum Price Reaching $16,000 ETHZilla amplified the message on X, saying it would “use its strong balance sheet to support shareholders through buybacks, reduce shares available for short borrow, [and] drive up NAV per share” and reiterating that it still holds “~$400 million of ETH” on the balance sheet and carries “no net debt.” The company also cited “recent, concentrated short selling” as a factor keeping the stock under pressure. The market-structure logic is straightforward: when a digital-asset treasury trades below the value of its coin holdings and cash, buying back stock with “coin-cash” can, in theory, collapse the discount and lift NAV per share. But the optics are contentious inside crypto because the mechanism requires selling the underlying asset—here, ETH—to purchase equity, potentially weakening the very treasury backing that investors originally sought. Death Spiral Incoming? Popular crypto trader SalsaTekila (@SalsaTekila) commented on X: “This is extremely bearish, especially if it invites similar behavior. ETH treasuries are not Saylor; they haven’t shown diamond-hand will. If treasury companies start dumping the coin to buy shares, it’s a death spiral setup.” Skeptics also zeroed in on funding choices. “I am mostly curious why the company chose to sell ETH and not use the $569m in cash they had on the balance sheet last month,” another analyst Dan Smith wrote, noting ETHZilla had just said it still holds about $400 million of ETH and thus didn’t deploy it on fresh ETH accumulation. “Why not just use cash?” The question cuts to the core of treasury signaling: using ETH as a liquidity reservoir to defend a discounted equity can be read as rational capital allocation, or as capitulation that undermines the ETH-as-reserve narrative. Beyond the buyback, a retail-driven storyline has rapidly formed around the stock. Business Insider reported that Dimitri Semenikhin—who recently became the face of the Beyond Meat surge—has targeted ETHZilla, saying he purchased roughly 2% of the company at what he views as a 50% discount to modified NAV. He has argued that the market is misreading ETHZilla’s balance sheet because it still reflects legacy biotech results rather than the current digital-asset treasury model. Related Reading: Ethereum Emerges As The Sole Trillion-Dollar Institutional Store Of Value — Here’s Why The same report cites liquid holdings on the order of 102,300 ETH and roughly $560 million in cash, translating to about $62 per share in liquid assets, and calls out a 1-for-10 reverse split on October 15 that, in his view, muddied the optics for retail. Semenikhin flagged November 13 as a potential catalyst if results show the pivot to ETH generating profits. The company’s own messaging emphasizes the discount-to-NAV lens rather than a change in strategy. ETHZilla told investors it would keep buying while the stock trades below asset value and highlighted a goal of shrinking lendable supply to blunt short-selling pressure. For Ethereum markets, the immediate flow effect is limited—$40 million is marginal in ETH’s daily liquidity—but the second-order risk flagged by traders is behavioral contagion. If other ETH-heavy treasuries follow the playbook, selling the underlying to buy their own stock, the flow could become pro-cyclical: coins are sold to close equity discounts, the selling pressures spot, and wider discounts reappear as equity screens rerate to the weaker mark—repeat. That is the “death spiral” scenario skeptics warn about when the treasury asset doubles as the company’s signal of conviction. At press time, ETH traded at $4,156. Featured image created with DALL.E, chart from TradingView.com
Share
2025/10/29 12:00