Neural vocoder is the final model in the Text to Speech (TTS) pipeline. It turns a mel‑spectrogram into the sound you can actually hear. WaveNet, WaveGlow, HiFi‑GAN, and FastDiff are the four contenders.Neural vocoder is the final model in the Text to Speech (TTS) pipeline. It turns a mel‑spectrogram into the sound you can actually hear. WaveNet, WaveGlow, HiFi‑GAN, and FastDiff are the four contenders.

Inside the Neural Vocoder Zoo: WaveNet to Diffusion in Four Audio Clips

2025/09/09 02:33

Hey everyone, I’m Oleh Datskiv, Lead AI Engineer at the R&D Data Unit of N-iX. Lately, I’ve been working on text-to-speech systems and, more specifically, on the unsung hero behind them: the neural vocoder.

Let me introduce you to this final step of the TTS pipeline — the part that turns abstract spectrograms into the natural-sounding speech we hear.

Introduction

If you’ve worked with text‑to‑speech in the past few years, you’ve used a vocoder - even if you didn’t notice it. The neural vocoder is the final model in the Text to Speech (TTS) pipeline; it turns a mel‑spectrogram into the sound you can actually hear.

Since the release of WaveNet in 2016, neural vocoders have evolved rapidly. They become faster, lighter, and more natural-sounding. From flow-based to GANs to diffusion, each new approach has pushed the field closer to real-time, high-fidelity speech.

2024 felt like a definitive turning point: diffusion-based vocoders like FastDiff were finally fast enough to be considered for real-time usage, not just batch synthesis as before. That opened up a range of new possibilities. The most notable ones were smarter dubbing pipelines, higher-quality virtual voices, and more expressive assistants, even if you’re not utilizing a high-end GPU cluster.

But with so many options that we now have, the questions remain:

  • How do these models sound side-by-side?
  • Which ones keep latency low enough for live or interactive use?
  • What is the best choice of a vocoder for you?

This post will examine four key vocoders: WaveNet, WaveGlow, HiFi‑GAN, and FastDiff. We’ll explain how each model works and what makes them different. Most importantly, we’ll let you hear the results of their work so you can decide which one you like better. Also, we will share custom benchmarks of model evaluation that were done through our research.

What Is a Neural Vocoder?

At a high level, every modern TTS system still follows the same basic path:

\ Let’s quickly go over what each of these blocks does and why we are focusing on the vocoder today:

  1. Text encoder: It changes raw text or phonemes into detailed linguistic embeddings.
  2. Acoustic model: This stage predicts how the speech should sound over time. It turns linguistic embeddings into mel spectrograms that show timing, melody, and expression. It has two critical sub-components:
  3. Alignment & duration predictor: This component determines how long each phoneme should last, ensuring the rhythm of speech feels natural and human
  4. Variance/prosody adaptor: At this stage, the adaptor injects pitch, energy, and style, shaping the melody, emphasis, and emotional contour of the sentence.
  5. Neural vocoder: Finally, this model converts the prosody-rich mel spectrogram into actual sound, the waveform we can hear.

The vocoder is where good pipelines live or die. Map mels to waveforms perfectly, and the result is a studio-grade actor. Get it wrong, and even with the best acoustic model, you will get metallic buzz in the generated audio. That’s why choosing the right vocoder matters - because they’re not all built the same. Some optimize for speed, others for quality. The best models balance naturalness, speed, and clarity.

The Vocoder Lineup

Now, let's meet our four contenders. Each represents a different generation of neural speech synthesis, with its unique approach to balancing the trade-offs between audio quality, speed, and model size. The numbers below are drawn from the original papers. Thus, the actual performance will vary depending on your hardware and batch size. We will share our benchmark numbers later in the article for a real‑world check.

  1. WaveNet (2016): The original fidelity benchmark

Google's WaveNet was a landmark that redefined audio quality for TTS. As an autoregressive model, it generates audio one sample at a time, with each new sample conditioned on all previous ones. This process resulted in unprecedented naturalness at the time (MOS=4.21), setting a "gold standard" that researchers still benchmark against today. However, this sample-by-sample approach also makes WaveNet painfully slow, restricting its use to offline studio work rather than live applications.

  1. WaveGlow (2019): Leap to parallel synthesis

To solve WaveNet's critical speed problem, NVIDIA's WaveGlow introduced a flow-based, non-autoregressive architecture. Generating the entire waveform in a single forward pass drastically reduced inference time to approximately 0.04 RTF, making it much faster than in real time. While the quality is excellent (MOS≈3.961), it was considered a slight step down from WaveNet's fidelity. Its primary limitations are a larger memory footprint and a tendency to produce a subtle high-frequency hiss, especially with noisy training data.

  1. HiFi-GAN (2020): Champion of efficiency

HiFi-GAN marked a breakthrough in efficiency using a Generative Adversarial Network (GAN) with a clever multi-period discriminator. This architecture allows it to produce extremely high-fidelity audio (MOS=4.36), which is competitive with WaveNet, but is fast from a remarkably small model (13.92 MB). It's ultra-fast on a GPU (<0.006×RTF) and can even achieve real-time performance on a CPU, which is why HiFi-GAN quickly became the default choice for production systems like chatbots, game engines, and virtual assistants.

  1. FastDiff (2025): Diffusion quality at real-time speed

Proving that diffusion models don't have to be slow, FastDiff represents the current state-of-the-art in balancing quality and speed. Pruning the reverse diffusion process to as few as four steps achieves top-tier audio quality (MOS=4.28) while maintaining fast speeds for interactive use (~0.02×RTF on a GPU). This combination makes it one of the first diffusion-based vocoders viable for high-quality, real-time speech synthesis, opening the door for more expressive and responsive applications.

Each of these models reflects a significant shift in vocoder design. Now that we've seen how they work on paper, it's time to put them to the test with our own benchmarks and audio comparisons.

\n Let’s Hear It — A/B Audio Gallery

Nothing beats your ears!

We will use the following sentences from the LJ Speech Dataset to test our vocoders. Later in the article, you can also listen to the original audio recording and compare it with the generated one.

Sentences:

  1. “A medical practitioner charged with doing to death persons who relied upon his professional skill.”
  2. “Nothing more was heard of the affair, although the lady declared that she had never instructed Fauntleroy to sell.”
  3. “Under the new rule, visitors were not allowed to pass into the interior of the prison, but were detained between the grating.”

The metrics we will use to evaluate the model’s results are listed below. These include both objective and subjective metrics:

  • Naturalness (MOS): How human-like does it sound (rated by real people on a 1/5 scale)
  • Clarity (PESQ / STOI): Objective scores that help measure intelligibility and noise/artifacts. The higher, the better.
  • Speed (RTF): An RTF of 1 means it takes 1 second to generate 1 second of audio. For anything interactive, you’ll want this at 1 or below

Audio Players

(Grab headphones and tap the buttons to hear each model.)

| Sentence | Ground truth | WaveNet | WaveGlow | HiFi‑GAN | FastDiff | |----|:---:|:---:|:---:|:---:|:---:| | S1 | ▶️ | ▶️ | ▶️ | ▶️ | ▶️ | | S2 | ▶️ | ▶️ | ▶️ | ▶️ | ▶️ | | S3 | ▶️ | ▶️ | ▶️ | ▶️ | ▶️ |

\n Quick‑Look Metrics

Here, we will show you the results obtained for the models we evaluate.

| Model | RTF ↓ | MOS ↑ | PESQ ↑ | STOI ↑ | |----|:---:|:---:|:---:|:---:| | WaveNet | 1.24 | 3.4 | 1.0590 | 0.1616 | | WaveGlow | 0.058 | 3.7 | 1.0853 | 0.1769 | | HiFi‑GAN | 0.072 | 3.9 | 1.098 | 0.186 | | FastDiff | 0.081 | 4.0 | 1.131 | 0.19 |

\n *For the MOS evaluation, we used voices from 150 participants with no background in music.

** As an acoustic model, we used Tacotron2 for WaveNet and WaveGlow, and FastSpeech2 for HiFi‑GAN and FastDiff.

\n Bottom line

Our journey through the vocoder zoo shows that while the gap between speed and quality is shrinking, there’s no one-size-fits-all solution. Your choice of a vocoder in 2025 and beyond should primarily depend on your project's needs and technical requirements, including:

  • Runtime constraints (Is it an offline generation or a live, interactive application?)
  • Quality requirements (What’s a higher priority: raw speed or maximum fidelity?)
  • Deployment targets (Will it run on a powerful cloud GPU, a local CPU, or a mobile device?)

As the field progresses, the lines between these choices will continue to blur, paving the way for universally accessible, high-fidelity speech that is heard and felt.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

US Spot ETH ETFs Witness Remarkable $244M Inflow Surge

US Spot ETH ETFs Witness Remarkable $244M Inflow Surge

BitcoinWorld US Spot ETH ETFs Witness Remarkable $244M Inflow Surge The world of digital assets is buzzing with exciting news! US spot ETH ETFs recently experienced a significant milestone, recording a whopping $244 million in net inflows on October 28. This marks the second consecutive day of positive movement for these crucial investment vehicles, signaling a growing appetite for Ethereum exposure among mainstream investors. What’s Fueling the Latest US Spot ETH ETFs Inflow? This impressive influx of capital into US spot ETH ETFs highlights a clear trend: institutional and retail investors are increasingly comfortable with regulated crypto investment products. The figures, reported by industry tracker Trader T, show a robust interest that could reshape the market. Fidelity’s FETH led the charge, attracting a substantial $99.27 million. This demonstrates strong confidence in Fidelity’s offering and Ethereum’s long-term potential. BlackRock’s ETHA wasn’t far behind, securing $74.74 million in inflows. BlackRock’s entry into the crypto ETF space has been closely watched, and these numbers confirm its growing influence. Grayscale’s Mini ETH also saw significant action, pulling in $73.03 million. This new product is quickly gaining traction, offering investors another avenue for Ethereum exposure. It’s important to note that while most products saw positive flows, Grayscale’s ETHE experienced a net outflow of $2.66 million. This might suggest a shift in investor preference towards newer, perhaps more cost-effective, spot ETF options. Why Are US Spot ETH ETFs Attracting Such Significant Capital? The appeal of US spot ETH ETFs is multifaceted. For many investors, these products offer a regulated and accessible way to gain exposure to Ethereum without directly owning the cryptocurrency. This removes some of the complexities associated with digital asset management, such as setting up wallets, managing private keys, or dealing with less regulated exchanges. Key benefits include: Accessibility: Investors can buy and sell shares of the ETF through traditional brokerage accounts, just like stocks. Regulation: Being regulated by financial authorities provides a layer of security and trust that some investors seek. Diversification: For traditional portfolios, adding exposure to a leading altcoin like Ethereum through an ETF can offer diversification benefits. Liquidity: ETFs are generally liquid, allowing for easy entry and exit from positions. Moreover, Ethereum itself continues to be a powerhouse in the blockchain space, underpinning a vast ecosystem of decentralized applications (dApps), NFTs, and decentralized finance (DeFi) protocols. Its ongoing development and significant network activity make it an attractive asset for long-term growth. What Does This US Spot ETH ETFs Trend Mean for Investors? The consistent positive inflows into US spot ETH ETFs could be a strong indicator of maturing institutional interest in the broader crypto market. It suggests that major financial players are not just dabbling but are actively integrating digital assets into their investment strategies. For individual investors, this trend offers several actionable insights: Market Validation: The increasing capital flow validates Ethereum’s position as a significant digital asset with real-world utility and investor demand. Potential for Growth: Continued institutional adoption through ETFs could contribute to greater price stability and potential upward momentum for Ethereum. Observing Investor Behavior: The shift from products like Grayscale’s ETHE to newer spot ETFs highlights how investors are becoming more discerning about their investment vehicles, prioritizing efficiency and cost. However, it is crucial to remember that the crypto market remains volatile. While these inflows are positive, investors should always conduct their own research and consider their risk tolerance before making investment decisions. A Compelling Outlook for US Spot ETH ETFs The recent $244 million net inflow into US spot ETH ETFs is more than just a number; it’s a powerful signal. It underscores a growing confidence in Ethereum as an asset class and the increasing mainstream acceptance of regulated cryptocurrency investment products. With major players like Fidelity and BlackRock leading the charge, the landscape for digital asset investment is evolving rapidly, offering exciting new opportunities for both seasoned and new investors alike. This positive momentum suggests a potentially bright future for Ethereum’s integration into traditional financial portfolios. Frequently Asked Questions (FAQs) What is a US spot ETH ETF? A US spot ETH ETF (Exchange-Traded Fund) is an investment product that allows investors to gain exposure to the price movements of Ethereum (ETH) without directly owning the cryptocurrency. The fund holds actual Ethereum, and shares of the fund are traded on traditional stock exchanges. Which firms are leading the inflows into US spot ETH ETFs? On October 28, Fidelity’s FETH led with $99.27 million, followed by BlackRock’s ETHA with $74.74 million, and Grayscale’s Mini ETH with $73.03 million. Why are spot ETH ETFs important for the crypto market? Spot ETH ETFs are crucial because they provide a regulated, accessible, and often more familiar investment vehicle for traditional investors to enter the cryptocurrency market. This can lead to increased institutional adoption, greater liquidity, and enhanced legitimacy for Ethereum as an asset class. What was Grayscale’s ETHE outflow and what does it signify? Grayscale’s ETHE experienced a net outflow of $2.66 million. This might indicate that some investors are shifting capital from older, perhaps less efficient, Grayscale products to newer spot ETH ETFs, which often offer better fee structures or direct exposure without the previous trust structure limitations. If you found this article insightful, consider sharing it with your network! Your support helps us bring more valuable insights into the world of cryptocurrency. Spread the word and let others discover the exciting trends shaping the digital asset space. To learn more about the latest Ethereum trends, explore our article on key developments shaping Ethereum institutional adoption. This post US Spot ETH ETFs Witness Remarkable $244M Inflow Surge first appeared on BitcoinWorld.
Share
2025/10/29 11:45
First Ethereum Treasury Firm Sells ETH For Buybacks: Death Spiral Incoming?

First Ethereum Treasury Firm Sells ETH For Buybacks: Death Spiral Incoming?

Ethereum-focused treasury company ETHZilla said it has sold roughly $40 million worth of ether to fund ongoing share repurchases, a maneuver aimed at closing what it calls a “significant discount to NAV.” In a press statement on Monday, the company disclosed that since Friday, October 24, it has bought back about 600,000 common shares for approximately $12 million under a broader authorization of up to $250 million, and that it intends to continue buying while the discount persists. ETHZilla Dumps ETH For BuyBacks The company framed the buybacks as balance-sheet arbitrage rather than a strategic retreat from its core Ethereum exposure. “We are leveraging the strength of our balance sheet, including reducing our ETH holdings, to execute share repurchases,” chairman and CEO McAndrew Rudisill said, adding that ETH sales are being used as “cash” while common shares trade below net asset value. He argued the transactions would be immediately accretive to remaining shareholders. Related Reading: Crypto Analyst Shows The Possibility Of The Ethereum Price Reaching $16,000 ETHZilla amplified the message on X, saying it would “use its strong balance sheet to support shareholders through buybacks, reduce shares available for short borrow, [and] drive up NAV per share” and reiterating that it still holds “~$400 million of ETH” on the balance sheet and carries “no net debt.” The company also cited “recent, concentrated short selling” as a factor keeping the stock under pressure. The market-structure logic is straightforward: when a digital-asset treasury trades below the value of its coin holdings and cash, buying back stock with “coin-cash” can, in theory, collapse the discount and lift NAV per share. But the optics are contentious inside crypto because the mechanism requires selling the underlying asset—here, ETH—to purchase equity, potentially weakening the very treasury backing that investors originally sought. Death Spiral Incoming? Popular crypto trader SalsaTekila (@SalsaTekila) commented on X: “This is extremely bearish, especially if it invites similar behavior. ETH treasuries are not Saylor; they haven’t shown diamond-hand will. If treasury companies start dumping the coin to buy shares, it’s a death spiral setup.” Skeptics also zeroed in on funding choices. “I am mostly curious why the company chose to sell ETH and not use the $569m in cash they had on the balance sheet last month,” another analyst Dan Smith wrote, noting ETHZilla had just said it still holds about $400 million of ETH and thus didn’t deploy it on fresh ETH accumulation. “Why not just use cash?” The question cuts to the core of treasury signaling: using ETH as a liquidity reservoir to defend a discounted equity can be read as rational capital allocation, or as capitulation that undermines the ETH-as-reserve narrative. Beyond the buyback, a retail-driven storyline has rapidly formed around the stock. Business Insider reported that Dimitri Semenikhin—who recently became the face of the Beyond Meat surge—has targeted ETHZilla, saying he purchased roughly 2% of the company at what he views as a 50% discount to modified NAV. He has argued that the market is misreading ETHZilla’s balance sheet because it still reflects legacy biotech results rather than the current digital-asset treasury model. Related Reading: Ethereum Emerges As The Sole Trillion-Dollar Institutional Store Of Value — Here’s Why The same report cites liquid holdings on the order of 102,300 ETH and roughly $560 million in cash, translating to about $62 per share in liquid assets, and calls out a 1-for-10 reverse split on October 15 that, in his view, muddied the optics for retail. Semenikhin flagged November 13 as a potential catalyst if results show the pivot to ETH generating profits. The company’s own messaging emphasizes the discount-to-NAV lens rather than a change in strategy. ETHZilla told investors it would keep buying while the stock trades below asset value and highlighted a goal of shrinking lendable supply to blunt short-selling pressure. For Ethereum markets, the immediate flow effect is limited—$40 million is marginal in ETH’s daily liquidity—but the second-order risk flagged by traders is behavioral contagion. If other ETH-heavy treasuries follow the playbook, selling the underlying to buy their own stock, the flow could become pro-cyclical: coins are sold to close equity discounts, the selling pressures spot, and wider discounts reappear as equity screens rerate to the weaker mark—repeat. That is the “death spiral” scenario skeptics warn about when the treasury asset doubles as the company’s signal of conviction. At press time, ETH traded at $4,156. Featured image created with DALL.E, chart from TradingView.com
Share
2025/10/29 12:00