Speech Enhancement Demo

Abstract

This page demonstrates the performance of our proposed method compared to Semamba w/o PCS, ZipEnhancer, MPSENet-Up, and Universe++ across universal speech degradation scenarios.

Training Logs

The following figures visualize the training convergence and PESQ metric improvements over epochs for the universal speech enhancement task trained on synthesized DNS2021 data (300 hours), validated on a subset of WSJ+WHAMR! (150 pieces), all with the same loss and optimizer configurations.

Training Loss Curve — Figure 1: Training Loss Convergence (Pink: Proposed, Blue: ZipEnhancer, Green: MPSENet-Up, Light Blue: Semamba w/o PCS).

Figure 2: Validation Loss (PESQ) (Pink: Proposed, Blue: ZipEnhancer, Green: MPSENet-Up, Light Blue: Semamba w/o PCS).

DNS-2020 Large-Scale Denoising (3000h)

The following results are from the ultra-large-scale DNS-2020 purely denoising task, all with the same loss and optimizer configurations (the same as that of MP-SENet). The grey curve denotes the proposed method, the blue curve denotes SEMamba w/o PCS, and the red curve denotes MP-SENet-Up.

DNS-2020 Validation PESQ — Figure 3: DNS-2020 validation PESQ on the large-scale 3000h denoising task.

DNS-2020 Validation Phase Metric — Figure 4: DNS-2020 validation phase metric on the large-scale 3000h denoising task.

Phase Retrieval on VoiceBank (Validation)

The orange curve denotes the proposed method, the blue curve denotes MP-SENet Up, and the red curve denotes SEMamba. The validation phase loss is computed as GD + IF + PD. Since our method surpasses the baselines after about 250k steps within fewer than 150k training steps, we did not continue training the proposed method further. Training deeper may yield better results. All models are trained with the same loss and optimizer configurations.

Phase Retrieval Validation PESQ — Figure 3: Phase Retrieval Validation (PESQ) on VoiceBank.

Phase Retrieval Validation Phase Loss (GD+IF+PD) — Figure 4: Phase Retrieval Validation Phase Loss (GD + IF + PD) on VoiceBank.

Acknowledgment of Baseline Methods

We gratefully acknowledge the authors of the compared baseline methods for sharing their open-sourced code. The following papers present the methods used in this comparison:

Y.-X. Lu, Y. Ai, and Z.-H. Ling, "Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement," Neural Netw., vol. 189, p. 107562, 2025.
R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y. Tsao, "An investigation of incorporating mamba for speech enhancement," in Proc. IEEE Spoken Lang. Technol. Workshop (SLT), 2024, pp. 302–308.
H. Wang and B. Tian, "ZipEnhancer: Dual-path down-up sampling-based zipformer for monaural speech enhancement," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2025, pp. 1–5.

Statistical Significance (p-test for Universal SE scenario: Proposed vs. ZipEnhancer)

We report two-sided p-test results comparing the proposed method with the second-ranked ZipEnhancer. “Significant (+)” indicates improvement in value; “Significant (-)” indicates degradation in value.

Composite (DN+DR, DN+DR+BWE)

Metric	P-value	Result
PESQ	7.05602e-04	Significant (+)
STOI	2.45487e-01
SI-SDR	2.07514e-05	Significant (+)
COVL	2.61178e-03	Significant (+)
UTMOS	7.55323e-04	Significant (+)
PD(↓)	6.64324e-03	Significant (-)
WOPD(↓)	2.79556e-05	Significant (-)

Overall (DN, DR, BWE, DN+DR, DN+DR+BWE)

Metric	P-value	Result
PESQ	2.05312e-11	Significant (+)
STOI	5.54429e-02
SI-SDR	8.56033e-07	Significant (+)
COVL	4.49094e-12	Significant (+)
UTMOS	1.52879e-10	Significant (+)
PD	1.17751e-05	Significant (-)
WOPD	4.83145e-12	Significant (-)

Audio Samples

Please use headphones for the best listening experience.

Supplementary Notes

For readers' convenience, we provide a brief note on the model's Global Rotation Equivariance (GRE) on a dedicated page.

Open Supplementary Notes