Abstract
This page demonstrates the performance of our proposed method compared to Semamba w/o PCS, ZipEnhancer, MPSENet-Up, and Universe++ across universal speech degradation scenarios.
Training Logs
The following figures visualize the training convergence and PESQ metric improvements over epochs for the universal speech enhancement task trained on synthesized DNS2021 data (300 hours), validated on a subset of WSJ+WHAMR! (150 pieces), all with the same loss and optimizer configurations.
DNS-2020 Large-Scale Denoising (3000h)
The following results are from the ultra-large-scale DNS-2020 purely denoising task, all with the same loss and optimizer configurations (the same as that of MP-SENet). The grey curve denotes the proposed method, the blue curve denotes SEMamba w/o PCS, and the red curve denotes MP-SENet-Up.
Phase Retrieval on VoiceBank (Validation)
The orange curve denotes the proposed method, the blue curve denotes MP-SENet Up, and the red curve denotes SEMamba. The validation phase loss is computed as GD + IF + PD. Since our method surpasses the baselines after about 250k steps within fewer than 150k training steps, we did not continue training the proposed method further. Training deeper may yield better results. All models are trained with the same loss and optimizer configurations.
Acknowledgment of Baseline Methods
We gratefully acknowledge the authors of the compared baseline methods for sharing their open-sourced code. The following papers present the methods used in this comparison:
- Y.-X. Lu, Y. Ai, and Z.-H. Ling, "Explicit estimation of magnitude and phase spectra in parallel for high-quality speech enhancement," Neural Netw., vol. 189, p. 107562, 2025.
- R. Chao, W.-H. Cheng, M. La Quatra, S. M. Siniscalchi, C.-H. H. Yang, S.-W. Fu, and Y. Tsao, "An investigation of incorporating mamba for speech enhancement," in Proc. IEEE Spoken Lang. Technol. Workshop (SLT), 2024, pp. 302–308.
- H. Wang and B. Tian, "ZipEnhancer: Dual-path down-up sampling-based zipformer for monaural speech enhancement," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2025, pp. 1–5.
Statistical Significance (p-test for Universal SE scenario: Proposed vs. ZipEnhancer)
We report two-sided p-test results comparing the proposed method with the second-ranked ZipEnhancer. “Significant (+)” indicates improvement in value; “Significant (-)” indicates degradation in value.
| Metric | P-value | Result |
|---|---|---|
| PESQ | 7.05602e-04 | Significant (+) |
| STOI | 2.45487e-01 | |
| SI-SDR | 2.07514e-05 | Significant (+) |
| COVL | 2.61178e-03 | Significant (+) |
| UTMOS | 7.55323e-04 | Significant (+) |
| PD(↓) | 6.64324e-03 | Significant (-) |
| WOPD(↓) | 2.79556e-05 | Significant (-) |
| Metric | P-value | Result |
|---|---|---|
| PESQ | 2.05312e-11 | Significant (+) |
| STOI | 5.54429e-02 | |
| SI-SDR | 8.56033e-07 | Significant (+) |
| COVL | 4.49094e-12 | Significant (+) |
| UTMOS | 1.52879e-10 | Significant (+) |
| PD | 1.17751e-05 | Significant (-) |
| WOPD | 4.83145e-12 | Significant (-) |
Audio Samples
Please use headphones for the best listening experience.
Supplementary Notes
For readers' convenience, we provide a brief note on the model's Global Rotation Equivariance (GRE) on a dedicated page.
Open Supplementary Notes