Supplementary Notes

Global Rotation Equivariant Phase Modeling for Speech Enhancement with Deep Magnitude-Phase Interaction

Supplementary Material

1. Introduction

Conventional neural networks typically operate in Euclidean space, which creates a topological mismatch when modeling phase information that resides on a circular manifold ($S^1$). To address this, we propose a framework that fundamentally respects this geometry.

This page provides a brief note (with supporting equations) on why our key components—the Magnitude-Phase Interactive Convolutional Module (MPICM) and the Hybrid-Attention Dual-FFN (HADF)—preserve Global Rotation Equivariance (GRE). This property ensures that a global rotation of the input phase results in an identical rotation of the output features, encouraging the model to learn relative structural patterns (e.g., group delay) rather than absolute orientation.

2. A Note on Global Rotation Equivariance (GRE)

We demonstrate that both the convolutional building block (MPICM) and the attention bottleneck (HADF) strictly preserve the rotation equivariance of the phase stream. Let the input to the network be defined by a complex tensor $\mathbf{Z}_{in}$ (phase stream) and a real tensor $\mathbf{M}_{in}$ (magnitude stream). We define a global phase rotation operator $T_\theta$ such that the rotated inputs are:

$$ \tilde{\mathbf{Z}}_{in} = \mathbf{Z}_{in} \cdot e^{j\theta}, \quad \tilde{\mathbf{M}}_{in} = \mathbf{M}_{in} $$

Enforced constraints (by design): (i) all complex-valued linear layers in the phase stream (ComplexConv/ComplexLinear) are complex-linear and bias-free; (ii) any phase-stream normalization (e.g., cRMS) consists only of multiplication by a real scale computed from rotation-invariant quantities (e.g., RMS of $|\mathbf{Z}|$), optionally with a real learnable scale; (iii) interactive gates are real-valued and depend only on rotation-invariant inputs (e.g., $|\mathbf{Z}|$) and/or the magnitude stream (which does not rotate). Under these enforced constraints, the equalities below hold exactly.

2.1 Equivariance in MPICM Convolution

The phase stream employs a bias-free complex convolution, denoted as $W_{ang}$. For the rotated input $\tilde{\mathbf{Z}}_{in}$:

$$ \tilde{\mathbf{Z}}_{out} = W_{ang} * \tilde{\mathbf{Z}}_{in} = W_{ang} * (\mathbf{Z}_{in} \cdot e^{j\theta}) $$

Due to the linearity of the convolution and the absence of a bias term, the scalar rotation factor $e^{j\theta}$ factors out:

$$ \tilde{\mathbf{Z}}_{out} = (W_{ang} * \mathbf{Z}_{in}) \cdot e^{j\theta} = \mathbf{Z}_{out} \cdot e^{j\theta} $$

Thus, the output of the bias-free convolution is strictly global equivariant.

2.2 Invariance in MPICM Interactive Gating

The MPICM utilizes a cross-stream gating mechanism where gates depend on the magnitude stream feature $\tilde{\mathbf{M}}$ and the modulus of the phase stream feature $|\tilde{\mathbf{Z}}|$. In particular, if a phase-stream normalization such as cRMS is applied, it preserves equivariance because it rescales by a real factor computed from $|\mathbf{Z}|$. First, observe that the modulus of the complex feature is rotation invariant:

$$ |\tilde{\mathbf{Z}}_{out}| = |\mathbf{Z}_{out} \cdot e^{j\theta}| = |\mathbf{Z}_{out}| \cdot |e^{j\theta}| = |\mathbf{Z}_{out}| $$

Since $\mathbf{M}_{in}$ is unchanged, the magnitude stream feature $\tilde{\mathbf{M}}$ is also invariant. Consequently, the gating functions $\Psi$, which rely solely on $\tilde{\mathbf{M}}$ and $|\tilde{\mathbf{Z}}|$, produce identical gating coefficients regardless of the global phase rotation $\theta$.

Conclusion: The magnitude output remains invariant ($\tilde{\mathbf{M}}_{out} = \mathbf{M}_{out}$), and the phase-stream output remains equivariant ($\tilde{\mathbf{Z}}_{out} = \mathbf{Z}_{out} \cdot e^{j\theta}$), proving the MPICM is Global Rotation Equivariant.

2.3 Equivariance in Hybrid Attention (HADF)

The Hybrid-Attention Dual-FFN (HADF) fuses information using a unified attention map. Let the complex Query and Key vectors for the phase stream be $\mathbf{Q}_{pha}$ and $\mathbf{K}_{pha}$. If the input rotates by $e^{j\theta}$, the bias-free linear projections rotate identically:

$$ \tilde{\mathbf{Q}}_{pha} = \mathbf{Q}_{pha} \cdot e^{j\theta}, \quad \tilde{\mathbf{K}}_{pha} = \mathbf{K}_{pha} \cdot e^{j\theta} $$

Invariance of the Attention Score: The unified attention score is derived from the concatenation of magnitude terms and the complex phase terms. The contribution of the phase stream to the dot product is equivalent to the real part of the Hermitian inner product:

$$ \text{Score}_{pha} = \text{Re}(\mathbf{Q}_{pha} \cdot \mathbf{K}_{pha}^\mathcal{H}) $$

Substituting the rotated vectors:

$$ \text{Score}_{rotated} = \text{Re}\left( (\mathbf{Q}_{pha} e^{j\theta}) \cdot (\mathbf{K}_{pha} e^{j\theta})^\mathcal{H} \right) $$ $$ = \text{Re}\left( \mathbf{Q}_{pha} e^{j\theta} \cdot e^{-j\theta} \mathbf{K}_{pha}^\mathcal{H} \right) = \text{Re}(\mathbf{Q}_{pha} \cdot \mathbf{K}_{pha}^\mathcal{H}) $$

The rotation terms cancel out. Thus, the attention matrix is rotation invariant.

(Equivalently, if implementation uses concatenation of real/imag parts, i.e., $[\text{Re}(\mathbf{Q}_{pha}),\text{Im}(\mathbf{Q}_{pha})]$ and the standard real dot product, the score is invariant because a common 2D rotation preserves dot products.)

Equivariance of the Output: The final output is the product of the invariant attention matrix and the Value vectors $\mathbf{V}_{pha}$. Since $\mathbf{V}_{pha}$ rotates by $e^{j\theta}$ (due to the bias-free projection), the final weighted sum also rotates by $e^{j\theta}$. This confirms that the HADF module fuses cross-modal information without breaking the global rotation equivariance of the phase stream.

Return to Main Project Page