联合卡尔曼滤波和泰勒残差展开的回声消除方法

李勇; 孙成立; 陈飞龙

doi:10.3969/j.issn.2096-8566.2024.01.005

联合卡尔曼滤波和泰勒残差展开的回声消除方法

Echo Cancellation Method Combining Kalman Filtering and Taylor Residual Expansion

摘要

摘要: 现有基于深度学习的声学回声消除算法主要采用端到端的结构，这种结构使得神经网络模型在设计上的可解释性难以实现。针对这一问题，提出一种联合卡尔曼滤波和泰勒残差展开的回声消除方法，可以为网络结构设计提供很好的可解释性。该方法由线性自适应滤波和深度神经网络两部分组成。首先，采用神经卡尔曼滤波（Neural Kalman Filtering, NKF）作为自适应滤波器去除线性噪声，获得目标语音的粗略频谱估计；然后，通过泰勒展开神经网络对粗谱估计的结果进一步处理，以抑制非线性残留回声，并逐步修复目标语音的复数频谱。在泰勒展开神经网络中设计融合不同尺度时频特征的编解码网络用于零阶项估计，构建轻量级高阶项估计网络，并按颗粒度由大到小重建目标语音复数频谱。结果表明，相比现有的主流回声消除方法，本文所提方法的性能有显著提升。双讲情况下，语音质量感知评估（Perceptual Evaluation of Speech Quality, PESQ）和短时客观可懂度（Short-Time Objective Intelligibility, STOI）均有大幅提升；单讲情况下，回声损失增强度量（Echo Return Loss Enhancement, ERLE）达到了56.106的优良表现，相比先进的UNET神经网络方法提高了6.5%。

Abstract: The existing acoustic echo cancellation algorithms based on deep learning mainly adopt an end-to-end structure, which makes it difficult for neural network models to explain their internal mechanisms. To solve this problem, an echo cancellation method combining the Kalman filter and Taylor residual expansion was proposed, which can provide better interpretability for each layer of the network structure. The method consists of two parts, i.e. linear adaptive filtering and deep neural network. Firstly, Neural Kalman Filtering (NKF) is used as an adaptive filter to remove linear noise and obtain a rough spectral estimation of the target speech. Then, Taylor expansion is used to gradually learn the value of the rough spectral estimation, suppress nonlinear residual echoes, and gradually repair the complex spectrum of the target speech. In the Taylor expansion neural network, an encoding and decoding network integrating time-frequency features of different scales was designed for zero-order term estimation. A lightweight high-order term estimation network was constructed to reconstruct the target speech complex spectrum from large to small granularity. The experiment shows that the proposed method has significant performance improvement compared to existing mainstream echo cancellation methods. In the case of double lectures, the Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI) were greatly improved. In the single lecture case, the Echo Return Loss Enhancement (ERLE) measure was greatly improved, achieving an excellent performance of 56.106, which has a 6.5% improvement over the advanced UNET neural network method.

HTML全文

参考文献(29)

施引文献

资源附件(0)