李全涛, 陈飞龙, 孙成立, 丁碧云, 彭建坤. 基于多尺度和坐标注意力的声音事件定位与检测[J]. 南昌航空大学学报(自然科学版), 2025, 39(5): 1-10. DOI: 10.3969/j.issn.2096-8566.2025.05.001
引用本文: 李全涛, 陈飞龙, 孙成立, 丁碧云, 彭建坤. 基于多尺度和坐标注意力的声音事件定位与检测[J]. 南昌航空大学学报(自然科学版), 2025, 39(5): 1-10. DOI: 10.3969/j.issn.2096-8566.2025.05.001
Quantao LI, Feilong CHEN, Chengli SUN, Biyun DING, Jiankun PENG. Sound Event Localization and Detection Based on Multi-scale and Coordination Attention[J]. Journal of nanchang hangkong university(Natural science edition), 2025, 39(5): 1-10. DOI: 10.3969/j.issn.2096-8566.2025.05.001
Citation: Quantao LI, Feilong CHEN, Chengli SUN, Biyun DING, Jiankun PENG. Sound Event Localization and Detection Based on Multi-scale and Coordination Attention[J]. Journal of nanchang hangkong university(Natural science edition), 2025, 39(5): 1-10. DOI: 10.3969/j.issn.2096-8566.2025.05.001

基于多尺度和坐标注意力的声音事件定位与检测

Sound Event Localization and Detection Based on Multi-scale and Coordination Attention

  • 摘要: 真实空间场景下的声音事件定位与检测是一项新兴且具有挑战性的任务。针对传统深度学习方法提取特征方式单一的问题,本文提出一种U型声音事件定位与检测方法(U-SELD)。通过设计串行多尺度的特征提取子网,使得输入特征包含多层次的信息,增强网络的特征表达能力,提升模型性能。此外,本文提出尺度特征坐标注意力机制,使网络捕获关键特征的位置信息并减少冗余特征信息。在STARSS22 Task 3数据集上的广泛实验表明,本文方法收敛速度更快,总体性能指标达到0.4987,相比基线方法提升了9%,对13类目标声音事件展现出更好的稳定性。

     

    Abstract: Sound event Localization and Detection (SELD) in real spatial scenarios is an emerging and challenging task. To address the problem of single feature extraction method in traditional deep learning methods, this paper proposes a U-shaped Sound Event Localization and Detection method (U-SELD). By designing a serial multi-scale feature extraction sub-network, the proposed method enables the input features to contain multi-level information, thereby enhancing the network’s feature expression capability and improving the model performance. In addition, this paper proposes a scale feature coordinate attentio mechanism (SFCA), which allows the network to capture thelocation information of key features and reduce redundant feature information. Experimental experiments on the STARSS22 Task 3 dataset demonstrate that the proposed method converges faster and achieves an overall performance score of 0.4987, which is an approximately 9% improvement compared with the baseline. Moreover, the proposed approach exhibits better stability for 13 types of target sound events.

     

/

返回文章
返回