Abstract:
Sound event Localization and Detection (SELD) in real spatial scenarios is an emerging and challenging task. To address the problem of single feature extraction method in traditional deep learning methods, this paper proposes a U-shaped Sound Event Localization and Detection method (U-SELD). By designing a serial multi-scale feature extraction sub-network, the proposed method enables the input features to contain multi-level information, thereby enhancing the network’s feature expression capability and improving the model performance. In addition, this paper proposes a scale feature coordinate attentio mechanism (SFCA), which allows the network to capture thelocation information of key features and reduce redundant feature information. Experimental experiments on the STARSS22 Task 3 dataset demonstrate that the proposed method converges faster and achieves an overall performance score of
0.4987, which is an approximately 9% improvement compared with the baseline. Moreover, the proposed approach exhibits better stability for 13 types of target sound events.