An efficient Short-Time Discrete Cosine Transform and Attentive MultiResUNet framework for Music Source Separation

Thomas Sgouros, Angelos Bousis and Nikolaos Mitianoudis

Department of Electrical and Computer Engineering,
Democritus University of Thrace,
67100, Xanthi Greece
tsgouros@ee.duth.gr, bousis.ang@gmail.com, nmitiano@ee.duth.gr

Abstract

The music source separation problem, where the task at hand is to estimate the audio components that are present in a mixture, has been the centre of research activity for a long time. In more recent frameworks, the problem is tackled by creating deep learning models, which attempt to extract information from each component by using Short-Time Fourier Transform (STFT)} spectrograms as input. Most approaches assume that one source is present at each time-frequency point, which allows to allocate this point from the mixture to the desired source. Since this assumption is very strong and is reported not to hold in practice, there is a problem that arises from the use of the magnitude of the STFT as input to these networks, which is the absence of the Fourier phase information during the separated source reconstruction.}  The recovery of the Fourier phase information is neither easily tractable, nor computationally efficient to estimate. In this paper, we propose a novel Attentive MultiResUNet architecture, that uses real-valued Short-Time Discrete Cosine Transform data as inputs. This step avoids the phase recovery problem, by estimating the appropriate values within the network itself, rather than employing complex estimation or post-processing algorithms. The proposed novel network features a U-Net type structure with residual skip connections and an attention mechanism that correlates the skip connection and the decoder output at the previous level.


Proposed Architecture

Audio Samples

Example 1:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 2:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 3:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 4:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 5:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 6:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 7:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 8:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 9:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT:   

Example 10:


Vocals:   Vocals GT :
Bass:       Bass GT:      
Drums:   Drums GT:  
Other:   Other GT: