Firstly, obtaining spectrum matrix x(f,r) with the help of STFT.(1) x(f,r)=∫-∞∞w(t-r)y(t)e-i2πfrdt Setting the length of the Hamming window w(t-r) to 512, the window shift to 256, f represents the frequency, y means the pre-emphasized birdsong audio signal, r represents the frame obtained by STFT of the current window, and x(f,r) represents the finally obtained spectrum matrix. In (1), y(t) is the pre-emphasized time domain signal and x(f,r) is the Hamming window with the center position at r. Fourier transform is carried out in the window to obtain the two-dimensional spectrogram matrix x(f,r) . The Mel frequency is the nonlinear frequency inspired by the hearing characteristics of the human ear, which also reflects that the Mel-spectrogram has strong learning of low-frequency signals. The two-dimensional spectrogram matrix x(f,r) is converted by the Mel filter, and its function is exhibited below:(2) fmel=2595log10(1+f700) Where fmel is the calculated Mel scale frequency and f is the normal Hertz frequency. The Mel filter bank imitates the human ear in filtering speech, with 512 triangular filters in the frequency range of a section of birdsong audio, with the width of the filters varying from small to large, with 50% overlap between each filter to avoid loss of information. On the Mel scale, these filters are shown as equal in width. Finally, the output matrix is converted into a spectrogram.