How to Create & Understand Mel-Spectrograms
What is a Spectrogram?
Spectrograms are immensely useful tools that we can use to help dissect information from audio files and process it into images. In a spectrogram, the horizontal axis represents time, the vertical axis represents frequency, and the color intensity represents the amplitude of a frequency at a certain point in time. In case you can’t quite picture that, here is an example of what a spectrogram looks like:
The cool part about these images is that we can actually use them as a diagnostic tool with Deep Learning and Computer Vision to train convolutional neural networks for the classification of a wide variety of topics! These topics can range from music style classification to being able to diagnose if a person has an infectious disease based on their coughing audio.
Creating Signal and Sample Rate
So how exactly can we create spectrograms from audio? First, we will import the necessary libraries, and then load our target audio file. The main library we will be using is Librosa, which is a python package that is used for audio analysis¹.
import librosa, librosa.display
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinesignal, sr = librosa.load(librosa.util.example(‘brahms’))
When we use librosa.load() to load an audio file, we need to specify two variables to take in the signal of the audio and the sample rate. The signal is a 1-dimensional NumPy array that contains many values that is equal to the sample rate multiplied by the duration of the audio file. Each of these values in the signal is the amplitude of the audio file at a certain point in time. The sample rate (sr) by default is 22050, which means that for every second there are 22,050 samples. To view our loaded audio, we can use librosa.display.waveplot().
plt.figure(figsize=(20, 5))
librosa.display.waveplot(signal, sr=sr)
plt.title(‘Waveplot’, fontdict=dict(size=18))
plt.xlabel(‘Time’, fontdict=dict(size=15))
plt.ylabel(‘Amplitude’, fontdict=dict(size=15))
plt.show()
Using the above code we can see our audio file in its waveform. We can even use another library called IPython.display to hear the audio with only one line of code!
import IPython.display as ipdipd.Audio(signal, rate=sr)
For us to see the magnitude each frequency is contributing to the overall sound of the audio file, we need to implement a Fast-Fourier Transformation algorithm.
Fast-Fourier & Discrete-Fourier Transforms
A Fast-Fourier transformation (FFT) is an algorithm that can compute a Discrete Fourier Transformation (DFT). To put it into perspective, think of it is any Mustang (DFT) is a car that is built by Ford (FFT). What the FFT algorithm does is takes our audio from the time domain into the frequency domain. The magnitude of each frequency present in the audio is plotted onto a graph. This allows us to identify the different frequencies that make up the audio. Essentially, it is a snapshot of the audio information.
An FFT is a 1-dimensional NumPy array that has as many values as the total number of samples we have in the waveform (remember that by default, our sample rate is 22,050 samples per second in the audio file). For each of these frequencies, we have a complex value which gets the magnitude for each of the frequency. These magnitudes tell us the level of contribution for each frequency in the sound. The larger the magnitude, the greater the contribution that frequency gave to the overall sound in the audio.
# Creating a Discrete-Fourier Transform with our FFT algorithm
fast_fourier_transf = np.fft.fft(signal)# Magnitudes indicate the contribution of each frequency
magnitude = np.abs(fast_fourier_transf)# mapping the magnitude to the relative frequency bins
frequency = np.linspace(0, sr, len(magnitude))# We only need the first half of the magnitude and frequency
left_mag = magnitude[:int(len(magnitude)/2)]
left_freq = frequency[:int(len(frequency)/2)]plt.plot(left_freq, left_mag)
plt.title(‘Discrete-Fourier Transform’, fontdict=dict(size=15))
plt.xlabel(‘Frequency’, fontdict=dict(size=12))
plt.ylabel(‘Magnitude’, fontdict=dict(size=12))
plt.show()
The only issue with a DFT is the fact that it is static — there is no time associated with this plot. So to incorporate time into our audio to see what frequencies impact at what time, we should make a spectrogram.
Short-time Fourier Transformation Algorithm
To create a spectrogram, we can’t use a Fast-Fourier Transformation on the entire audio at once. Instead of performing an FFT across the entire signal, we will take small segments, or frames, of the audio signal and apply FFTs to each of these frames. This is called a Short-Time Fourier Transformation (STFT). Doing so allows us to preserve information about the time and the way the audio signal evolves. Going further, the frames will overlap each other as we slide across the audio signal. How far the frame slides is determined by the hop length, which tells the function how many sample rates to the right it should slide when we create the next frame. Let’s go ahead and put this into code.
# this is the number of samples in a window per fft
n_fft = 2048# The amount of samples we are shifting after each fft
hop_length = 512# Short-time Fourier Transformation on our audio data
audio_stft = librosa.core.stft(signal, hop_length=hop_length, n_fft=n_fft)# gathering the absolute values for all values in our audio_stft
spectrogram = np.abs(audio_stft)# Plotting the short-time Fourier Transformation
plt.figure(figsize=(20, 5))# Using librosa.display.specshow() to create our spectrogram
librosa.display.specshow(spectrogram, sr=sr, x_axis=’time’, y_axis=’hz’, hop_length=hop_length)
plt.colorbar(label=’Amplitude’)
plt.title(‘Spectrogram (amp)’, fontdict=dict(size=18))
plt.xlabel(‘Time’, fontdict=dict(size=15))
plt.ylabel(‘Frequency’, fontdict=dict(size=15))
plt.show()
As we can see (or more accurately, not see), most of these frequencies contribute very little to the overall amplitude of the sound. A way for us to visualize loudness, which is not linear but logarithmic, is to convert our spectrograms from amplitude to decibels.
# Short-time Fourier Transformation on our audio data
audio_stft = librosa.core.stft(signal, hop_length=hop_length, n_fft=n_fft)# gathering the absolute values for all values in our audio_stft
spectrogram = np.abs(audio_stft)# Converting the amplitude to decibels
log_spectro = librosa.amplitude_to_db(spectrogram)# Plotting the short-time Fourier Transformation
plt.figure(figsize=(20, 5))# Using librosa.display.specshow() to create our spectrogram
librosa.display.specshow(log_spectro, sr=sr, x_axis=’time’, y_axis=’hz’, hop_length=hop_length, cmap=’magma’)
plt.colorbar(label=’Decibels’)
plt.title(‘Spectrogram (dB)’, fontdict=dict(size=18))
plt.xlabel(‘Time’, fontdict=dict(size=15))
plt.ylabel(‘Frequency’, fontdict=dict(size=15))
plt.show()
Creating the Mel-Spectrogram
The difference between a spectrogram and a Mel-spectrogram is that a Mel-spectrogram converts the frequencies to the mel-scale. According to the University of California, the mel-scale is “a perceptual scale of pitches judged by listeners to be equal in distance from one another”². If you are familiar with playing or reading music, this may help you visualize and understand the conversion and reasoning. Let’s go ahead and picture this as notes on a musical scale:
- From C to D is one whole step, and from D to E is another whole step. Perceptually to the human ears, the step sizes are equal.
- However, if we were to compare these steps in hertz, they would not be equal steps. A C is around 261.63 Hz, a D is 293.66 Hz, and an E is 329.63 Hz.
- C to D difference = 32.03 Hz
- D to E difference = 35.37 Hz
As the notes go higher in octave, the difference between the steps dramatically increases. Mel-spectrograms provide a perceptually relevant amplitude and frequency representation. Let’s go ahead and plot a Mel-spectrogram.
mel_signal = librosa.feature.melspectrogram(y=signal, sr=sr, hop_length=hop_length,
n_fft=n_fft)
spectrogram = np.abs(mel_signal)
power_to_db = librosa.power_to_db(spectrogram, ref=np.max)
plt.figure(figsize=(8, 7))
librosa.display.specshow(power_to_db, sr=sr, x_axis=’time’, y_axis=’mel’, cmap=’magma’,
hop_length=hop_length)
plt.colorbar(label=’dB’)
plt.title('Mel-Spectrogram (dB)', fontdict=dict(size=18))
plt.xlabel('Time', fontdict=dict(size=15))
plt.ylabel('Frequency', fontdict=dict(size=15))
plt.show()
As we can see, the Mel-spectrogram provides more information in the image of our audio file. Providing our convolutional neural network model with more information through our Mel-spectrograms allows the model to better differentiate between whatever classes we are training on.
- “Librosa — Librosa 0.8.0 Documentation.” Librosa, librosa.org/doc/latest/index.html. Accessed 7 Apr. 2021.
- Smyth, Tamara. “The Mel Scale.” University of California, San Diego, Tamara Smyth, 4 June 2019, musicweb.ucsd.edu/~trsmyth/pitch2/Mel_Scale.html.