Project Group 15
Kariko Kimathi (s2911337)
Gianmarco Lodi (s3103544)
Jan Menzel (s2933152)
Pablo Laso Mielgo (s2808161)
Managing Big Data Project
Music Genre Prediction on the Million Song Dataset
2022-1B
Abstract
In this paper a study on the use of machine learning techniques for music genre prediction on the
Million Songs Dataset is presented, which is one of the largest and most widely used datasets in the
field of Music Information Retrieval. A statistical analysis on the dataset is done before supervised
machine learning algorithms are applied, including Decision Tree, Random Forest, and Linear Regres-
sion, to predict the genre of a song. The highest-performing model for the case was a Random Forest
implementation with an accuracy of 0.48%. This can likely be improved using hyperparameter tuning,
more accurate genre labels and balancing the classes.
1 Introduction
Music and music streaming are an essential part of people’s daily activities. A study conducted in
Germany showed that 74% of people use a music streaming service at least once per day. Furthermore,
32% of the people answered that they are using it constantly in the background [1]. The reasons for
humans to listen to music differ e.g. changing their mood, exploring their personality, or to pass time
[2].
In order to constantly listen to music it requires a sufficient volume of music that furthermore also
matches the taste of the listener and needs to be ordered a certain way. Doing this manually can be time
intensive. This is why today’s music streaming services offer features that take care of that for the user
[3]. This requires information such as metadata about the individual songs and the process to get this
information is called information retrieval [4]. Metadata helps with identifying whether songs belong to
the same genre or style. Listeners often combine music based on their desired goal or based on the genre
of the music [3].
This paper aims at assisting this process of classifying music into a genre by using the Million Song
Dataset[5] (MSD) and supervised machine learning methods. Firstly, it gives insight into the nature of
the dataset and determines which features are important for a song to be popular. Furthermore, it aims
at providing insight into which of the features provided in the dataset are most important to classify the
genre of a song. In addition, the dataset will be extended with external features from the Spotify API
such as danceability, energy, and more to enable a better classification. The features danceability and
energy were originally part of the MSD, but they were present as zero values.
1.1 Research Questions
This paper aims at contributing information to research by answering the following research questions.
Given the dataset, is a machine learning-based classification feasible and to what extent?
What are the characteristics of this dataset?
How do features impact the popularity of a song?
Which genre is easier to predict?
Which features have a greater impact on the prediction?
1
1.2 Related work
Applying machine learning to classify music genres is not a novel procedure, as it has been approached
several times before. Silla et al. [6] performed a classification using Support Vector Machine (SVM) and
a Multilayer Perceptron models by extracting specific time segments of a song using a data set of 3000
songs and 10 predefined genres, the top model being SVM with an accuracy of 65%. While this appears
to be a good prototype, having only 3000 genres and 10 genres does not seem applicable in the actual
music world with many niches, which is why this work can be trialed using a larger dataset such as the
MSD. Liang et al. [7] used the MSD for genre classification but included lyrics, as well as cross-modal
approach of the song audio and tasks, referring to the blending of song features together, instead of using
them independently. This paper will extend the work done by Liang et al. by shedding more light on how
much each genre contributes to machine learning classification, while still keeping them apart. Oramas
et al. [8] applied a deep learning model to an extract of the MSD in 2018, while also merging the song
features. As this was the first deep learning model using a multi-modal approach to be used on the MSD,
the focus remained on exploring the impact of the multi-modality, as a song can have also extracts from
e.g. a reggae genre, while still belonging generally to rock.
Finally, Scaringella et al. and their survey on automatic genre classification [9] shed light on the main
obstacle of these tasks: the definitions of music genres are extremely convoluted, even though they are
highly relevant in our cultural background. This definitely affects music genre classification, which has
always been influenced by experience and background.
2 Methodology
The first step taken was an inspection of the MSD and its features, its quality, and data quantity. Based
on this the possible objectives for research were derived. This step included an additional literature review
to identify what research was already done on the topic and the dataset. Furthermore, due to the state of
the dataset with features like danceability and energy not being useable initiatlly, it was decided to extend
the dataset with these respective features as they were perceived as important for the classification. This
required initial research in identifying possible data sources that contained the desired information.
The algorithms were chosen based on the task of classification and the availability of methods in the
PySpark 2.4.7 version. While more algorithms were suitable for the task, they were also required to be
adjusted for big data operations. As this extra research would have required extra time, it was perceived
as out of scope.
Once the approach was chosen, the next step was the preprocessing and cleaning of the dataset. The
specific actions of the preprocessing can be found in Section 2.2. Since the algorithms chosen are all
supervised machine learning methods, the data must be labeled. As mentioned by [10], the MSD does
not contain readily accessible genre labels. Here, the LastFM API came in handy, helping to retrieve
the genre for each song. It was queried based on a predefined list which was constructed using multiple
sources. Once the data were preprocessed and labeled, they were stored on the Hadoop Distributed File
System (HDFS) for further use. All of the operations were first performed using a small sample of the
data to confirm the correctness of the code before applying it to the entire data. On a technical level
the operations were performed on a HDFS using Pyspark 2.4.7 and additional Python 3 libraries such as
requests, pandas, and NumPy.
2.1 Dataset
As already mentioned, most of the work was done by employing the MSD [5]. It is a freely-available
collection of audio features and metadata for a million contemporary popular music tracks, created in
2011 by gathering data from different sources. More specifically, it contains 280 GB of data, 1,000,000
songs by 44,745 unique artists, dated starting from 1922 until 2011. Each song sample is described by 55
2
attribute fields, including artist metadata (such as the name, location and related tags), song metadata
(e.g. name, sample rate, year of release), song audio features, which explain the main characteristics of
the audio signal itself and the musical aspects of the song, e.g. duration, key, mode, tempo, loudness,
energy.
2.2 Preprocessing
After reading the data into a dataframe and inspecting it initially, it became apparent that the dataset
required preprocessing in order to be used for its intended purpose. Therefore, the following preprocessing
steps were done: Firstly, it was identified which features should remain and which ones should be removed.
Afterward, the respective features were removed accordingly. Once the dataset contained only the desired
features, its quality was checked by inspecting for missing values and NaN values. Columns that contained
any of these were removed. Because some columns contained arrays, some modifications had to be made,
since PySpark by default reads everything as a string if a schema is not provided. Therefore, the infer
schema parameter was added. However, since the dataset was stored in CSV files, the schema for array-
type columns could not be inferred and required a manual schema that was applied once the data was
read. Once cleaned, the data were stored in HDFS for further use.
The initial dataset had a size of 300 gigabytes (GB) compressed which is around 676 GB uncompressed.
After the preprocessing the dataset had a size of 104 GB compressed.
2.2.1 Machine Learning
Data processing prior to building ML models is crucial. For the machine learning task a subset of
15,000 songs was used hence this had the additional features mentioned in section 2.5. After reading
the sample from the directory, the first six most common genres were taken, namely “house”, “trance”,
“metal”, “pop”, “rock”, and “jazz”. Since the label is categorical and the models requires numerical
data, the genres were cast to integers. Thenceforth, all features and labels were assembled into a vector
that PySpark classifiers can handle. The data was divided into a training and test set. These two sets
are treated independently in order to avoid overfitting. The data was then scaled using the Min-Max
functionality of PySpark.
2.3 Last.fm API
Last.fm is the largest online music service, which offers features like music recommendations, radios, and
charts per country and genre [11]. One unique feature of Last.fm is the community aspect, as users can
collaborate, send each other songs and also put so-called tags on songs, artists, and albums. These tags
are visible to every user of the platform. It is also possible to “vote” for a tag that a previous user has
assigned to a song or artist, increasing the relevance of that tag.
Multiple GET-Requests can be sent to the API; one of them is a GET request for the “top-tags” of
a specific song, which then returns the most voted tags of a song [11]. Since it is not defined when a tag
is considered a top tag and the API does not set up a GET method for all the tags a song has received,
this project relied solely on the top-tags function to retrieve an up-to-date genre for the songs, which led
to the Last.FM API only returning 5,000 songs of the 934,000 songs. Initially, this was a setback, as the
project relied on accurate genres and many of the songs contained tags one could believe is the genre of
a song, but it had not been voted enough by other users, so it was not considered a top-tag by Last.FM.
To solve this issue, the column of the MSD containing artist terms was reviewed for the songs Last.FM
API returned a genre and it became clear that only 1/7 would have been differently labeled. Therefore,
the remaining songs of the MSD set were labeled using the artist terms column.
When considering distributed code and calling an API, load-balancing can become an issue, but it
depends on the API implementation. The load-balancing of Last.fm is defined as follows: A token is
3
allowed to retrieve up to 100 MB of data from the API and the time window until the next request is
defined as ”reasonable usage”, which did not pose a problem in this project. [11]
2.4 Spotify API
Spotify offers an API that allows developers to interact with the Spotify application [12]. For this research
project, the endpoints of “search” and “song features” were of interest. First, the artist name and song
title were used to retrieve the Spotify ID of a particular song. When searching, the endpoint will almost
always return a result for the search query, but it will not always be the intended result. Therefore,
a function was added that checked whether the returned data is indeed the desired one. The function
compared the strings and only allow strings that match with at least 80 percent. The identifier was then
stored with the corresponding song. Once all IDs were collected, they were used to retrieve song features
about the songs via the get track audio features endpoint. This endpoint requires a Spotify ID as an
input and returns a JSON object which contains the following audio features of a song: accousticness,
danceability, energy, liveness, valence, duration, instrumentals, key, loudness, mode, tempo, speechiness,
and time signature. The features such as duration, key, and tempo were already part of the MSD, but
the remaining features were seen as interesting and important for the classification.
One challenge posed by the Spotify API here was the load-balancing of the API: A token can retrieve
between 180 and 500 songs per minute. When trying to go above this threshold, it will return a response
with a status 429, stating how much time should pass until the next API call will be successful again [12].
If this is not obeyed, the timeout of the token will increase indefinitely, essentially rendering the token
useless. With the cluster using distributed code and multiple cores calling the API at the same time, but
with the same key, the Spotify API quickly sends the used key into a timeout. This led to the choice of
limiting the cluster to only one machine and using time.sleep functions inside every API call to be on the
safe side of risking a timeout.
Another challenge came up with the fact that a token is only valid for one hour, but calling the API
for 920 thousand songs led to a higher run time than one hour. Therefore, a time check, if one hour had
passed, was implemented every time before a Song ID was retrieved from the API. If this was the case, a
new token was requested from the API, which was then used in the next calls.
2.5 Relevant Audio features used for Machine Learning
Calling the Spotify API, the following acoustic features were able to retrieved:
Danceability: describes how suitable a track is for dancing based on a combination of musical
elements including tempo, rhythm stability, beat strength, and overall regularity.
Valence: a measure that describes the musical positivity conveyed by a track.
Energy: it represents a perceptual measure of intensity and activity. Typically, energetic tracks feel
fast, loud, and noisy.
Liveness: a probability value about the presence of an audience in the recording.
Acousticness: a measure of how much the track is acoustic.
Instrumentalness: it predicts whether a track contains no vocals.
These features were combined with specific features coming from the MSD (that were perceived as
the most reasonable for a genre prediction task), obtaining the input values for training and testing the
machine learning algorithms. Those are:
4
Tempo: it represents the estimated number of beats per minute.
Duration (in seconds).
Key: the musical key the song is in, ranging from 0 (C) to 11 (B).
Mode: the musical mode the song is in, being minor (value 0) or major (value 1).
Loudness: the overall loudness of a track in decibels (dB). Values are averaged across the entire
track and usually range from -60 to 0 (samples with a loudness value greater than 0 were found,
above the 0 threshold the song distorts).
Time Signature: number of beats per bar.
Year (of release).
2.6 Machine Learning Algorithms
After gathering and preprocessing all data, three different ML algorithms were tested, namely, Decision
Tree (DC), Random Forest (RF), and Logistic Regression (LR). All available and usable features were
assembled, together with the encoded “genre” label, and used as input for the ML models. The dataset
was also split into a training set, that encompassed 70% of all data available, and a test set with the rest
of the remaining data. The latter would be used for evaluation. The metric for performance was accuracy.
For error analysis, the predictions are visualized in a confusion matrix.
3 Results and Discussion
General feature analysis of the MSD Before the preparation and training of the models, the data set was
analyzed for its properties. The properties were examined through the following questions:
Which artists appear the most in the data set?
How many songs were released yearly?
What is the distribution of genres?
What are the five hottest songs in the data set, and what do they have in common?
How does the hotness of a song correlate with the features in the dataset?
3.1 Top Ten most occurring artists in the dataset
For answering the first question about the dataset properties, the function count and groupBy of Pyspark
allowed insight into the most occurring Artists (see Figure 1). In Figure 1, it can be seen that the ten
most present artists are well-known artists like Michael Jackson, Neil Diamond, and Johnny Cash that
have between 171 and 194 songs in the dataset.
3.2 Song releases per year
Next, it was analyzed how many songs were released each year by using the previous function. When
looking at figure 2 it can be seen that all of the years with the most songs released are after the year 2000.
It is assumed that this is the case because, during this time span, the barrier to releasing music was lower
due to the rise of the internet. But this cant be concluded finally and requires further research.
5
Figure 1: Ten most occurring Artists.
Figure 2: Songs distribution per year of
release.
3.3 Genre distribution
Not everybody enjoys the same type of music, and a way for people to find music is to look for songs
in the same genre [3]. The five most common genres in the MSD are rock, pop, jazz, house, and folk
(see figure 3). Together, they represent approximately one-third of the data set. Overall, the dataset
contained music of 35 genres.
Figure 3: Genre distribution in the MSD.
3.4 The five most popular songs
The five most popular songs can be found in Table 1. Out of these five songs, two were released before
2000 while the others were released after 2000. One trait they all have in common is that all of them are
in the minor key. It was assumed that hotness is tightly connected to the number of listeners a song had.
When searching these songs on Spotify to verify this it was found that each of the songs has at least 280
million streams which indicates that these are indeed popular songs. When analyzing the mode related
to the hotness using the mean it was found that a song in the minor key has an average hotness of 0.37
which is two percent higher than the average hotness of a song in the major key.
6
Artist Song Title
Pearl Jam Black
Black Sabbath Iron Man
One Republic Apologize
Keri Hilson and Kanye West Knock you down
Amy Winehouse Rehab
Table 1: “Hottest” songs.
3.5 Correlation between song hotness and features
One objective of this research project was to determine how different features might influence the hotness
or popularity of a song. This was done individually for the preprocessed MSD and for the extended
version with the Spotify features. Therefore, the Pearson correlation was calculated for the following
MSD features:
Feature Pearson correlation
Danceability -0.15
Valence -0.12
Energy 0.04
Liveness 0.01
Accousticness -0.04
Instrumentalness -0.02
Table 2: Spotify features Pearson Correlation to song hotness.
Feature Pearson correlation
Duration -0.01
Loudness 0.15
Tempo 0.03
Table 3: MSD feature correlation to song hotness.
As shown in Table 2 a higher danceability and a higher valence have slight negative impact on the
hotness of a song. The other observed features do not have a statistical relevance concerning the correlation
with the song hotness.
When reviewing song features already contained in the MSD data set and its correlation with the
hotness of a song, only the loudness appears to have a minor impact on the hotness. Therefore, if a song
has a higher volume, it is likely that it also has a slightly higher hotness as well. Duration and Tempo
showed no significant correlation with it in Table 3.
3.6 Machine Learning Evaluation
Table 4 shows the test accuracy of each ML model we trained. Note that the training data is not balanced
in this example, but meaningful insights can still be inferred from visualizing it with the corresponding
Confusion Matrix. Among all three, Random Forest outperforms the others, assumably because it is an
ensemble method. It builds several decision trees and takes the average output of each of them, thus
reducing variance [13]. It also takes a random sample with replacement of the total training set, which
reduces overfitting [14].
The ML-predicted label and the actual true label are shown in Figure 4. Class imbalance becomes
apparent in the figure, where “house” genre is clearly dominant. Nevertheless, the majority of genres are
7
Model Accuracy
Decision Tree 0.44
Random Forest 0.48
Logistic Regression 0.44
Table 4: ML model accuracies.
Figure 4: Confusion Matrix for RF classi-
fier.
Figure 5: Feature Importance for RF clas-
sifier.
classified correctly in most cases, i.e., along the diagonal of the matrix. The only exception are “pop”
songs, which are classified as “house”. The reason behind this is most likely the class imbalance. In other
words, there is a tendency towards the major class or class with most samples (i.e., house) which might
increase the misclassification of some of these genres. On the other hand, it can be seen that jazz songs
are often classified right, showing that this genre has very distinct features that most likely differentiate
it form the others.
Given that RF achieved the highest accuracy scores, it was used to get the feature importance, that
is, the weights (or consideration) that the classifier assigns each feature for making a decision. The more
importance, the more relevant it is for our model while making a prediction. Feature importance is
illustrated in Figure 5.
4 Conclusion
The MSD contains a comprehensive collection of musical attributes and songs over the course of 90 years
and with its size of 300GB it is located in the Big Data domain. It contains several columns such as
tatums, valence, danceability, mode, keys, and many more, per song. After removing duplicates and
dropping not usable columns for the purposes it described in this report it shrank to about 100 GB.
When enriching the dataset with the Last.fm genres and song features per song, the research questions
formulated in the beginning of the paper can be answered. When reviewing the data set from a statisti-
cal/descriptive perspective: The most popular genre in is rock, while the second most genre is pop music.
The most prominent artist of the dataset is Michael Jackson with 194 songs published contained in the
MSD. The only attributes contributing to song hotness by being correlated with it, is loudness, while
danceability and valence impact negatively.
The Machine Learning models constructed to answer the other research questions consisted of a RF,
a DT and a LR model: When evaluating the accuracy of those models, RF was slightly ahead wit 0.48
accuracy, while DT and LR had an accuracy of 0.44. House was the genre that was the easiest to predict
8
by the RF and danceability, instrumentalness and acousticness were the features with the highest impact
on our machine models as shown by the feature importance analysis. However, due to the small sample
size of 15.000 songs it is difficult to determine how valuable the results are as it will be described in the
following paragraph.
5 Further Research
When considering the state of this research project there are more topics to look into. First, the current
labeling of the data set is not optimal. Therefore, a new source for labeling needs to be identified and
applied. This could help with increasing the accuracy of the algorithms. Furtheremore, it should prove
helpful to extend the given dataset by good amount of songs to proivde more training data which could
ultimately result in more sufficient models. A further developed model can be built by taking some
additional steps: First, data balancing (either by downsampling given our large dataset, or oversampling)
will likely cause a major improvement in model accuracy, as seen in other studies [15, 16]. Likewise,
hyperparameter optimization might also ameliorate the accuracy scores in ML models, as well as a larger
a more curated dataset [17]. Finally, predicting one genre only per song is not meaningful nowadays.
We believe allowing several genre predictions, as in other multi-label music genre classification studies,
together with its likelihood as a measure of the confidence for that genre, might result in a more accurate
but also practical algorithm [18].
References
[1] Statista. Audiostreaming - Nutzungsh¨aufigkeit 2022. https://de.statista.com/statistik/
daten/studie/189751/umfrage/nutzung-von-musik-streaming/.
[2] Adam J Lonsdale and Adrian C North. Why do we listen to music? a uses and gratifications analysis.
British journal of psychology, 102(1):108–134, 2011.
[3] Geoffray Bonnin and Dietmar Jannach. Automated generation of music playlists: Survey and exper-
iments. ACM Computing Surveys (CSUR), 47(2):1–35, 2014.
[4] Peter Knees, Tim Pohle, Markus Schedl, and Gerhard Widmer. Combining audio-based similarity
with web-based data to accelerate automatic music playlist generation. In Proceedings of the 8th
ACM international workshop on Multimedia information retrieval, pages 147–154, 2006.
[5] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song
dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR
2011), 2011.
[6] Carlos N Silla, Alessandro L Koerich, and Celso AA Kaestner. A machine learning approach to
automatic music genre classification. Journal of the Brazilian Computer Society, 14(3):7–18, 2008.
[7] Dawen Liang, Haijie Gu, and Brendan O’Connor. Music genre classification with the million song
dataset. Machine Learning Department, CMU, 2011.
[8] Sergio Oramas, Francesco Barbieri, Oriol Nieto Caballero, and Xavier Serra. Multimodal deep learn-
ing for music genre classification. Transactions of the International Society for Music Information
Retrieval. 2018; 1 (1): 4-21., 2018.
[9] Nicolas Scaringella, Giorgio Zoia, and Daniel Mlynek. Automatic genre classification of music content:
a survey. IEEE Signal Processing Magazine, 23(2):133–141, 2006.
9
[10] Hendrik Schreiber. Improving genre annotations for the million song dataset. In ISMIR, pages
241–247, 2015.
[11] Last.fm API. https://www.last.fm/api. Accessed: 2022-01.
[12] Spotify for Developers. https://developer.spotify.com/. Accessed: 2022-01.
[13] Robin Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics,
24(3):543–562, 2012.
[14] Michele Fratello and Roberto Tagliaferri. Decision trees and random forests. Encyclopedia of Bioin-
formatics and Computational Biology: ABC of Bioinformatics, 1:3, 2018.
[15] Nicolas Dauban, Christine S´enac, Julien Pinquier, and Pascal Gaillard. Towards a content-based pre-
diction of personalized musical preferences using transfer learning. In 2021 International Conference
on Content-Based Multimedia Indexing (CBMI), pages 1–6. IEEE, 2021.
[16] Alexander AS Gunawan, Derwin Suhartono, et al. Music recommender system based on genre using
convolutional recurrent neural networks. Procedia Computer Science, 157:99–109, 2019.
[17] Dorien Herremans, David Martens, and Kenneth orensen. Dance hit song prediction. Journal of
New Music Research, 43(3):291–302, 2014.
[18] Chris Sanden and John Z Zhang. Enhancing multi-label music genre classification through ensem-
ble techniques. In Proceedings of the 34th international ACM SIGIR conference on Research and
development in Information Retrieval, pages 705–714, 2011.
10