Project Group 15

Kariko Kimathi (s2911337)

Gianmarco Lodi (s3103544)

Jan Menzel (s2933152)

Pablo Laso Mielgo (s2808161)

Managing Big Data – Project

Music Genre Prediction on the Million Song Dataset

2022-1B

Abstract

In this paper a study on the use of machine learning techniques for music genre prediction on the

Million Songs Dataset is presented, which is one of the largest and most widely used datasets in the

ﬁeld of Music Information Retrieval. A statistical analysis on the dataset is done before supervised

machine learning algorithms are applied, including Decision Tree, Random Forest, and Linear Regres-

sion, to predict the genre of a song. The highest-performing model for the case was a Random Forest

implementation with an accuracy of 0.48%. This can likely be improved using hyperparameter tuning,

more accurate genre labels and balancing the classes.

1 Introduction

Music and music streaming are an essential part of people’s daily activities. A study conducted in

Germany showed that 74% of people use a music streaming service at least once per day. Furthermore,

32% of the people answered that they are using it constantly in the background [1]. The reasons for

humans to listen to music diﬀer e.g. changing their mood, exploring their personality, or to pass time

[2].

In order to constantly listen to music it requires a suﬃcient volume of music that furthermore also

matches the taste of the listener and needs to be ordered a certain way. Doing this manually can be time

intensive. This is why today’s music streaming services oﬀer features that take care of that for the user

[3]. This requires information such as metadata about the individual songs and the process to get this

information is called information retrieval [4]. Metadata helps with identifying whether songs belong to

the same genre or style. Listeners often combine music based on their desired goal or based on the genre

of the music [3].

This paper aims at assisting this process of classifying music into a genre by using the Million Song

Dataset[5] (MSD) and supervised machine learning methods. Firstly, it gives insight into the nature of

the dataset and determines which features are important for a song to be popular. Furthermore, it aims

at providing insight into which of the features provided in the dataset are most important to classify the

genre of a song. In addition, the dataset will be extended with external features from the Spotify API

such as danceability, energy, and more to enable a better classiﬁcation. The features danceability and

energy were originally part of the MSD, but they were present as zero values.

1.1 Research Questions

This paper aims at contributing information to research by answering the following research questions.

• Given the dataset, is a machine learning-based classiﬁcation feasible and to what extent?

• What are the characteristics of this dataset?

• How do features impact the popularity of a song?

• Which genre is easier to predict?

• Which features have a greater impact on the prediction?

1.2 Related work

Applying machine learning to classify music genres is not a novel procedure, as it has been approached

several times before. Silla et al. [6] performed a classiﬁcation using Support Vector Machine (SVM) and

a Multilayer Perceptron models by extracting speciﬁc time segments of a song using a data set of 3000

songs and 10 predeﬁned genres, the top model being SVM with an accuracy of 65%. While this appears

to be a good prototype, having only 3000 genres and 10 genres does not seem applicable in the actual

music world with many niches, which is why this work can be trialed using a larger dataset such as the

MSD. Liang et al. [7] used the MSD for genre classiﬁcation but included lyrics, as well as cross-modal

approach of the song audio and tasks, referring to the blending of song features together, instead of using

them independently. This paper will extend the work done by Liang et al. by shedding more light on how

much each genre contributes to machine learning classiﬁcation, while still keeping them apart. Oramas

et al. [8] applied a deep learning model to an extract of the MSD in 2018, while also merging the song

features. As this was the ﬁrst deep learning model using a multi-modal approach to be used on the MSD,

the focus remained on exploring the impact of the multi-modality, as a song can have also extracts from

e.g. a reggae genre, while still belonging generally to rock.

Finally, Scaringella et al. and their survey on automatic genre classiﬁcation [9] shed light on the main

obstacle of these tasks: the deﬁnitions of music genres are extremely convoluted, even though they are

highly relevant in our cultural background. This deﬁnitely aﬀects music genre classiﬁcation, which has

always been inﬂuenced by experience and background.

2 Methodology

The ﬁrst step taken was an inspection of the MSD and its features, its quality, and data quantity. Based

on this the possible objectives for research were derived. This step included an additional literature review

to identify what research was already done on the topic and the dataset. Furthermore, due to the state of

the dataset with features like danceability and energy not being useable initiatlly, it was decided to extend

the dataset with these respective features as they were perceived as important for the classiﬁcation. This

required initial research in identifying possible data sources that contained the desired information.

The algorithms were chosen based on the task of classiﬁcation and the availability of methods in the

PySpark 2.4.7 version. While more algorithms were suitable for the task, they were also required to be

adjusted for big data operations. As this extra research would have required extra time, it was perceived

as out of scope.

Once the approach was chosen, the next step was the preprocessing and cleaning of the dataset. The

speciﬁc actions of the preprocessing can be found in Section 2.2. Since the algorithms chosen are all

supervised machine learning methods, the data must be labeled. As mentioned by [10], the MSD does

not contain readily accessible genre labels. Here, the LastFM API came in handy, helping to retrieve

the genre for each song. It was queried based on a predeﬁned list which was constructed using multiple

sources. Once the data were preprocessed and labeled, they were stored on the Hadoop Distributed File

System (HDFS) for further use. All of the operations were ﬁrst performed using a small sample of the

data to conﬁrm the correctness of the code before applying it to the entire data. On a technical level

the operations were performed on a HDFS using Pyspark 2.4.7 and additional Python 3 libraries such as

requests, pandas, and NumPy.

2.1 Dataset

As already mentioned, most of the work was done by employing the MSD [5]. It is a freely-available

collection of audio features and metadata for a million contemporary popular music tracks, created in

2011 by gathering data from diﬀerent sources. More speciﬁcally, it contains 280 GB of data, 1,000,000

songs by 44,745 unique artists, dated starting from 1922 until 2011. Each song sample is described by 55

attribute ﬁelds, including artist metadata (such as the name, location and related tags), song metadata

(e.g. name, sample rate, year of release), song audio features, which explain the main characteristics of

the audio signal itself and the musical aspects of the song, e.g. duration, key, mode, tempo, loudness,

energy.

2.2 Preprocessing

After reading the data into a dataframe and inspecting it initially, it became apparent that the dataset

required preprocessing in order to be used for its intended purpose. Therefore, the following preprocessing

steps were done: Firstly, it was identiﬁed which features should remain and which ones should be removed.

Afterward, the respective features were removed accordingly. Once the dataset contained only the desired

features, its quality was checked by inspecting for missing values and NaN values. Columns that contained

any of these were removed. Because some columns contained arrays, some modiﬁcations had to be made,

since PySpark by default reads everything as a string if a schema is not provided. Therefore, the infer

schema parameter was added. However, since the dataset was stored in CSV ﬁles, the schema for array-

type columns could not be inferred and required a manual schema that was applied once the data was

read. Once cleaned, the data were stored in HDFS for further use.

The initial dataset had a size of 300 gigabytes (GB) compressed which is around 676 GB uncompressed.

After the preprocessing the dataset had a size of 104 GB compressed.

2.2.1 Machine Learning

Data processing prior to building ML models is crucial. For the machine learning task a subset of

15,000 songs was used hence this had the additional features mentioned in section 2.5. After reading

the sample from the directory, the ﬁrst six most common genres were taken, namely “house”, “trance”,

“metal”, “pop”, “rock”, and “jazz”. Since the label is categorical and the models requires numerical

data, the genres were cast to integers. Thenceforth, all features and labels were assembled into a vector

that PySpark classiﬁers can handle. The data was divided into a training and test set. These two sets

are treated independently in order to avoid overﬁtting. The data was then scaled using the Min-Max

functionality of PySpark.

2.3 Last.fm API

Last.fm is the largest online music service, which oﬀers features like music recommendations, radios, and

charts per country and genre [11]. One unique feature of Last.fm is the community aspect, as users can

collaborate, send each other songs and also put so-called tags on songs, artists, and albums. These tags

are visible to every user of the platform. It is also possible to “vote” for a tag that a previous user has

assigned to a song or artist, increasing the relevance of that tag.

Multiple GET-Requests can be sent to the API; one of them is a GET request for the “top-tags” of

a speciﬁc song, which then returns the most voted tags of a song [11]. Since it is not deﬁned when a tag

is considered a top tag and the API does not set up a GET method for all the tags a song has received,

this project relied solely on the top-tags function to retrieve an up-to-date genre for the songs, which led

to the Last.FM API only returning 5,000 songs of the 934,000 songs. Initially, this was a setback, as the

project relied on accurate genres and many of the songs contained tags one could believe is the genre of

a song, but it had not been voted enough by other users, so it was not considered a top-tag by Last.FM.

To solve this issue, the column of the MSD containing artist terms was reviewed for the songs Last.FM

API returned a genre and it became clear that only 1/7 would have been diﬀerently labeled. Therefore,

the remaining songs of the MSD set were labeled using the artist terms column.

When considering distributed code and calling an API, load-balancing can become an issue, but it

depends on the API implementation. The load-balancing of Last.fm is deﬁned as follows: A token is

allowed to retrieve up to 100 MB of data from the API and the time window until the next request is

deﬁned as ”reasonable usage”, which did not pose a problem in this project. [11]

2.4 Spotify API

Spotify oﬀers an API that allows developers to interact with the Spotify application [12]. For this research

project, the endpoints of “search” and “song features” were of interest. First, the artist name and song

title were used to retrieve the Spotify ID of a particular song. When searching, the endpoint will almost

always return a result for the search query, but it will not always be the intended result. Therefore,

a function was added that checked whether the returned data is indeed the desired one. The function

compared the strings and only allow strings that match with at least 80 percent. The identiﬁer was then

stored with the corresponding song. Once all IDs were collected, they were used to retrieve song features

about the songs via the get track audio features endpoint. This endpoint requires a Spotify ID as an

input and returns a JSON object which contains the following audio features of a song: accousticness,

danceability, energy, liveness, valence, duration, instrumentals, key, loudness, mode, tempo, speechiness,

and time signature. The features such as duration, key, and tempo were already part of the MSD, but

the remaining features were seen as interesting and important for the classiﬁcation.

One challenge posed by the Spotify API here was the load-balancing of the API: A token can retrieve

between 180 and 500 songs per minute. When trying to go above this threshold, it will return a response

with a status 429, stating how much time should pass until the next API call will be successful again [12].

If this is not obeyed, the timeout of the token will increase indeﬁnitely, essentially rendering the token

useless. With the cluster using distributed code and multiple cores calling the API at the same time, but

with the same key, the Spotify API quickly sends the used key into a timeout. This led to the choice of

limiting the cluster to only one machine and using time.sleep functions inside every API call to be on the

safe side of risking a timeout.

Another challenge came up with the fact that a token is only valid for one hour, but calling the API

for 920 thousand songs led to a higher run time than one hour. Therefore, a time check, if one hour had

passed, was implemented every time before a Song ID was retrieved from the API. If this was the case, a

new token was requested from the API, which was then used in the next calls.

2.5 Relevant Audio features used for Machine Learning

Calling the Spotify API, the following acoustic features were able to retrieved:

• Danceability: describes how suitable a track is for dancing based on a combination of musical

elements including tempo, rhythm stability, beat strength, and overall regularity.

• Valence: a measure that describes the musical positivity conveyed by a track.

• Energy: it represents a perceptual measure of intensity and activity. Typically, energetic tracks feel

fast, loud, and noisy.

• Liveness: a probability value about the presence of an audience in the recording.

• Acousticness: a measure of how much the track is acoustic.

• Instrumentalness: it predicts whether a track contains no vocals.

These features were combined with speciﬁc features coming from the MSD (that were perceived as

the most reasonable for a genre prediction task), obtaining the input values for training and testing the

machine learning algorithms. Those are:

• Tempo: it represents the estimated number of beats per minute.

• Duration (in seconds).

• Key: the musical key the song is in, ranging from 0 (C) to 11 (B).

• Mode: the musical mode the song is in, being minor (value 0) or major (value 1).

• Loudness: the overall loudness of a track in decibels (dB). Values are averaged across the entire

track and usually range from -60 to 0 (samples with a loudness value greater than 0 were found,

above the 0 threshold the song distorts).

• Time Signature: number of beats per bar.

• Year (of release).

2.6 Machine Learning Algorithms

After gathering and preprocessing all data, three diﬀerent ML algorithms were tested, namely, Decision

Tree (DC), Random Forest (RF), and Logistic Regression (LR). All available and usable features were

assembled, together with the encoded “genre” label, and used as input for the ML models. The dataset

was also split into a training set, that encompassed 70% of all data available, and a test set with the rest

of the remaining data. The latter would be used for evaluation. The metric for performance was accuracy.

For error analysis, the predictions are visualized in a confusion matrix.

3 Results and Discussion

General feature analysis of the MSD Before the preparation and training of the models, the data set was

analyzed for its properties. The properties were examined through the following questions:

• Which artists appear the most in the data set?

• How many songs were released yearly?

• What is the distribution of genres?

• What are the ﬁve hottest songs in the data set, and what do they have in common?

• How does the hotness of a song correlate with the features in the dataset?

3.1 Top Ten most occurring artists in the dataset

For answering the ﬁrst question about the dataset properties, the function count and groupBy of Pyspark

allowed insight into the most occurring Artists (see Figure 1). In Figure 1, it can be seen that the ten

most present artists are well-known artists like Michael Jackson, Neil Diamond, and Johnny Cash that

have between 171 and 194 songs in the dataset.

3.2 Song releases per year

Next, it was analyzed how many songs were released each year by using the previous function. When

looking at ﬁgure 2 it can be seen that all of the years with the most songs released are after the year 2000.

It is assumed that this is the case because, during this time span, the barrier to releasing music was lower

due to the rise of the internet. But this cant be concluded ﬁnally and requires further research.

Figure 1: Ten most occurring Artists.

Figure 2: Songs distribution per year of

release.

3.3 Genre distribution

Not everybody enjoys the same type of music, and a way for people to ﬁnd music is to look for songs

in the same genre [3]. The ﬁve most common genres in the MSD are rock, pop, jazz, house, and folk

(see ﬁgure 3). Together, they represent approximately one-third of the data set. Overall, the dataset

contained music of 35 genres.

Figure 3: Genre distribution in the MSD.

3.4 The ﬁve most popular songs

The ﬁve most popular songs can be found in Table 1. Out of these ﬁve songs, two were released before

2000 while the others were released after 2000. One trait they all have in common is that all of them are

in the minor key. It was assumed that hotness is tightly connected to the number of listeners a song had.

When searching these songs on Spotify to verify this it was found that each of the songs has at least 280

million streams which indicates that these are indeed popular songs. When analyzing the mode related

to the hotness using the mean it was found that a song in the minor key has an average hotness of 0.37

which is two percent higher than the average hotness of a song in the major key.

Artist Song Title

Pearl Jam Black

Black Sabbath Iron Man

One Republic Apologize

Keri Hilson and Kanye West Knock you down

Amy Winehouse Rehab

Table 1: “Hottest” songs.

3.5 Correlation between song hotness and features

One objective of this research project was to determine how diﬀerent features might inﬂuence the hotness

or popularity of a song. This was done individually for the preprocessed MSD and for the extended

version with the Spotify features. Therefore, the Pearson correlation was calculated for the following

MSD features:

Feature Pearson correlation

Danceability -0.15

Valence -0.12

Energy 0.04

Liveness 0.01

Accousticness -0.04

Instrumentalness -0.02

Table 2: Spotify features Pearson Correlation to song hotness.

Feature Pearson correlation

Duration -0.01

Loudness 0.15

Tempo 0.03

Table 3: MSD feature correlation to song hotness.

As shown in Table 2 a higher danceability and a higher valence have slight negative impact on the

hotness of a song. The other observed features do not have a statistical relevance concerning the correlation

with the song hotness.

When reviewing song features already contained in the MSD data set and its correlation with the

hotness of a song, only the loudness appears to have a minor impact on the hotness. Therefore, if a song

has a higher volume, it is likely that it also has a slightly higher hotness as well. Duration and Tempo

showed no signiﬁcant correlation with it in Table 3.

3.6 Machine Learning Evaluation

Table 4 shows the test accuracy of each ML model we trained. Note that the training data is not balanced

in this example, but meaningful insights can still be inferred from visualizing it with the corresponding

Confusion Matrix. Among all three, Random Forest outperforms the others, assumably because it is an

ensemble method. It builds several decision trees and takes the average output of each of them, thus

reducing variance [13]. It also takes a random sample with replacement of the total training set, which

reduces overﬁtting [14].

The ML-predicted label and the actual true label are shown in Figure 4. Class imbalance becomes

apparent in the ﬁgure, where “house” genre is clearly dominant. Nevertheless, the majority of genres are

Model Accuracy

Decision Tree 0.44

Random Forest 0.48

Logistic Regression 0.44

Table 4: ML model accuracies.

Figure 4: Confusion Matrix for RF classi-

ﬁer.

Figure 5: Feature Importance for RF clas-

siﬁer.

classiﬁed correctly in most cases, i.e., along the diagonal of the matrix. The only exception are “pop”

songs, which are classiﬁed as “house”. The reason behind this is most likely the class imbalance. In other

words, there is a tendency towards the major class or class with most samples (i.e., house) which might

increase the misclassiﬁcation of some of these genres. On the other hand, it can be seen that jazz songs

are often classiﬁed right, showing that this genre has very distinct features that most likely diﬀerentiate

it form the others.

Given that RF achieved the highest accuracy scores, it was used to get the feature importance, that

is, the weights (or consideration) that the classiﬁer assigns each feature for making a decision. The more

importance, the more relevant it is for our model while making a prediction. Feature importance is

illustrated in Figure 5.

4 Conclusion

The MSD contains a comprehensive collection of musical attributes and songs over the course of 90 years

and with its size of 300GB it is located in the Big Data domain. It contains several columns such as

tatums, valence, danceability, mode, keys, and many more, per song. After removing duplicates and

dropping not usable columns for the purposes it described in this report it shrank to about 100 GB.

When enriching the dataset with the Last.fm genres and song features per song, the research questions

formulated in the beginning of the paper can be answered. When reviewing the data set from a statisti-

cal/descriptive perspective: The most popular genre in is rock, while the second most genre is pop music.

The most prominent artist of the dataset is Michael Jackson with 194 songs published contained in the

MSD. The only attributes contributing to song hotness by being correlated with it, is loudness, while

danceability and valence impact negatively.

The Machine Learning models constructed to answer the other research questions consisted of a RF,

a DT and a LR model: When evaluating the accuracy of those models, RF was slightly ahead wit 0.48

accuracy, while DT and LR had an accuracy of 0.44. House was the genre that was the easiest to predict

by the RF and danceability, instrumentalness and acousticness were the features with the highest impact

on our machine models as shown by the feature importance analysis. However, due to the small sample

size of 15.000 songs it is diﬃcult to determine how valuable the results are as it will be described in the

following paragraph.

5 Further Research

When considering the state of this research project there are more topics to look into. First, the current

labeling of the data set is not optimal. Therefore, a new source for labeling needs to be identiﬁed and

applied. This could help with increasing the accuracy of the algorithms. Furtheremore, it should prove

helpful to extend the given dataset by good amount of songs to proivde more training data which could

ultimately result in more suﬃcient models. A further developed model can be built by taking some

additional steps: First, data balancing (either by downsampling given our large dataset, or oversampling)

will likely cause a major improvement in model accuracy, as seen in other studies [15, 16]. Likewise,

hyperparameter optimization might also ameliorate the accuracy scores in ML models, as well as a larger

a more curated dataset [17]. Finally, predicting one genre only per song is not meaningful nowadays.

We believe allowing several genre predictions, as in other multi-label music genre classiﬁcation studies,

together with its likelihood as a measure of the conﬁdence for that genre, might result in a more accurate

but also practical algorithm [18].

References

[1] Statista. Audiostreaming - Nutzungsh¨auﬁgkeit 2022. https://de.statista.com/statistik/

daten/studie/189751/umfrage/nutzung-von-musik-streaming/.

[2] Adam J Lonsdale and Adrian C North. Why do we listen to music? a uses and gratiﬁcations analysis.

British journal of psychology, 102(1):108–134, 2011.

[3] Geoﬀray Bonnin and Dietmar Jannach. Automated generation of music playlists: Survey and exper-

iments. ACM Computing Surveys (CSUR), 47(2):1–35, 2014.

[4] Peter Knees, Tim Pohle, Markus Schedl, and Gerhard Widmer. Combining audio-based similarity

with web-based data to accelerate automatic music playlist generation. In Proceedings of the 8th

ACM international workshop on Multimedia information retrieval, pages 147–154, 2006.

[5] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song

dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR

2011), 2011.

[6] Carlos N Silla, Alessandro L Koerich, and Celso AA Kaestner. A machine learning approach to

automatic music genre classiﬁcation. Journal of the Brazilian Computer Society, 14(3):7–18, 2008.

[7] Dawen Liang, Haijie Gu, and Brendan O’Connor. Music genre classiﬁcation with the million song

dataset. Machine Learning Department, CMU, 2011.

[8] Sergio Oramas, Francesco Barbieri, Oriol Nieto Caballero, and Xavier Serra. Multimodal deep learn-

ing for music genre classiﬁcation. Transactions of the International Society for Music Information

Retrieval. 2018; 1 (1): 4-21., 2018.

[9] Nicolas Scaringella, Giorgio Zoia, and Daniel Mlynek. Automatic genre classiﬁcation of music content:

a survey. IEEE Signal Processing Magazine, 23(2):133–141, 2006.

[10] Hendrik Schreiber. Improving genre annotations for the million song dataset. In ISMIR, pages

241–247, 2015.

[11] Last.fm API. https://www.last.fm/api. Accessed: 2022-01.

[12] Spotify for Developers. https://developer.spotify.com/. Accessed: 2022-01.

[13] Robin Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics,

24(3):543–562, 2012.

[14] Michele Fratello and Roberto Tagliaferri. Decision trees and random forests. Encyclopedia of Bioin-

formatics and Computational Biology: ABC of Bioinformatics, 1:3, 2018.

[15] Nicolas Dauban, Christine S´enac, Julien Pinquier, and Pascal Gaillard. Towards a content-based pre-

diction of personalized musical preferences using transfer learning. In 2021 International Conference

on Content-Based Multimedia Indexing (CBMI), pages 1–6. IEEE, 2021.

[16] Alexander AS Gunawan, Derwin Suhartono, et al. Music recommender system based on genre using

convolutional recurrent neural networks. Procedia Computer Science, 157:99–109, 2019.

[17] Dorien Herremans, David Martens, and Kenneth S¨orensen. Dance hit song prediction. Journal of

New Music Research, 43(3):291–302, 2014.

[18] Chris Sanden and John Z Zhang. Enhancing multi-label music genre classiﬁcation through ensem-

ble techniques. In Proceedings of the 34th international ACM SIGIR conference on Research and

development in Information Retrieval, pages 705–714, 2011.