attribute fields, including artist metadata (such as the name, location and related tags), song metadata
(e.g. name, sample rate, year of release), song audio features, which explain the main characteristics of
the audio signal itself and the musical aspects of the song, e.g. duration, key, mode, tempo, loudness,
energy.
2.2 Preprocessing
After reading the data into a dataframe and inspecting it initially, it became apparent that the dataset
required preprocessing in order to be used for its intended purpose. Therefore, the following preprocessing
steps were done: Firstly, it was identified which features should remain and which ones should be removed.
Afterward, the respective features were removed accordingly. Once the dataset contained only the desired
features, its quality was checked by inspecting for missing values and NaN values. Columns that contained
any of these were removed. Because some columns contained arrays, some modifications had to be made,
since PySpark by default reads everything as a string if a schema is not provided. Therefore, the infer
schema parameter was added. However, since the dataset was stored in CSV files, the schema for array-
type columns could not be inferred and required a manual schema that was applied once the data was
read. Once cleaned, the data were stored in HDFS for further use.
The initial dataset had a size of 300 gigabytes (GB) compressed which is around 676 GB uncompressed.
After the preprocessing the dataset had a size of 104 GB compressed.
2.2.1 Machine Learning
Data processing prior to building ML models is crucial. For the machine learning task a subset of
15,000 songs was used hence this had the additional features mentioned in section 2.5. After reading
the sample from the directory, the first six most common genres were taken, namely “house”, “trance”,
“metal”, “pop”, “rock”, and “jazz”. Since the label is categorical and the models requires numerical
data, the genres were cast to integers. Thenceforth, all features and labels were assembled into a vector
that PySpark classifiers can handle. The data was divided into a training and test set. These two sets
are treated independently in order to avoid overfitting. The data was then scaled using the Min-Max
functionality of PySpark.
2.3 Last.fm API
Last.fm is the largest online music service, which offers features like music recommendations, radios, and
charts per country and genre [11]. One unique feature of Last.fm is the community aspect, as users can
collaborate, send each other songs and also put so-called tags on songs, artists, and albums. These tags
are visible to every user of the platform. It is also possible to “vote” for a tag that a previous user has
assigned to a song or artist, increasing the relevance of that tag.
Multiple GET-Requests can be sent to the API; one of them is a GET request for the “top-tags” of
a specific song, which then returns the most voted tags of a song [11]. Since it is not defined when a tag
is considered a top tag and the API does not set up a GET method for all the tags a song has received,
this project relied solely on the top-tags function to retrieve an up-to-date genre for the songs, which led
to the Last.FM API only returning 5,000 songs of the 934,000 songs. Initially, this was a setback, as the
project relied on accurate genres and many of the songs contained tags one could believe is the genre of
a song, but it had not been voted enough by other users, so it was not considered a top-tag by Last.FM.
To solve this issue, the column of the MSD containing artist terms was reviewed for the songs Last.FM
API returned a genre and it became clear that only 1/7 would have been differently labeled. Therefore,
the remaining songs of the MSD set were labeled using the artist terms column.
When considering distributed code and calling an API, load-balancing can become an issue, but it
depends on the API implementation. The load-balancing of Last.fm is defined as follows: A token is
3