Figure 3: Hip-Hop top-10 words occurrences per
genre.
our data set. For example, for all stop words like
“I’ll” and “would’ve” we added the version with-
out apostrophe: “ill” and “wouldve”, because in
our lyrics we find them in that form. Finally, after
some fine-tuning, we found that keeping the 2000
most important features like to the best results.
We exploited the count vectors to plot word fre-
quency comparisons between genres. The word
counts in each genre have been standardized as fol-
lows:
word count =
word count − mean length
g enre
max length
g enre
− min length
g enre
This already brings further insights on words dis-
tributions among genres. The graph (Figure 3)
shows a comparison of the min-max normalized fre-
quency of the most relevant, in this sense, words
in the ”Hip Hop” category. We can see how ”like”,
probably used in rap music to build metaphors,
accounts for almost the double with respect to all
other genres. Moreover, ”nigga” and ”shit” have
almost zero relevance for other genres. This con-
firms the ease of identification of a ”Hip Hop” song.
4.2. TF-IDF Vector
The second method of embedding a song into a
vector is by using the term frequency – inverse doc-
ument frequency of the lyric’s words. The features
are the 2000 most relevant word types and the fea-
tures values are calculated as follows:
tf(t, d) =
f
t,d
P
t
′
∈d
f
t
′
,d
idf(t) = log[(1 + n)/(1 + df(t))] + 1
tf idf(t, d) = tf(t, d) · idf(t)
Notice that in idf(t) we used smoothing to pre-
vent 0 divisions in the document does not contain
a word. This could happen in testing phase. In
this way we scaled down the impact of tokens that
awkward very frequently in our lyrics and there-
fore, empirically, less informative for our classifi-
cation task, with respect to tokens which appear
in a small subset of our data set.
We still remove stop words before computing
the vector values. However, a downside of this em-
bedding is still sparsity. For this reason, we apply
dimensionality reduction using truncated singular
value decomposition (SVD) which is known as la-
tent semantic analysis (LSA) in the context of the
of TF-IDF matrices. We keep the first 100 princi-
pal components. We used scikit learn TfidfVector-
izer to obtain this embedding.
4.3. Word2Vec
The third and last embeddings are obtained by av-
eraging the words vectors representation learned
through Word2Vec. We explored 2 variations of
this embedding:first we used the word vectors of
the pre-trained Word2Vec model on the Google
News data set; second, we trained the Word2Vec
model on our corpus in order to embed case spe-
cific information in the words vectors. Word2Vec
is used to compress the sparse representation into
300 features, as a skip-gram model that is learning
to predict the word given a nearby word. Once we
have the word embeddings, we represent the songs
as the average of the word vectors.
The advantage of these vectors with respect to
the ones of table with down vectorizer and TF-IDF
vectorizer is that they are dense, not sparse, and
they carry context information, because the em-
bedding of each word depends on the surrounding
words. In this way each song vector is not carry-
ing only information about words frequencies but
also words meaning expressed in function of their
context. We used Word2Vec from gensim library
to perform this embedding.
So the contribution to the song’s representation
vector will be given only by words present in the
Google News dataset, in the first case. This could
be a problem mainly for the ”Hip Hop” songs which
contain many slang words.
4.4. Model training and Evaluation
Now we have a five class-balanced and indepen-
dently preprocessed training set. We feed different
ML models with each differently processed dataset.
Six classifiers are fitted with the train data,
namely Logistic Regression (LR), K-Nearest Neigh-
bors (KNN), Decision Tree (DT), Multi-Layer Per-
ceptron (MLP), Random Forests (RF), and Sup-
port Vector Machine (SVM). Naive Bayes could
not handle negative values produced by some meth-
ods, for instance, SVD or Word2vec -so it was left
apart in some cases. Each of the aforementioned
models is fitted with train data processed by dif-
ferent NLP techniques, to find the most efficient
one.
3