Classiﬁcation of song genres based on the lyrics

Natural Language Processing – Final report - Group 9

Amir Bachir

, Pablo Laso

University of Twente, Enschede, Netherlands

October, 2022

Abstract

Song classiﬁcation into music genres is a hard task due to the artistic nature of music which means

that these classiﬁcations are often subjective and controversial, and some genres may also overlap. The

researches on this topic in most cases try to extract knowledge from audio features of songs.

In this project, we look for a distinction between song lyrics which belong to diﬀerent genres. We also

research how accurate and reliable NLP tools can be for genre prediction. We classify songs in ﬁve genres

by training a machine learning classiﬁer on a vector representation of the songs’ lyrics. We explore three

diﬀerent ways of embedding text and use them as input for diﬀerent classiﬁers to compare the results.

1. Introduction

“Next song recommendation” is a relevant part

of the success of music platforms as Spotify and

iTunes music. The recommendation algorithms

take into account multiple features related to the

listening history of the user, and one of these is

the song genre. Songs are usually assigned to the

same genre as the one of the authors, but it is not

uncommon that artists play in multiple genres, so

a more eﬀective way to assign the genre is to de-

termine the song’s speciﬁc genre.

Something that diﬀerentiates one genre from an-

other are surely instrumental, musical, rhythmic,

and audio properties of the track.

Our hypothesis is that lyrics are a strong method

to identify song genres.

It is a shared opinion that most of the songs talk

about the same four topics: love, friendship, state-

ment of discontent and death. The treated topic

might be the same among diﬀerent genres, but

there are inﬁnite possible ways to talk about them,

and we want to know if the words used within the

same genres are similar or distinctive from other

genres.

2. Related work

When aiming at classifying song genres, we ﬁnd

multiple previous attempts and work on the mat-

ter. For example, D. Buˇzi´c et al. [1] used Naive

Bayes on a dataset consisting of lyrics performed

by Nirvana and Metallica, 207 songs in total. Upon

evaluation, they obtained very good results: preci-

sion of 0.93, recall of 0.95 and F1 -measure of 0.94,

therefore lyrics classiﬁcation, on a small dataset,

using Naive Bayes as a classiﬁer can be considered

as successful. Still, classifying song genres based on

the lyrics alone is a diﬃcult task. That is why there

have been many attempts and diﬀerent approaches

to this challenge. For instance, A. Tsaptsinos [2]

uses a Hierarchical Attention Network to catch the

hierarchical layer structure that lyrics exhibit.

3. Dataset and Data Processing

The data set was obtained from tmthyjames GitHub

project repository “Cypher” [3]. Finding a sig-

niﬁcant and “correct” lyrics database is diﬃcult

due to copyright issues, in fact, this one is built

scraping web pages. The data was scraped by the

author through the python Cypher library. The

dataset has 2,778,359 rows. The columns are ‘al-

bum’, ‘song’, ‘artist, ‘album genre’, ‘genre’, ‘year’,

‘lyric’ and ‘ranker genre’. We keep the ‘ranker genre’

column as song genre label, obtained through the

Ranker API, which has seven unique values (’Hip-

Hop’, ‘Rhythm and blues’, ‘Pop’, ‘Heavy metal’,

‘Screamo’, ‘Punk rock’, ‘Country’) and has no null

values. For the purpose of our project we unify the

songs belonging to the ‘Heavy metal’, ‘Screamo’,

‘Punk rock’ genres under the new genre ‘alt rock’

because of their subgenres properties. So our ﬁ-

nal genres are ﬁve: ’Hip-Hop’, ‘Rhythm and blues’,

‘Pop’ and ‘alt rock’. Furthermore, the lyric of each

song is divided on multiple rows, so we processed

the data in order to have the complete lyric in one

cell. Now the dataset has 62,155 rows x 7 columns.

There are two problems of unbalance in the data:

ﬁrst, dataset dependent, the number of songs per

genre is not uniform, as shown in Figure 1; second,

Figure 1: Song genres distribution in the dataset.

Figure 2: Lyrics length per song genre.

dataset independent, the number of words per song

varies signiﬁcantly between genres (Figure 2, e.g.,

Hip-Hop songs have on average more than the dou-

ble of words per lyric with respect to other genres.

The ﬁrst problem is handled by sampling the same

number of songs per genre while the second one,

that may cause issues using CountVectorizer, is in

part implicitly handled by the embedding/feature

extraction algorithms such as TF-IDF.

As we can see from Figure 2 the distributions

of songs length among genres are approximately

Gaussians, but with very diﬀerent mean and vari-

ance as shown by Table 1.

Genre Mean V ariance

Alt Rock 1049 241724

Hip-Hop 2736 1119291

Pop 1271 289529

Rhythm&Blues 1168 359480

Country 906 148766

Table 1: Songs mean and variance per genre.

We cleaned the data from some of the noise,

that includes nonsense tokens due to scraping prob-

lems and most probably incorrect or partial lyrics.

Nonsense tokens were removed selecting via a reg-

ular expression only words formed by letters. Par-

tial lyrics were avoided by sampling only songs

which have more than 400 characters. Prior to

any training, we looked at the 20 most recurrent

words per genre, excluding stopwords. By this sim-

ple mean, we can already extract some interesting

insights. All genres except ”Hip-Hop” share some

words as ”time”, ”love”, ”heart” and ”oh”. Every

genre except ”Alt rock” has ”yeah” as a very re-

current ﬁller. On the other hand, ”Hip-Hop” has

some slang words which do not appear in any other

genre. This might suggest a good performance in

classifying ”Hip-Hop” with respect to the other

categories.

3.1. Training dataset

Finally, the data set we use to train our models

is composed by two columns: ‘lyric’ and ‘ranker

genre’. From lyric, we will extract our features,

while ‘ranker genre’ is the label we want to be

able to predict. We sampled 3500 songs from each

genre, with at least 400 characters per lyric. We

further divided it into training (75%) and test (25%)

sets, still with balanced genres.

4. Methods

We need a vector representation of our songs to

feed to a classiﬁer in order to train it. We need to

determine what will be the features and the values

of the vector. The better representation we have,

the better the classiﬁer will discern genres.

4.1. Count Vectors

One of the simplest embeddings is to represent a

document as a bag of words (BOW). BOW means

representing a document as a vector of dimension

size of vocabulary, where the features are the words

of the vocabulary and the values are the number of

occurrences of that word in the document. Thus,

we identify a document as the count of its uni-

grams.

It is simple and quite eﬀective, but has also

some downsides. The vector will be sparse (most

of the values will be 0) because usually in a song

appears a restricted subset of the vocabulary and

the values depend on the length of the song: longer

songs will have higher values. This could be a prob-

lem for classiﬁers, as KNN, if they do not use cosine

similarity as a distance measure.

We used scikit learn CountVectorizer to ob-

tain this embedding. We also removed stopwords

from the vocabulary. Stop words are, in general,

words which appeared very frequently in all kinds

of texts, but they do not add meaning. They are

used for functional purposes in sentences such as

“the”, “is” and “i’ll”. We used NLTK’s stop words

english set and added some stop words speciﬁc for

Figure 3: Hip-Hop top-10 words occurrences per

genre.

our data set. For example, for all stop words like

“I’ll” and “would’ve” we added the version with-

out apostrophe: “ill” and “wouldve”, because in

our lyrics we ﬁnd them in that form. Finally, after

some ﬁne-tuning, we found that keeping the 2000

most important features like to the best results.

We exploited the count vectors to plot word fre-

quency comparisons between genres. The word

counts in each genre have been standardized as fol-

lows:

word count =

word count − mean length

g enre

max length

g enre

− min length

g enre

This already brings further insights on words dis-

tributions among genres. The graph (Figure 3)

shows a comparison of the min-max normalized fre-

quency of the most relevant, in this sense, words

in the ”Hip Hop” category. We can see how ”like”,

probably used in rap music to build metaphors,

accounts for almost the double with respect to all

other genres. Moreover, ”nigga” and ”shit” have

almost zero relevance for other genres. This con-

ﬁrms the ease of identiﬁcation of a ”Hip Hop” song.

4.2. TF-IDF Vector

The second method of embedding a song into a

vector is by using the term frequency – inverse doc-

ument frequency of the lyric’s words. The features

are the 2000 most relevant word types and the fea-

tures values are calculated as follows:

tf(t, d) =

t,d

′

∈d

′

idf(t) = log[(1 + n)/(1 + df(t))] + 1

tf idf(t, d) = tf(t, d) · idf(t)

Notice that in idf(t) we used smoothing to pre-

vent 0 divisions in the document does not contain

a word. This could happen in testing phase. In

this way we scaled down the impact of tokens that

awkward very frequently in our lyrics and there-

fore, empirically, less informative for our classiﬁ-

cation task, with respect to tokens which appear

in a small subset of our data set.

We still remove stop words before computing

the vector values. However, a downside of this em-

bedding is still sparsity. For this reason, we apply

dimensionality reduction using truncated singular

value decomposition (SVD) which is known as la-

tent semantic analysis (LSA) in the context of the

of TF-IDF matrices. We keep the ﬁrst 100 princi-

pal components. We used scikit learn TﬁdfVector-

izer to obtain this embedding.

4.3. Word2Vec

The third and last embeddings are obtained by av-

eraging the words vectors representation learned

through Word2Vec. We explored 2 variations of

this embedding:ﬁrst we used the word vectors of

the pre-trained Word2Vec model on the Google

News data set; second, we trained the Word2Vec

model on our corpus in order to embed case spe-

ciﬁc information in the words vectors. Word2Vec

is used to compress the sparse representation into

300 features, as a skip-gram model that is learning

to predict the word given a nearby word. Once we

have the word embeddings, we represent the songs

as the average of the word vectors.

The advantage of these vectors with respect to

the ones of table with down vectorizer and TF-IDF

vectorizer is that they are dense, not sparse, and

they carry context information, because the em-

bedding of each word depends on the surrounding

words. In this way each song vector is not carry-

ing only information about words frequencies but

also words meaning expressed in function of their

context. We used Word2Vec from gensim library

to perform this embedding.

So the contribution to the song’s representation

vector will be given only by words present in the

Google News dataset, in the ﬁrst case. This could

be a problem mainly for the ”Hip Hop” songs which

contain many slang words.

4.4. Model training and Evaluation

Now we have a ﬁve class-balanced and indepen-

dently preprocessed training set. We feed diﬀerent

ML models with each diﬀerently processed dataset.

Six classiﬁers are ﬁtted with the train data,

namely Logistic Regression (LR), K-Nearest Neigh-

bors (KNN), Decision Tree (DT), Multi-Layer Per-

ceptron (MLP), Random Forests (RF), and Sup-

port Vector Machine (SVM). Naive Bayes could

not handle negative values produced by some meth-

ods, for instance, SVD or Word2vec -so it was left

apart in some cases. Each of the aforementioned

models is ﬁtted with train data processed by dif-

ferent NLP techniques, to ﬁnd the most eﬃcient

one.

The performance is measured by means of the

accuracy, recall, and F score. Since this is a multi-

class classiﬁcation problem, metric results will vary

per genre and model. We compare diﬀerent mod-

els by means of the test accuracy. For further in-

sights, we dig deeper into how each classiﬁer per-

forms for each genre, by also considering recall and

F score. In a more graphical manner, we also do

so by means of the confusion matrix.

5. Results

Performance was measured by means of the accu-

racy, recall, precision, and F1 score. We evaluated

all models and all our NLP techniques. In Figure

4 we can observe diﬀerent accuracy scores for each

of the aforementioned classiﬁers, and each result-

ing dataset after the feature extraction (TF-IDF,

TF-IDF + SVD, Word2Vec, Word2Vec self-trained

and CountVectorizer). It shows that the highest

values were given for MLP, SVM, and RF classi-

ﬁers. More speciﬁcally, both CountVectorizer and

TF-IDF (with RF) account for the best score (av-

erage 73.25% test accuracy), closely followed by

SVM (average 71.5%). Word2Vec accounts for the

lowest scores. SVD does not seem to increase sig-

niﬁcantly scores -as shown when used with TF-

IDF.

Figure 4: Test accuracy for combination of lyric em-

beddings and classiﬁers

In Table 2 we have represented the test ac-

curacy for each classiﬁer, trained with diﬀerent

datasets, namely TF-IDF, Word2Vec, and Count

Vectorizer. Similarly, we can observe, in Figure

5, the Confusion Matrix of the RF classiﬁer ﬁt-

ted with TF-IDF training data. The diagonal val-

ues are the correctly classiﬁed samples. The x-axis

corresponds to the true label, whereas the y-axis

represents the label the samples were actually as-

signed to by the classiﬁer. A similar case is given

for other classiﬁers, such as SVM (see A.11).

Further evaluation statistics performed on RF

(see A.12) show high values for Hip-Hop metrics,

i.e., a precision of 0.91, a recall of 0.95, and an F-

score of 0.93. Contrarily, the lowest F scores are

given for RNB and Country. Alt Rock accounts

for the lowest precision, although F-score values

are close to the average. Both Alt Rock and Pop

show metrics results ranging from 0.65 to 0.72 in

precision and an F-score of 0.70.

Model CountV ec T f − Idf W ord2V ec

Logistic Regr. 0.678 0.691 0.449

KNN 0.453 0.482 0.452

Decision Tree 0.605 0.600 0.480

MLP (NN) 0.723 0.713 0.481

Random Forest 0.735 0.730 0.580

SVM 0.689 0.732 0.450

Table 2: Test accuracy results for ML classiﬁers on

CountVec, Tf-Idf, Word2Vec processed data.

Figure 5: RF on TF-IDF vectors Confusion Matrix.

6. Discussion

The highest results are given by LR, MLP, RF and

SVM in the TD-IDF and Count Vectorizer train

set. However, SVM and especially RF reach max-

imal performance among all. Accuracies such as

that of DT or KNN are very low, which makes

these models unreliable. Therefore, we can state

that TF-IDF and Count Vectorizer for processing

together with RF for classifying is a suitable tech-

nique for our goal.

Based on Figure 5, we noticed that Hip-Hop

accounts for the least errors. In Figure 3, we can

actually observe that the words for Hip-Hop ap-

pear with a distinct frequency than other genres,

which makes it diﬀerentiable and easier to clas-

sify. Contrarily, “alt Rock” and ”Country” have a

higher number of common samples. In fact, even

the top-10 normalized word counts for Country

(Figure A.10), which tend to be more character-

istic of each genre, are actually similar for the rest

of genres. For example, ”love” occurrences are very

similar to those of ”pop”, and it is a popular word

in all other genres, too. It is a similar case for

”know”. Actually, the following word counts are

more and more similar to other genres. In other

words, Country vocabulary does not show such a

high diﬀerentiating potential as that of Hip-Hop.

Other genres fall in between these two, although

none of these get close to that of Hip-Hop.

Furthermore, RF statistics (A.12) suggest that

Hip-Hop accounts for the highest scores in all met-

rics (precision, recall, and F-score), all of them

higher than 90%. This score was achieved just by

means of NLP processing techniques, which proves

these methods can be very useful when it comes to

genre prediction. However, the statistics also show

that other genres are not so accurate (most range

around the 0.65-0.70 values in the aforementioned

evaluation metrics).

This might suggest that, although NLP can be

very useful for some genres, it might not be enough

for all song genres classiﬁcation. That is, our meth-

ods could be used as a fundamental component of

a more complex algorithm that takes into account

additional data types (other than text and NLP

techniques) for song genre classiﬁcation.

Regarding other NLP techniques, none have

shown greater potential than TF-IDF and Count

Vectorizer. We believe that pretrained Word2Vec

embeddings, for example, may have given worse

results with respect to TF-IDF encoding because

the Word2Vec vectors do not include slang words

or ﬁller words, so their contribution is not con-

sidered. With a bit of surprise, the accuracy does

not increase either with the Word2Vec embeddings

learned on our corpus. This might be due to the

insuﬃcient training and quality of the data.

7. Conclusion

We have tried diﬀerent NLP processing techniques

to treat our data. We also trained diﬀerent ML

models on that data and evaluated them with dif-

ferent metrics.

The best results are obtained when using TF-

IDF and Count Vectorizer on song lyrics and utiliz-

ing RF for song genre prediction, with an accuracy

of 73% for a ﬁve-class classiﬁcation problem.

The easiest genre to classify is Hip-Hop, mostly

due to its idiosyncratic vocabulary. Other genres,

albeit still rather accurate, show some bias to other

classes -such as Country for RNB and alt rock,

and vice versa, which has proven to be the most

diﬃcult genre to classify.

In conclusion, applying NLP techniques on song

lyrics for song genre classiﬁcation is genuinely ef-

fective on some genres. It can therefore be a very

powerful tool for genre classiﬁcation algorithms,

which might beneﬁt from including additional data

types (such as rhythm or audio) to make up for the

complexity that certain genres pose.

References

Buˇzi´c and J. Dobˇsa, Lyrics classiﬁcation using

Naive Bayes (41st International Convention on

Information, Communication Technology, Electron-

ics, and Microelectronics (MIPRO), 2018, pp. 1011-

1015, doi: 10.23919/MIPRO.2018.8400185, 2018).

Buˇzi´c and J. Dobˇsa, Lyrics-based music genre clas-

siﬁcation using a hierarchical attention network

(arXiv preprint arXiv:1707.04678, 2017).

tmthyjames, Cypher, GitHub, (2017) https : //

github. com / tmthyjames / cypher. git (visited

on 10/18/2022).

Appendix A. Additional Information

We show below other ﬁgures and graphs that might

be useful to gain a better insight of the dataset,

but were omitted for the sake of simplicity and

conciseness.

Appendix A.1. Top words in the whole dataset

Figure A.6: Top-10 dataset words per genre.

Appendix A.2. Top words per genre

Figure A.7: Top-10 RNB words per genre.

Figure A.8: Top-10 Pop words per genre.

Figure A.9: Top-10 alt Rock words per genre.

Figure A.10: Top-10 Country words per genre.

Figure A.11: SVM Confusion Matrix.

Appendix A.3. Evaluation

Figure A.12: RF Evaluation Statistics.

Figure A.13: SVM Evaluation Statistics.