Using Bigram Paragraph Vectors for Concept Detection

6 minute read | Updated:

Recently, I was working on a project using paragraph vectors at work (with gensim's `Doc2Vec` model) and noticed that the `Doc2Vec` model didn't natively interact well with their `Phrases` class, and there was no easy workaround (that I noticed). I saw <a href="http://lmgtfy.com/?q=bigrams+doc2vec+gensim">very little activity</a> around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram *paragraph* vectors.
So I decided to take a look at <a href="https://github.com/RaRe-Technologies/gensim">gensim's source code</a> and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams *and* bigrams by only passing an additional argument to the `Phrases` class. If you want to dig into the code I added, you can find it <a href="https://github.com/RaRe-Technologies/gensim/pull/2158/files">on github</a>. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often. 
First, let's create a helper class to stream in our documents, much like we did in my <a href="https://tmthyjames.github.io/posts/Analyzing-Rap-Lyrics-Using-Word-Vectors/">previous post</a> on analyzing rap lyrics with word vectors. Since the `Doc2Vec` model accepts as input a `TaggedDocument` object, that's what we'll `yield` from our `__iter__` method. 

Recently, I was working on a project using paragraph vectors at work (with gensim's Doc2Vec model) and noticed that the Doc2Vec model didn't natively interact well with their Phrases class, and there was no easy workaround (that I noticed). I saw very little activity around the interwebs about using bigrams with paragraph vectors, which I thought was surprising since paragraph vectors can be much more illuminating than word vectors, especially when trying to disambiguate the various meanings of a given text. This is the main reason I was looking to move from bigram word vectors to bigram paragraph vectors.

So I decided to take a look at gensim's source code and incorporate this interaction into its API. With this commit, you can build paragraph vectors with unigrams and bigrams by only passing an additional argument to the Phrases class. If you want to dig into the code I added, you can find it on github. Here, I'll just explain how to use this new code to detect concepts, something I have to do quite often.

First, let's create a helper class to stream in our documents, much like we did in my previous post on analyzing rap lyrics with word vectors. Since the Doc2Vec model accepts as input a TaggedDocument object, that's what we'll yield from our __iter__ method.

In [509]:
from gensim.models.doc2vec import TaggedDocument
import nltk, csv
class Sentences(object):
    
    def __init__(self, filename=None, col=None, 
                 stopwords=None, ID=None):
        self.filename = filename
        self.col = col
        self.stopwords = stopwords
        self.ID = ID
        
    @staticmethod
    def get_tokens(text):
        """Helper function for retrieving for tokenizing data, 
           ignoring stemming and lemmatizing here for simplicity"""
        return [r.lower() for r in text.split()]
 
    def __iter__(self):
        reader = csv.DictReader(open(self.filename, 'r' ))
        for row in reader:
            song = row[self.ID]
            if not row[self.col]: continue
            words = self.get_tokens(row[self.col])
            tags = ['%s' % (row[self.ID].strip())]
            yield TaggedDocument(words=words, tags=tags)
Now let's initialize a `Sentences` object and pass in the `id` column so that we can tag each document with an identifier. 

Now let's initialize a Sentences object and pass in the id column so that we can tag each document with an identifier.

In [511]:
sentences = Sentences(
    filename='rap-lyrics.csv', # our filename
    col='lyric', # the text field
    ID='id', # our ID column for document tagging
    stopwords=nltk.corpus.stopwords.words('english') # default stopwords from NLTK
)
Previous to this commit, you had to pass a list of strings to the `Phrases` class in gensim, which is fine if you pass the output of `Phrases` to the `Word2Vec` model since that's exactly what the `Word2Vec` model expects. But `Doc2vec` expects a list (or any iterable) of `TaggedDocument` objects. This is really the only thing preventing the easy use of bigrams with `Doc2Vec`. So let's initialize a `Phrases` object and pass our `doc2vec` argument in.

Previous to this commit, you had to pass a list of strings to the Phrases class in gensim, which is fine if you pass the output of Phrases to the Word2Vec model since that's exactly what the Word2Vec model expects. But Doc2vec expects a list (or any iterable) of TaggedDocument objects. This is really the only thing preventing the easy use of bigrams with Doc2Vec. So let's initialize a Phrases object and pass our doc2vec argument in.

In [481]:
phrases = Phrases(sentences, doc2vec=True)
To those familiar with gensim's API, the only difference in the workflow is that now you must set `doc2vec=True` when initializing the `Phrases` object. That's it! **Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors**. The following is the code I used to train a `Word2vec` model using bigrams (phrases), but just replacing `Word2Vec` with `Doc2vec`. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.

To those familiar with gensim's API, the only difference in the workflow is that now you must set doc2vec=True when initializing the Phrases object. That's it! Now you can leverage the insight of bigrams while harnessing the totality of document-level vectors. The following is the code I used to train a Word2vec model using bigrams (phrases), but just replacing Word2Vec with Doc2vec. Previously, it would error out because the two weren't compatiable. Now they interact seemlessly.

In [482]:
from gensim.models.doc2vec import Doc2Vec
phrase2vec = Doc2Vec( 
    workers=5, 
    window=20, 
    alpha=0.025, 
    min_alpha=0.025,
    min_count=1,
)
phrase2vec.build_vocab(phrases[sentences])
for epoch in range(3):
    phrase2vec.train(phrases[sentences], total_examples=phrase2vec.corpus_count, epochs=phrase2vec.iter)
    phrase2vec.alpha -= 0.002  # decrease the learning rate
    phrase2vec.min_alpha = phrase2vec.alpha  # fix the learning rate, no decay
phrase2vec.save('rap-lyrics3.bigrams.doc2vec')
/home/ubuntu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:561: UserWarning: For a faster implementation, use the gensim.models.phrases.Phraser class
Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already. 

Now, let's use these bigram paragraph vectors to detect various concepts. We'll use rap lyrics since I have those on my machine already.

In [500]:
concepts = ['weed', 'mom', 'religion', 'party', 'spanish', 'politics']
for concept in concepts:
    query = concept.split()
    scores = pd.DataFrame(
        phrase2vec.docvecs.most_similar(
            phrase2vec[query], 
            topn=len(lyrics)
        ), 
        columns=['id', concept + '_concept']
    )
    scores['id'] = scores['id'].astype(int)
    lyrics = lyrics.merge(scores, on=['id'])
In [501]:
lyrics.sort_values('politics_concept', ascending=False).head()
Out[501]:
id song year album artist lyric weed_concept mom_concept religion_concept party_concept spanish_concept politics_concept
10036 10036 The Wormhole 2013.0 Gravitas (2013) Talib_Kweli symbologists at the college oh you n****s wann... -0.011319 0.059534 0.451099 0.049866 0.076093 0.462608
8963 8963 Speak Your Mind (Hidden Track) 2001.0 Revolutionary Vol. 1 (2001) Immortal_Technique you have to speak the truth you have to speak ... 0.047033 0.066683 0.477586 -0.000200 0.032641 0.461523
4663 4663 I Want to Talk to You 1999.0 I Am (1999) Nas chorus repeat 2x i wanna talk to the mayor the... -0.112713 0.040988 0.299852 -0.011642 0.187577 0.460955
8861 8861 Solidified 2003.0 Free Agents: The Murda Mixtape (2003) Mobb_Deep prodigy yeah you know the shit dont stop never... -0.083604 0.032377 0.175557 -0.128048 0.116116 0.431326
7243 7243 Open Your Eyes 2008.0 The 3rd World (2008) Immortal_Technique were here because of you were here because you... -0.035662 0.011329 0.379785 -0.144915 0.059664 0.424974
xxxxxxxxxx
Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.

Our model does a pretty good job of detecting very high level concepts, as you can see with our politics example. I won't show the entire songs since most are NSFW, but you can look these songs up and see for yourself. For the rap connoisseurs out there, you can tell immediately from the artists that these are artists who wear their politics on their sleeves—especially Talib Kweli and Immortal Technique.