How to extract common / significant phrases from a series of text entries

with nltk, it's easy enough to get bigrams and trigrams, but what I'm looking for are phrases that are more likely 7 - 8 words in length. I have not figured out how to make nltk (or some other method) provide such 'octograms' and above.

Commented Mar 16, 2010 at 9:23 Maybe you can try graph based algorithms like TextRank - github.com/ceteri/pytextrank Commented Mar 28, 2018 at 19:00

4 Answers 4

I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.

To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.

The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:

import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() # change this to read in your data finder = BigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only bigrams that appear 3+ times finder.apply_freq_filter(3) # return the 10 n-grams with the highest PMI finder.nbest(bigram_measures.pmi, 10)