Computer scienceData scienceNLPText processing

N-gram and collocation measures

Find the best bigrams by Log-Likelihood score

Report a typo

Take the Brown corpus in NLTK and choose the government subcorpus:

from nltk.corpus import brown
nltk.download('brown')

gov_text = brown.words(categories='government')

Preporcess the text by filtering words shorter than three letters and words that are marked as stopwords.

Find ten best trigrams according to the log-likelihood association score.

The output should be a list of the tuples (trigrams) you have found, with each trigram on a new line. For example,
[('First word', 'Second word', 'Third word'),
('First word', 'Second word', 'Third word'),
('First word', 'Second word', 'Third word')]

Enter a short text

___

Create a free account to access the full topic