Parsing is the method to identify and understand the syntactic structure of a text. It is done by analyzing the individual elements of the text. The machine parses the text one word at a time, then two at a time, further three, and so on.
When the machine parses the text one word at a time, then it is a unigram.
When the text is parsed two words at a time, it is a bigram.
The set of words is a trigram when the machine parses three words at a time.
Look at the below diagram to understand unigram, bigram, and trigram.
Now, let’s implement parsing with the help of the nltk package.
import nltk
text = ”Top 30 NLP interview questions and answers”
We will now tokenize the text using word_tokenize.
text_token= word_tokenize(text)
Now, we will use the function for extracting unigrams, bigrams, and trigrams.
list(nltk.unigrams(text))
Output:
[ "Top 30 NLP interview questions and answer"]
list(nltk.bigrams(text))
Output:
["Top 30", "30 NLP", "NLP interview", "interview questions", "questions and", "and answer"]
list(nltk.trigrams(text))
Output:
["Top 30 NLP", "NLP interview questions", "questions and answers"]
For extracting n-grams, we can use the function nltk.ngrams and give the argument n for the number of parsers.
list(nltk.ngrams(text,n))