Tokenization is a process used in NLP to split a sentence into tokens. Sentence tokenization refers to splitting a text or paragraph into sentences.
For tokenizing, we will import sent_tokenize from the nltk package:
from nltk.tokenize import sent_tokenize<>
We will use the below paragraph for sentence tokenization:
Para = “Hi Guys. Welcome to Intellipaat. This is a blog on the NLP interview questions and answers.”
sent_tokenize(Para)
Output:
[ 'Hi Guys.' ,
'Welcome to Intellipaat. ',
'This is a blog on the NLP interview questions and answers. ' ]
Tokenizing a word refers to splitting a sentence into words.
Now, to tokenize a word, we will import word_tokenize from the nltk package.
from nltk.tokenize import word_tokenize
Para = “Hi Guys. Welcome to Intellipaat. This is a blog on the NLP interview questions and answers.”
word_tokenize(Para)
Output:
[ 'Hi' , 'Guys' , ' . ' , 'Welcome' , 'to' , 'Intellipaat' , ' . ' , 'This' , 'is' , 'a', 'blog' , 'on' , 'the' , 'NLP' , 'interview' , 'questions' , 'and' , 'answers' , ' . ' ]