ML & NLP home

Word Tokenization

Text Normalization

Every NLP task needs to text normalization. This involves 3 steps:

Segmenting/Tokenizing words
Normalizing word formats
Segments sentences from text

Issues in Tokenization:

Finland’s could be Finlands / Finland / Findland is
Lowercase could be lowercase / lower-case / lower case
San Francisco - One letter or two?
Ph.d, MD - wtf?

We need to decide & fixate on normalizing these issues to achieve a better tokenization.

Ref: https://www.youtube.com/watch?v=jBk24DI8kg0