Word Tokenization
Text Normalization
Every NLP task needs to text normalization. This involves 3 steps:
- Segmenting/Tokenizing words
- Normalizing word formats
- Segments sentences from text
Issues in Tokenization:
- Finland’s could be Finlands / Finland / Findland is
- Lowercase could be lowercase / lower-case / lower case
- San Francisco - One letter or two?
- Ph.d, MD - wtf?
We need to decide & fixate on normalizing these issues to achieve a better tokenization.
Ref: https://www.youtube.com/watch?v=jBk24DI8kg0