ML & NLP home

Lemmatization and Stemming

Once the tokenization is done, we have to normalize & stem them.

Normalization

This includes keeping the indexed text & query text to have the same form like converting all the text to lowercase.
As alternative, this could also include asymmetric expansion like matching windows to window.
Note that, everything you do for normalization has its both good & bad effects. Like, for example, US & us are completely different & must be case-preserved.

Lemmatization

Lemmatization is the process of reducing certain words into their base form.
Like:
- am, are, is => be
- car, car’s, cars => car
So the boy’s car is red becomes boy car be red

Morphology

Morphemes: Morphemes are the smallest meaningful units that make up words.
Two types:
- Stems : Core meaning-bearing units
- Affixes : Bits & pieces that often adhere to Stems
For example, in word cars, car is the stem and s is affix.

Stemming

Stemming is the process of chopping of all the Affixes.
For example,
- Compression => Compress
- Compressed => Compress
- Automatic => Automat
The most common stemming algorithm used is Porter’s Algorithm.

Ref: https://www.youtube.com/watch?v=2s7f8mBwnko