Stemming Vs Lemmatization performance
Stemming and lemmatization are text normalization procedures used to prepare text, words, and documents for subsequent processing in the field of Natural Language Processing. In this blog, you can learn about stemming and lemmatization in a very practical way, including the background, applications, and how to stem and lemmatize words, sentences, and documents using the Python nltk package, which is the natural language package given by Python.
You could want your program to recognize that the terms “laugh” and “laughed” are just different tenses of the same verb in natural language processing. This can be the idea of reducing various forms of a word to a single root.
The process of producing morphological variants of a root/base word is known as stemming. Stemming algorithms or stemmers are terms used to describe stemming programs. When looking for a specific keyword in text, it often helps if the search returns variations of the word. For example, if you search for “boat,” you might also get “boats” and “boating.” The stem for [boat, boater, boating, boats] would be “boat.”
Stemming is a rudimentary approach for categorising related words in which letters are chopped off from the end until the stem is reached. In most circumstances, this works fine, but there are a few outliers in English, where a more complex procedure is required.
import nltk
from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = [‘run’,’runner’,’running’,’ran’,’runs’,’easily’,’fairly’]
for word in words:
print(word+’ –> ‘+p_stemmer.stem(word))
This will result in [‘run’,’runner’,’run’,’ran’,’run’,’easili’,’fairli’]
The “English Stemmer” or “Porter2 Stemmer” is a more correct name for the snowball algorithm used here. In terms of logic and speed, it is a small advance over the original Porter stemmer. We’ll use the moniker SnowballStemmer because it’s used by nltk.
s_stemmer = SnowballStemmer(language=’english’)
words = [‘run’,’runner’,’running’,’ran’,’runs’,’easily’,’fairly’]
for word in words:
print(word+’ –> ‘+s_stemmer.stem(word))
Result will be [‘run’,’runner’,’run’,’ran’,’run’,’easili’,’fair’]
The stemmer did the same job as the Porter Stemmer in this situation, with the exception that it handled the stem of “fairly” more correctly with “fair.”
There are certain disadvantages to stemming. Stemming could always yield saw if given the token saw, whereas lemmatization might return either see or saw depending on whether the token was used as a verb or a noun.
In contrast to stemming, lemmatization considers the entire lexicon of a language when applying morphological analysis to words. The lemma for ‘was’ is ‘be,’ while for’mice,’ the lemma is’mouse.’ Spacy has chosen to just have Lemmatization available instead of Stemming because Lemmatization is often viewed as significantly more informative than basic stemming. Lemmatization uses the context of the surrounding text to determine the part of speech of a particular word; it does not categorize phrases.
import spacy
nlp = spacy.load(‘en_core_web_sm’)def show_lemmas(text):
for token in text:
print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}} {token.lemma_}’)
Here we’re using an f-string to format the printed text by setting minimum field widths and adding a left-align to the lemma hash value.
doc = nlp(u”I saw eighteen mice today!”)
show_lemmas(doc)
Output here is
I PRON -PRON-
saw VERB see
eighteen NUM eighteen
mice NOUN mouse
today NOUN today
! PUNCT !
Notice that’saw’ has the lemma’see,’ that’mice’ is the plural form of’mouse,’ and that ‘eighteen’ is its own number, not an extended variant of ‘eight.’
Reaching out to the right people is the key ingredient to accelerating our career to the next level. Let’s grab the opportunity to learn AI in Kochi without waiting anymore and join the best AI training institute in Kochi.