EZ

Eduzan

Learning Hub

Eduzan
Eduzan / AI & Machine Learning

Basics of Natural Language Processing (NLP)

Computer Science / AI & Machine Learning tutorial chapter - Published 2025-12-17 - AI & Machine Learning

1. Tokenization:

  • Definition: The process of breaking down text into smaller units, typically words or subwords, called tokens.
  • Purpose: Helps in analyzing the structure of sentences and understanding the semantics of the text.
  • Example:
    • Input: “Artificial Intelligence is the future.”
    • Tokens: [“Artificial”, “Intelligence”, “is”, “the”, “future”, “.”]

2. Stemming:

  • Definition: The process of reducing words to their base or root form by removing suffixes.
  • Purpose: Helps in grouping similar words together for analysis, though it might result in non-standard word forms.
  • Example:
    • Input: “running”, “runner”, “ran”
    • Stemmed: “run”, “run”, “ran”

3. Lemmatization:

  • Definition: Similar to stemming, but lemmatization reduces words to their dictionary form (lemma), ensuring that the word remains valid.
  • Purpose: Provides a more accurate representation of the word’s meaning by considering context.
  • Example:
    • Input: “running”, “runner”, “ran”
    • Lemmatized: “run”, “runner”, “run”
End of lesson.