What is n-grams
N-grams is the contiguous sequence of items from a given sample of text or speech
Types of N-grams: Unigram, Bigram, Trigram, higher-order n-grams
For example, sentence “The cat has magic power”:
- The Bigrams would be
- “the cat”
- “cat has”
- “has magic”
- “magic power”
Typically we move one step forward, but you can move multi-step in more advanced scenarios
When moveing one step forward, the number of n-grams could be calculated by
where is the the number of words across the sentence
Applications
- Language Modeling: Predicting the next word based on the previous words, more details see makemore
- Text Classification: Identifying categories of text by analyzing word patterns.
- Machine Translation: Assisting in translating text by understanding sequences of words.
- Sentiment Analysis: Evaluating opinions expressed in text by examining word combinations
Advantages & Limitations
- Advantages
- N-grams help capture local context and word order
- Improving the accuracy of task like sentiment analysis, tagging and text-to-speech(TTS)
- Limitations
- As increases, the dimensionality of the data cna grow significantly, leading to issues such as data sparsity
- lack of understanding beyond context/window size