NGram Language Model 1
3 min read

predicting the probability of a word in a document.

Unigram Language Model

  • n-gram: sequence of n words.
  • “hello” is a unigram or 1-gram. “hello world” is a bigram or 2-gram and so on.

Training

Estimation of probability of a word in a sentence based on the words that have come before it.

Assumptions

  • Probability of each word is independent of other words.
  • Probability of a word is calculated as the fraction of times the word occurs among all words in the training text. Training is complete when probabilities of all words in the document is calculated.

Ptrain(“bottle” | “This” “is” “a” ) = Ptrain(“bottle”) = ntrain(dream)/Ntrain

Prediction

Probability of each sentence: Peval(“This is a bottle.”) = Ptrain(“This”) . Ptrain(“is” | “This”) . Ptrain(“a” | “This” “is”) . Ptrain(“bottle” | “This” “is” “a”) . Ptrain("." | “This” “is” “a” “bottle”) = Ptrain(“This”) . Ptrain(“is”) . Ptrain(“a”) . Ptrain(“bottle”) . Ptrain(".")

We can see here that probabilities are assigned not only to words but also to sentences. The “.” character is the [END] character which ensures that the probability of all possible sentences sum to 1.

Evaluation Metric

Average Log Likelihood

Peval(text) = Π Ptrain(word) log(Peval(text)) = Σ log(Ptrain(word)) Average log likelihoodeval = Σ log(Ptrain(word))/ Neval

Unknown Unigrams

If our model gets a word which never occurred in the training data, we run into a problem.

ntrain(unigram) = 0 Ptrain(unigram) = 0 log(Ptrain(unigram)) = log(0) = -∞

Laplace smoothing

  • Introduce new [UNK] unigram. This unigram never occurs in the data so its counter always 0.

  • Now a pseudo-count k is added to all unigrams in the data. Most commonly k=1. This is also called “add-one smoothing”

  • The -∞ problem is avoided now. predicting the probability of a word in a document.

Unigram Language Model

  • n-gram: sequence of n words.
  • “hello” is a unigram or 1-gram. “hello world” is a bigram or 2-gram and so on.

Training

Estimation of probability of a word in a sentence based on the words that have come before it.

Assumptions

  • Probability of each word is independent of other words.
  • Probability of a word is calculated as the fraction of times the word occurs among all words in the training text. Training is complete when probabilities of all words in the document is calculated.

Ptrain(“bottle” | “This” “is” “a” ) = Ptrain(“bottle”) = ntrain(dream)/Ntrain

Prediction

Probability of each sentence: Peval(“This is a bottle.”) = Ptrain(“This”) . Ptrain(“is” | “This”) . Ptrain(“a” | “This” “is”) . Ptrain(“bottle” | “This” “is” “a”) . Ptrain("." | “This” “is” “a” “bottle”) = Ptrain(“This”) . Ptrain(“is”) . Ptrain(“a”) . Ptrain(“bottle”) . Ptrain(".")

We can see here that probabilities are assigned not only to words but also to sentences. The “.” character is the [END] character which ensures that the probability of all possible sentences sum to 1.

Evaluation Metric

Average Log Likelihood

Peval(text) = Π Ptrain(word) log(Peval(text)) = Σ log(Ptrain(word)) Average log likelihoodeval = Σ log(Ptrain(word))/ Neval