predicting the probability of a word in a document.
Unigram Language Model
- n-gram: sequence of n words.
- “hello” is a unigram or 1-gram. “hello world” is a bigram or 2-gram and so on.
Training
Estimation of probability of a word in a sentence based on the words that have come before it.
Assumptions
- Probability of each word is independent of other words.
- Probability of a word is calculated as the fraction of times the word occurs among all words in the training text. Training is complete when probabilities of all words in the document is calculated.
Ptrain(“bottle” | “This” “is” “a” ) = Ptrain(“bottle”) = ntrain(dream)/Ntrain
Prediction
Probability of each sentence: Peval(“This is a bottle.”) = Ptrain(“This”) . Ptrain(“is” | “This”) . Ptrain(“a” | “This” “is”) . Ptrain(“bottle” | “This” “is” “a”) . Ptrain("." | “This” “is” “a” “bottle”) = Ptrain(“This”) . Ptrain(“is”) . Ptrain(“a”) . Ptrain(“bottle”) . Ptrain(".")
We can see here that probabilities are assigned not only to words but also to sentences. The “.” character is the [END]
character which ensures that the probability of all possible sentences sum to 1.
Evaluation Metric
Average Log Likelihood
Peval(text) = Π Ptrain(word) log(Peval(text)) = Σ log(Ptrain(word)) Average log likelihoodeval = Σ log(Ptrain(word))/ Neval
Unknown Unigrams
If our model gets a word which never occurred in the training data, we run into a problem.
ntrain(unigram) = 0 Ptrain(unigram) = 0 log(Ptrain(unigram)) = log(0) = -∞
Laplace smoothing
Introduce new [UNK] unigram. This unigram never occurs in the data so its counter always 0.
Now a pseudo-count k is added to all unigrams in the data. Most commonly k=1. This is also called “add-one smoothing”
The -∞ problem is avoided now. predicting the probability of a word in a document.
Unigram Language Model
- n-gram: sequence of n words.
- “hello” is a unigram or 1-gram. “hello world” is a bigram or 2-gram and so on.
Training
Estimation of probability of a word in a sentence based on the words that have come before it.
Assumptions
- Probability of each word is independent of other words.
- Probability of a word is calculated as the fraction of times the word occurs among all words in the training text. Training is complete when probabilities of all words in the document is calculated.
Ptrain(“bottle” | “This” “is” “a” ) = Ptrain(“bottle”) = ntrain(dream)/Ntrain
Prediction
Probability of each sentence: Peval(“This is a bottle.”) = Ptrain(“This”) . Ptrain(“is” | “This”) . Ptrain(“a” | “This” “is”) . Ptrain(“bottle” | “This” “is” “a”) . Ptrain("." | “This” “is” “a” “bottle”) = Ptrain(“This”) . Ptrain(“is”) . Ptrain(“a”) . Ptrain(“bottle”) . Ptrain(".")
We can see here that probabilities are assigned not only to words but also to sentences. The “.” character is the [END]
character which ensures that the probability of all possible sentences sum to 1.
Evaluation Metric
Average Log Likelihood
Peval(text) = Π Ptrain(word) log(Peval(text)) = Σ log(Ptrain(word)) Average log likelihoodeval = Σ log(Ptrain(word))/ Neval