Let’s start by estimating bi-gram probabilities
Bi-gram probability can be calculated using:
P(wi|wi-1) = count(wi-1, wi) / count(wi-1)
This simply means, Of all the times we saw wi-1, how many times wi-1 & wi occur together.
For example, consider the sentences,
P(I|-start-) = count(-start-,I) / count(-start-) = occurrence of I followed by -start- by occurrence of -start- = 2/3 = 0.67
Now we have bi-gram probabilities of words. We can calculate the bi-gram probability of sentence like this:
P(-start-I am Sam-end-) = P(I | -start-) * P(am | I) * P(Sam | am) * P(-end- | Sam) |
- SRILM
- Google N-gram corpus (Released in 2006). Contains 13m unique words.
- Google Books N-gram corpus