Background
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Typing on mobile devices can be easier if the keyboard can present three options for what the next word might be.
Summary
This report presents how a smart keyboard application that uses predictive text models can be built. It assumes, Predicted words by the application shall belong to English language.
Humans have the ability to predict future words in any utterance, by using domain or subject knowledge for example after the word red we can predict either word hat or blood, or we use syntactic knowledge for example after word the , we know a adjective or noun word should appear, or we use lexical knowledge to chose potato & not steak after the word baked.
For a machine to predict next word in a sentence, in the same fashion as humans, the appraoch in this report assumes that
- Simple statistical techniques can fairly address the part of the domain or subject knowledge needed to allow Word Prediction.
- Techniques will be based on the probability of a sequence (of letters, words,…), and not cover Natural Langauge Parsing Context Free Grammar to parse words i.e. into subject->verb->Object objects.
To accurately predict, the biggest challenge is that Infinite sentences or strings can be made from some finite vocabulary of words.
This report explores a Language Model called N-Gram models of language, which is the probablistic method using all the previous (N-1) words in a sentence sequence to predict the next word.
This current version of the report covers only the first part to explore & analyze what data, techniques & approach are needed to create a web based application to predict next word.
Data
To train these models for machine learning large corpora are needed. Corpora are online collections of text and speech. To build the learning model, training sample from a corpus called HC Corpora (www.corpora.heliohost.org) was used. The data is in the form of twitter, news and blogs sentences. It can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip Random samples of 1% of data sentences from each set were taken as the Training data set
Summary of the Training Data (1% Sample)
blogs |
2.078230 |
8992 |
372543 |
twitter |
1.660501 |
23601 |
302750 |
news |
0.156155 |
772 |
26326 |
Twitter sentences are short with a lot of informal characters and poor spelling in these add to non useful terms in data model, however it could be argued as defacto language of mobile user community. The report (or application) shall not aim to be enforcer of queens english , however will refrain from proposing profanity words. News documents are well written in a formal manner, however topics context could be limited to news items only. Three varied sources should help in generalizing the probabilities of words & sentences.
A separate test corpus (0.5% of corpus data) with held out test set is also used to evaluate the model.
Data Cleaning
Regular Expressions can be used for text classification of the sample training data. It will do clean up & in process reduce the size of the sample data. R Package tm has lot of useful functions that can be used for easiness & fair completeness. Keep it simple approach has been followed with the assumption that allthough the rigrous approach (i.e. instead of taking 1% of sample data, take 50% or use of highly custom regular expressions to consider context of punctuations in sentences etc ) , will tend to yield better predictive results, but it will take more computational resources and it may not be significantly better. Major steps to clean the training data’re
- Non English language Characters & Words in the Training Data were removed from the training data.
- Stop words are so commonly used words , & if you try to predict the next word you may always get a stop word instead of something you are more likely to type, however if these are in often use why not propose it.So I choose to keep those for fitting & validating the prediction model.
- Stemming is another common technique to concise the terms or words that are originating with common stem for text classification corpus, I choose not to do it to maintain sentence context sense while predicting.
- Punctuation and Numbers were removed completely from data for creating simple terms in the sentences , eventhough certain words (& sentences) with punctuation and numbers are humanly used. However it mostly got rid of poor use of punctuation as well and thus cleaning the corpus a lot.
- Lower casing of all the words in the corpus was done to remove duplicity of terms.
- Extra blanks or spaces, from the sentences were removed, to compress the data.
- Profanity words were not removed in the training document , these will be removed from the predictions.
Data Language Modelling
Estimation technique using probability of occurence of a sentence can be used to determine most likelyhood outcome of sentence in a language model. However it has a major flaw that probability of any sentence which is not in the training data can not be determined. Since there are only finite training sentences in training data it assigns probability of zero to any sentence which this model has not seen before.
As an alternative, if sentences are treated as sequences of words and each occurence of word is assigned a probability. Words probability being its count of occurence or its frequency divided by total words in the training corpus.Probability of any sentence is then product of its individual word probabilities. However with this apporach, there are enormously huge combinations possible to compute the probability of a sentence. Since a sentence can be made of any length of words and there’s vast choice for word value, it means numerically very large possibilities to predict.
Markov Processes Probablistic models use conditional probability chain rules and assumption that the probability of a word depends only on the probability of a limited history . It means the probability of a word depends only on the probability of the n previous words.This approximation addreseses the issue that the longer the sequence, the less likely we are to find it in a training corpus.
In this language model, words of sentences are split into groups of size n number also referred as Tokens, or Tokenization of corpus is done to make all possible groups of n words, that appear together in sentences with their count of occurence or frequencies.
Tokenization of training data is stored in Term document matrix which holds Term with their occurence count or frequency in each sampled document in a matrix form. It appears that some terms appear frequently in several documents and many appear rarely , thus leading to sparsity ( or emptyness) in term document matrix. I choose not to remove sparse terms or use highly dense or 0% sparsity as each short sentence from twitter is considered a separate document & high count of these document leads to spurious sparsity. Term document matrix of training data resulted in 43,968 unique terms.
A n-Gram Language model with token value n = 3 is called Trigram , n = 2 is called Bigram and n = 1 is called unigram. Higher the n is, the more data is needed to train
In this exploratory analysis , only Bigram and Trigram Language Models were built using RWeka
package. For the prediction models & web application intent is to use maximum likelyhood estimation techniques using Markovs Processes , Linear interpolation and use discounting method or Backoff models to deal with unseen words in corpus. Perplexity will be used to evaluate the information content and goodness of fit of Language Model.
Data Analysis
Top 3 Frequent words in N-Gram Model
Unigram |
the |
and |
you |
53629 |
534618 |
Bigram |
in the |
of the |
to the |
6431 |
652478 |
Trigram |
thanks for the |
a lot of |
one of the |
632 |
619322 |
Least 3 Frequent words in N-Gram Model
Unigram |
aaaaaahhhhhi |
aaaah |
aaahhh |
3 |
534618 |
Bigram |
a able |
a abomination |
a accounts |
3 |
652478 |
Trigram |
a a big |
a a couple |
a a plate |
3 |
619322 |
Summary of Word Frequency in N-Gram Model
Unigram |
1 |
1 |
1 |
12.160 |
3 |
29120 |
Bigram |
1 |
1 |
1 |
2.109 |
1 |
2526 |
Trigram |
1 |
1 |
1 |
1.180 |
1 |
234 |
Corpus Percentage of frequent terms in N-Gram Model
Unigram |
16.7317599 |
38.602516 |
67.197139 |
Bigram |
2.0707212 |
7.542936 |
19.583189 |
Trigram |
0.2373563 |
1.090063 |
3.918801 |
Words or terms with high frequency in unigram , bigram & trigram model are few and about 3/4th of words have frequency less than 3. Number of unique terms in unigram, bigram and trigram language model created out of the training data corpus are 43968 , 309337 & 524889 respectively. Above table shows that the 10 most frequently used Terms which are 0.0227438 % of terms in unigram model account for 16.7317599 % of usage in corpus. Similarly most frequently used 10 Terms of bigram and trigram model are 0.0032327% and 0.0019052% of terms respectively however these occur significantly much more times in corpus.
Above table shows a language model in which very few words are repeated very often
Often used Terms in Unigram & Bigram model
1% of random sample of corpus data for Training and Test Data provides enough term associations to be able for it to be used for the next word prediction algorithm and tokenization’s done with considerable ease on my computers memory.
Unigram, Bigram and Trigram language Term Document Matrix are taking 2.52 Mb, 19.18 Mb and 35.66 Mb of memory space respectively
Machine Learning Model Evaluation
Prediction success increases by number of keystrokes & choices