How to Use Application: Following User Options are available in Settings section.Number of suggested Words:
Control upto how many words application should propose for selection.Stopwords:
These are some of the commonly used words i.e. (the,are,a etc.), by default these are included as word predictions.Prediction Mode : Pattern
This word prediction method is very useful to save typing by predicting the partially typed current word. Application predicts words that are most likely by considering previously typed words & auto completes the sentence.Prediction Mode : Next Word
Application provides options, auto completes or predicts for the subsequent next word(s) it considers most likely to follow.Prediction Mode : Last Word
Quite similar to next word,Application provides options, auto completes or predicts for word(s) that may end the current phrase or sentence. In this mode Application further takes into consideration those words that are often used as actual last word of the sentence versus the same word being used in the middle of sentence.Learn Mode:
If selected, Application train predictions based on the currently typed phrases. The learning is available for word predictions in subsequent user sessions, only after launching the application again i.e.close application or restart by opening it again a new browser window.
Important : (Accept Sentence) button user selection action is needed to initiate Learning mode. Learning Data Message is flashed on screen & User may notice slight delay for next user action.
Application considers the previous words that have been typed before proposing Next Word.However what is considered Next word is a user choice.Keyboard Pattern
By choosing <pattern mode> which is the popular preferred way the application proposes choice(s) to auto complete the currently typed word. This method has the best prediction accuracy.Type Ahead
By choosing <next word mode> application proposes subsequent word i.e. kind of attempts to guess what is in users mind, and helps in even less typing, however accuracy is not always good.Conclude Sentence
By choosing <last word mode> application proposes whact could be the phrase or sentence ending word. This is similar to <next word> mode, however not all words that appear in the middle of sentence can be considered at end of sentence. Accuracy is similar to <next word> predictions. It is the applications default mode to help users play and get acclimitized with the application.
Upon Initial or first Launch of Application, message of (Application Initialized) is flashed and word (the) is available as first word choice. It indicates application is ready for the user now.
Accept Sentence Button Press action shall replace input text with the proposed suggested sentence including the next word.In the settings section, User has option to exclude commonly used words i.e. <the>, from the proposal list. By Default they are included. In the settings section, User has option to choose upto 3 number of words being proposed. The default option is to propose Top 1 word.
Training & test corpus data source used are from www.corpora.heliohost.org
Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Typing on mobile devices can be easier if the keyboard can present three options for what the next word might be.
This report presents how a smart keyboard application that uses predictive text models can be built. It assumes, Predicted words by the application shall belong to English language.
Humans have the ability to predict future words in any utterance, by using domain or subject knowledge for example after the word red we can predict either word hat or blood, or we use syntactic knowledge for example after word the , we know a adjective or noun word should appear, or we use lexical knowledge to chose potato & not steak after the word baked.
For a machine to predict next word in a sentence, in the same fashion as humans, the appraoch in this report assumes that
To accurately predict, the biggest challenge is that Infinite sentences or strings can be made from some finite vocabulary of words.
This report explores a Language Model called N-Gram models of language, which is the probablistic method using all the previous (N-1) words in a sentence sequence to predict the next word.
This current version of the report covers only the first part to explore & analyze what data, techniques & approach are needed to create a web based application to predict next word.
To train these models for machine learning large corpora are needed. Corpora are online collections of text and speech. To build the learning model, training sample from a corpus called HC Corpora (www.corpora.heliohost.org) was used. The data is in the form of twitter, news and blogs sentences. It can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip Random samples of 1% of data sentences from each set were taken as the Training data set
Twitter sentences are short with a lot of informal characters and poor spelling in these add to non useful terms in data model, however it could be argued as defacto language of mobile user community. The report (or application) shall not aim to be enforcer of queens english , however will refrain from proposing profanity words. News documents are well written in a formal manner, however topics context could be limited to news items only. Three varied sources should help in generalizing the probabilities of words & sentences.
A separate test corpus (0.5% of corpus data) with held out test set is also used to evaluate the model.
Regular Expressions can be used for text classification of the sample training data. It will do clean up & in process reduce the size of the sample data. R Package tm has lot of useful functions that can be used for easiness & fair completeness. Keep it simple approach has been followed with the assumption that allthough the rigrous approach (i.e. instead of taking 1% of sample data, take 50% or use of highly custom regular expressions to consider context of punctuations in sentences etc ) , will tend to yield better predictive results, but it will take more computational resources and it may not be significantly better. Major steps to clean the training data’re
Estimation technique using probability of occurence of a sentence can be used to determine most likelyhood outcome of sentence in a language model. However it has a major flaw that probability of any sentence which is not in the training data can not be determined. Since there are only finite training sentences in training data it assigns probability of zero to any sentence which this model has not seen before.
As an alternative, if sentences are treated as sequences of words and each occurence of word is assigned a probability. Words probability being its count of occurence or its frequency divided by total words in the training corpus.Probability of any sentence is then product of its individual word probabilities. However with this apporach, there are enormously huge combinations possible to compute the probability of a sentence. Since a sentence can be made of any length of words and there’s vast choice for word value, it means numerically very large possibilities to predict.
Markov Processes Probablistic models use conditional probability chain rules and assumption that the probability of a word depends only on the probability of a limited history . It means the probability of a word depends only on the probability of the n previous words.This approximation addreseses the issue that the longer the sequence, the less likely we are to find it in a training corpus.
In this language model, words of sentences are split into groups of size n number also referred as Tokens, or Tokenization of corpus is done to make all possible groups of n words, that appear together in sentences with their count of occurence or frequencies.
Tokenization of training data is stored in Term document matrix which holds Term with their occurence count or frequency in each sampled document in a matrix form. It appears that some terms appear frequently in several documents and many appear rarely , thus leading to sparsity ( or emptyness) in term document matrix. I choose not to remove sparse terms or use highly dense or 0% sparsity as each short sentence from twitter is considered a separate document & high count of these document leads to spurious sparsity. Term document matrix of training data resulted in 43,968 unique terms.
A n-Gram Language model with token value n = 3 is called Trigram , n = 2 is called Bigram and n = 1 is called unigram. Higher the n is, the more data is needed to train
In this exploratory analysis , only Bigram and Trigram Language Models were built using
RWeka package. For the prediction models & web application intent is to use maximum likelyhood estimation techniques using Markovs Processes , Linear interpolation and use discounting method or Backoff models to deal with unseen words in corpus. Perplexity will be used to evaluate the information content and goodness of fit of Language Model.
|Term1||Term2||Term3||Top 3 Count||Total Word Count|
|Bigram||in the||of the||to the||6431||652478|
|Trigram||thanks for the||a lot of||one of the||632||619322|
|Term1||Term2||Term3||Least 3 Count||Total Word Count|
|Bigram||a able||a abomination||a accounts||3||652478|
|Trigram||a a big||a a couple||a a plate||3||619322|
|Min.||1st Qu.||Median||Mean||3rd Qu.||Max.|
|10 Terms||100 Terms||1000 Terms|
Words or terms with high frequency in unigram , bigram & trigram model are few and about 3/4th of words have frequency less than 3. Number of unique terms in unigram, bigram and trigram language model created out of the training data corpus are 43968 , 309337 & 524889 respectively. Above table shows that the 10 most frequently used Terms which are 0.0227438 % of terms in unigram model account for 16.7317599 % of usage in corpus. Similarly most frequently used 10 Terms of bigram and trigram model are 0.0032327% and 0.0019052% of terms respectively however these occur significantly much more times in corpus.
Above table shows a language model in which very few words are repeated very often
1% of random sample of corpus data for Training and Test Data provides enough term associations to be able for it to be used for the next word prediction algorithm and tokenization’s done with considerable ease on my computers memory.
Unigram, Bigram and Trigram language Term Document Matrix are taking 2.52 Mb, 19.18 Mb and 35.66 Mb of memory space respectively
Prediction success increases by number of keystrokes & choices