# Next Word

###### Click Speak and say the command word 'Next' followed by your sentence to start predicting

___________________

ThinkWord Contact Me!

#### Suggested Sentence

How to Use Application: Following User Options are available in Settings section.

Number of suggested Words:

Control upto how many words application should propose for selection.

Stopwords:

These are some of the commonly used words i.e. (the,are,a etc.), by default these are included as word predictions.

Prediction Mode : Pattern

This word prediction method is very useful to save typing by predicting the partially typed current word. Application predicts words that are most likely by considering previously typed words & auto completes the sentence.

Prediction Mode : Next Word

Application provides options, auto completes or predicts for the subsequent next word(s) it considers most likely to follow.

Prediction Mode : Last Word

Quite similar to next word,Application provides options, auto completes or predicts for word(s) that may end the current phrase or sentence. In this mode Application further takes into consideration those words that are often used as actual last word of the sentence versus the same word being used in the middle of sentence.

Learn Mode:

If selected, Application train predictions based on the currently typed phrases. The learning is available for word predictions in subsequent user sessions, only after launching the application again i.e.close application or restart by opening it again a new browser window.

Important : (Accept Sentence) button user selection action is needed to initiate Learning mode. Learning Data Message is flashed on screen & User may notice slight delay for next user action.

#### Predict Next

Typing on Mobile & Computer Devices is made easier by Predicting the next word in the sentence It's Helpful to Millions of Mobile Device users for E-mail, Banking, Social Networking etc. activities

Wide range of applications possible by extending Natural Language Processings use

• Speech Recognition & Complex web searches
• Machine Translation, Customer Services Dialogue systems
• Text Summarization & Document Classification
• Brand Management & Plagiarism detection
• Investigate service invoices & Electronic medical record data for quality service, care & predict costs
• #### Next Word

Application considers the previous words that have been typed before proposing Next Word.However what is considered Next word is a user choice.

Keyboard Pattern

By choosing <pattern mode> which is the popular preferred way the application proposes choice(s) to auto complete the currently typed word. This method has the best prediction accuracy.

By choosing <next word mode> application proposes subsequent word i.e. kind of attempts to guess what is in users mind, and helps in even less typing, however accuracy is not always good.

Conclude Sentence

By choosing <last word mode> application proposes whact could be the phrase or sentence ending word. This is similar to <next word> mode, however not all words that appear in the middle of sentence can be considered at end of sentence. Accuracy is similar to <next word> predictions. It is the applications default mode to help users play and get acclimitized with the application.

Upon Initial or first Launch of Application, message of (Application Initialized) is flashed and word (the) is available as first word choice. It indicates application is ready for the user now.

Accept Sentence Button Press action shall replace input text with the proposed suggested sentence including the next word.

In the settings section, User has option to exclude commonly used words i.e. <the>, from the proposal list. By Default they are included. In the settings section, User has option to choose upto 3 number of words being proposed. The default option is to propose Top 1 word.

#### Summary

This application presents how a smart keyboard application that uses predictive text models can be built. It assumes, Predicted words by the application shall belong to English language. Application algorithm uses N-Gram Language Model & Katz Backoff Methods Implementation. If Prediction from Trigram is unavailable , back off to Bigram , & then to unigram. Linear modelling smoothing is performed if different & multiple matches are available between N-gram (Trigram , Bigram & Unigram) language models. To model unseen sentences, algortithm models approach akin to Discounting method to replace low frequency words as Unknowns.

Training & test corpus data source used are from www.corpora.heliohost.org

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Next Word Prediction:Exploratory Report

### Background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. Typing on mobile devices can be easier if the keyboard can present three options for what the next word might be.

## Summary

This report presents how a smart keyboard application that uses predictive text models can be built. It assumes, Predicted words by the application shall belong to English language.

Humans have the ability to predict future words in any utterance, by using domain or subject knowledge for example after the word red we can predict either word hat or blood, or we use syntactic knowledge for example after word the , we know a adjective or noun word should appear, or we use lexical knowledge to chose potato & not steak after the word baked.

For a machine to predict next word in a sentence, in the same fashion as humans, the appraoch in this report assumes that

1. Simple statistical techniques can fairly address the part of the domain or subject knowledge needed to allow Word Prediction.
2. Techniques will be based on the probability of a sequence (of letters, words,…), and not cover Natural Langauge Parsing Context Free Grammar to parse words i.e. into subject->verb->Object objects.

To accurately predict, the biggest challenge is that Infinite sentences or strings can be made from some finite vocabulary of words.

This report explores a Language Model called N-Gram models of language, which is the probablistic method using all the previous (N-1) words in a sentence sequence to predict the next word.

This current version of the report covers only the first part to explore & analyze what data, techniques & approach are needed to create a web based application to predict next word.

## Data

To train these models for machine learning large corpora are needed. Corpora are online collections of text and speech. To build the learning model, training sample from a corpus called HC Corpora (www.corpora.heliohost.org) was used. The data is in the form of twitter, news and blogs sentences. It can be downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip Random samples of 1% of data sentences from each set were taken as the Training data set

Summary of the Training Data (1% Sample)
Size(Mb) Lines Words
blogs 2.078230 8992 372543
news 0.156155 772 26326

Twitter sentences are short with a lot of informal characters and poor spelling in these add to non useful terms in data model, however it could be argued as defacto language of mobile user community. The report (or application) shall not aim to be enforcer of queens english , however will refrain from proposing profanity words. News documents are well written in a formal manner, however topics context could be limited to news items only. Three varied sources should help in generalizing the probabilities of words & sentences.

A separate test corpus (0.5% of corpus data) with held out test set is also used to evaluate the model.

## Data Cleaning

Regular Expressions can be used for text classification of the sample training data. It will do clean up & in process reduce the size of the sample data. R Package tm has lot of useful functions that can be used for easiness & fair completeness. Keep it simple approach has been followed with the assumption that allthough the rigrous approach (i.e. instead of taking 1% of sample data, take 50% or use of highly custom regular expressions to consider context of punctuations in sentences etc ) , will tend to yield better predictive results, but it will take more computational resources and it may not be significantly better. Major steps to clean the training data’re

1. Non English language Characters & Words in the Training Data were removed from the training data.
2. Stop words are so commonly used words , & if you try to predict the next word you may always get a stop word instead of something you are more likely to type, however if these are in often use why not propose it.So I choose to keep those for fitting & validating the prediction model.
3. Stemming is another common technique to concise the terms or words that are originating with common stem for text classification corpus, I choose not to do it to maintain sentence context sense while predicting.
4. Punctuation and Numbers were removed completely from data for creating simple terms in the sentences , eventhough certain words (& sentences) with punctuation and numbers are humanly used. However it mostly got rid of poor use of punctuation as well and thus cleaning the corpus a lot.
5. Lower casing of all the words in the corpus was done to remove duplicity of terms.
6. Extra blanks or spaces, from the sentences were removed, to compress the data.
7. Profanity words were not removed in the training document , these will be removed from the predictions.

## Data Language Modelling

Estimation technique using probability of occurence of a sentence can be used to determine most likelyhood outcome of sentence in a language model. However it has a major flaw that probability of any sentence which is not in the training data can not be determined. Since there are only finite training sentences in training data it assigns probability of zero to any sentence which this model has not seen before.

As an alternative, if sentences are treated as sequences of words and each occurence of word is assigned a probability. Words probability being its count of occurence or its frequency divided by total words in the training corpus.Probability of any sentence is then product of its individual word probabilities. However with this apporach, there are enormously huge combinations possible to compute the probability of a sentence. Since a sentence can be made of any length of words and there’s vast choice for word value, it means numerically very large possibilities to predict.

Markov Processes Probablistic models use conditional probability chain rules and assumption that the probability of a word depends only on the probability of a limited history . It means the probability of a word depends only on the probability of the n previous words.This approximation addreseses the issue that the longer the sequence, the less likely we are to find it in a training corpus.

In this language model, words of sentences are split into groups of size n number also referred as Tokens, or Tokenization of corpus is done to make all possible groups of n words, that appear together in sentences with their count of occurence or frequencies.

Tokenization of training data is stored in Term document matrix which holds Term with their occurence count or frequency in each sampled document in a matrix form. It appears that some terms appear frequently in several documents and many appear rarely , thus leading to sparsity ( or emptyness) in term document matrix. I choose not to remove sparse terms or use highly dense or 0% sparsity as each short sentence from twitter is considered a separate document & high count of these document leads to spurious sparsity. Term document matrix of training data resulted in 43,968 unique terms.

A n-Gram Language model with token value n = 3 is called Trigram , n = 2 is called Bigram and n = 1 is called unigram. Higher the n is, the more data is needed to train

In this exploratory analysis , only Bigram and Trigram Language Models were built using RWeka package. For the prediction models & web application intent is to use maximum likelyhood estimation techniques using Markovs Processes , Linear interpolation and use discounting method or Backoff models to deal with unseen words in corpus. Perplexity will be used to evaluate the information content and goodness of fit of Language Model.

## Data Analysis

Top 3 Frequent words in N-Gram Model
Term1 Term2 Term3 Top 3 Count Total Word Count
Unigram the and you 53629 534618
Bigram in the of the to the 6431 652478
Trigram thanks for the a lot of one of the 632 619322
Least 3 Frequent words in N-Gram Model
Term1 Term2 Term3 Least 3 Count Total Word Count
Unigram aaaaaahhhhhi aaaah aaahhh 3 534618
Bigram a able a abomination a accounts 3 652478
Trigram a a big a a couple a a plate 3 619322
Summary of Word Frequency in N-Gram Model
Min. 1st Qu. Median Mean 3rd Qu. Max.
Unigram 1 1 1 12.160 3 29120
Bigram 1 1 1 2.109 1 2526
Trigram 1 1 1 1.180 1 234
Corpus Percentage of frequent terms in N-Gram Model
10 Terms 100 Terms 1000 Terms
Unigram 16.7317599 38.602516 67.197139
Bigram 2.0707212 7.542936 19.583189
Trigram 0.2373563 1.090063 3.918801

Words or terms with high frequency in unigram , bigram & trigram model are few and about 3/4th of words have frequency less than 3. Number of unique terms in unigram, bigram and trigram language model created out of the training data corpus are 43968 , 309337 & 524889 respectively. Above table shows that the 10 most frequently used Terms which are 0.0227438 % of terms in unigram model account for 16.7317599 % of usage in corpus. Similarly most frequently used 10 Terms of bigram and trigram model are 0.0032327% and 0.0019052% of terms respectively however these occur significantly much more times in corpus.

Above table shows a language model in which very few words are repeated very often

### Often used Terms in Unigram & Bigram model

1% of random sample of corpus data for Training and Test Data provides enough term associations to be able for it to be used for the next word prediction algorithm and tokenization’s done with considerable ease on my computers memory.

Unigram, Bigram and Trigram language Term Document Matrix are taking 2.52 Mb, 19.18 Mb and 35.66 Mb of memory space respectively

## Machine Learning Model Evaluation

Prediction success increases by number of keystrokes & choices