1 Installing Libraries

For this lab, we will need the twitteR, ROAuth and tm libraries. These were installed in the week 2 lab. If you have not installed them using the instructions from the week 2 lab, do so now.

2 Accessing Twitter

To allow R to access twitter, we need to provide R with twitter's keys. In this section we will examine the process perform this. A video of this process is also available:

Note that this video is stored in a Matroska media container (.mkv) and uses H.264 video coding and vorbis audio encoding. They are free formats (not limited in use by patents) and so not promoted by commercial software companies. If you can't play this movie with your current video player, download VLC.

2.1 Obtaining OAuth keys

To access the Twitter API, a Twitter account is needed. If you do not have one, sign up here:

To access the Twitter API, we must register a program that is accessing the API. To register the program, visit this link:

and log in. Once you have logged in, click "Create a new application" and fill in the form. Ensure that the Callback URL is set to http://127.0.0.1:1410 When the application is created, it will be provided a consumer key and a consumer secret (both text sequences); these text sequences are needed soon, so create a new R script and record the values:

key = "put key string here"
secret = "put secret string here"

2.2 Authorising R

We can access the Twitter API in R by loading the twitteR library (and installing them first if needed).

library("twitteR")

To perform browser based authentication, we need to install the additional libraries base64enc and httpuv:

install.packages("base64enc")
install.packages("httpuv")

We then need to set up the OAuth session details. Note: insert your application's key and secret where needed. To perform the OAuth authentication using twitteR and your previously recorded key and secret, add the following code to your R script and run the script:

setup_twitter_oauth(key, secret)

A Web browser window should open asking you to verify yourself to Twitter. Once you have entered your twitter password, the OAuth handshake is complete and you can start using twitteR to download tweets.

If the process failed, it may be due to a missing SSL certificate. If so, add the following line to the top of your script to disable SSL verification, and rerun the script.

options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))

We are now ready to request information from the Twitter API.

2.3 Searching Twitter

NOTE: Anyone with Web access can upload information to Twitter. Therefore, we are unable to control the text that appears from any Twitter downloads. The tasks we have provided did not contain offensive material at the time of writing, but we cannot guarantee that there will not be offensive language during the time of the lab.

Let's find which words are mentioned with "Kevin Bacon".

First get the latest 100 tweets written in English, containing the terms "Kevin" and "Bacon":

tweets = searchTwitter('kevin bacon', n = 100, lang = "en")

The variable tweets is a list. Examine the list items using double brackets (e.g. tweets[[2]]).

Convert the list to a data frame:

tweets.df = twListToDF(tweets)

Remember that a data frame is a table, with column headings. Examine the column headings:

names(tweets.df)

There are many columns that we can explore. At this moment, we are interested in the text:

tweets.df$text

To examine the word frequencies, we must build a frequency table. To do this, we must extract the words from the strings. We can extract sequences of letters by splitting the strings on all non-letter characters. We can do this for the first tweet:

strsplit(tweets.df$text[1], "[^A-Za-z]+")

or for all tweets:

tweet.words = strsplit(tweets.df$text, "[^A-Za-z]+")

The variable tweet.words is a list, where each list item is a vector of the words from a tweet. We want to combine all words to count them, so we remove the list and tabulate the resulting vector:

word.table = table(unlist(tweet.words))

To identify the top 20 occurring words, we must sort the table and examine the top 20 items.

Examine the help page for sort and work out how to obtain the top 20 occurring words from the table.

Do these words tell us anything about Kevin Bacon. It is likely that they don't. The list is likely to contain words such as is, of, a and so on. We need to use a more sophisticated method to extract meaningful terms.

3 Text Mining

In this section, we will use the library tm (text mining) to assist us in finding more useful information about Kevin Bacon.

First load the library:

library("tm")

The tm functions work with its own Corpus object, so we must convert the data frame into a corpus:

tweet.corpus = Corpus(VectorSource(tweets.df$text))

then convert the characters to UTF8:

tweet.corpus = tm_map(tweet.corpus,
         function(x) iconv(x, to='UTF8', sub='byte')) # for Windows
tweet.corpus = tm_map(tweet.corpus, 
         function(x) iconv(x, to='UTF-8-MAC', sub='byte')) # for OS X

Note: tm_map applies the given function to each document in the corpus. Keep this in mind for the next section.

3.1 Preprocessing the document set

Now that we have our corpus, we want to

  1. remove numbers
  2. remove punctuation
  3. remove whitespace
  4. case fold all characters to lower case
  5. remove a set of stop words
  6. reduce each word to its stem

To perform each of these tasks, tm provides the functions removeNumbers, removePunctuation, stripWhitespace, tolower, removeWords and stemDocument and the application function tm_map.

Use your knowledge of R, the help pages and your favourite Web search engine to work out how to perform the six tasks, then implement them on our corpus. You can examine the changes in the corpus by printing the contents of the first document using tweet.corpus[[1]]. Hint: tm_map applies a function to all documents in the corpus. Look at the examples at the bottom of the tm_map help page.

3.2 Weighting the Document Matrix

In the lecture, we saw a form of TF-IDF weighting, were each document term weight is computed as:

\[ w_{d,t} = \log_e{\left ( f_{d,t} + 1 \right )}\log_e{\left ( \frac{N}{f_t} \right )} \]

where \(f_{d,t}\) is the frequency of term \(t\) in document \(d\) (found at tweet.matrix[d,t]), \(N\) is the number of documents (given by dim(tweet.matrix)[1]) and \(f_t\) is the number of documents containing term \(t\) (found using sum(tweet.matrix[,t] > 0).

To apply this document term weighting, we need to extract the term frequency matrix from the corpus object.

First we must make sure that the data is in Corpus data type:

tweet.corpus = tm_map(tweet.corpus, PlainTextDocument)

Then we can pass the corpus to DocumentTermMatrix, to create a document-term matrix:

tweet.dtm = DocumentTermMatrix(tweet.corpus) # create the DocumentTermMatrix object
tweet.matrix = as.matrix(tweet.dtm) # convert to a matrix

For the following exercises, remember that tweet.matrix[i,] is the \(i\)th row of the matrix (a document) , tweet.matrix[,i] is the \(i\)th column (a term), and tweet.matrix[i,j] is the \(i\)th row and \(j\)th column (word \(j\) in document \(i\)).

Compute the weighted document term matrix tweet.weighted.matrix containing the values of \(w_{d,t}\).

Sum the weights in tweet.weighted.matrix to obtain an overall weight for each term.

Locate the position of the top 20 words, according to the overall word weight. Use the vector colnames(tweet.matrix) to locate the word names.

Are these words more descriptive of Kevin Bacon than those computed using only term frequencies? They should be after all the effort you put into calculating them!

4 Label your friends.

You found words associated to Kevin Bacon. Use Twitter to find words associated to your friend's names.