For this lab, we will need the twitteR
, ROAuth
and tm
libraries. These were installed in the week 2 lab. If you have not installed them using the instructions from the week 2 lab, do so now.
To allow R to access twitter, we need to provide R with twitter's keys. In this section we will examine the process perform this. A video of this process is also available:
Note that this video is stored in a Matroska media container (.mkv) and uses H.264 video coding and vorbis audio encoding. They are free formats (not limited in use by patents) and so not promoted by commercial software companies. If you can't play this movie with your current video player, download VLC.
To access the Twitter API, a Twitter account is needed. If you do not have one, sign up here:
To access the Twitter API, we must register a program that is accessing the API. To register the program, visit this link:
and log in. Once you have logged in, click "Create a new application" and fill in the form. Ensure that the Callback URL is set to
When the application is created, it will be provided a consumer key and a consumer secret (both text sequences); these text sequences are needed soon, so create a new R script and record the values:
key = "put key string here"
secret = "put secret string here"
NOTE: Anyone with Web access can upload information to Twitter. Therefore, we are unable to control the text that appears from any Twitter downloads. The tasks we have provided did not contain offensive material at the time of writing, but we cannot guarantee that there will not be offensive language during the time of the lab.
Let's find which words are mentioned with "Kevin Bacon".
First get the latest 100 tweets written in English, containing the terms "Kevin" and "Bacon":
tweets = searchTwitter('kevin bacon', n = 100, lang = "en")
The variable tweets
is a list. Examine the list items using double brackets (e.g. tweets[[2]]
Convert the list to a data frame:
tweets.df = twListToDF(tweets)
Remember that a data frame is a table, with column headings. Examine the column headings:
There are many columns that we can explore. At this moment, we are interested in the text:
To examine the word frequencies, we must build a frequency table. To do this, we must extract the words from the strings. We can extract sequences of letters by splitting the strings on all non-letter characters. We can do this for the first tweet:
strsplit(tweets.df$text[1], "[^A-Za-z]+")
or for all tweets:
tweet.words = strsplit(tweets.df$text, "[^A-Za-z]+")
The variable tweet.words
is a list, where each list item is a vector of the words from a tweet. We want to combine all words to count them, so we remove the list and tabulate the resulting vector:
word.table = table(unlist(tweet.words))
To identify the top 20 occurring words, we must sort the table and examine the top 20 items.
Examine the help page for sort
and work out how to obtain the top 20 occurring words from the table.
Do these words tell us anything about Kevin Bacon. It is likely that they don't. The list is likely to contain words such as is, of, a and so on. We need to use a more sophisticated method to extract meaningful terms.
In this section, we will use the library tm
(text mining) to assist us in finding more useful information about Kevin Bacon.
First load the library:
The tm
functions work with its own Corpus
object, so we must convert the data frame into a corpus:
tweet.corpus = Corpus(VectorSource(tweets.df$text))
then convert the characters to UTF8:
tweet.corpus = tm_map(tweet.corpus,
function(x) iconv(x, to='UTF8', sub='byte')) # for Windows
tweet.corpus = tm_map(tweet.corpus,
function(x) iconv(x, to='UTF-8-MAC', sub='byte')) # for OS X
Note: tm_map
applies the given function to each document in the corpus. Keep this in mind for the next section.
Now that we have our corpus, we want to
To perform each of these tasks, tm
provides the functions removeNumbers
, removePunctuation
, stripWhitespace
, tolower
, removeWords
and stemDocument
and the application function tm_map
Use your knowledge of R, the help pages and your favourite Web search engine to work out how to perform the six tasks, then implement them on our corpus. You can examine the changes in the corpus by printing the contents of the first document using tweet.corpus[[1]]
. Hint: tm_map
applies a function to all documents in the corpus. Look at the examples at the bottom of the tm_map
help page.
In the lecture, we saw a form of TF-IDF weighting, were each document term weight is computed as:
\[ w_{d,t} = \log_e{\left ( f_{d,t} + 1 \right )}\log_e{\left ( \frac{N}{f_t} \right )} \]
where \(f_{d,t}\) is the frequency of term \(t\) in document \(d\) (found at tweet.matrix[d,t]
), \(N\) is the number of documents (given by dim(tweet.matrix)[1]
) and \(f_t\) is the number of documents containing term \(t\) (found using sum(tweet.matrix[,t] > 0)
To apply this document term weighting, we need to extract the term frequency matrix from the corpus object.
First we must make sure that the data is in Corpus data type:
tweet.corpus = tm_map(tweet.corpus, PlainTextDocument)
Then we can pass the corpus to DocumentTermMatrix, to create a document-term matrix:
tweet.dtm = DocumentTermMatrix(tweet.corpus) # create the DocumentTermMatrix object
tweet.matrix = as.matrix(tweet.dtm) # convert to a matrix
For the following exercises, remember that tweet.matrix[i,]
is the \(i\)th row of the matrix (a document) , tweet.matrix[,i]
is the \(i\)th column (a term), and tweet.matrix[i,j]
is the \(i\)th row and \(j\)th column (word \(j\) in document \(i\)).
Compute the weighted document term matrix tweet.weighted.matrix
containing the values of \(w_{d,t}\).
Sum the weights in tweet.weighted.matrix
to obtain an overall weight for each term.
Locate the position of the top 20 words, according to the overall word weight. Use the vector colnames(tweet.matrix)
to locate the word names.
Are these words more descriptive of Kevin Bacon than those computed using only term frequencies? They should be after all the effort you put into calculating them!
You found words associated to Kevin Bacon. Use Twitter to find words associated to your friend's names.