Back

DataJun 02, 2014

Twitter Analytics Using R Part 2: Create Word Cloud

Bailey Adam, and Sugandha Choudhary

We’ve been exploring the power of the programming language R for data mining. In this post we will use R to visualize tweets as a word cloud to find out what people are tweeting about the NBA (#nba). A word cloud is a visual representation showing the most relevant words (i.e., the more times a word appears in our tweet sampling the bigger the word). Please see Twitter Analytics Using R Part 1: Extract Tweets for how to extract data from Twitter. The final result should look similar to the following:

1. Extract Tweets

Load the Twitter authentication and extract tweets using #nba.

load("twitter authentication.Rdata") registerTwitterOAuth(cred)

tweets <- searchTwitter(“#nba”, n=1499, cainfo=”cacert.pem”, lang=”en”)

tweets.text <- sapply(tweets, function(x) x$getText())

2. Clean Up Text

We have already been authenticated and successfully retrieved the text from the tweets using #nba. The first step in creating a word cloud is to clean up the text by using lowercase and removing punctuation, usernames, links, etc. We are using the function gsub to replace unwanted text. Gsub will replace all occurrences of any given pattern. Although there are alternative packages that can perform this operation, we have chosen gsub because of its simplicity and readability.

#convert all text to lower case tweets.text <- tolower(tweets.text)

replace blank space (“rt”)

tweets.text <- gsub("rt", "", tweets.text)

replace @username

tweets.text <- gsub("@\\w+", "", tweets.text)

remove punctuation

tweets.text <- gsub("[[:punct:]]", "", tweets.text)

remove links

tweets.text <- gsub("http\\w+", "", tweets.text)

remove tabs

tweets.text <- gsub("[ |\t]{2,}", "", tweets.text)

remove blank spaces at the beginning

tweets.text <- gsub("^ ", "", tweets.text)

remove blank spaces at the end

tweets.text <- gsub(" $", "", tweets.text)

3. Remove Stop Words

In the next step we will use the text mining package tm to remove stop words. A stop word is a commonly used word such as “the”. Stop words should not be included in the analysis. If tm is not already installed you will need to install it (available from the Comprehensive R Archive Network).

#install tm – if not already installed install.packages("tm") library("tm")

#create corpus tweets.text.corpus <- Corpus(VectorSource(tweets.text))

#clean up by removing stop words tweets.text.corpus <- tm_map(tweets.text.corpus, function(x)removeWords(x,stopwords()))

4. Generate word cloud

Now we’ll generate the word cloud using the wordcloud package. For this example we are concerned with plotting no more than 150 words that occur more than once with random color, order, and position. If wordcloud is not already installed you will need to install it (available from the Comprehensive R Archive Network).

#install wordcloud if not already installed install.packages("wordcloud") library("word cloud")

#generate wordcloud wordcloud(tweets.text.corpus,min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"),  random.color= TRUE, random.order = FALSE, max.words = 150)

Summary

This post highlights how easily R can extract and visualize Twitter data as a word cloud. There are thousands of ways to represent data in R and you’ll need to dig deeper to fully understand how all the words are related to NBA. “Lakers” might be obvious, but “Paul” or “girlfriend” might require more context. In the next post we will learn how we can perform sentiment analysis and chart the analysis results as a graph using R.