NLP with AirBnb
First I had the survey text sent to me in a .csv file, aptly named “text.csv,” then I did some basic data cleaning to get the data properly labeled.
data <- read.csv("~/coding/text.csv", header = TRUE, sep=",")
uid = 0:45
airbdata = data.frame("uid" = uid, "date" = data$Respondents, "text" = data$Response.Date)
Next, we can load the following packages:
tm, snowballC, wordcloud, topicmodels, ggplot2, ggthemes, dplyr
Now we can create the text corpus that will be used for our analysis and visualizations.
corpus <- Corpus(VectorSource(airb2$text)) %>%
tm_map(tolower) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(stripWhitespace) %>%
tm_map(removeWords, c(stopwords('english'),'like')) #removes list of stop words and like
The above code also removes stopwords, ‘like’, punctuation characters, numeric characters, and whitespaces.
Here is what our corpus looks like now:
At this point we can generate our Wordcloud and word frequency chart, but the words ‘travel’ in entry 17, and ‘traveling’ in entry 22 would be considered separate and different words. These two words are clearly related, and through a technique called stemming, we can remove the word endings to combine like words for purposes of analysis.
Because we prepared our corpus in the previous step, stemming the text is rather easy.
dictCorpus <- corpus #duplicates corpus prior to stemming
corpus <- tm_map(corpus,stemDocument, language='english') #removes word endings, forming stems
Here is what our corpus looks like now:
Both words are now replaced with ‘travel’ and it will help improve impact of the visualizations.
Now for the "difficult" part – stem completion. Stem completion is taking these newly stemmed words, and adding a generic suffix to them so that our wordcloud won’t contain words like ‘sometim’ or ‘vacat.’
The stem completion is a multi-part process, first we define the function:
stemCompletion2 <- function(x, dictionary) {
x <- unlist(strsplit(as.character(x), " "))
x <- x[x != ""]
x <- stemCompletion(x, dictionary=dictionary)
x <- paste(x, sep="", collapse=" ")
PlainTextDocument(stripWhitespace(x))
}
then ‘lapply’ the function to the contents of the corpus passing the argument dictionary=dictCorpus.
corpus <- lapply(corpus, stemCompletion2, dictionary=dictCorpus)
next, we vectorize the text.
corpus <- Corpus(VectorSource(corpus))
Here’s what our newly stem-completed corpus contains:
Since our corpus is complete, before we can use the wordcloud function we need to generate a TernDocumentMatrix. Here is the code to prepare the TDM:
h <- TermDocumentMatrix(corpus)
m <- as.matrix(h)
v <- sort(rowSums(m),decreasing = TRUE)
d <- data.frame(word=names(v),freq=v)
The output includes some of the extra words generated by the stemcompletion process such as ‘character’, ‘wday’, ‘min’, and ‘meta’. So, I filtered the dataframe d to only include words with frequency of 16 or fewer.
d <- d %>% filter(freq<=16) #removes extra terms added from stemcompletion
Everything we need is now in place for our wordcloud.
Continue to the next page for final instructions.
About Author
Related Articles
Leave a Comment
NLP with AirBnb – Mubashir Qasim April 24, 2017
[…] article was first published on R – NYC Data Science Academy Blog, and kindly contributed to […]