Data Study on financial products and services Complaints
The skills the author demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
Contributed by Wansang Lim. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his first class project - R visualization (due on the 2th week of the program).
Introduction
Each week, data shows the government sent thousands of consumers’ complaints about financial products and services to companies for response. Those complaints are published after the company responds or after 15 days, whichever comes first. By adding their voice, consumers help improve the financial marketplace.
The narrative of consumer complaints is like "The insurance company is responsible for this charge, I was and presently am a Federal employee and medical /dental insurance is extended 31 days after separation from employment. I left my job the same month I visited this dental office, they are aware of the 31 day extension and refuse to bill the correct party for this charge, the dental insurance company. I 've regained my Federal employment status and hold the same insurance. They sent this account to collections in error. They need to bill the insurance company and remove the fraudulent charge off of all my credit reports, they have caused me a tremendous hardship. "
The web site for reporting complaints is https://www.consumerfinance.gov/complaintdatabase/
Data
The data was downloaded from https://catalog.data.gov/dataset/consumer-complaint-database. The data is quite huge. It has 0.5 million rows and 16 columns.The time period is 2011/December ~ 2015/December.
First, the number of reports are grouped by year to see if there is any relationships.
```{r}
dateData <- select(Consumer_Complaints, year, month)
groupByYear % summarise(., yearNum = n())
groupByYear
<pre><code>Table 1. Consumer complaints data group by Year.
year yearNum
2012 72523
2013 108272
2014 153164
2015 161703
The table 1 shows that consumer increases by year. Then I decide to divide by month to look at the trends and regress it.
```{R}
ByYearMonth % summarise(., total = n())
ByYearMonth$seq <- c(1:nrow(ByYearMonth))
regre <- lm(total ~ seq, data=ByYearMonth)
summary(regre)
Figure 1. Regress plots of total reported number of complaints by each month from 2011/Dec to 2012/Dec
Then, I grouped it by company. The number of all reported companies is 3, 4111. The 20 reported companies are 66% of all complaints.
````````````{r}
top20Company <- companyRank[1:20,]
top20Company$Company <- factor(top20Company$Company, levels = top20Company$Company)
top20Company$percentage <- top20Company$total/nrow(Consumer_complainYearMonth)*100
library(ggplot2)
barCompany <- ggplot(top20Company, aes(x=Company, y=percentage)) + geom_bar(stat="identity")
barCompany + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylab("%")
````````````````
Figure. The top 20 reported companies
The complaints was grouped by products.
````````{r}
byProduct % summarise(., total= n() ) %>% arrange(., desc(total))
byProduct$Product <- factor(byProduct$Product, levels = byProduct$Product)
byProduct$Percentage <- byProduct$total/nrow(Consumer_complainYearMonth)*100
barPro <- ggplot(byProduct, aes(x=Product, y=Percentage)) + geom_bar(stat="identity")
barPro + theme(axis.text.x = element_text(angle = 90, hjust = 1), panel.background = element_blank()) + ylab("%")
`````````
The total number of reports are grouped by top 10 company and product.
```{r}
CcTop10Company <- filter(Consumer_complainYearMonth, Company %in% top10Company$Company)
CcTop10Company$Company <- factor(CcTop10Company$Company, levels = top10Company$Company)
byCompanyProduct <- group_by(CcTop10Company, Company, Product)
byCompanyProductTotal <- summarise(byCompanyProduct, total = n())
byCompanyProductArrange <- arrange(byCompanyProductTotal, desc(total))
library(ggplot2)
library(lattice)
groupXYplot <- xyplot(byCompanyProductArrange$total ~ byCompanyProductArrange$Product | byCompanyProductArrange$Company, data=byCompanyProductArrange, xlab = list(label = "Product", cex=1), scales=list(x=list(rot=45)), groups = Product)
groupXYplot
<a href="https://nycdsa-blog-files.s3.us-east-2.amazonaws.com/2016/01/ComPro.jpg"><img src="https://nycdsa-blog-files.s3.us-east-2.amazonaws.com/2016/01/ComPro-300x214.jpg" alt="ComPro" width="300" height="214" class="alignnone size-medium wp-image-8998" /></a>
The total complaints are grouped by state.
```{r}
onlyState <- select(Consumer_complainYearMonth, State)
byState % summarise(., total = n())
byState <- byState[-1,]
onlyState <- select(Consumer_complainYearMonth, State)
byState % summarise(., total = n())
byState <- byState[-1,]
stateAbb <- read.csv("/media/wan/64GB/R_CLASS/Project/data/stateAbb.csv", stringsAsFactors=FALSE)
fullName <- c()
total <- c()
byStateFullname <- data.frame(fullName, total)
for (i in 1: nrow(byState)) {
posi <- match(byState$State[i], stateAbb$abbreviation)
if (!is.na(posi)) {
lastPosi <- length(fullName) + 1
fullName[lastPosi] <- stateAbb$state.name[posi]
total[lastPosi] <- byState$total[i]
}
}
byState <- data.frame(fullName, total)
library(maps)
library(ggplot2)
all_state <-map_data("state")
byState$region <- byState$fullName
Total <- merge(all_state, byState, by="region")
Total <- Total[Total$region!="district of columbia",]
p <- ggplot()
p <- p + geom_polygon(data=Total, aes(x=long, y=lat, group = group, fill=Total$total),colour="white"
) + scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar")
p1 <- p + theme_bw() + labs(fill = "Total Reports"
,title = "Total Reports by State", x="", y="")
p1 + scale_y_continuous(breaks=c()) + scale_x_continuous(breaks=c()) + theme(panel.border = element_blank())
</code></pre>
<a href="https://nycdsa-blog-files.s3.us-east-2.amazonaws.com/2016/01/State.jpg"><img src="https://nycdsa-blog-files.s3.us-east-2.amazonaws.com/2016/01/State-300x214.jpg" alt="State" width="300" height="214" class="alignnone size-medium wp-image-8999" /></a>
Then total reports are analyzed by reporting methods like mail, interent.
The result is checked by time passing.
```{r}
byMediaYearMonth % summarise(., total = n())
byMediaYearMonth <- arrange(byMediaYearMonth, total)
byMediaYearMonth$month <- as.character(byMediaYearMonth$month)
for (i in 1: nrow(byMediaYearMonth)) {
if ( nchar(byMediaYearMonth$month[i], type = "chars", allowNA = FALSE, keepNA = NA) == 1) {
byMediaYearMonth$month[i] <- paste(0, byMediaYearMonth$month[i], sep = "" )
}
}
byMediaYearMonth$yymm <- paste(byMediaYearMonth$year, byMediaYearMonth$month, sep = "/")
byMediaYearMonth <- filter(byMediaYearMonth, yymm != "2015/12")
graph <- qplot(yymm, total, data = byMediaYearMonth, color = Submitted.via)
graph45 <- graph + theme(axis.text.x = element_text(angle = 45,size = 14, hjust = 1), panel.background = element_blank())
yymm <- byMediaYearMonth$yymm
yymm <- unique(yymm)
YearMonth <- data.frame(yymm, stringsAsFactors=FALSE)
YearMonth$yymm <- as.character(YearMonth$yymm)
ticks5 <- c()
lastRow <- nrow(YearMonth)
for (i in 1: lastRow) {
if (i %% 5 == 0) {
ticks5[i] <- YearMonth$yymm[i]
cat("yymm", YearMonth$yymm[i])
} else {
ticks5[i] <- ""
}
}
graph45five <- graph45 + scale_x_discrete(breaks = ticks5,labels =ticks5)
graph45five+ theme(axis.text.x = element_text(angle = 45, hjust = 1))
```