Why R is a Must for Data Scientist?

Aiko Liu
Hanqing Zhang
Pranjali Galgali
, and
Posted on Dec 4, 2019

As Python is gaining more popularity, and can handle tasks in engineering, data wrangling, etc, R’s close-tie with statistics/statistical machine learning still render it an important language in data science field, not to mention R remains to be a desired skill for many hiring companies. Here are several top reasons to learn R in data science: 

A Language Designed for Data Analysis 

Unlike Python, which had been a general purpose high-level programming language before it became a tool of data analysis, R has been designed as a data analysis tool from the very beginning. Unlike Python, which assumes that the coder has been trained as a programmer, R designers realize that in many data related tasks, gaining valuable data insights is more impactful than following a strict coding practice. This is reflected in the ease to learn R coding as it is very intuitive. The analyst/data scientist does not always need to master programming like a data-engineer before they can derive insights in impacting the business/research. 

As of now, there are 15000+ R open source data related packages on https://cran.r-project.org/web/ packages/, the major R-code repository. R users all over the world are free to install and modify the code-base for research and industrial usages. These packages are developed or donated by top research universities or private companies. Thus when an R user downloads new packages into his/her R-studio IDE, he/she is supported implicitly by the collected wisdom of the whole R community. 

The well-known data analysis package pandas in Python was created modeling R’s dataframe construct. Moreover a lot of statistical related packages migrate from R to Python after the technique gains popularity. 

Easiness to Process 

R tries its best to eliminate the gap of converting human thoughts to computational codes. The convenience to the data analyst has been placed at the top of the list, which is impossible for general purpose language like Python or javascript. This design philosophy is super important for those whose major interest is in the data insight rather than the engineering aspect. One popular package in R called dplyr contains a pipe operator %>%, which passes the data object to the first argument inside a function. For example to get a preview of a data, instead of head(cars), we can write cars %>% head(). When there are multiple steps in processing the data object involving function compositions, pipe them using %>% consecutively offer elegance and simplicity in code reading. 

This simple invention allows a data analyst to avoid dealing with nested parentheses of multiple functions. In this way the code becomes more intuitive, easier to be maintained and debugged, easier to read by the third- party. The design advantage can translate into time and energy saving and improvement of productivity. 

Convenience in Statistical Analysis 

As you probably know, R is also widely used in the statistics related fields. On one hand its super convenient usage for statistical testing and modeling continues to draw researchers to use it. When the researchers develop new ideas, they often create new R packages, circulating in the R community. In decades, this has formed a very robust eco-system such that when we encounter data analysis issues, the problems often have been thoroughly investigated and researched by the forerunners in the field. To the typical users, this means a significant reduction of development time and labor. This allows data analysts to bypass technical issues and focus on high-level insights. 

Graphs Made to Talk 

You probably have heard about ggplot2, one of the well-known R data visualization packages. Not only does the package produce excellent graphs, like this: 

Graphs Made to Talk

 

 

It’s also devoted to creating graphs for better insights, like these:

 

Other than that, graphs can be fully customized on their backgrounds and themes, labels and legend, and other types of aesthetics, based on a unique graphical grammer, which allows maximal flexibility. 

Interactive Web Application 

In the life-cycle of performing data analysis, the insights gained by data analysis often needs to be conveyed to non-technical people, either to the corporate leaders, or to the general citizens. It is vital to be able to display the high level insights visually so that typical users with no data science background can interpret them. This task falls on the shoulder of web applications. 

Shiny is not merely an R package that supports building interactive web-pages, it allows a typical analysts to build web-applications like a web-developer does, but with zero prior knowledge on the lower level language like javascript. A shiny app allows the app designer to hook up the clickable buttons, drop-down menus, sliding bars, etc, placed on the fully customized pages. Once the app skeleton is built in shiny, the task of data analysis and data visualization can be dispensed to base R, ggplot2, leaflet, googleVis, maps, etc. in a seamless way. For more details and interesting shiny app examples, please take a look at shiny gallery here. 

The Advantage of R Machine Learning over Python 

While Python offers a centralized general purpose machine learning package like scikit-learn, it offers a much less diversified ecosystem than R does. The scikit-learn package’s primary focus on predictive tasks alone tends to downplay the other important aspects of machine learning like inferential tasks, etc. For example, the scientific principle of analyzing customer retention in marketing/sales also works in many other industries, known as hazard model (insurance), survival model (healthcare), product reliability model (manufacturing), credit default model (lending), etc. In these use cases, the predictive accuracy as well as the possibility to interpret the models and understanding the internal working of the models are both important. When we find resources on data analysis/machine learning related to such an important topic, there are over 50 packages in R eco-system devoting to survival analysis while the corresponding resource in Python is barely minimal. 

This suggests that for one who is interested in applying machine learning techniques to specific domain, the data scientist fluent only in Python probably needs to start from scratch with scikit-learn. On the other hand, an R data scientist could enjoy the implicit support from multiple packages with decades of prior research to use algorithms tailored for these use cases. 

Interactive Documents–R Markdown 

Last but not least, R Markdown file format of documentation helps with non-tech users as well. Good thoughts won’t last forever, an .rmd file takes over the formatting task, which helps word-workers with focusing more on the content. For scientific researchers and operators, an .rmd file also supports LaTex format, which makes scientific formula typing a pleasant journey. Easily-inserted links and pictures, as well as the flexibility of knitting different types of documents contribute to thesis & dissertation, news report, and other types of writing/presentation styles. Believe or not, this write-up pdf file is actually generated by an R Markdown file. 

As code readability and GitHub interface adds values to R, R is never designed as a speed/performance centric language. As a result, R is not as robust as Python in handling large datasets, or in dealing with resource intensive tasks. A data scientist bi-linqual in both Python and R can take advantage of the unique edges of each language and the corresponding packages to support different use cases in their business need. 

All in all, the choice of R depends on your mission objectives between statistical analysis and deployment, and the amount of time investment. A smart data scientist who knows both R and Python would take advantage of what each language would offer and combine their strength more efficiently. 

About Authors

Aiko Liu

Aiko Liu

Aiko was born and raised in Taiwan. After college graduation, he came to U.S. and got his Ph.D. at Harvard University, specializing in geometry. After having done research at several top research universities for years, he switched gear...
View all posts by Aiko Liu >
Hanqing Zhang

Hanqing Zhang

R Data Analysis Instructor in NYC Data Science Academy; Master's in Statistics-- Indiana University Bloomington; Master's in Education-- Purdue University
View all posts by Hanqing Zhang >
Pranjali Galgali

Pranjali Galgali

Pranjali Galgali is a Marketing and Communications Associate, NYC Data Science Academy. She is a Master's in Digital Media and Strategic Communications from Rutgers University. She enjoys reading and writing about data science, upcoming technologies and loves interviewing...
View all posts by Pranjali Galgali >

Related Articles

Leave a Comment

No comments found.

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

2019 airbnb Alex Baransky alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus API Application artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep Bundles California Cancer Research capstone Career Career Day citibike clustering Coding Course Demo Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day Discount dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Hiring Partners Industry Experts Instructor Blog Instructor Interview Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research Meet the team meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Online Open Data painter pandas Part-time Portfolio Development prediction Prework Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau team TensorFlow Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping Weekend Course What to expect word cloud word2vec XGBoost yelp