Data Scientists, Why is R a Must-Learn for?
The skills I demoed here can be learned through taking Data Science with Machine Learning bootcamp with NYC Data Science Academy.
As Python is gaining more popularity, and can handle tasks in engineering, data wrangling, etc, R’s close-tie with statistics/statistical machine learning still render it an important language in data science field, not to mention R remains to be a desired skill for many hiring companies. Here are several top reasons to learn R in data science:
A Language Designed for Data Analysis
Unlike Python, which had been a general purpose high-level programming language before it became a tool of data analysis, R has been designed as a data analysis tool from the very beginning. Unlike Python, which assumes that the coder has been trained as a programmer, R designers realize that in many data related tasks, gaining valuable data insights is more impactful than following a strict coding practice. This is reflected in the ease to learn R coding as it is very intuitive. The analyst/data scientist does not always need to master programming like a data-engineer before they can derive insights in impacting the business/research.
As of now, there are 15000+ R open source data related packages on https://cran.r-project.org/web/ packages/, the major R-code repository. R users all over the world are free to install and modify the code-base for research and industrial usages. These packages are developed or donated by top research universities or private companies. Thus when an R user downloads new packages into his/her R-studio IDE, he/she is supported implicitly by the collected wisdom of the whole R community.
The well-known data analysis package pandas in Python was created modeling R’s dataframe construct. Moreover a lot of statistical related packages migrate from R to Python after the technique gains popularity.
Easiness to Process
R tries its best to eliminate the gap of converting human thoughts to computational codes. The convenience to the data analyst has been placed at the top of the list, which is impossible for general purpose language like Python or javascript. This design philosophy is super important for those whose major interest is in the data insight rather than the engineering aspect. One popular package in R called dplyr contains a pipe operator %>%, which passes the data object to the first argument inside a function. For example to get a preview of a data, instead of head(cars), we can write cars %>% head(). When there are multiple steps in processing the data object involving function compositions, pipe them using %>% consecutively offer elegance and simplicity in code reading.
This simple invention allows a data analyst to avoid dealing with nested parentheses of multiple functions. In this way the code becomes more intuitive, easier to be maintained and debugged, easier to read by the third- party. The design advantage can translate into time and energy saving and improvement of productivity.
Convenience in Statistical Analysis
As you probably know, R is also widely used in the statistics related fields. On one hand its super convenient usage for statistical testing and modeling continues to draw researchers to use it. When the researchers develop new ideas, they often create new R packages, circulating in the R community. In decades, this has formed a very robust eco-system such that when we encounter data analysis issues, the problems often have been thoroughly investigated and researched by the forerunners in the field. To the typical users, this means a significant reduction of development time and labor. This allows data analysts to bypass technical issues and focus on high-level insights.
Graphs Made to Talk
You probably have heard about ggplot2, one of the well-known R data visualization packages. Not only does the package produce excellent graphs, like this:
It’s also devoted to creating graphs for better insights, like these:
Other than that, graphs can be fully customized on their backgrounds and themes, labels and legend, and other types of aesthetics, based on a unique graphical grammer, which allows maximal flexibility.
Interactive Web Application
In the life-cycle of performing data analysis, the insights gained by data analysis often needs to be conveyed to non-technical people, either to the corporate leaders, or to the general citizens. It is vital to be able to display the high level insights visually so that typical users with no data science background can interpret them. This task falls on the shoulder of web applications.
Shiny is not merely an R package that supports building interactive web-pages, it allows a typical analysts to build web-applications like a web-developer does, but with zero prior knowledge on the lower level language like javascript. A shiny app allows the app designer to hook up the clickable buttons, drop-down menus, sliding bars, etc, placed on the fully customized pages. Once the app skeleton is built in shiny, the task of data analysis and data visualization can be dispensed to base R, ggplot2, leaflet, googleVis, maps, etc. in a seamless way. For more details and interesting shiny app examples, please take a look at shiny gallery here.
The Advantage of R Machine Learning over Python
While Python offers a centralized general purpose machine learning package like scikit-learn, it offers a much less diversified ecosystem than R does. The scikit-learn package’s primary focus on predictive tasks alone tends to downplay the other important aspects of machine learning like inferential tasks, etc. For example, the scientific principle of analyzing customer retention in marketing/sales also works in many other industries, known as hazard model (insurance), survival model (healthcare), product reliability model (manufacturing), credit default model (lending), etc. In these use cases, the predictive accuracy as well as the possibility to interpret the models and understanding the internal working of the models are both important. When we find resources on data analysis/machine learning related to such an important topic, there are over 50 packages in R eco-system devoting to survival analysis while the corresponding resource in Python is barely minimal.
This suggests that for one who is interested in applying machine learning techniques to specific domain, the data scientist fluent only in Python probably needs to start from scratch with scikit-learn. On the other hand, an R data scientist could enjoy the implicit support from multiple packages with decades of prior research to use algorithms tailored for these use cases.
Interactive Documents–R Markdown
Last but not least, R Markdown file format of documentation helps with non-tech users as well. Good thoughts won’t last forever, an .rmd file takes over the formatting task, which helps word-workers with focusing more on the content. For scientific researchers and operators, an .rmd file also supports LaTex format, which makes scientific formula typing a pleasant journey. Easily-inserted links and pictures, as well as the flexibility of knitting different types of documents contribute to thesis & dissertation, news report, and other types of writing/presentation styles. Believe or not, this write-up pdf file is actually generated by an R Markdown file.
As code readability and GitHub interface adds values to R, R is never designed as a speed/performance centric language. As a result, R is not as robust as Python in handling large datasets, or in dealing with resource intensive tasks. A data scientist bi-linqual in both Python and R can take advantage of the unique edges of each language and the corresponding packages to support different use cases in their business need.
All in all, the choice of R depends on your mission objectives between statistical analysis and deployment, and the amount of time investment. A smart data scientist who knows both R and Python would take advantage of what each language would offer and combine their strength more efficiently.