# Data Science Bootcamp Pre-Work

# Data Science

*Andrew Nichols, NYC Data Science Academy*

*July 9, 2015*

- Where to Start Setup
- Overall Goals
- Command Line
- Git and Github
- Foundational Statistics
- R
- Python
- Machine Learning
- Proprietary Platforms

# Where to Start

Data Science is built around 3 concepts: programming, statistics, and domain expertise. In preparing this prework guide, we focused on establishing a strong programming foundation that can later be enriched with statistical learning methods so you can apply them to your own domain expertise.

The focus of this guide is on R and Python. R is a statistical programming language developed by statisticians and Python is a more general programming language used in a variety of disciplines for solving a wide range of problems. Our curriculum focuses on leveraging both languages so their strengths can balance out each otherâs weaknesses and also prepare you to work as a data scientist in either environment.

We believe in the power of free and open-source content. The tech and data analytics communities are both moving towards a more democratic access to technology and learning. The bulk of this content will focus on these technologies with a short section at the end dedicated to proprietary products.

Depending on your experience, you can determine which sections will require more of your time. We have placed time guides to help you understand how much emphasis and time you should spend on each section. For bootcamp students, we want you to at least have read âAn Introduction To Statistical Learningâ and have had some exposure to the command line, git, and foundational knowledge in stats, Python, and R.

# Setup

You should install R and Python on your computer.

In addition to R, you should also install RStudio which will be used as an integrated development environment (IDE) that makes programming in R faster and easier.

We will be using Python 2 since it will take several years for Python 3 to fully replace it. Python 2 is still used, and will be continued to be used across the data science industry until enough libraries have been updated and the new working ecosystem is ensured to be stable. The Anaconda distribution of Python contains most of the libraries you will need to get started.

# Overall Goals

- Understand what data science is, develop the appropriate vocabulary, and learn where to get help.
- Learn about version control and why itâs needed.
- Gain knowledge about storing and accessing your data in databases.
- Develop a working knowledge of R and Python.
- Develop a working knowledge of statistics and machine learning.

# Command Line

`2 to 4 hours`

### Goals:

- Navigate filepaths.
- Basic commands for using your terminal.

(The Command Line Crash Course)[http://cli.

# Git and Github

`2 to 4 hours`

Version Control is a system that allows you to track changes and recall previous versions of old files. In the data science and tech communities, Git and Github have become the industry-wide standard. In many ways, Github accounts have become equivalent to technical resumes, especially for those who are trying to break into the industry.

### Goals:

- Understand what version control is.
- Understand what git and github are.
- Install git and open an account on github.
- Work with git: cloning, commiting, etc.

#### Git Immersion - Full length tutorial

#### Git - The Simple Guide - One page tutorial

# Foundational Statistics

`5 to 8 hours`

### Goals:

- Understand foundational statistical concepts.

*Make sure you understand these basic ideas:*

Population / Sample

Distributions

Discrete / Continous

Distributon Functions

Null Hypothesis / Alternative Hypothesis / P-value

Mean / Variance / Skewness / Kurtosis / Percentile / Quantile

T-Test / F-Test / Chi-Square Test / ANOVA / Normality Test

*If you have more time:*

# R

`20 to 40 hours`

### Goals:

- Learn the basics of R Syntax.

#### Swirl - Learn R in the console

#### Codeschool: TryR - Online Interactive Learning

__Book Recommendations__

- R in a Nutshell
- The Art of R Programming

# Python

`20 to 40 hours`

### Goals:

- Learn the basics of Python Syntax.

#### Learn Python the Hard Way

We recommend this book above all others. The book is available online for free or you can buy the book.

# Machine Learning

`As much time as you have, after finishing the prior sections.`

### Goals:

- Understand the basics of machine learning.
- Read
*Introduction to Statistical Learning*.

### Basic Ideas:

Prediction / Inference

Parametric / Nonparametric

Supervised / Unsupervised

Linear Regression / Logistic Regression

Regression / Classification

#### Introduction to Statistical Learning - The best book for learning the foundation for more advanced machine learning. Available for free online or you can buy the book.

# Proprietary Platforms

`Optional`

This section is not required and is solely to give you an idea of the proprietary platforms that exists should you read about or hear them in a conversation.

- Tableau

Tableau is a platform that removes the technical difficulty of visualizing data and is designed for non-technical users so they can perform data analysis. It integrates with R and Python and is not designed to be a replacement, in particualr due to its limitations on cleaning, manipulating data, and modeling. Tableau shines when you need to quickly make beautiful visualizations. In some ways, it is akin to a video editor using iMovie. It definitely has its place, but it might not be for everyone. The best way to know is to try it. - SAS and SPSS

SAS and SPSS are both analytics software that are on their way out for various reasons. They are powerful tools, but they are facing difficulty in the market due to their cost, and in turn the difficulty of getting access for learning purposes. Furthermore, SAS and SPSS are not open-source which limits growth since users cannot build upon and improve the software. - CartoDB

CartoDB allows you to quickly make beautiful maps with your data. Like Tableau, this platform abstracts away the technical difficulty. There is a free version of the software which is good for the average user especially if you just need a quick map for a presentation or for exploratory analysis.