Big Data with Amazon Cloud, Hadoop/Spark and Docker

Big Data with Amazon Cloud, Hadoop/Spark and Docker

Big Data with Amazon Cloud, Hadoop/Spark and Docker

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Course Overview

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

* Tuition paid for part-time courses can be applied to the Data Science Bootcamps if admitted within 9 months.
January Session
$2990.00
Early bird pricing
$2840.50
January Session
Jan 14 - Feb 20, 2020, 7:00-9:30pm
April Session
$2990.00
Early bird pricing
$2840.50
April Session
Apr 21 - May 28, 2020, 7:00-9:30pm

Date and Time

January Session Early-bird Pricing!

Jan 14 - Feb 20, 2020, 7:00-9:30pm
Day 1: January 14, 2020
Day 2: January 16, 2020
Day 3: January 21, 2020
Day 4: January 23, 2020
Day 5: January 28, 2020
Day 6: January 30, 2020
Day 7: February 4, 2020
Day 8: February 6, 2020
Day 9: February 11, 2020
Day 10: February 13, 2020
Day 11: February 18, 2020
Day 12: February 20, 2020
$2990.00$2840.50
Add to Cart

April Session Early-bird Pricing!

Apr 21 - May 28, 2020, 7:00-9:30pm
Day 1: April 21, 2020
Day 2: April 23, 2020
Day 3: April 28, 2020
Day 4: April 30, 2020
Day 5: May 5, 2020
Day 6: May 7, 2020
Day 7: May 12, 2020
Day 8: May 14, 2020
Day 9: May 19, 2020
Day 10: May 21, 2020
Day 11: May 26, 2020
Day 12: May 28, 2020
$2990.00$2840.50
Add to Cart

Instructors

Jake Bialer
Jake Bialer
Jake Bialer is a full stack developer and data scientist who has spent the last decade immersed in data problems at online media organizations, e-commerce sites, and other web businesses. He currently runs his own consultancy, Bialerology, and teaches web scraping and big data engineering at the NYC Data Science Academy.

Product Description


Overview

 

This 6-week program provides a hands-on introduction to Apache Hadoop and Spark programming using Python and cloud computing. The key components covered by the course include Hadoop Distributed File Systems, MapReduce using MRJob, Apache Hive, Pig, and Spark. Tools and platforms that are used include Docker, Amazon Web Services and Databricks. In the first half of the program students are required to pull a pre-built Docker image and run most of the exercises locally using docker containers. In the second half students must access their AWS and Databricks accounts to run cloud computing exercises. Students will need to bring their laptops to class. Detailed instructions will be provided ahead of time on: how to pull and run a docker image, how to connect to AWS/Databricks, etc.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

 

Syllabus

Unit 1: Introduction to Hadoop

1. Data Engineering Toolkits
  • Running Linux using Docker containers
  • Linux CLI command and bash scripts
  • Python basics
2. Hadoop and MapReduce
  • Big Data Overview
  • HDFS
  • YARN
  • MapReduce

Unit 2 – MapReduce

3. MapReduce using MRJob 1
  • Protocols for Input & Output
  • Filtering
4. MapReduce using MRJob 2
  • Top n
  • Inverted Index
  • Multi-step Jobs

Unit 3 – Apache Hive

5. Apache Hive 1
  • Databases for Big Data
  • HiveQL and Querying Data
  • Windowing And Analytics Functions
  • MapReduce Scripts
6. Apache Hive 2
  • Tables in Hive
  • Managed Tables and External Tables
  • Storage Formats
  • Partitions and Buckets

Unit 4 – Apache Pig

7. Apache Pig 1
  • Overview
  • Pig Latin: Data Types
  • Pig Latin: Relational Operators
8. Apache Pig 2
  • More Pig Latin: Relational operators
  • More Pig Latin: Functions
  • Compiling Pig to MapReduce
  • The Parallel Clause
  • Join Optimizations

Unit 5 – Apache Spark and AWS

9. Apache Spark – Spark Core
  • Spark Overview
  • Running Spark using Databricks Notebooks
  • Working with PySpark: RDDs
  • Transformations and Actions
10. Apache Spark – Spark SQL
  • Spark DataFrame
  • SQL Operations using Spark SQL
11. Apache Spark – Spark ML
  • ML Pipeline using PySpark
12. Amazon Elastic MapReduce
  • Overview
  • Amazon Web Services: IAM, EC2, S3
  • Creating EMR Cluster
  • Submitting Jobs
  • Intro to AWS CLI

Project: Data Engineering Project

Reviews

There are no reviews yet.

Instructors

Jake Bialer
Jake Bialer
Jake Bialer is a full stack developer and data scientist who has spent the last decade immersed in data problems at online media organizations, e-commerce sites, and other web businesses. He currently runs his own consultancy, Bialerology, and teaches web scraping and big data engineering at the NYC Data Science Academy.

Product Description


Overview

 

This 6-week program provides a hands-on introduction to Apache Hadoop and Spark programming using Python and cloud computing. The key components covered by the course include Hadoop Distributed File Systems, MapReduce using MRJob, Apache Hive, Pig, and Spark. Tools and platforms that are used include Docker, Amazon Web Services and Databricks. In the first half of the program students are required to pull a pre-built Docker image and run most of the exercises locally using docker containers. In the second half students must access their AWS and Databricks accounts to run cloud computing exercises. Students will need to bring their laptops to class. Detailed instructions will be provided ahead of time on: how to pull and run a docker image, how to connect to AWS/Databricks, etc.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

 

Syllabus

Unit 1: Introduction to Hadoop

1. Data Engineering Toolkits
  • Running Linux using Docker containers
  • Linux CLI command and bash scripts
  • Python basics
2. Hadoop and MapReduce
  • Big Data Overview
  • HDFS
  • YARN
  • MapReduce

Unit 2 – MapReduce

3. MapReduce using MRJob 1
  • Protocols for Input & Output
  • Filtering
4. MapReduce using MRJob 2
  • Top n
  • Inverted Index
  • Multi-step Jobs

Unit 3 – Apache Hive

5. Apache Hive 1
  • Databases for Big Data
  • HiveQL and Querying Data
  • Windowing And Analytics Functions
  • MapReduce Scripts
6. Apache Hive 2
  • Tables in Hive
  • Managed Tables and External Tables
  • Storage Formats
  • Partitions and Buckets

Unit 4 – Apache Pig

7. Apache Pig 1
  • Overview
  • Pig Latin: Data Types
  • Pig Latin: Relational Operators
8. Apache Pig 2
  • More Pig Latin: Relational operators
  • More Pig Latin: Functions
  • Compiling Pig to MapReduce
  • The Parallel Clause
  • Join Optimizations

Unit 5 – Apache Spark and AWS

9. Apache Spark – Spark Core
  • Spark Overview
  • Running Spark using Databricks Notebooks
  • Working with PySpark: RDDs
  • Transformations and Actions
10. Apache Spark – Spark SQL
  • Spark DataFrame
  • SQL Operations using Spark SQL
11. Apache Spark – Spark ML
  • ML Pipeline using PySpark
12. Amazon Elastic MapReduce
  • Overview
  • Amazon Web Services: IAM, EC2, S3
  • Creating EMR Cluster
  • Submitting Jobs
  • Intro to AWS CLI

Project: Data Engineering Project

Reviews

There are no reviews yet.

Testimonials View All Student Testimonials

Sebastian Nordgren
Sebastian Nordgren
Senior Vice President at
Citi
I attended the Big Data with Amazon Cloud, Hadoop/Spark and Docker course, hosted and led by NYC Data Science Academy. My objective was two-fold: first, to gain a deeper and practical understanding on emerging 'Big Data' technologies, more so than what academic publications or industry white papers currently provide; and, second, to familiarize myself with the skill set and experience to expect from the new generation statisticians, or Data Scientists. With a background in Business Intelligence, Architecture, Risk Management and Governance on Wall Street, I find that foundational skills remain the same: mathematics and statistics. However, with the commoditizing of data storage and massively parallel computing, Data Scientist today are capable of solving problems reserved for an exclusive few in decades past. The course did not cover configuration of the Hadoop environment, but thanks to the engaging and knowledgeable instructor, clues on challenges and potential pitfalls were generously shared. I highly recommend this course not only to professionals or recent graduates looking to hone data analysis skills, but to anyone with an interest or stake in Big Data.

Date and Time

January Session Early-bird Pricing!

Jan 14 - Feb 20, 2020, 7:00-9:30pm
Day 1: January 14, 2020
Day 2: January 16, 2020
Day 3: January 21, 2020
Day 4: January 23, 2020
Day 5: January 28, 2020
Day 6: January 30, 2020
Day 7: February 4, 2020
Day 8: February 6, 2020
Day 9: February 11, 2020
Day 10: February 13, 2020
Day 11: February 18, 2020
Day 12: February 20, 2020
$2990.00$2840.50
Register before Dec 15th to take advantage of this price!
Add to Cart

April Session Early-bird Pricing!

Apr 21 - May 28, 2020, 7:00-9:30pm
Day 1: April 21, 2020
Day 2: April 23, 2020
Day 3: April 28, 2020
Day 4: April 30, 2020
Day 5: May 5, 2020
Day 6: May 7, 2020
Day 7: May 12, 2020
Day 8: May 14, 2020
Day 9: May 19, 2020
Day 10: May 21, 2020
Day 11: May 26, 2020
Day 12: May 28, 2020
$2990.00$2840.50
Register before Mar 22nd to take advantage of this price!
Add to Cart