Intermediate
Big Data with Amazon Cloud, Hadoop/Spark and Docker

Big Data with Amazon Cloud, Hadoop/Spark and Docker

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class.

* Tuition paid for part-time courses can be applied to the Data Science Bootcamps if admitted within 9 months.
In response to COVID-19 State reopening, all our courses could be taken either in-person or remote/live online. Please indicate your preference by emailing [email protected] after registering for our class

Course Dates

Earlybird ends on 12/12
January Session

Jan 11 - Feb 17, 2022
Tuesday, Thursday
7:00-9:30pm

$2990.00
$2990.00
$2840.50
Enroll Now
Earlybird ends on 02/20
March Session

Mar 22 - Apr 28, 2022
Tuesday, Thursday
7:00-9:30pm

$2990.00
$2990.00
$2840.50
Enroll Now
Earlybird ends on 05/01
May Session

May 31 - Jul 7, 2022
Tuesday, Thursday
7:00-9:30pm

$2990.00
$2990.00
$2840.50
Enroll Now
Find out more information about our professional development courses.
DOWNLOAD COURSE INFORMATION

Product Description

Course Overview

This 6-week program provides a hands-on introduction to Apache Hadoop and Spark programming using Python and cloud computing. The key components covered by the course include Hadoop Distributed File Systems, MapReduce using MRJob, Apache Hive, Pig, and Spark. Tools and platforms that are used include Docker, Amazon Web Services and Databricks. In the first half of the program students are required to pull a pre-built Docker image and run most of the exercises locally using docker containers. In the second half students must access their AWS and Databricks accounts to run cloud computing exercises. Students will need to bring their laptops to class.

Prerequisites

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course. Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

Certificate of Completion
Bundle Up, Learn More and Save More!
Browse Bundles
Bundle Up, Learn More and Save More!
Browse Bundles

Demo Lecture

MapReduce using MRJob
Module
MapReduce
Instructor
Jake Bialer
Description
NYC Data Science Academy's Instructor, Jake Bialer, walks through a lecture on MapReduce examples.

Syllabus

Unit 1: Introduction to Hadoop

  • 1. Data Engineering Toolkits
    • Running Linux using Docker containers
    • Linux CLI command and bash scripts
    • Python basics
  • 2. Hadoop and MapReduce
    • Big Data Overview
    • HDFS
    • YARN
    • MapReduce

Unit 2 – MapReduce

  • 3. MapReduce using MRJob 1
    • Protocols for Input & Output
    • Filtering
  • 4. MapReduce using MRJob 2
    • Top n
    • Inverted Index
    • Multi-step Jobs

Unit 3 – Apache Hive

  • 5. Apache Hive 1
    • Databases for Big Data
    • HiveQL and Querying Data
    • Windowing And Analytics Functions
    • MapReduce Scripts
  • 6. Apache Hive 2
    • Tables in Hive
    • Managed Tables and External Tables
    • Storage Formats
    • Partitions and Buckets

Unit 4 – Apache Pig

  • 7. Apache Pig 1
    • Overview
    • Pig Latin: Data Types
    • Pig Latin: Relational Operators
  • 8. Apache Pig 2
    • More Pig Latin: Relational operators
    • More Pig Latin: Functions
    • Compiling Pig to MapReduce
    • The Parallel Clause
    • Join Optimizations

Unit 5 – Apache Spark and AWS

  • 9. Apache Spark – Spark Core
    • Spark Overview
    • Running Spark using Databricks Notebooks
    • Working with PySpark: RDDs
    • Transformations and Actions
  • 10. Apache Spark – Spark SQL
    • Spark DataFrame
    • SQL Operations using Spark SQL
  • 11. Apache Spark – Spark ML
    • ML Pipeline using PySpark
  • 12. Amazon Elastic MapReduce
    • Overview
    • Amazon Web Services: IAM, EC2, S3
    • Creating EMR Cluster
    • Submitting Jobs
    • Intro to AWS CLI

Campus Location

500 8th Ave Suite 905, New York, NY 10018
Nearby Subways
1 2 3 34th, Penn Station
A C E 34th, Penn Station
N Q R B D F M 34th, Herald Square

Instructors

Jake Bialer
Jake Bialer
Instructor
Jake Bialer is a full stack developer and data scientist who has spent the last decade immersed in data problems at online media organizations, e-commerce sites, and other web businesses. He currently runs his own consultancy, Bialerology, and teaches web scraping and big data engineering at the NYC Data Science Academy.

Session Schedule

Earlybird ends on 12/12
January Session

Jan 11 - Feb 17, 2022 Tuesday & Thursday
  • 1January 11, 2022
  • 2January 13, 2022
  • 3January 18, 2022
  • 4January 20, 2022
  • 5January 25, 2022
  • 6January 27, 2022
  • 7February 1, 2022
  • 8February 3, 2022
  • 9February 8, 2022
  • 10February 10, 2022
  • 11February 15, 2022
  • 12February 17, 2022
7:00-9:30pm

$2990.00
$2990.00
$2840.50
Enroll Now
Earlybird ends on 02/20
March Session

Mar 22 - Apr 28, 2022 Tuesday & Thursday
  • 1March 22, 2022
  • 2March 24, 2022
  • 3March 29, 2022
  • 4March 31, 2022
  • 5April 5, 2022
  • 6April 7, 2022
  • 7April 12, 2022
  • 8April 14, 2022
  • 9April 19, 2022
  • 10April 21, 2022
  • 11April 26, 2022
  • 12April 28, 2022
7:00-9:30pm

$2990.00
$2990.00
$2840.50
Enroll Now
Earlybird ends on 05/01
May Session

May 31 - Jul 7, 2022 Tuesday & Thursday
  • 1May 31, 2022
  • 2June 2, 2022
  • 3June 7, 2022
  • 4June 9, 2022
  • 5June 14, 2022
  • 6June 16, 2022
  • 7June 21, 2022
  • 8June 23, 2022
  • 9June 28, 2022
  • 10June 30, 2022
  • 11July 5, 2022
  • 12July 7, 2022
7:00-9:30pm

$2990.00
$2990.00
$2840.50
Enroll Now

Save More by Enrolling in a Bundle

Data Science Mastery
Data Science with R: Machine Learning
Data Science with R: Machine Learning
Data Science with Python: Machine Learning
Data Science with Python: Machine Learning
Big Data with Amazon Cloud, Hadoop/Spark and Docker
Big Data with Amazon Cloud, Hadoop/Spark and Docker
$7970.00
Total: $7970.00$7410.00