Big Data with Hadoop and Spark

Big Data with Hadoop and Spark

Big Data with Hadoop and Spark

Beginner

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Course Overview
Beginner

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

January Session
$2990.00
January Session
Jan 22 - Mar 5, 2019, 7:00-9:30pm
April Session
$2990.00
Early bird pricing
$2840.50
April Session
Apr 23 - Jun 4, 2019, 7:00-9:30pm

Date and Time

January Session

Jan 22 - Mar 5, 2019, 7:00-9:30pm
Day 1: January 22, 2019
Day 2: January 24, 2019
Day 3: January 29, 2019
Day 4: January 31, 2019
Day 5: February 5, 2019
Day 6: February 7, 2019
Day 7: February 12, 2019
Day 8: February 19, 2019
Day 9: February 21, 2019
Day 10: February 26, 2019
Day 11: February 28, 2019
Day 12: March 5, 2019
$2990.00
Add to Cart

April Session Early-bird Pricing!

Apr 23 - Jun 4, 2019, 7:00-9:30pm
Day 1: April 23, 2019
Day 2: April 25, 2019
Day 3: April 30, 2019
Day 4: May 2, 2019
Day 5: May 7, 2019
Day 6: May 9, 2019
Day 7: May 16, 2019
Day 8: May 21, 2019
Day 9: May 23, 2019
Day 10: May 28, 2019
Day 11: May 30, 2019
Day 12: June 4, 2019
$2990.00$2840.50
Add to Cart

September Session Early-bird Pricing!

Sep 10 - Oct 17, 2019, 7:00-9:30pm
Day 1: September 10, 2019
Day 2: September 12, 2019
Day 3: September 17, 2019
Day 4: September 19, 2019
Day 5: September 24, 2019
Day 6: September 26, 2019
Day 7: October 1, 2019
Day 8: October 3, 2019
Day 9: October 8, 2019
Day 10: October 10, 2019
Day 11: October 15, 2019
Day 12: October 17, 2019
$2990.00$2840.50
Add to Cart

Instructors

Jake Bialer
Jake Bialer
Jake Bialer is a full stack developer and data scientist who has spent the last decade immersed in data problems at online media organizations, e-commerce sites, and other web businesses. He currently runs his own consultancy, Bialerology, and teaches web scraping and big data engineering at the NYC Data Science Academy.

Product Description


Overview

 

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.


Syllabus

Unit 1 – Introduction: Hadoop, MapReduce, Python

  • Overview of Big Data and the Hadoop ecosystem
  • The concept of MapReduce
  • HDFS – Hadoop Distributed File System
  • Python for MapReduce

Unit 2 – MapReduce

  • More Python for MapReduce
  • Implementing MapReduce with Python streaming

Unit 3 – Hive: A database for Big Data

  • Hive concepts, Hive query language (HiveQL)
  • User-defined functions in Python (using streaming)
  • Accessing Hive from Python

Unit 4 – Pig: A Platform for Analyzing Large Datasets Using MapReduce

  • Intro to Apache Pig
  • Data Types in Pig
  • Pig Latin
  • Compiling Pig to MapReduce

Unit 5 – Spark

  • Intro to Spark using PySpark
  • Basic Spark concepts: RDDs, transformations, actions
  • PairRDDs and aggregating transformations
  • Advanced Spark: partitions; shared variables
  • SparkSQL

Unit 6 – Project Week

  • Case studies/Final projects

Reviews

There are no reviews yet.

Instructors

Jake Bialer
Jake Bialer
Jake Bialer is a full stack developer and data scientist who has spent the last decade immersed in data problems at online media organizations, e-commerce sites, and other web businesses. He currently runs his own consultancy, Bialerology, and teaches web scraping and big data engineering at the NYC Data Science Academy.

Product Description


Overview

 

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.


Syllabus

Unit 1 – Introduction: Hadoop, MapReduce, Python

  • Overview of Big Data and the Hadoop ecosystem
  • The concept of MapReduce
  • HDFS – Hadoop Distributed File System
  • Python for MapReduce

Unit 2 – MapReduce

  • More Python for MapReduce
  • Implementing MapReduce with Python streaming

Unit 3 – Hive: A database for Big Data

  • Hive concepts, Hive query language (HiveQL)
  • User-defined functions in Python (using streaming)
  • Accessing Hive from Python

Unit 4 – Pig: A Platform for Analyzing Large Datasets Using MapReduce

  • Intro to Apache Pig
  • Data Types in Pig
  • Pig Latin
  • Compiling Pig to MapReduce

Unit 5 – Spark

  • Intro to Spark using PySpark
  • Basic Spark concepts: RDDs, transformations, actions
  • PairRDDs and aggregating transformations
  • Advanced Spark: partitions; shared variables
  • SparkSQL

Unit 6 – Project Week

  • Case studies/Final projects

Reviews

There are no reviews yet.

Testimonials View All Student Testimonials

Sebastian Nordgren
Sebastian Nordgren
Senior Vice President at
Citi

I attended the Big Data with Hadoop and Spark course, hosted and led by NYC Data Science Academy. My objective was two-fold: first, to gain a deeper and practical understanding on emerging 'Big Data' technologies, more so than what academic publications or industry white papers currently provide; and, second, to familiarize myself with the skill set and experience to expect from the new generation statisticians, or Data Scientists. With a background in Business Intelligence, Architecture, Risk Management and Governance on Wall Street, I find that foundational skills remain the same: mathematics and statistics. However, with the commoditizing of data storage and massively parallel computing, Data Scientist today are capable of solving problems reserved for an exclusive few in decades past. The course did not cover configuration of the Hadoop environment, but thanks to the engaging and knowledgeable instructor, clues on challenges and potential pitfalls were generously shared. I highly recommend this course not only to professionals or recent graduates looking to hone data analysis skills, but to anyone with an interest or stake in Big Data.

Date and Time

January Session

Jan 22 - Mar 5, 2019, 7:00-9:30pm
Day 1: January 22, 2019
Day 2: January 24, 2019
Day 3: January 29, 2019
Day 4: January 31, 2019
Day 5: February 5, 2019
Day 6: February 7, 2019
Day 7: February 12, 2019
Day 8: February 19, 2019
Day 9: February 21, 2019
Day 10: February 26, 2019
Day 11: February 28, 2019
Day 12: March 5, 2019
$2990.00
Add to Cart

April Session Early-bird Pricing!

Apr 23 - Jun 4, 2019, 7:00-9:30pm
Day 1: April 23, 2019
Day 2: April 25, 2019
Day 3: April 30, 2019
Day 4: May 2, 2019
Day 5: May 7, 2019
Day 6: May 9, 2019
Day 7: May 16, 2019
Day 8: May 21, 2019
Day 9: May 23, 2019
Day 10: May 28, 2019
Day 11: May 30, 2019
Day 12: June 4, 2019
$2990.00$2840.50
Register before Feb 22nd to take advantage of this price!
Add to Cart

September Session Early-bird Pricing!

Sep 10 - Oct 17, 2019, 7:00-9:30pm
Day 1: September 10, 2019
Day 2: September 12, 2019
Day 3: September 17, 2019
Day 4: September 19, 2019
Day 5: September 24, 2019
Day 6: September 26, 2019
Day 7: October 1, 2019
Day 8: October 3, 2019
Day 9: October 8, 2019
Day 10: October 10, 2019
Day 11: October 15, 2019
Day 12: October 17, 2019
$2990.00$2840.50
Register before Jul 12th to take advantage of this price!
Add to Cart