Big Data with Hadoop and Spark

Big Data with Hadoop and Spark

Big Data with Hadoop and Spark

Beginner

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Course Overview
Beginner

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

We're sorry, this class isn't on the schedule at the moment. Please join our waiting list to be notified when it becomes available again.

Date and Time

September Session

Sep 24 - Nov 5, 2018, 7:00-9:30pm
Day 1: September 24, 2018
Day 2: September 26, 2018
Day 3: October 1, 2018
Day 4: October 3, 2018
Day 5: October 10, 2018
Day 6: October 15, 2018
Day 7: October 17, 2018
Day 8: October 22, 2018
Day 9: October 24, 2018
Day 10: October 29, 2018
Day 11: October 31, 2018
Day 12: November 5, 2018
$2990.00
Add to Cart

Instructors

Jake Bialer
Jake Bialer
Jake Bialer is a full stack developer and data scientist who has spent the last decade immersed in data problems at online media organizations, e-commerce sites, and other web businesses. He currently runs his own consultancy, Bialerology, and teaches web scraping and big data engineering at the NYC Data Science Academy.

Product Description


Overview

 

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

 

Syllabus

Unit 1 – Introduction: Hadoop, MapReduce, Python

  • Overview of Big Data and the Hadoop ecosystem
  • The concept of MapReduce
  • HDFS – Hadoop Distributed File System
  • Python for MapReduce

Unit 2 – MapReduce

  • More Python for MapReduce
  • Implementing MapReduce with Python streaming

Unit 3 – Hive: A database for Big Data

  • Hive concepts, Hive query language (HiveQL)
  • User-defined functions in Python (using streaming)
  • Accessing Hive from Python

Unit 4 – Pig: A Platform for Analyzing Large Datasets Using MapReduce

  • Intro to Apache Pig
  • Data Types in Pig
  • Pig Latin
  • Compiling Pig to MapReduce

Unit 5 – Spark

  • Intro to Spark using PySpark
  • Basic Spark concepts: RDDs, transformations, actions
  • PairRDDs and aggregating transformations
  • Advanced Spark: partitions; shared variables
  • SparkSQL

Unit 6 – Project Week

  • Case studies/Final projects

Reviews

There are no reviews yet.

Instructors

Jake Bialer
Jake Bialer
Jake Bialer is a full stack developer and data scientist who has spent the last decade immersed in data problems at online media organizations, e-commerce sites, and other web businesses. He currently runs his own consultancy, Bialerology, and teaches web scraping and big data engineering at the NYC Data Science Academy.

Product Description


Overview

 

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

 

Syllabus

Unit 1 – Introduction: Hadoop, MapReduce, Python

  • Overview of Big Data and the Hadoop ecosystem
  • The concept of MapReduce
  • HDFS – Hadoop Distributed File System
  • Python for MapReduce

Unit 2 – MapReduce

  • More Python for MapReduce
  • Implementing MapReduce with Python streaming

Unit 3 – Hive: A database for Big Data

  • Hive concepts, Hive query language (HiveQL)
  • User-defined functions in Python (using streaming)
  • Accessing Hive from Python

Unit 4 – Pig: A Platform for Analyzing Large Datasets Using MapReduce

  • Intro to Apache Pig
  • Data Types in Pig
  • Pig Latin
  • Compiling Pig to MapReduce

Unit 5 – Spark

  • Intro to Spark using PySpark
  • Basic Spark concepts: RDDs, transformations, actions
  • PairRDDs and aggregating transformations
  • Advanced Spark: partitions; shared variables
  • SparkSQL

Unit 6 – Project Week

  • Case studies/Final projects

Reviews

There are no reviews yet.

Testimonials View All Student Testimonials

Sebastian Nordgren
Sebastian Nordgren
Senior Vice President at
Citi

I attended the Big Data with Hadoop and Spark course, hosted and led by NYC Data Science Academy. My objective was two-fold: first, to gain a deeper and practical understanding on emerging 'Big Data' technologies, more so than what academic publications or industry white papers currently provide; and, second, to familiarize myself with the skill set and experience to expect from the new generation statisticians, or Data Scientists. With a background in Business Intelligence, Architecture, Risk Management and Governance on Wall Street, I find that foundational skills remain the same: mathematics and statistics. However, with the commoditizing of data storage and massively parallel computing, Data Scientist today are capable of solving problems reserved for an exclusive few in decades past. The course did not cover configuration of the Hadoop environment, but thanks to the engaging and knowledgeable instructor, clues on challenges and potential pitfalls were generously shared. I highly recommend this course not only to professionals or recent graduates looking to hone data analysis skills, but to anyone with an interest or stake in Big Data.

Date and Time

September Session

Sep 24 - Nov 5, 2018, 7:00-9:30pm
Day 1: September 24, 2018
Day 2: September 26, 2018
Day 3: October 1, 2018
Day 4: October 3, 2018
Day 5: October 10, 2018
Day 6: October 15, 2018
Day 7: October 17, 2018
Day 8: October 22, 2018
Day 9: October 24, 2018
Day 10: October 29, 2018
Day 11: October 31, 2018
Day 12: November 5, 2018
$2990.00
Add to Cart