Big Data with Hadoop and Spark

Big Data with Hadoop and Spark

Big Data with Hadoop and Spark

Beginner

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Course Overview
Beginner

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

August Session
$2990.00
August Session
Aug 14 - Sep 20, 2018, 7:00-9:30pm

Date and Time

August Session

Aug 14 - Sep 20, 2018, 7:00-9:30pm
Day 1: August 14, 2018
Day 2: August 16, 2018
Day 3: August 21, 2018
Day 4: August 23, 2018
Day 5: August 28, 2018
Day 6: August 30, 2018
Day 7: September 4, 2018
Day 8: September 6, 2018
Day 9: September 11, 2018
Day 10: September 13, 2018
Day 11: September 18, 2018
Day 12: September 20, 2018
$2990.00
Add to Cart

Instructors

Luke Lin
Luke Lin
Luke holds a PhD in Mathematics at Stony Brook University, specialized in partial differential equations. As a lifelong learner of mathematics, he is extremely efficient in quantitative analysis and also skilled at communicating abstract concepts. With proficiency in R and Python, Luke is primed to be a major asset to any analytic force. Being extremely passionate to share the insight of the data from variety of industries, Luke looks forward to meeting talented students from all kinds of background here in NYC Data Science Academy.

Product Description


Overview

 

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

 

Syllabus

Unit 1 – Introduction: Hadoop, MapReduce, Python

  • Overview of Big Data and the Hadoop ecosystem
  • The concept of MapReduce
  • HDFS – Hadoop Distributed File System
  • Python for MapReduce

Unit 2 – MapReduce

  • More Python for MapReduce
  • Implementing MapReduce with Python streaming

Unit 3 – Hive: A database for Big Data

  • Hive concepts, Hive query language (HiveQL)
  • User-defined functions in Python (using streaming)
  • Accessing Hive from Python

Unit 4 – Pig: A Platform for Analyzing Large Datasets Using MapReduce

  • Intro to Apache Pig
  • Data Types in Pig
  • Pig Latin
  • Compiling Pig to MapReduce

Unit 5 – Spark

  • Intro to Spark using PySpark
  • Basic Spark concepts: RDDs, transformations, actions
  • PairRDDs and aggregating transformations
  • Advanced Spark: partitions; shared variables
  • SparkSQL

Unit 6 – Project Week

  • Case studies/Final projects

Reviews

There are no reviews yet.

Instructors

Luke Lin
Luke Lin
Luke holds a PhD in Mathematics at Stony Brook University, specialized in partial differential equations. As a lifelong learner of mathematics, he is extremely efficient in quantitative analysis and also skilled at communicating abstract concepts. With proficiency in R and Python, Luke is primed to be a major asset to any analytic force. Being extremely passionate to share the insight of the data from variety of industries, Luke looks forward to meeting talented students from all kinds of background here in NYC Data Science Academy.

Product Description


Overview

 

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

Details

 


What is Hadoop?

 

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Prerequisites

 

To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.

Certificate

Certificates are awarded at the end of the program at the satisfactory completion of the course.

Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.

 

Syllabus

Unit 1 – Introduction: Hadoop, MapReduce, Python

  • Overview of Big Data and the Hadoop ecosystem
  • The concept of MapReduce
  • HDFS – Hadoop Distributed File System
  • Python for MapReduce

Unit 2 – MapReduce

  • More Python for MapReduce
  • Implementing MapReduce with Python streaming

Unit 3 – Hive: A database for Big Data

  • Hive concepts, Hive query language (HiveQL)
  • User-defined functions in Python (using streaming)
  • Accessing Hive from Python

Unit 4 – Pig: A Platform for Analyzing Large Datasets Using MapReduce

  • Intro to Apache Pig
  • Data Types in Pig
  • Pig Latin
  • Compiling Pig to MapReduce

Unit 5 – Spark

  • Intro to Spark using PySpark
  • Basic Spark concepts: RDDs, transformations, actions
  • PairRDDs and aggregating transformations
  • Advanced Spark: partitions; shared variables
  • SparkSQL

Unit 6 – Project Week

  • Case studies/Final projects

Reviews

There are no reviews yet.

Testimonials View All Student Testimonials

Sebastian Nordgren
Sebastian Nordgren
Senior Vice President at
Citi

I attended the Big Data with Hadoop and Spark course, hosted and led by NYC Data Science Academy. My objective was two-fold: first, to gain a deeper and practical understanding on emerging 'Big Data' technologies, more so than what academic publications or industry white papers currently provide; and, second, to familiarize myself with the skill set and experience to expect from the new generation statisticians, or Data Scientists. With a background in Business Intelligence, Architecture, Risk Management and Governance on Wall Street, I find that foundational skills remain the same: mathematics and statistics. However, with the commoditizing of data storage and massively parallel computing, Data Scientist today are capable of solving problems reserved for an exclusive few in decades past. The course did not cover configuration of the Hadoop environment, but thanks to the engaging and knowledgeable instructor, clues on challenges and potential pitfalls were generously shared. I highly recommend this course not only to professionals or recent graduates looking to hone data analysis skills, but to anyone with an interest or stake in Big Data.

Date and Time

August Session

Aug 14 - Sep 20, 2018, 7:00-9:30pm
Day 1: August 14, 2018
Day 2: August 16, 2018
Day 3: August 21, 2018
Day 4: August 23, 2018
Day 5: August 28, 2018
Day 6: August 30, 2018
Day 7: September 4, 2018
Day 8: September 6, 2018
Day 9: September 11, 2018
Day 10: September 13, 2018
Day 11: September 18, 2018
Day 12: September 20, 2018
$2990.00
Add to Cart