Big Data with Hadoop and Spark

Big Data with Hadoop and Spark
Course Overview

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

SKU: 01to0000002V6Mm Categories: , Tags: , , , ,
Want to take the course remotely?
Learn more and enroll for this class on our online learning platform.
Dates & Time Venue Tuition  
January 23, 2017 - March 6, 2017 7:00-9:30pm Workdays
Day 1: January 23, 2017
Day 2: January 25, 2017
Day 3: January 30, 2017
Day 4: February 1, 2017
Day 5: February 6, 2017
Day 6: February 8, 2017
Day 7: February 13, 2017
Day 8: February 15, 2017
Day 9: February 22, 2017
Day 10: February 27, 2017
Day 11: March 1, 2017
Day 12: March 6, 2017
New York
500 8th Ave., Suite 905
New York, NY 10018.0
$2990.00 Add to Cart
April 17, 2017 - May 24, 2017 7:00-9:30pm Workdays
Day 1: April 17, 2017
Day 2: April 19, 2017
Day 3: April 24, 2017
Day 4: April 26, 2017
Day 5: May 1, 2017
Day 6: May 3, 2017
Day 7: May 8, 2017
Day 8: May 10, 2017
Day 9: May 15, 2017
Day 10: May 17, 2017
Day 11: May 22, 2017
Day 12: May 24, 2017
New York
500 8th Ave., Suite 905
New York, NY 10018.0
$2990.00 Add to Cart
July 10, 2017 - August 16, 2017 7:00-9:30pm Workdays
Early-Bird Pricing!
Day 1: July 10, 2017
Day 2: July 12, 2017
Day 3: July 17, 2017
Day 4: July 19, 2017
Day 5: July 24, 2017
Day 6: July 26, 2017
Day 7: July 31, 2017
Day 8: August 2, 2017
Day 9: August 7, 2017
Day 10: August 9, 2017
Day 11: August 14, 2017
Day 12: August 16, 2017
New York
500 8th Ave., Suite 905
New York, NY 10018.0
$2990.00
$2840.50
Early-Bird Pricing!
Add to Cart
September 18, 2017 - October 30, 2017 7:00-9:30pm Workdays
Early-Bird Pricing!
Day 1: September 18, 2017
Day 2: September 20, 2017
Day 3: September 25, 2017
Day 4: September 27, 2017
Day 5: October 2, 2017
Day 6: October 4, 2017
Day 7: October 11, 2017
Day 8: October 16, 2017
Day 9: October 18, 2017
Day 10: October 23, 2017
Day 11: October 25, 2017
Day 12: October 30, 2017
New York
500 8th Ave., Suite 905
New York, NY 10018.0
$2990.00
$2840.50
Early-Bird Pricing!
Add to Cart
Questions? Read our FAQs & Refund Policy
For corporate training or small group training inquiry:
Instructor
Shu Yan
Shu Yan
Shu Yan obtained his Ph.D degree in Physics at the University of South Carolina. As a physicist with proficient analytical skills and strong programming background, he brings coding, data science and critical problem solving skills together to tackle real world problems. His physical intuition and mathematical reasoning always bring more insight when thinking about statistical models and machine learning.

Product Description



Details


Overview

This is a 6-week evening program providing a hands-on introduction to the Hadoop and Spark ecosystem of Big Data technologies. The course will cover these key components of Apache Hadoop: HDFS, MapReduce with streaming, Hive, and Spark. Programming will be done in Python. The course will begin with a review of Python concepts needed for our examples. The course format is interactive. Students will need to bring laptops to class. We will do our work on AWS (Amazon Web Services); instructions will be provided ahead of time on how to connect to AWS and obtain an account.

What is Hadoop?

Hadoop is a set of open-source programs running in computer clusters that simplify the handling of large amounts of data. Originally, Hadoop consisted of a distributed file system tuned for large data sets and an implementation of the MapReduce parallelism paradigm, but has expanded in many ways. It now includes database systems, languages for parallelism, libraries for machine learning, its own job scheduler, and much more. Furthermore, MapReduce is no longer the only parallelism framework; Spark is an increasingly popular alternative. In summary, Hadoop is a very popular and rapidly growing set of cluster computing solutions, which is becoming an essential tool for data scientists.

Syllabus

Unit 1 – Introduction: Hadoop, MapReduce, Python

  • Overview of Big Data and the Hadoop ecosystem
  • The concept of MapReduce
  • HDFS – Hadoop Distributed File System
  • Python for MapReduce

Unit 2 – MapReduce

  • More Python for MapReduce
  • Implementing MapReduce with Python streaming

Unit 3 – Hive: A database for Big Data

  • Hive concepts, Hive query language (HiveQL)
  • User-defined functions in Python (using streaming)
  • Accessing Hive from Python

Unit 4 & 5 – Spark

  • Intro to Spark using PySpark
  • Basic Spark concepts: RDDs, transformations, actions
  • PairRDDs and aggregating transformations
  • Advanced Spark: partitions; shared variables
  • SparkSQL

Unit 6 – Project Week

  • Case studies/Final projects

Reviews

There are no reviews yet.

Be the first to review “Big Data with Hadoop and Spark”

Your email address will not be published. Required fields are marked *

Testimonials

Sebastian Nordgren

I attended the Big Data with Hadoop and Spark course, hosted and led by NYC Data Science Academy. My objective was two-fold: first, to gain a deeper and practical understanding on emerging 'Big Data' technologies, more so than what academic publications or industry white papers currently provide; and, second, to familiarize myself with the skill set and experience to expect from the new generation statisticians, or Data Scientists. With a background in Business Intelligence, Architecture, Risk Management and Governance on Wall Street, I find that foundational skills remain the same: mathematics and statistics. However, with the commoditizing of data storage and massively parallel computing, Data Scientist today are capable of solving problems reserved for an exclusive few in decades past. The course did not cover configuration of the Hadoop environment, but thanks to the engaging and knowledgeable instructor, clues on challenges and potential pitfalls were generously shared. I highly recommend this course not only to professionals or recent graduates looking to hone data analysis skills, but to anyone with an interest or stake in Big Data.