This 6-week program provides a hands-on introduction to Apache Hadoop and Spark programming using Python and cloud computing. The key components covered by the course include Hadoop Distributed File Systems, MapReduce using MRJob, Apache Hive, Pig, and Spark. Tools and platforms that are used include Docker, Amazon Web Services and Databricks. In the first half of the program students are required to pull a pre-built Docker image and run most of the exercises locally using docker containers. In the second half students must access their AWS and Databricks accounts to run cloud computing exercises. Students will need to bring their laptops to class. Detailed instructions will be provided ahead of time on: how to pull and run a docker image, how to connect to AWS/Databricks, etc.
To get the most out of the class, you need to be familiar with Linux file systems, Linux command line interface (CLI) and the basic linux commands such as cd, ls, cp, etc. You also need to have basic programming skills in Python, and are comfortable with functional programming style, for example, how to use map() function to split a list of strings into a nested list. Object oriented programming (OOP) in python is not required.
Certificates are awarded at the end of the program at the satisfactory completion of the course. Students are evaluated on a pass/fail basis for their performance on the required homework and final project (where applicable). Students who complete 80% of the homework and attend a minimum of 85% of all classes are eligible for the certificate of completion.