NYC Open Data: Streaming Python on Hadoop

Posted on Sep 19, 2015

Our upcoming 12-Week Data Science bootcamp starts on January, 11th, 2016. Apply now to secure a spot in our winter cohort!

In the meantime, come join the NYC Open Data Meetup Group and learn how easily you can use Hadoop for Machine Learning.

If you are hiring Data Scientists, call us at +1-888-752-7585 (USA) or reach us at [email protected] to share your openings and set up interviews with our excellent students.

What is an NYC Open Data Meetup event like? Here’s an example:

This summer, Sam Kamin (Vice President of Engineering, NYC Data Science Academy) gave a teaser version of NYC Data Science Academy’s five-week Hadoop course. (The current version on offer is a six-week evening program, Big Data with Hadoop and Spark.”) See below for a class syllabus from the five-week Hadoop course.

First, though, here are the slides from the Meetup:

(Slides can also be accessed here through SlideShare.)

Plus video of the event:

 Speaker Bio:

Sam Kamin is currently a Vice President of Engineering at NYC Data Science Academy. He is also Associate Professor Emeritus from the University of Illinois Champaign Urbana where he taught computer science for over 20 years and was head of the undergraduate program. Most recently he was an engineer at Google before joining NYC Data Science Academy.

Now for the five-week Hadoop course: 

This five-week course is an intensive, hands-on introduction to the Hadoop ecosystem of Big Data technologies. The emphasis in this course is on learning several of the major components of Apache Hadoop – HDFS, MapReduce, Hive, Pig, Streaming – by doing exercises of increasing complexity. Programming will be done in Python. Students are expected to be familiar with using an operating system from the command line; knowledge of Python is helpful; the material in Learn Python the Hard Way is sufficient background knowledge. The course format is mixed lecture/lab. Students will need to bring their own laptops to connect to our server; instructions will be provided ahead of time as to how to install any required software.

SYLLABUS
Week 1 – Introduction: MapReduce Overview of Big Data and the Hadoop ecosystem the concept of MapReduce
HDFS – Hadoop Distributed File System
MapReduce with Python streaming

Week 2 – More on MapReduce
More on Big Data, the Hadoop ecosystem, and MapReduce.
Mixed case studies and exercises using MR with Python streaming

Week 3 – Hive: A database for Big Data
Hive concepts
HiveQL
User-defined functions in the Hive language
User-defined functions in Python (using streaming)
Advanced topic: Hive queries in Python code

Week 4 – Pig: Simplified MapReduce
Basic concepts
Pig Latin
Pig functions and macros
User-defined functions

Week 5 – Project day

The Hadoop ecosystem
Brief intro to Spark
Brief intro to Mahout
Case studies/project ideas

About Author

Vivian Zhang

Vivian Zhang is the CTO and School Director of the NYC Data Science Academy. She started the NYC Open Data meetup group. She earned her M.S. in Computer Science and Statistics and B.S. in Computer Science. She is...
View all posts by Vivian Zhang >

Leave a Comment

Felicitas August 12, 2016
Creating a shift iin attitude will be about changing perspective Jeffry fresh foods through a fraction within the prices you'd pay to the grocery store Jeffry It puts a glow upon your face using a sparkle on thee inside eyes Wec.li Though Apple doesn't make dedicated e-book devices Jeffry The average price for a picture book is bout $20 to $30 Jeffry While the Mac doesn't have nearly the market share of PC still, it hass grown and continues to grow at an instant pace tinyurl.com

View Posts by Categories


Our Recent Popular Posts


View Posts by Tags

#python #trainwithnycdsa 2019 2020 Revenue 3-points agriculture air quality airbnb airline alcohol Alex Baransky algorithm alumni Alumni Interview Alumni Reviews Alumni Spotlight alumni story Alumnus ames dataset ames housing dataset apartment rent API Application artist aws bank loans beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Best Data Science Bootcamp 2020 Best Ranked Big Data Book Launch Book-Signing bootcamp Bootcamp Alumni Bootcamp Prep boston safety Bundles cake recipe California Cancer Research capstone car price Career Career Day citibike classic cars classpass clustering Coding Course Demo Course Report covid 19 credit credit card crime frequency crops D3.js data data analysis Data Analyst data analytics data for tripadvisor reviews data science Data Science Academy Data Science Bootcamp Data science jobs Data Science Reviews Data Scientist Data Scientist Jobs data visualization database Deep Learning Demo Day Discount disney dplyr drug data e-commerce economy employee employee burnout employer networking environment feature engineering Finance Financial Data Science fitness studio Flask flight delay gbm Get Hired ggplot2 googleVis H20 Hadoop hallmark holiday movie happiness healthcare frauds higgs boson Hiring hiring partner events Hiring Partners hotels housing housing data housing predictions housing price hy-vee Income Industry Experts Injuries Instructor Blog Instructor Interview insurance italki Job Job Placement Jobs Jon Krohn JP Morgan Chase Kaggle Kickstarter las vegas airport lasso regression Lead Data Scienctist Lead Data Scientist leaflet league linear regression Logistic Regression machine learning Maps market matplotlib Medical Research Meet the team meetup methal health miami beach movie music Napoli NBA netflix Networking neural network Neural networks New Courses NHL nlp NYC NYC Data Science nyc data science academy NYC Open Data nyc property NYCDSA NYCDSA Alumni Online Online Bootcamp Online Training Open Data painter pandas Part-time performance phoenix pollutants Portfolio Development precision measurement prediction Prework Programming public safety PwC python Python Data Analysis python machine learning python scrapy python web scraping python webscraping Python Workshop R R Data Analysis R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest Ranking recommendation recommendation system regression Remote remote data science bootcamp Scrapy scrapy visualization seaborn seafood type Selenium sentiment analysis sentiment classification Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Switchup Tableau teachers team team performance TensorFlow Testimonial tf-idf Top Data Science Bootcamp Top manufacturing companies Transfers tweets twitter videos visualization wallstreet wallstreetbets web scraping Weekend Course What to expect whiskey whiskeyadvocate wildfire word cloud word2vec XGBoost yelp youtube trending ZORI