Hadoop Workshop II: Run Map Reduce Jobs on your Amazon cloud cluster

Vivian Zhang
Posted on Apr 11, 2014

I was so happy to get many upvotes!

Many thanks go to Conductor Inc (Conductor makes the most widely used SEO platform - empowering enterprise marketers to take control of their search performance.)

Special thanks go to Caitlin Wilterdink, Jon Torodash, and Chris Lee (now Googler) for their assistance and for hosting us in the wonderful space!

NYC Data Science Academy is offering two relative courses:
RSVP Hadoop Beginner level classes
RSVP Hadoop Intermediate level classes


The Intermediate level week 1:


Desktop recording:

Camera recording:

Meetup Announcement:

Speaker: Vivian Zhang, CTO and co-founder of SupStat Inc, organizer of NYC Open Data Meetup, Founder of NYC Data Science Academy( https://nycdatascience.com/ ) She teaches R and Hadoop.

Her data school hires the best working professionals to teach Python, D3.js and related Data Science skills. All the courses are designed to teach you employable skills. We teach the skills and toolkits in the class and assist students to do projects of their own choices. Students will show case their projects in this meetup group at the end of their courses.


In Hadoop workshop I and II, I will walk you through the steps to configure a Hadoop cluster on Amazon EC2 and run two simple map-reduce jobs on the cluster.


  1. Sign up for Amazon AWS acct from http://aws.amazon.com/account/

  2. get familiar with basic vi commands(if you don't know it, I can show you quickly. You are welcome to read more before coming.)

  3. You don't need to know Java at this moment.  If you know Java, you can program in Hadoop quickly in later workshops.

Tutorial Repo:

You can follow along to set up your R, Rjava and run map-reduce using our Hackpad tutorial

  1. Goal:
  • Configure a Hadoop Cluster on Amazon EC2

  • Run two sample map reduce jobs on the cluster

2. I will go over the details:

  • Preparation

  • Server configuration

  • Hadoop installation and configuration

3. Preparation:

1) apply for a AWS acct

  • go to http://aws.amazon.com/account/

  • put name, email, click on "sign in";

  • put full name, email, billing info, credit bard;

  • pass the phone number authorization;

  • select AWS support plan, use Basic (Free);

  • see message "Thank you for updating your Amazon Web Services Account!”.

2) log in your acct

  • sign in using the acct/pass you just made.
  • choose "Launch the AWS Management Console";
  • click on button "sign in to the AWS console";
  • choose "I am a returning user and my password is";
  • click "sign in using our secure server";
  • see "EC2 Dashboard" on the top left corner.

3) create your key pair

  • you authorize yourself and communicate with Amazon cloud using public and private key pair.
  • on your left panel of EC2 dashboard, click on "network & security" category's "Key Pairs";
  • click on button "create key pair";
  • type your key pair name, for me, I put "EC2UbuntuLTSThreeT1Micro";
  • see a download file window, choose a secure folder to save your .pem file which will be your private key. For me, I get the file called "EC2UbuntuLTSThreeT1Micro.pem";

4) create EC2 instances

  • How to understand instance? each instance is like a seperate machine(or pc or laptop).
  • on your left panel of EC2 dashboard,, click on "instances" category's "instances";
  • click on button "launch instance"; choose the system, find the 9th options--"Ubuntu Server 12.04 LTS (HVM) - ami-5fa4a236 64 bit";
  • click on "select";
  • keep the default setting, choose "next:configure instance details";
  • put "3" into "Number of instances";
  • click on "review and launch";
  • click on "launch";
  • choose the key pair "EC2UbuntuLTSThreeT1Micro";
  • check the acknowleage box;
  • click on "launch instance",you should see "Your instances are now launching"; scroll down to the bottom, click on "view instances",click one each instance and rename them, such as "meetup1","meetup2","meetup3",wait till you see the status checks is changed from "initializing" to "2/2 checks passed".

5) configure security group

  • each group is like a firewall. The nodes of the same cluster need to be in the same security group.
  • click on "create security groups",put name such as "launch-wizard-1" or change to your customized name;
  • under "inbound" tab, cick on add rules
  • add five rules:
    • choose type="SSH", "save";
    • choose type="ALL ICM", Source=Anywhere, "save";
    • choose type="Custom TCP", port range="9000", choose "anywhere",save;
    • choose type="Custom TCP", port range="9001", choose "anywhere",save;
    • choose type="Custom TCP", port range="50000 - 50100", choose "anywhere", save;
    double check all your inbound setting, they should look like:
    | Type            | Protocol  | Port Range      | Source            |
    | ----------------|:---------:| ---------------:|------------------:|
    | SSH             | TCP       | 22              |Anywhere |
    | Custom TCP Rule | TCP       | 9001            |Anywhere |
    | Custom TCP Rule | TCP       | 9000            |Anywhere |
    | Custom TCP Rule | TCP       | 50000 - 50100   |Anywhere |
    | All ICMP        | ICMP      | 0 - 65535(or na)|Anywhere |

6) manage your instance

  • know how to turn on and off the instances
  • select the instance, right click to choose "stop", you won't be able to use this instance and won't be charged;
  • select the instance, right click to choose "start", you can use the instance and will be charged. --next time, if you restart the instance, your "public DNS" will be different, but your "private DNS" will not changed. If you reboot the instance, your "public DNS" will be the same as before rebooting.
  1. Server configuration

1) You are required to use vi editor. The basic operations are:

 Cheet sheet for vi

| what you want to do  | Key stokes       | 
| ---------------------|:----------------:| 
| insert/edit model    | i                | 
| finish and save      | ESC->:->wq->Enter|
| finish and not save  | ESC->q!->Enter   |
| go to the end of line| ESC->o           |

2) generate your server rsa key for three instances

  • find your meetup1's public DNS and get its RSA public key
  • open terminal 1, ssh to your server
  • before you run ssh, make sure you are at the location you can access your .pem file. I save .pem in my ".ssh" folder, so I do "cd .ssh" first.
  • remote access to your instance by "ssh -i EC2UbuntuLTSThreeT1Micro.pem [email protected]_public_dns".
  make your own reference table for public DNS

  | machine name| public DNS                             | 
  | ------------|:--------------------------------------:|      
  | meetup1     |ec2-54-86-2-169.compute-1.amazonaws.com |
  | meetup2     |ec2-54-86-4-68.compute-1.amazonaws.com  |
  | meetup3     |ec2-54-86-10-200.compute-1.amazonaws.com|
  • you should be able to assembly commands based on the table
  • for "are you sure you want to continue connecting (yes/no)?"" put "yes"
  • generate your server key
  • ssh-keygen -t rsa
  • do "Enter" three times
  • cd .ssh
  • vi id_rsa.pub
  • copy and paste it into a text file for future
  • find your meetup2's public DNS and generate its RSA public key
  • open terminal 2, ssh to your server
  • two parts are different from meetup1
    • meetup2 has different public DNS address
    • copy and paste meetup2's id_ras.pub into the same file as second line
  • find your meetup3's public DNS and generate its RSA public key
  • open terminal 3, ssh to your server
  • two parts are different from meetup1
  • meetup3 has different public DNS address
  • copy and paste meetup3's id_ras.pub into the same file as third line
  • in the end, you should have a file which contains three rsa keys

3) configure "authorized_keys" file for three instances

  • on meetup1 instance
  • cd .ssh
  • vi authorized_keys
  • go to the end of file by "ESC->o"
  • copy/paste three rsa keys to the end of file you should have one original keys and three new keys in the end
  • save and exit file by "ESC->wq!"
  • on meetup2 instance, do the same
  • on meetup3 instance, do the same

4) configure "hosts" file for three instances

  • on meetup 1 instance
  • sudo vi /etc/hosts
  • find your private DNS from EC2 console -- add three lines to the end of files, such as meetup1 meetup2 meetup3
  • on meetup2 instance, do the same

  • on meetup3 instance, do the same

5) test connections among cluster

  • from meetup1
  • ping meetup2
  • ping meetup3
  • from meetup2
  • ping meetup1
  • ping meetup3
  • from meetup3
  • ping meetup1
  • ping meetup2
  • ctrl +c to stop pinging

6) install Java

  • run the comands for each instance
  • sudo add-apt-repository ppa:webupd8team/java
  • sudo apt-get update
  • sudo apt-get install oracle-java7-installer (do "Enter" to select "ok", user cursor to select second "ok" and enter again)
  • sudo apt-get install oracle-java7-set-default
  • exit
  • ssh -i EC2UbuntuLTSThreeT1Micro.pem [email protected] (logout and login again to validate your new configuration)
  • echo $JAVA_HOME (you should see "/usr/lib/jvm/java-7-oracle")

5. Hadoop installation and configuration

  • We are using stable Hadoop version 1.2.1 (2014-04-04)
  • the mirror is from Columbia Univ
  • all the operations in the below will be run on meetup1(your master node)

1) download haoop source codes

2) configure the environment

  • configure Java Path
  • vi hadoop-env.sh
  • search for the line "# The java implementation to use. Required."
  • delete the "#"
  • set "export JAVA_HOME=/usr/lib/jvm/java-7-oracle"
  • save the file
  • configure core-site file -vi core-site.xml
  • between and , put fs.default.namehdfs://meetup1:9000hadoop.tmp.dir/home/ubuntu/hadoop/tmp
  • make new folder "~/hadoop/tmp"
  • cd ~
  • mkdir hadoop
  • cd hadoop
  • mkfir tmp
  • configure redundance, the value is usually set as 1 or 2 and less than total number of slave nodes.
  • cd ~
  • cd hadoop-1.2.1/conf
  • vi hdfs-site.xml
  • between and , put dfs.replication2
  • configure master file
  • delect localhost
  • put meetup1
  • configre slave file
  • delect localhost
  • put meetup2 meetup3
  • copy all the configuration from master node to two slave nodes
  • scp -r ~/hadoop-1.2.1 meetup2:/home/ubuntu
  • scp -r ~/hadoop-1.2.1 meetup3:/home/ubuntu

3) format

  • ~/hadoop-1.2.1/bin/hadoop namenode -format
  • you should get "Storage directory /home/ubuntu/hadoop/tmp/dfs/name has been successfully formatted."
  • cd ~/hadoop/tmp/
  • you should find the two folders
  • dfs->name->current
  • dfs->name->image

4) start your hadoop
- ~/hadoop-1.2.1/bin/start-all.sh
- test whether the hadoop is running
- run "jps" on three instances
- on your master node, you should see (the numbers will vary) 6919 NameNode 7237 JobTracker 7155 SecondaryNameNode 7445 Jps
- on two slave nodes, you should see (the numbers will vary) 7653 TaskTracker 7490 DataNode 7713 Jps

5) stop your hadoop
6. Congratulation! You have your first hadoop cluster!

Extra note on your hadoop log

  • you can find log files under "cd ~/hadoop/tmp/mapred/local/userlogs/"
  • pick one job folder, such as "job_201404071413_0002"
  • pick one log file, such as "vi attempt_201404071413_0002_m_000001_3"

Other Useful Info Link:
REvolution RMR map reduce examples
Hadoop R for airline
Example rmr script to calculate average departure delays per month for each airline
Kmeans implementation in R
Finding Frequent Itemsets
Facebook social network plot made by R and haoop

Wiki page for Gradient_descent
Wiki page for Logistic Function
MIT page to explain using logistic function for Gradient Descent

About Author

Vivian Zhang

Vivian Zhang

Vivian is a data scientist who has been devoted to the analytics industry and the development and use of data technologies for several years. She obtained expertise in data analysis and data management as a Senior Analyst and...
View all posts by Vivian Zhang >

Related Articles

Leave a Comment

Hadoop Workshop III: One Stop Shop — One System Fit All Sizes of Data | NYC Data Science Academy June 18, 2014
[…] Hadoop Workshop II: Run Map Reduce Jobs on Your Amazon Cloud […]

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

2019 airbnb alumni Alumni Interview Alumni Spotlight alumni story Alumnus API artist aws beautiful soup Best Bootcamp Best Data Science 2019 Best Data Science Bootcamp Big Data bootcamp Bootcamp Prep Bundles California Cancer Research capstone Career citibike clustering Coding Course Report D3.js data Data Analyst data science Data Science Academy Data Science Bootcamp Data Scientist Data Scientist Jobs data visualization Deep Learning Demo Day dplyr employer networking feature engineering Finance Financial Data Science Flask gbm Get Hired ggplot2 googleVis Hadoop higgs boson Hiring hiring partner events Industry Experts Job JP Morgan Chase Kaggle lasso regression Lead Data Scienctist Lead Data Scientist leaflet linear regression Logistic Regression machine learning Maps matplotlib Medical Research meetup Networking neural network Neural networks New Courses nlp NYC NYC Data Science nyc data science academy NYC Open Data NYCDSA NYCDSA Alumni Open Data painter pandas Portfolio Development prediction Programming PwC python python machine learning python scrapy python web scraping python webscraping Python Workshop R R language R Programming R Shiny r studio R Visualization R Workshop R-bloggers random forest recommendation recommendation system regression Scrapy scrapy visualization seaborn Selenium sentiment analysis Shiny Shiny Dashboard Spark Special Special Summer Sports statistics streaming Student Interview Student Showcase SVM Tableau Testimonial tf-idf Top Data Science Bootcamp twitter visualization web scraping What to expect word cloud word2vec XGBoost yelp