Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Scott Edenbaum

Posted on May 6, 2017

Networking - always 1 letter away from 'Not Working'

Assuming you have your Pi 3 connected to your network through a wired router that acts as a DHCP server you can follow along nearly identically.

Most consumer routers offer functionality to display the devices currently attached to the router, including the hostname (which we specified in the raspi-config settings), ip address, and MAC address.

Here's what the attached devices web page looks like for my router. Keep in mind, this will look different depending on the model and vendor of your router. Also, you will need to get the router ip address and login information. The router ip address is typically the same as your computer's ip address, except the last group of digits are replaced with a 1. ie: my ip address is 10.0.0.55, and my router ip address is 10.0.0.1

Notice the convenient consecutive ip addresses for the devices with hostname SME1-SME5? That was intentional, but it is optional. There are many ways to assign static ip addresses to your computers/Pis, but I chose to configure it at the router level using my router's web interface. Note, I intentionally blocked out the device MAC addresses, they will be visible for you.

To keep the next steps easy, I suggest setting up a similar range of static ip addresses for your Pi computers, but we can also reference them by hostname so it isn't a required step.

Now lets connect to the Pi computer.

In a terminal, use the ssh command to connect to the Pi. you can either use the ip address, or the hostname with the .local extension. The default username is pi, and default password is raspbian.

i used the following command to connect to my Raspberry Pi with ip 10.0.049, aka SME5 command

ssh pi@sme5.local

You will see similar results if you successfully connected to the Pi.

Next we need to setup our `/etc/hosts` file to enable each node to connect through hostnames.

edit the file with the following command

`sudo nano /etc/hosts`

Once you have the `nano` text editor open the file `/etc/hosts` you can now enter the ip addresses and hostnames for ALL of the Pi computers in the cluster. Below is my edited `/etc/hosts.`

The above step will need to be repeated for each other Pi in the cluster.

Now we can create ssh keys to allow password-free ssh connections between the Pi computers. In your terminal type the following command to generate the ssh key

`ssh-keygen -t rsa -C pi@hostname`

This will create a directory `~/.ssh/`, add your personal SSH keys into `~/.ssh/authorized_keys`, you may have to create the file first.

in my case that was

`ssh-keygen -t rsa -C pi@sme5`

This step will need to be repeated for each other pi, and be sure to adjust the hostname for each pi.

Time to share theses keys amongst the cluster!

`cat ~/.ssh/id_rsa.pub | ssh pi@hostname 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`You will be prompted to enter the password for the remote machine you're connecting to. This like outputs the newly generated public ssh key, then it pipes the output through ssh to the specified node. Afterwards, the ssh key is appended to the file ~/.ssh/authorized_keys`

`you'll need to either repeat this line, adjusting it for each other Pi in the cluster. As an example, my cluster has the following format: 10.0.0.49 - SME5 10.0.0.50 - SME4 10.0.0.51 - SME3 10.0.0.52 - SME2 10.0.0.53 - SME1`

cat ~/.ssh/id_rsa.pub | ssh pi@sme5 'mkdir -p .ssh; cat >> .ssh/authorized_keys'

`cat ~/.ssh/id_rsa.pub | ssh pi@sme4 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`cat ~/.ssh/id_rsa.pub | ssh pi@sme3 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`cat ~/.ssh/id_rsa.pub | ssh pi@sme2 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`cat ~/.ssh/id_rsa.pub | ssh pi@sme1 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

Alternatively, if you assigned a naming scheme for the Pi hostnames where the have consecutive numbers as the suffix, you can use one line of code to send the ssh keys to every worker at once.

`for i in 1 2 3 4 5; do cat ~/.ssh/id_rsa.pub | ssh 'pi@sme'$i 'mkdir -p .ssh; cat >> .ssh/authorized_keys'; done;`

Keep in mind, this took the ssh key on SME5 and sent it to SME1, SME2, SME3, SME4, and SME5. The above process needs to be repeated for SME1, SME2, SME3, and SME4.

Now you can test the connectivity between nodes. once I connect to pi@sme5, I should be able to run `ssh sme[1-4]` without the needing to enter a password.

eg. from pi@SME1, it should be able to run `ssh sme2` without needing a password.
If you are unable to verify every Pi to Pi connection works passwordless, go to the previous steps and check for typos!

`Networking is working! Let's update some software before our network changes its mind!`

Software Updating

The Raspberian operating system is based on Ubuntu Linux which utilizes the apt-get (synaptic) package management system.

Log into each Pi in your cluster and run the following commands to update it:

sudo apt-get update && sudo apt-get upgrade

Click Y when prompted to continue with the software upgrade. Note, the amount of files that need to be installed will differ than the example above.

Connect to the remaining Pi computers in your cluster and run the same code to upgrade them too.

`sudo apt-get update && sudo apt-get upgrade`

Now lets install some dependencies.

For all of the Pi computers install java, python3 dependencies, and pip3 with the following commands

`sudo apt-get -y install openjdk-8-jdk python3-pip python3-dev python-dev python-setuptools sudo pip3 install -U pip`
`sudo pip3 install numpy pandas matplotlib ipython3`

For the Master node, in this case SME5, we'll also need to install jupyter-notebook

`sudo pip3 install jupyter`

Now we can move on towards Hadoop & Spark

You can download the binary package of Hadoop from link http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz.

`wget http://mirror.nus.edu.sg/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz # Unzip the downloaded file tar xvfz hadoop-2.7.3.tar.gz # Move the Hadoop installation to path /usr/local/hadoop mv hadoop-2.7.3 /usr/local/hadoop`

Use the `ls` command to verify that the Hadoop files are in the correct directory `/usr/local/hadoop`

`Continue to the next page for Hadoop configuration`

Pages: 1 2 3 4 5 6

About Author

Scott Edenbaum

Scott Edenbaum is a recent graduate from the NYC Data Science Academy. He was hired by the Academy to assist in buildout of the learning management system and seeks to pursue a career as a Data Scientist. Scott's...

View all posts by Scott Edenbaum >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

Jair June 13, 2017

Hi Scott, Thanks for posting. As a heads up, "sudo pip3 install ipython3" didn't work for me. However, "sudo pip3 install ipython" seems to work fine. Jair

Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Networking - always 1 letter away from 'Not Working'

Assuming you have your Pi 3 connected to your network through a wired router that acts as a DHCP server you can follow along nearly identically.

Most consumer routers offer functionality to display the devices currently attached to the router, including the hostname (which we specified in the raspi-config settings), ip address, and MAC address.

You will see similar results if you successfully connected to the Pi.

Next we need to setup our /etc/hosts file to enable each node to connect through hostnames.

edit the file with the following command

sudo nano /etc/hosts

Once you have the nano text editor open the file /etc/hosts you can now enter the ip addresses and hostnames for ALL of the Pi computers in the cluster. Below is my edited /etc/hosts.

The above step will need to be repeated for each other Pi in the cluster.

Now we can create ssh keys to allow password-free ssh connections between the Pi computers. In your terminal type the following command to generate the ssh key

ssh-keygen -t rsa -C pi@hostname

This will create a directory ~/.ssh/, add your personal SSH keys into ~/.ssh/authorized_keys, you may have to create the file first.

in my case that was

ssh-keygen -t rsa -C pi@sme5

This step will need to be repeated for each other pi, and be sure to adjust the hostname for each pi.

Time to share theses keys amongst the cluster!

cat ~/.ssh/id_rsa.pub | ssh pi@hostname 'mkdir -p .ssh; cat >> .ssh/authorized_keys'

You will be prompted to enter the password for the remote machine you're connecting to. This like outputs the newly generated public ssh key, then it pipes the output through ssh to the specified node. Afterwards, the ssh key is appended to the file ~/.ssh/authorized_keys

you'll need to either repeat this line, adjusting it for each other Pi in the cluster. As an example, my cluster has the following format: 10.0.0.49 - SME5 10.0.0.50 - SME4 10.0.0.51 - SME3 10.0.0.52 - SME2 10.0.0.53 - SME1

cat ~/.ssh/id_rsa.pub | ssh pi@sme4 'mkdir -p .ssh; cat >> .ssh/authorized_keys'

cat ~/.ssh/id_rsa.pub | ssh pi@sme3 'mkdir -p .ssh; cat >> .ssh/authorized_keys'

cat ~/.ssh/id_rsa.pub | ssh pi@sme2 'mkdir -p .ssh; cat >> .ssh/authorized_keys'

cat ~/.ssh/id_rsa.pub | ssh pi@sme1 'mkdir -p .ssh; cat >> .ssh/authorized_keys'

Alternatively, if you assigned a naming scheme for the Pi hostnames where the have consecutive numbers as the suffix, you can use one line of code to send the ssh keys to every worker at once.

for i in 1 2 3 4 5; do cat ~/.ssh/id_rsa.pub | ssh 'pi@sme'$i 'mkdir -p .ssh; cat >> .ssh/authorized_keys'; done;

Keep in mind, this took the ssh key on SME5 and sent it to SME1, SME2, SME3, SME4, and SME5. The above process needs to be repeated for SME1, SME2, SME3, and SME4.

Now you can test the connectivity between nodes. once I connect to pi@sme5, I should be able to run ssh sme[1-4] without the needing to enter a password.

eg. from pi@SME1, it should be able to run ssh sme2 without needing a password. If you are unable to verify every Pi to Pi connection works passwordless, go to the previous steps and check for typos!

Networking is working! Let's update some software before our network changes its mind!

Software Updating

Click Y when prompted to continue with the software upgrade. Note, the amount of files that need to be installed will differ than the example above.

Connect to the remaining Pi computers in your cluster and run the same code to upgrade them too.

sudo apt-get update && sudo apt-get upgrade

Now lets install some dependencies.

For all of the Pi computers install java, python3 dependencies, and pip3 with the following commands

sudo apt-get -y install openjdk-8-jdk python3-pip python3-dev python-dev python-setuptools sudo pip3 install -U pip sudo pip3 install numpy pandas matplotlib ipython3

For the Master node, in this case SME5, we'll also need to install jupyter-notebook

sudo pip3 install jupyter

You can download the binary package of Hadoop from link http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz.

wget http://mirror.nus.edu.sg/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz # Unzip the downloaded file tar xvfz hadoop-2.7.3.tar.gz # Move the Hadoop installation to path /usr/local/hadoop mv hadoop-2.7.3 /usr/local/hadoop

Use the ls command to verify that the Hadoop files are in the correct directory /usr/local/hadoop

Continue to the next page for Hadoop configuration

About Author

Scott Edenbaum

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

Next we need to setup our `/etc/hosts` file to enable each node to connect through hostnames.

`sudo nano /etc/hosts`

Once you have the `nano` text editor open the file `/etc/hosts` you can now enter the ip addresses and hostnames for ALL of the Pi computers in the cluster. Below is my edited `/etc/hosts.`

`ssh-keygen -t rsa -C pi@hostname`

This will create a directory `~/.ssh/`, add your personal SSH keys into `~/.ssh/authorized_keys`, you may have to create the file first.

`ssh-keygen -t rsa -C pi@sme5`

`cat ~/.ssh/id_rsa.pub | ssh pi@hostname 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`You will be prompted to enter the password for the remote machine you're connecting to. This like outputs the newly generated public ssh key, then it pipes the output through ssh to the specified node. Afterwards, the ssh key is appended to the file ~/.ssh/authorized_keys`

`you'll need to either repeat this line, adjusting it for each other Pi in the cluster. As an example, my cluster has the following format: 10.0.0.49 - SME5 10.0.0.50 - SME4 10.0.0.51 - SME3 10.0.0.52 - SME2 10.0.0.53 - SME1`

`cat ~/.ssh/id_rsa.pub | ssh pi@sme4 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`cat ~/.ssh/id_rsa.pub | ssh pi@sme3 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`cat ~/.ssh/id_rsa.pub | ssh pi@sme2 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`cat ~/.ssh/id_rsa.pub | ssh pi@sme1 'mkdir -p .ssh; cat >> .ssh/authorized_keys'`

`for i in 1 2 3 4 5; do cat ~/.ssh/id_rsa.pub | ssh 'pi@sme'$i 'mkdir -p .ssh; cat >> .ssh/authorized_keys'; done;`

Now you can test the connectivity between nodes. once I connect to pi@sme5, I should be able to run `ssh sme[1-4]` without the needing to enter a password.

eg. from pi@SME1, it should be able to run `ssh sme2` without needing a password.
If you are unable to verify every Pi to Pi connection works passwordless, go to the previous steps and check for typos!

`Networking is working! Let's update some software before our network changes its mind!`

`sudo apt-get update && sudo apt-get upgrade`

`sudo apt-get -y install openjdk-8-jdk python3-pip python3-dev python-dev python-setuptools sudo pip3 install -U pip`
`sudo pip3 install numpy pandas matplotlib ipython3`

`sudo pip3 install jupyter`

`wget http://mirror.nus.edu.sg/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz # Unzip the downloaded file tar xvfz hadoop-2.7.3.tar.gz # Move the Hadoop installation to path /usr/local/hadoop mv hadoop-2.7.3 /usr/local/hadoop`

Use the `ls` command to verify that the Hadoop files are in the correct directory `/usr/local/hadoop`

`Continue to the next page for Hadoop configuration`

Get detailed curriculum information about our
amazing bootcamp!