Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark
Networking - always 1 letter away from 'Not Working'
Assuming you have your Pi 3 connected to your network through a wired router that acts as a DHCP server you can follow along nearly identically.
Most consumer routers offer functionality to display the devices currently attached to the router, including the hostname (which we specified in the raspi-config settings), ip address, and MAC address.
Here's what the attached devices web page looks like for my router. Keep in mind, this will look different depending on the model and vendor of your router. Also, you will need to get the router ip address and login information. The router ip address is typically the same as your computer's ip address, except the last group of digits are replaced with a 1. ie: my ip address is 10.0.0.55, and my router ip address is 10.0.0.1
Notice the convenient consecutive ip addresses for the devices with hostname SME1-SME5? That was intentional, but it is optional. There are many ways to assign static ip addresses to your computers/Pis, but I chose to configure it at the router level using my router's web interface. Note, I intentionally blocked out the device MAC addresses, they will be visible for you.
To keep the next steps easy, I suggest setting up a similar range of static ip addresses for your Pi computers, but we can also reference them by hostname so it isn't a required step.
Now lets connect to the Pi computer.
In a terminal, use the ssh command to connect to the Pi. you can either use the ip address, or the hostname with the .local extension. The default username is pi
, and default password is raspbian.
i used the following command to connect to my Raspberry Pi with ip 10.0.049
, aka SME5
command
ssh pi@sme5.local
You will see similar results if you successfully connected to the Pi.
Next we need to setup our /etc/hosts
file to enable each node to connect through hostnames.
edit the file with the following command
sudo nano /etc/hosts
Once you have the nano
text editor open the file /etc/hosts
you can now enter the ip addresses and hostnames for ALL of the Pi computers in the cluster. Below is my edited /etc/hosts.
The above step will need to be repeated for each other Pi in the cluster.
Now we can create ssh keys to allow password-free ssh connections between the Pi computers. In your terminal type the following command to generate the ssh key
ssh-keygen -t rsa -C pi@hostname
This will create a directory ~/.ssh/
, add your personal SSH keys into ~/.ssh/authorized_keys
, you may have to create the file first.
in my case that was
ssh-keygen -t rsa -C pi@sme5
This step will need to be repeated for each other pi, and be sure to adjust the hostname for each pi.
Time to share theses keys amongst the cluster!
cat ~/.ssh/id_rsa.pub | ssh pi@hostname 'mkdir -p .ssh; cat >> .ssh/authorized_keys'
You will be prompted to enter the password for the remote machine you're connecting to.
This like outputs the newly generated public ssh key, then it pipes the output through ssh to the specified node. Afterwards, the ssh key is appended to the file ~/.ssh/authorized_keys
This like outputs the newly generated public ssh key, then it pipes the output through ssh to the specified node. Afterwards, the ssh key is appended to the file ~/.ssh/authorized_keys
you'll need to either repeat this line, adjusting it for each other Pi in the cluster.
As an example, my cluster has the following format:
10.0.0.49 - SME5
10.0.0.50 - SME4
10.0.0.51 - SME3
10.0.0.52 - SME2
10.0.0.53 - SME1
As an example, my cluster has the following format:
10.0.0.49 - SME5
10.0.0.50 - SME4
10.0.0.51 - SME3
10.0.0.52 - SME2
10.0.0.53 - SME1
cat ~/.ssh/id_rsa.pub | ssh pi@sme5 'mkdir -p .ssh; cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh pi@sme4 'mkdir -p .ssh; cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh pi@sme3 'mkdir -p .ssh; cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh pi@sme2 'mkdir -p .ssh; cat >> .ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh pi@sme1 'mkdir -p .ssh; cat >> .ssh/authorized_keys'
Alternatively, if you assigned a naming scheme for the Pi hostnames where the have consecutive numbers as the suffix, you can use one line of code to send the ssh keys to every worker at once.
for i in 1 2 3 4 5; do cat ~/.ssh/id_rsa.pub | ssh 'pi@sme'$i 'mkdir -p .ssh; cat >> .ssh/authorized_keys'; done;
Keep in mind, this took the ssh key on SME5 and sent it to SME1, SME2, SME3, SME4, and SME5. The above process needs to be repeated for SME1, SME2, SME3, and SME4.
Now you can test the connectivity between nodes. once I connect to pi@sme5, I should be able to run ssh sme[1-4]
without the needing to enter a password.
eg. from pi@SME1, it should be able to run ssh sme2
without needing a password.
If you are unable to verify every Pi to Pi connection works passwordless, go to the previous steps and check for typos!
Networking is working! Let's update some software before our network changes its mind!
Software Updating
The Raspberian operating system is based on Ubuntu Linux which utilizes the apt-get
(synaptic) package management system.
Log into each Pi in your cluster and run the following commands to update it:
sudo apt-get update && sudo apt-get upgrade
Click Y when prompted to continue with the software upgrade. Note, the amount of files that need to be installed will differ than the example above.
Connect to the remaining Pi computers in your cluster and run the same code to upgrade them too.
sudo apt-get update && sudo apt-get upgrade
Now lets install some dependencies.
For all of the Pi computers install java, python3 dependencies, and pip3 with the following commands
sudo apt-get -y install openjdk-8-jdk python3-pip python3-dev python-dev python-setuptools
sudo pip3 install -U pip
sudo pip3 install numpy pandas matplotlib ipython3
sudo pip3 install -U pip
For the Master node, in this case SME5, we'll also need to install jupyter-notebook
sudo pip3 install jupyter
Now we can move on towards Hadoop & Spark
You can download the binary package of Hadoop from link http://www-us.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz.
wget http://mirror.nus.edu.sg/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
# Unzip the downloaded file tar xvfz hadoop-2.7.3.tar.gz
# Move the Hadoop installation to path /usr/local/hadoop mv hadoop-2.7.3 /usr/local/hadoop

