Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark
Raspberry Pi 3
Specifications
Hardware
At a mere 3.37” x 2.24” (85.6mm x 56.5mm), weighing 1.6 oz, and requiring a 5V power source at ~2.5 A during max draw, the Pi is both physically small, and low on power requirements. Add in the ability for passive cooling options (ie metal heatsinks), and the $30 cost, you end up with a very powerful tool with seemingly unlimited opportunities for developing in various physical environments.
A 1.2GHZ Quad Core ARM CPU with 1gB of RAM drive the Pi’s computational performance. These specifications are comparable to a high-end cell phone in ~2012 (Similar to a Samsung Galaxy S3). In terms of connectivity, the Pi 3 has 10/100mbps NIC, 4 usb 2.0 ports, WIFI 802.11n and Bluetooth 4.1.
Software
On a foundational level, there is not too much different between a Raspberry Pi and a typical desktop/laptop computer. The CPU is of the ARM architecture, so typical software built for x86 architecture (ie: Intel and AMD CPUs) will not execute. With that in mind, the Raspberry Pi foundation openly recommends use of a community developed, custom fork of Ubuntu Linux - Raspbian Linux.
From a Data Science perspective, Python, R, SQL, numpy, pandas, scikit-learn, tensorflow - nearly all of the most popular libraries/packages have been built for the ARM architecture and support the Raspberry Pi. Even the virtualization software Docker has recently released a version compiled specifically for the Raspberry Pi. Best of all, the ever popular Hadoop and Spark distributed computing platforms support the ARM architecture and have been built for the Raspberry Pi platform.
Project Details & Motivations
Aside from my clearly biased opinion in favor of the Pi 3, I have found these computers incredibly useful. Currently I use a Raspberry Pi 3 running KODI to operate as a media center for my home, sharing multimedia content to my television over its HDMI connection. I also have its smaller cousin, the PI Zero, connecting a usb printer to my wifi network, enabling wireless printing on all the networked computers. During my time enrolled as a student with the NYC Data Science Academy, I regularly used a Raspberry Pi 3 as a MYSQL server, hosting project databases, or to host a Jupyter Notebook or RStudio Server for running python and R code without the need to install any software on my laptop.
Once the coursework moved towards BigData and I was intimately introduced to Hadoop and Spark, I understood their growing importance and looked for the best options for developing in a distributed environment. There are two clear ways to access a distributed computing environment for Hadoop & Spark, either through virtualization or running directly on ‘bare metal’ hardware.
There are numerous online vendors providing access to virtualized computer clusters for operating Hadoop/Spark clusters, but they all operate on a subscription pricing model, and the free editions are severely limited. The remaining option is to run software directly with each node on a separate computer. Normally this is a pricey endeavor, and at least $400 for a modest computer, but the Raspberry Pi 3 presents a unique and cost effective solution.
Project Scope
I decided the best course of action would be to build a cluster using 5 Raspberry Pi 3 computers. 4 would function as a worker/executor node, and 1 as a master/driver node. In addition, there are two main ways to setup the software, one is to directly run it on the Pi, the other is to run the code in a virtualized docker container. I'll run through how to get Hadoop/Spark up and running both ways.
Execution
Direct Installation
Before we can get to the fun part of installing Hadoop and Spark, we must install Raspbian Linux and setup our Pis!
-
Grab a copy of Raspbian Linux https://www.raspberrypi.org/downloads/raspbian/
-
Write the Raspbian Linux image file to the raspberry pi's Micro SD card.
-
On a Mac - Plug in your Micro SD card (optional usb adapter might be necessary)
-
Open Disk Utility
-
Take note of the 'Device' value corresponding to the memory card, for a computer with a single storage drive, the Micro SD card should assign to /dev/disk2
-
Click the button to eject the drive within Disk Utility
-
Disk Utility should look like this
-
Open a Terminal and run the following command
-
sudo dd bs=10m if=/path/to/download/2017-03-02-raspbian-jessie.img of=/dev/rdisk# (Where /path/to/download/ is the path to the .img file, and rdisk# corresponds to disk2 from disk utility before disconnecting the memory card, bs=10m is an argument for byte size, any lower and this will take a long time!)
-
RUN THIS COMMAND WITH EXTREME CAUTION!!, dd is called 'disk destroyer' for a reason, its not hard to accidentally swap the order of 'if' and 'of', and and up deleting the wrong drive.
-
The latest version of Raspbian has ssh server disabled by default. To enable it, and allow headless operation type the following command:
-
touch /Volumes/boot/ssh
-
we can verify this works by typing ls /Volumes/boot and checking for the newly created 'ssh' file
-
-
Eject the Micro SD card, plug it into your Raspberry Pi 3, and fire it up!
-
At this point you can connect an ethernet cable, TV and keyboard to the Raspberry Pi, or continue remotely setting up the Raspberry Pi.
-
If everything went okay, you should be able to ssh into your Pi with the command:
-
ssh pi@raspberrypi.local
-
note, the default username is pi, and password is raspberry
-
Alternatively, you may need to connect to your router to find the ip address of the Raspberry Pi if you are not able to ssh to the default hostname.
-
-
once connected, we need to change a few of the default settings.
-
We will run the built in configuration tool to setup the initial setting by typing:
-
sudo raspi-config
-
Change the Hostname, and choose a pattern such as pi[1-5]
-
I Used my initials SME followed by a digit for each Pi, 1 in this case
-
Set the default boot environment to CLI
-
Set the Locale settings (timezone, etc)
-
Go to Advanced settings
-
Set Memory split to 16
-
Finish and reset!
-
Now rinse and repeat for each of the remaining Pis in your cluster
-
-
-
-
-
Join me on the following page where we configure the network next!
About Author
Related Articles
Leave a Comment
Jair June 13, 2017
Hi Scott,
Thanks for posting. As a heads up, "sudo pip3 install ipython3" didn't work for me. However, "sudo pip3 install ipython" seems to work fine.
Jair