Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Scott Edenbaum

Posted on May 6, 2017

Raspberry Pi 3

Specifications

Hardware

At a mere 3.37” x 2.24” (85.6mm x 56.5mm), weighing 1.6 oz, and requiring a 5V power source at ~2.5 A during max draw, the Pi is both physically small, and low on power requirements. Add in the ability for passive cooling options (ie metal heatsinks), and the $30 cost, you end up with a very powerful tool with seemingly unlimited opportunities for developing in various physical environments.

A 1.2GHZ Quad Core ARM CPU with 1gB of RAM drive the Pi’s computational performance. These specifications are comparable to a high-end cell phone in ~2012 (Similar to a Samsung Galaxy S3). In terms of connectivity, the Pi 3 has 10/100mbps NIC, 4 usb 2.0 ports, WIFI 802.11n and Bluetooth 4.1.

Software

On a foundational level, there is not too much different between a Raspberry Pi and a typical desktop/laptop computer. The CPU is of the ARM architecture, so typical software built for x86 architecture (ie: Intel and AMD CPUs) will not execute. With that in mind, the Raspberry Pi foundation openly recommends use of a community developed, custom fork of Ubuntu Linux - Raspbian Linux.

From a Data Science perspective, Python, R, SQL, numpy, pandas, scikit-learn, tensorflow - nearly all of the most popular libraries/packages have been built for the ARM architecture and support the Raspberry Pi. Even the virtualization software Docker has recently released a version compiled specifically for the Raspberry Pi. Best of all, the ever popular Hadoop and Spark distributed computing platforms support the ARM architecture and have been built for the Raspberry Pi platform.

Project Details & Motivations

Aside from my clearly biased opinion in favor of the Pi 3, I have found these computers incredibly useful. Currently I use a Raspberry Pi 3 running KODI to operate as a media center for my home, sharing multimedia content to my television over its HDMI connection. I also have its smaller cousin, the PI Zero, connecting a usb printer to my wifi network, enabling wireless printing on all the networked computers. During my time enrolled as a student with the NYC Data Science Academy, I regularly used a Raspberry Pi 3 as a MYSQL server, hosting project databases, or to host a Jupyter Notebook or RStudio Server for running python and R code without the need to install any software on my laptop.

Once the coursework moved towards BigData and I was intimately introduced to Hadoop and Spark, I understood their growing importance and looked for the best options for developing in a distributed environment. There are two clear ways to access a distributed computing environment for Hadoop & Spark, either through virtualization or running directly on ‘bare metal’ hardware.

There are numerous online vendors providing access to virtualized computer clusters for operating Hadoop/Spark clusters, but they all operate on a subscription pricing model, and the free editions are severely limited. The remaining option is to run software directly with each node on a separate computer. Normally this is a pricey endeavor, and at least $400 for a modest computer, but the Raspberry Pi 3 presents a unique and cost effective solution.

Project Scope

I decided the best course of action would be to build a cluster using 5 Raspberry Pi 3 computers. 4 would function as a worker/executor node, and 1 as a master/driver node. In addition, there are two main ways to setup the software, one is to directly run it on the Pi, the other is to run the code in a virtualized docker container. I'll run through how to get Hadoop/Spark up and running both ways.

Execution

Direct Installation

Before we can get to the fun part of installing Hadoop and Spark, we must install Raspbian Linux and setup our Pis!

Grab a copy of Raspbian Linux https://www.raspberrypi.org/downloads/raspbian/
Write the Raspbian Linux image file to the raspberry pi's Micro SD card.
On a Mac - Plug in your Micro SD card (optional usb adapter might be necessary)
- Open Disk Utility
  - Take note of the 'Device' value corresponding to the memory card, for a computer with a single storage drive, the Micro SD card should assign to /dev/disk2
  - Click the button to eject the drive within Disk Utility
  - Disk Utility should look like this
  - Open a Terminal and run the following command
    - sudo dd bs=10m if=/path/to/download/2017-03-02-raspbian-jessie.img of=/dev/rdisk# (Where /path/to/download/ is the path to the .img file, and rdisk# corresponds to disk2 from disk utility before disconnecting the memory card, bs=10m is an argument for byte size, any lower and this will take a long time!)
    - RUN THIS COMMAND WITH EXTREME CAUTION!!, dd is called 'disk destroyer' for a reason, its not hard to accidentally swap the order of 'if' and 'of', and and up deleting the wrong drive.
    - The latest version of Raspbian has ssh server disabled by default. To enable it, and allow headless operation type the following command:
    - touch /Volumes/boot/ssh
    - we can verify this works by typing ls /Volumes/boot and checking for the newly created 'ssh' file
  - Eject the Micro SD card, plug it into your Raspberry Pi 3, and fire it up!
    - At this point you can connect an ethernet cable, TV and keyboard to the Raspberry Pi, or continue remotely setting up the Raspberry Pi.
    - If everything went okay, you should be able to ssh into your Pi with the command:
      - ssh pi@raspberrypi.local
        
        note, the default username is pi, and password is raspberry
        
        Alternatively, you may need to connect to your router to find the ip address of the Raspberry Pi if you are not able to ssh to the default hostname.
      - once connected, we need to change a few of the default settings.
        
        you can find the ip address with the command ifconfig
        
        In this instance sme5 is assigned ip address 10.0.0.49 on eth0 (ethernet connection)
      - We will run the built in configuration tool to setup the initial setting by typing:
        
        sudo raspi-config
        
        Change the Hostname, and choose a pattern such as pi[1-5]
        
        I Used my initials SME followed by a digit for each Pi, 1 in this case
        
        Set the default boot environment to CLI
        
        Set the Locale settings (timezone, etc)
        
        Go to Advanced settings
        
        Set Memory split to 16
        
        Finish and reset!
        
        Now rinse and repeat for each of the remaining Pis in your cluster

Join me on the following page where we configure the network next!

Pages: 1 2 3 4 5 6

About Author

Scott Edenbaum

Scott Edenbaum is a recent graduate from the NYC Data Science Academy. He was hired by the Academy to assist in buildout of the learning management system and seeks to pursue a career as a Data Scientist. Scott's...

View all posts by Scott Edenbaum >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

Jair June 13, 2017

Hi Scott, Thanks for posting. As a heads up, "sudo pip3 install ipython3" didn't work for me. However, "sudo pip3 install ipython" seems to work fine. Jair

Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Raspberry Pi 3

Specifications

Hardware

A 1.2GHZ Quad Core ARM CPU with 1gB of RAM drive the Pi’s computational performance. These specifications are comparable to a high-end cell phone in ~2012 (Similar to a Samsung Galaxy S3). In terms of connectivity, the Pi 3 has 10/100mbps NIC, 4 usb 2.0 ports, WIFI 802.11n and Bluetooth 4.1.

Software

Project Details & Motivations

Project Scope

Execution

Direct Installation

Before we can get to the fun part of installing Hadoop and Spark, we must install Raspbian Linux and setup our Pis!

Grab a copy of Raspbian Linux https://www.raspberrypi.org/downloads/raspbian/

Write the Raspbian Linux image file to the raspberry pi's Micro SD card.

On a Mac - Plug in your Micro SD card (optional usb adapter might be necessary)

Open Disk Utility

Take note of the 'Device' value corresponding to the memory card, for a computer with a single storage drive, the Micro SD card should assign to /dev/disk2

Click the button to eject the drive within Disk Utility

Disk Utility should look like this

Open a Terminal and run the following command

RUN THIS COMMAND WITH EXTREME CAUTION!!, dd is called 'disk destroyer' for a reason, its not hard to accidentally swap the order of 'if' and 'of', and and up deleting the wrong drive.

The latest version of Raspbian has ssh server disabled by default. To enable it, and allow headless operation type the following command:

touch /Volumes/boot/ssh

we can verify this works by typing ls /Volumes/boot and checking for the newly created 'ssh' file

Eject the Micro SD card, plug it into your Raspberry Pi 3, and fire it up!

At this point you can connect an ethernet cable, TV and keyboard to the Raspberry Pi, or continue remotely setting up the Raspberry Pi.

If everything went okay, you should be able to ssh into your Pi with the command:

ssh pi@raspberrypi.local

note, the default username is pi, and password is raspberry

Alternatively, you may need to connect to your router to find the ip address of the Raspberry Pi if you are not able to ssh to the default hostname.

once connected, we need to change a few of the default settings.

you can find the ip address with the command ifconfig

In this instance sme5 is assigned ip address 10.0.0.49 on eth0 (ethernet connection)

We will run the built in configuration tool to setup the initial setting by typing:

sudo raspi-config

Change the Hostname, and choose a pattern such as pi[1-5]

I Used my initials SME followed by a digit for each Pi, 1 in this case

Set the default boot environment to CLI

Set the Locale settings (timezone, etc)

Go to Advanced settings

Set Memory split to 16

Finish and reset!

Now rinse and repeat for each of the remaining Pis in your cluster

Join me on the following page where we configure the network next!

About Author

Scott Edenbaum

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

you can find the ip address with the command `ifconfig`

Get detailed curriculum information about our
amazing bootcamp!