Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Scott Edenbaum

Posted on May 6, 2017

Before we go ahead with our installation of Spark, lets confirm Hadoop is running through its web interface.

Since SME1 is set as the cluster namenode, the two urls contain the Hadoop web interface: http://SME1:8088 and http://SME1:50070

We can tell all 5 nodes in our the cluster correctly registered as DataNodes. Now, on to Spark!

First, we need to download Spark from the Apache website http://www-us.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

From our master node (SME5) download the compressed file from the link above with wget:

`wget http://www-us.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz`

Decompress the package with tar

`tar -xvf spark-2.11-bin-hadoop2.7.tgz`

Then move the newly unpacked directory to /usr/local/spark

`mv spark-2.11-bin-hadoop-2.7 /usr/local/spark`

Now lets edit /usr/local/spark/conf/spark-env.sh and make some adjustments for our cluster resources.

Also include the following two lines in the spark-env.sh

`export PYSPARK_PYTHON=/usr/bin/python3`

`export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython`

Now create a `master` file in /usr/local/spark/conf/

either use nano:

`nano /usr/local/spark/conf`

add the cluster master node

`SME5`

or use echo

`echo SME5 > /usr/local/spark/conf/master`

Now that Spark is configured (aside from the slaves file on the master node) we can copy our Spark installation to our other nodes with the scp bash command in a for loop. For my cluster, I used the command.

`for i in 1 2 3 4 5; do scp -r /usr/local/spark/ 'sme'$i:/usr/local/; done;`

Note, this may take a while as it copies the files to each node one at a time. Also, if there are any errors, you may need to use the chown or chmod commands to adjust file permissions/ownership.

Alternatively, you could just download, unpack, move, and edit the Spark configuration files on each node, but who has time for that?

At this point each node in the cluster should have an identical installation of Spark in the /usr/local/spark directory.

on your master node, run the following command with nano to create the slaves file:

`nano /usr/local/spark/conf/slaves`

Note, since this is a very low memory environment, I kept the Spark master node out of the slaves list so it won't double as a worker node like the Hadoop NameNode on SME1 doubles as a DataNode.

Let's test Pyspark on each node before we try it in cluster mode.

Change directories to /usr/local/spark then run Pyspark with the following commands:

`cd /usr/local/spark`

`./bin/pyspark`

If everything works we'll have the Pyspark shell run and see the nice Pyspark logo.

Time to start our Spark cluster! From the master node run:

`cd /usr/local/spark`

`./sbin/start-all.sh`

The Spark web UI should load on http://<master>:7077 if everything is working.

Spark works! Continue to the next page for the final setup of Jupyter-notebooks.

Pages: 1 2 3 4 5 6

About Author

Scott Edenbaum

Scott Edenbaum is a recent graduate from the NYC Data Science Academy. He was hired by the Academy to assist in buildout of the learning management system and seeks to pursue a career as a Data Scientist. Scott's...

View all posts by Scott Edenbaum >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

Jair June 13, 2017

Hi Scott, Thanks for posting. As a heads up, "sudo pip3 install ipython3" didn't work for me. However, "sudo pip3 install ipython" seems to work fine. Jair

Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Before we go ahead with our installation of Spark, lets confirm Hadoop is running through its web interface.

Since SME1 is set as the cluster namenode, the two urls contain the Hadoop web interface: http://SME1:8088 and http://SME1:50070

We can tell all 5 nodes in our the cluster correctly registered as DataNodes. Now, on to Spark!

First, we need to download Spark from the Apache website http://www-us.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

From our master node (SME5) download the compressed file from the link above with wget:

wget http://www-us.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz

Decompress the package with tar

tar -xvf spark-2.11-bin-hadoop2.7.tgz

Then move the newly unpacked directory to /usr/local/spark

mv spark-2.11-bin-hadoop-2.7 /usr/local/spark

Now lets edit /usr/local/spark/conf/spark-env.sh and make some adjustments for our cluster resources.

Also include the following two lines in the spark-env.sh

export PYSPARK_PYTHON=/usr/bin/python3

export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython

Now create a master file in /usr/local/spark/conf/

either use nano:

nano /usr/local/spark/conf

add the cluster master node

SME5

or use echo

echo SME5 > /usr/local/spark/conf/master

Now that Spark is configured (aside from the slaves file on the master node) we can copy our Spark installation to our other nodes with the scp bash command in a for loop. For my cluster, I used the command.

for i in 1 2 3 4 5; do scp -r /usr/local/spark/ 'sme'$i:/usr/local/; done;

Note, this may take a while as it copies the files to each node one at a time. Also, if there are any errors, you may need to use the chown or chmod commands to adjust file permissions/ownership.

Alternatively, you could just download, unpack, move, and edit the Spark configuration files on each node, but who has time for that?

At this point each node in the cluster should have an identical installation of Spark in the /usr/local/spark directory.

on your master node, run the following command with nano to create the slaves file:

nano /usr/local/spark/conf/slaves

Note, since this is a very low memory environment, I kept the Spark master node out of the slaves list so it won't double as a worker node like the Hadoop NameNode on SME1 doubles as a DataNode.

Let's test Pyspark on each node before we try it in cluster mode.

Change directories to /usr/local/spark then run Pyspark with the following commands:

cd /usr/local/spark

./bin/pyspark

Time to start our Spark cluster! From the master node run:

cd /usr/local/spark

./sbin/start-all.sh

The Spark web UI should load on http://<master>:7077 if everything is working.

Spark works! Continue to the next page for the final setup of Jupyter-notebooks.

About Author

Scott Edenbaum

Related Articles

Leave a Comment

Cancel reply

View Posts by Categories

Our Recent Popular Posts

View Posts by Tags

NYC Data Science Academy

Get detailed curriculum information about our amazing bootcamp!

Offerings

About

SOCIAL MEDIA

`wget http://www-us.apache.org/dist/spark/spark-2.1.1/spark-2.1.1-bin-hadoop2.7.tgz`

`tar -xvf spark-2.11-bin-hadoop2.7.tgz`

`mv spark-2.11-bin-hadoop-2.7 /usr/local/spark`

`export PYSPARK_PYTHON=/usr/bin/python3`

`export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython`

Now create a `master` file in /usr/local/spark/conf/

`nano /usr/local/spark/conf`

`SME5`

`echo SME5 > /usr/local/spark/conf/master`

`for i in 1 2 3 4 5; do scp -r /usr/local/spark/ 'sme'$i:/usr/local/; done;`

`nano /usr/local/spark/conf/slaves`

`cd /usr/local/spark`

`./bin/pyspark`

`cd /usr/local/spark`

`./sbin/start-all.sh`

Get detailed curriculum information about our
amazing bootcamp!