Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark
Most of the Jupyter-Notebook configuration has already taken place earlier, relevant environmental variables have already been included in the .bashrc file.
It takes 1 long command to run our notebook from the master node:
PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip sme5" /usr/local/spark/bin/pyspark \
--master spark://sme5:7077 --driver-memory 500m --executor-memory 500m
If we don't include the parameter --master spark://<master>:7077
Pyspark will only run locally, not as a cluster!
Jupyter Notebook defaults to port 8888, so if you have another application using that port, use the --ip flag to select a different port for the web interface.
Copy and paste the full http url with to load the Jupyter Notebooks web interface. Start up a Python3 notebook as normal, just be sure to import pyspark
!
Lets see it in action! Here is a quick example of reading in some text files (acquired from project gutenberg) that have been stored in Hadoop's HDFS, then counted the number of lines. Also, there is a basic map reduce example generating a list of words and their occurrences throughout the texts. Lastly, I included a simple parallel calculation of pi with increasing sample size.
Add files to HDFS with this command:
hadoop fs -put /path/to/files /path/on/HDFS
Also, when we go to the Spark main web interface it will show PySparkShell as a running application.
Thoughts & Conclusion
This project gave me a great 'under the hood' view of Spark and Hadoop. As a tool, they are an incredibly flexible and scalable platform that offer a great way to work with vast amounts of data on 'commodity' computer hardware.
Now I have a portable 'desktop' cluster to continue working on my projects even if there is an outage at the popular Big Data cloud service providers and to bring with me for demonstration purposes. When I discussed this project, initially the main point of criticism is that this cluster is too low powered for practical use. Although counterintuitive, working with Big Data techniques on 'small' hardware seems like a great opportunity to learn more efficient techniques rather than relying on 'greedy' algorithms and hoping there is enough computing horsepower to compensate.
An alternative method for setting up the cluster involves loading the software from a docker image. Although I wont go into the full details and there is no HDFS support, the docker image can be found here edenbaus/pyspark-notebook -arm https://hub.docker.com/r/edenbaus/pyspark-notebook-arm/
Create a docker swarm on your master node
Create an attachable network for the swarm on a separate subnet (i chose armnet)
have the workers join the swarm
load the docker image on each node in the cluster.
docker run -it --p 8080:8080 8888:8888 8088:8088 7077:7077 --network armnet edenbaus/pyspark-notebook-arm
run ifconfig on the master to find its ip, then add it to the /etc/hosts file on each node with the command: (ie 10.0.100.10)
echo 10.0.100.10 spark-master >> /etc/hosts
finally to start spark:
/opt/spark/sbin/start-all.sh
PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip 0.0.0.0 --port 8889" \
/usr/local/spark/bin/pyspark --master spark://spark-master:7077 \
--driver-memory 500m --executor-memory 500m
Going forward I plan on running a series of benchmarks to test out the performance of this Raspberry Pi cluster under various workloads and with various amounts of workers attached. As well as doing some more serious data crunch for future projects and data fun.
With projects like BerryNet https://github.com/DT42/BerryNet?ref=producthunt that offer image recognition based deep learning solutions run in realtime on a Rasberry pi 3, the opportunity for standalone machine learning applications with the resources of multiple Raspberry Pi 3 computers (and similar single board computers) seems quite numerous an appealing!
About Author
Related Articles
Leave a Comment
Jair June 13, 2017
Hi Scott,
Thanks for posting. As a heads up, "sudo pip3 install ipython3" didn't work for me. However, "sudo pip3 install ipython" seems to work fine.
Jair