Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark
Edit the Hadoop configuration
-
~/.bashrc:
-
<hadoop_directory>/etc/hadoop/hadoop-env.sh: To set Hadoop-specific environment variables, including JAVA_HOME.
-
<hadoop_directory>/etc/hadoop/core-site.xml: To set site-specific properties, including fs.default.name, which contains the path and port of the namenode.
-
<hadoop_directory>/etc/hadoop/hdfs-site.xml: This has to be configured for each node in the cluster. We will need to specify the directiories which are used as namenode (on master node) and the datanode (on all nodes).
-
<hadoop_directory>/etc/hadoop/yarn-site.xml:
-
<hadoop_directory>/etc/hadoop/mapred-site.xml:
-
<hadoop_directory>/etc/hadoop/slaves: for master node only.
.bashrc - (Note, there are entries for spark's configuration in this file too)
Using nano ~/.bashrc
to edit the file, enter the following lines of code at the top of the file: press ctrl + x (hold ctrl, then press x) to save the edits to the file. Included are the environmental variables necessary for Hadoop and Spark, and setting a constant value across the cluster for PYTHONHASHSEED stops Spark from crashing during operations that call the random number generator - such as .distinct().
export PYTHONHASHSEED=123
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython3
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
After saving your changes (with ctrl + x, then y)
we are ready to include the updates to the system with the following command:
source ~/.bashrc
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
For hadoop-env.sh
, what we need to amend is the line that exports JAVA_HOME
variable. Change it into the corresponding value on your machines. For my machine, I made the amendment as below
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf
/usr/local/hadoop/etc/hadoop/core-site.xml
In core-site.xml, we need to specify the NameNode. Enter the following content in between the <configuration> and </configuration> tags.
<property>
<name>fs.default.name</name> <value>hdfs://<URI_of_Namenode>:
<Port_of_Namenode></value>
</property>
below is the core-site.xml currently on my cluster. I set the sme1 as the NameNode (ip 10.0.0.53), so the additional overhead will be left off the spark master node, sme5.
<property>
<name>fs.default.name</name>
<value>hdfs://10.0.0.53:9000
</value>
</property>
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
hdfs-site.xml
has to be configured on each of your notes. it helps specify the directories which are used as the DataNode and NameNode on each node
Now, we need to create two directories for NameNode and DataNode,
mkdir -p /usr/local/hadoop_store/hdfs/namenode
mkdir -p /usr/local/hadoop_store/hdfs/datanode
Note, if you use different directories, adjust your hdfs-site.xml file accordingly.
Now, add the content between the <configuration> and </configuration> tags in the hdfs-site.xml file
<property>
<name>dfs.replication</name>
<value>5</value>
</property>
<property>
<name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
Aside from setting the directory path for the NameNode and DataNode, the property we need to set is dfs.replication. This numeric value is the integer equal the number of nodes in your cluster, 5 in my case.
/usr/local/hadoop/etc/hadoop/yarn-site.xml
The yarn-site.xml file contains our yarn configuration. This example is rather minimal because the cluster is relying on Hadoop for its HDFS filesystem and not for using the YARN scheduler.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
/usr/local/hadoop/etc/hadoop/mapred-site.xml
This file has Hadoop's mapreduce configuration, again this is a minimal example because we are not using Hadoop to perform mapreduce functions on the cluster.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
/usr/local/hadoop/etc/hadoop/slaves
This file contains the hostnames of the worker nodes, and only needs to be stored on the master node, in this case SME5.
Format the New File System
sudo hdfs namenode -format
NOTE: This step is only needed when you start your new HDFS as it will erase all existing data.
Start HDFS
start-dfs.sh
The above line will start NameNode on your master node, DataNode on all worker nodes in the slaves file, as well as SecondaryNameNode on the master node. At this point, the HDFS is accessible from any node on the cluster.
To use the resource management feature, you need to start YARN.
start-yarn.sh
This will start ResourceManager on your master node and NodeManager on all nodes. Note, the jps command can check if the preprocesses mentioned above are running.
The output from jps confirms Hadoop is up and running on the cluster!
Note, we can stop the Hadoop cluster with stop-yarn.sh
then stop-dfs.sh
Now we can move on to setting up Spark!
About Author
Related Articles
Leave a Comment
Jair June 13, 2017
Hi Scott,
Thanks for posting. As a heads up, "sudo pip3 install ipython3" didn't work for me. However, "sudo pip3 install ipython" seems to work fine.
Jair