Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Scott Edenbaum

Posted on May 6, 2017

`Edit the Hadoop configuration`

~/.bashrc:
<hadoop_directory>/etc/hadoop/hadoop-env.sh: To set Hadoop-specific environment variables, including JAVA_HOME.
<hadoop_directory>/etc/hadoop/core-site.xml: To set site-specific properties, including fs.default.name, which contains the path and port of the namenode.
<hadoop_directory>/etc/hadoop/hdfs-site.xml: This has to be configured for each node in the cluster. We will need to specify the directiories which are used as namenode (on master node) and the datanode (on all nodes).
<hadoop_directory>/etc/hadoop/yarn-site.xml:
<hadoop_directory>/etc/hadoop/mapred-site.xml:
<hadoop_directory>/etc/hadoop/slaves: for master node only.

.bashrc - (Note, there are entries for spark's configuration in this file too)

Using `nano ~/.bashrc` to edit the file, enter the following lines of code at the top of the file: press ctrl + x (hold ctrl, then press x) to save the edits to the file. Included are the environmental variables necessary for Hadoop and Spark, and setting a constant value across the cluster for PYTHONHASHSEED stops Spark from crashing during operations that call the random number generator - such as .distinct().

`export PYTHONHASHSEED=123`

`export PYSPARK_PYTHON=/usr/bin/python3`

`export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython3`

`export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf`

`export HADOOP_INSTALL=/usr/local/hadoop`

`export PATH=$PATH:$HADOOP_INSTALL/bin`

`export PATH=$PATH:$HADOOP_INSTALL/sbin`

`export HADOOP_MAPRED_HOME=$HADOOP_INSTALL`

`export HADOOP_COMMON_HOME=$HADOOP_INSTALL`

`export HADOOP_HDFS_HOME=$HADOOP_INSTALL`

`export YARN_HOME=$HADOOP_INSTALL`

`export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native`

`export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"`

After saving your changes (with `ctrl + x, then y)` we are ready to include the updates to the system with the following command:

`source ~/.bashrc`

/usr/local/hadoop/etc/hadoop/hadoop-env.sh

For `hadoop-env.sh`, what we need to amend is the line that exports `JAVA_HOME` variable. Change it into the corresponding value on your machines. For my machine, I made the amendment as below

`export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf`

/usr/local/hadoop/etc/hadoop/core-site.xml

In core-site.xml, we need to specify the NameNode. Enter the following content in between the <configuration> and </configuration> tags.

`<property>`

`<name>fs.default.name</name> <value>hdfs://<URI_of_Namenode>:`

`<Port_of_Namenode></value>`

`</property>`

below is the core-site.xml currently on my cluster. I set the sme1 as the NameNode (ip 10.0.0.53), so the additional overhead will be left off the spark master node, sme5.

`<property>`

`<name>fs.default.name</name>`

`<value>hdfs://10.0.0.53:9000</value>`

`</property>`

/usr/local/hadoop/etc/hadoop/hdfs-site.xml

`hdfs-site.xml` has to be configured on each of your notes. it helps specify the directories which are used as the DataNode and NameNode on each node

Now, we need to create two directories for NameNode and DataNode,

`mkdir -p /usr/local/hadoop_store/hdfs/namenode`

`mkdir -p /usr/local/hadoop_store/hdfs/datanode`

Note, if you use different directories, adjust your hdfs-site.xml file accordingly.

Now, add the content between the <configuration> and </configuration> tags in the hdfs-site.xml file

`<property>`

`<name>dfs.replication</name>`

`<value>5</value>`

`</property>`

`<property>`

`<name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value>`

`</property>`

`<property>`

`<name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value>`

`</property>`

Aside from setting the directory path for the NameNode and DataNode, the property we need to set is dfs.replication. This numeric value is the integer equal the number of nodes in your cluster, 5 in my case.

/usr/local/hadoop/etc/hadoop/yarn-site.xml

The yarn-site.xml file contains our yarn configuration. This example is rather minimal because the cluster is relying on Hadoop for its HDFS filesystem and not for using the YARN scheduler.

`<configuration>`

`<property>`

`<name>yarn.nodemanager.aux-services</name>`

`<value>mapreduce_shuffle</value>`

`</property>`

`<property>`

`<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>`

`<value>org.apache.hadoop.mapred.ShuffleHandler</value>`

`</property>`

`</configuration>`

/usr/local/hadoop/etc/hadoop/mapred-site.xml

This file has Hadoop's mapreduce configuration, again this is a minimal example because we are not using Hadoop to perform mapreduce functions on the cluster.

`<configuration>`

`<property>`

`<name>mapreduce.framework.name</name>`

`<value>yarn</value>`

`</property>`

`</configuration>`

/usr/local/hadoop/etc/hadoop/slaves

This file contains the hostnames of the worker nodes, and only needs to be stored on the master node, in this case SME5.

Format the New File System

`sudo hdfs namenode -format`

NOTE: This step is only needed when you start your new HDFS as it will erase all existing data.

Start HDFS

`start-dfs.sh`

The above line will start NameNode on your master node, DataNode on all worker nodes in the slaves file, as well as SecondaryNameNode on the master node. At this point, the HDFS is accessible from any node on the cluster.

To use the resource management feature, you need to start YARN.

`start-yarn.sh`

This will start ResourceManager on your master node and NodeManager on all nodes. Note, the jps command can check if the preprocesses mentioned above are running.

The output from jps confirms Hadoop is up and running on the cluster!

Note, we can stop the Hadoop cluster with `stop-yarn.sh` then `stop-dfs.sh`

Now we can move on to setting up Spark!

Pages: 1 2 3 4 5 6

About Author

Scott Edenbaum

Scott Edenbaum is a recent graduate from the NYC Data Science Academy. He was hired by the Academy to assist in buildout of the learning management system and seeks to pursue a career as a Data Scientist. Scott's...

View all posts by Scott Edenbaum >

Machine Learning

Beware of Feature Importance for Business Decisions

Capstone

LendingClub Grade Optimization

Data Visualization

Ames Iowa Home Sale Prediction

Data Visualization

Python Shows Factors Influencing University Retention Rates

Machine Learning

Boosting Real Estate Decisions

Cancel reply

You must be logged in to post a comment.

Jair June 13, 2017

Hi Scott, Thanks for posting. As a heads up, "sudo pip3 install ipython3" didn't work for me. However, "sudo pip3 install ipython" seems to work fine. Jair

Big Data in a Small Package - Building a Raspberry Pi Cluster for Hadoop and Spark

Edit the Hadoop configuration

~/.bashrc:

<hadoop_directory>/etc/hadoop/hadoop-env.sh: To set Hadoop-specific environment variables, including JAVA_HOME.

<hadoop_directory>/etc/hadoop/core-site.xml: To set site-specific properties, including fs.default.name, which contains the path and port of the namenode.

<hadoop_directory>/etc/hadoop/hdfs-site.xml: This has to be configured for each node in the cluster. We will need to specify the directiories which are used as namenode (on master node) and the datanode (on all nodes).

<hadoop_directory>/etc/hadoop/yarn-site.xml:

<hadoop_directory>/etc/hadoop/mapred-site.xml:

<hadoop_directory>/etc/hadoop/slaves: for master node only.

.bashrc - (Note, there are entries for spark's configuration in this file too)

export PYTHONHASHSEED=123

export PYSPARK_PYTHON=/usr/bin/python3

export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython3

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf

export HADOOP_INSTALL=/usr/local/hadoop

export PATH=$PATH:$HADOOP_INSTALL/bin

export PATH=$PATH:$HADOOP_INSTALL/sbin

export HADOOP_MAPRED_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_HOME=$HADOOP_INSTALL

export HADOOP_HDFS_HOME=$HADOOP_INSTALL

export YARN_HOME=$HADOOP_INSTALL

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"

After saving your changes (with ctrl + x, then y) we are ready to include the updates to the system with the following command:

source ~/.bashrc

/usr/local/hadoop/etc/hadoop/hadoop-env.sh

For hadoop-env.sh, what we need to amend is the line that exports JAVA_HOME variable. Change it into the corresponding value on your machines. For my machine, I made the amendment as below

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf

/usr/local/hadoop/etc/hadoop/core-site.xml

In core-site.xml, we need to specify the NameNode. Enter the following content in between the <configuration> and </configuration> tags.

<property>

<name>fs.default.name</name> <value>hdfs://<URI_of_Namenode>:

<Port_of_Namenode></value>

</property>

below is the core-site.xml currently on my cluster. I set the sme1 as the NameNode (ip 10.0.0.53), so the additional overhead will be left off the spark master node, sme5.

<property>

<name>fs.default.name</name>

<value>hdfs://10.0.0.53:9000</value>

</property>

/usr/local/hadoop/etc/hadoop/hdfs-site.xml

hdfs-site.xml has to be configured on each of your notes. it helps specify the directories which are used as the DataNode and NameNode on each node

Now, we need to create two directories for NameNode and DataNode,

mkdir -p /usr/local/hadoop_store/hdfs/namenode

mkdir -p /usr/local/hadoop_store/hdfs/datanode

Note, if you use different directories, adjust your hdfs-site.xml file accordingly.

Now, add the content between the <configuration> and </configuration> tags in the hdfs-site.xml file

<property>

<name>dfs.replication</name>

<value>5</value>

</property>

<property>

<name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value>

</property>

<property>

<name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value>

</property>

Aside from setting the directory path for the NameNode and DataNode, the property we need to set is dfs.replication. This numeric value is the integer equal the number of nodes in your cluster, 5 in my case.

/usr/local/hadoop/etc/hadoop/yarn-site.xml

The yarn-site.xml file contains our yarn configuration. This example is rather minimal because the cluster is relying on Hadoop for its HDFS filesystem and not for using the YARN scheduler.

<configuration>

<property>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<property>

<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

</configuration>

/usr/local/hadoop/etc/hadoop/mapred-site.xml

This file has Hadoop's mapreduce configuration, again this is a minimal example because we are not using Hadoop to perform mapreduce functions on the cluster.

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

/usr/local/hadoop/etc/hadoop/slaves

This file contains the hostnames of the worker nodes, and only needs to be stored on the master node, in this case SME5.

Format the New File System

`Edit the Hadoop configuration`

`export PYTHONHASHSEED=123`

`export PYSPARK_PYTHON=/usr/bin/python3`

`export PYSPARK_DRIVER_PYTHON=/usr/bin/ipython3`

`export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf`

`export HADOOP_INSTALL=/usr/local/hadoop`

`export PATH=$PATH:$HADOOP_INSTALL/bin`

`export PATH=$PATH:$HADOOP_INSTALL/sbin`

`export HADOOP_MAPRED_HOME=$HADOOP_INSTALL`

`export HADOOP_COMMON_HOME=$HADOOP_INSTALL`

`export HADOOP_HDFS_HOME=$HADOOP_INSTALL`

`export YARN_HOME=$HADOOP_INSTALL`

`export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native`

`export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"`

After saving your changes (with `ctrl + x, then y)` we are ready to include the updates to the system with the following command:

`source ~/.bashrc`

For `hadoop-env.sh`, what we need to amend is the line that exports `JAVA_HOME` variable. Change it into the corresponding value on your machines. For my machine, I made the amendment as below

`export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-armhf`

`<property>`

`<name>fs.default.name</name> <value>hdfs://<URI_of_Namenode>:`

`<Port_of_Namenode></value>`

`</property>`

`<property>`

`<name>fs.default.name</name>`

`<value>hdfs://10.0.0.53:9000</value>`

`</property>`

`hdfs-site.xml` has to be configured on each of your notes. it helps specify the directories which are used as the DataNode and NameNode on each node

`mkdir -p /usr/local/hadoop_store/hdfs/namenode`

`mkdir -p /usr/local/hadoop_store/hdfs/datanode`

`<property>`

`<name>dfs.replication</name>`

`<value>5</value>`

`</property>`

`<property>`

`<name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value>`

`</property>`

`<property>`

`<name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value>`

`</property>`

`<configuration>`

`<property>`

`<name>yarn.nodemanager.aux-services</name>`

`<value>mapreduce_shuffle</value>`

`</property>`

`<property>`

`<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>`

`<value>org.apache.hadoop.mapred.ShuffleHandler</value>`

`</property>`

`</configuration>`

`<configuration>`

`<property>`

`<name>mapreduce.framework.name</name>`

`<value>yarn</value>`

`</property>`

`</configuration>`

`sudo hdfs namenode -format`

`start-dfs.sh`

`start-yarn.sh`

Note, we can stop the Hadoop cluster with `stop-yarn.sh` then `stop-dfs.sh`

Get detailed curriculum information about our
amazing bootcamp!