🐘 Scaling Big Data from Scratch: Setting Up a Hadoop Multi-Node Cluster

Before technologies like Spark or cloud data lakes took over, Apache Hadoop laid the foundation for the big data revolution. It introduced the world to an open-source framework capable of storing and processing massive datasets across clusters of commodity hardware.

Even today, understanding Hadoop’s underlying infrastructure is a rite of passage for data engineers.

In this architectural guide, we will break down Hadoop’s node topology and walk through the step-by-step configuration required to stand up a functional Multi-Node Hadoop Cluster using a primary orchestrator (Master) and a compute instance (Worker).

🏗️ The Multi-Node Cluster Architecture

Hadoop scales horizontally using two primary structural layers: HDFS (Hadoop Distributed File System) for storage, and YARN (Yet Another Resource Negotiator) for compute cluster management.

Hadoop HDFS cluster architecture NameNode DataNode architecture diagram. Source: hadoop.apache.org

  • The Master Node (Orchestration): Runs the NameNode (the directory bookkeeper that tracks where file blocks live across the cluster) and the ResourceManager (the arbiter that allocates computing resources to running jobs).
  • The Worker Nodes (Execution): Run the DataNode (which physically writes file blocks to local hard drives) and the NodeManager (which executes computational tasks under YARN’s direction).

🛠️ Step 1: Network & System Prerequisites

For a multi-node cluster, your machines must communicate seamlessly. Execute these system steps on both the Master and Worker servers.

1. Configure the Hosts File

Ensure your servers can resolve each other by name rather than shifting IP addresses. Edit /etc/hosts:

Bash

sudo nano /etc/hosts

Add the private IP mappings of your infrastructure:

Plaintext

192.168.1.50  hadoop-master
192.168.1.51  hadoop-worker1

2. Configure Passwordless SSH

The Master node must be able to securely log into Worker nodes to spin up execution daemons automatically.

On the Master Node, generate an SSH key and copy it over to the worker:

Bash

ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
ssh-copy-id hadoop-worker1

Verify the security handshake by running ssh hadoop-worker1 from your master terminal. You should slide right in without typing a password.

3. Install Java Environment

Hadoop is built on Java, demanding a stable Java Development Kit (JDK 8 or 11):

Bash

sudo apt update
sudo apt install openjdk-8-jdk -y

🛠️ Step 2: Download & Extract Hadoop

Execute these steps on the Master Node first:

  1. Download a stable Hadoop release tarball (e.g., version 3.x) from the Apache Hadoop Official Releases.
  2. Unpack the compressed payload and drop it into an enterprise directory path like /opt/hadoop.
  3. Append these environment pathways to your local system user profile (~/.bashrc):

Bash

export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Run source ~/.bashrc to update your current terminal session context.

🛠️ Step 3: Modifying Configuration Xml Files

All configuration scripts reside inside the $HADOOP_HOME/etc/hadoop/ directory on your Master Node.

1. hadoop-env.sh

Explicitly define the Java runtime path inside the environment shell script:

Bash

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

2. core-site.xml

Define the structural URI location of your central NameNode coordinator:

XML

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hadoop-master:9000</value>
    </property>
</configuration>

3. hdfs-site.xml

Configure how many times file blocks should replicate across your data nodes. Since we have one master and one worker node, set the value to 2:

XML

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///opt/hadoop/data/hdfs/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///opt/hadoop/data/hdfs/datanode</value>
    </property>
</configuration>

4. yarn-site.xml

Configure YARN to run as the primary computation manager:

XML

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-master</value>
    </property>
</configuration>

5. Define the Workers List

Open the configuration file named workers and list the hostnames of your execution machines:

Plaintext

hadoop-master
hadoop-worker1

🛠️ Step 4: Sync Configurations to Worker Nodes

Rather than rewriting all these XML files manually on every single worker instance in your server rack, securely copy your pre-configured, local Hadoop tree across the network fabric using rsync from your master node:

Bash

rsync -avz /opt/hadoop hadoop-worker1:/opt/

(Ensure the destination path user permissions match exactly on the receiving machine).

🛠️ Step 5: Formatting and Starting the Cluster

Everything is configured. Now, we initialize the HDFS filesystem directory structure and boot up the cluster network.

1. Format the NameNode File System

Execute this command only once on your Master Node before launching the cluster for the first time. Warning: running this on an active production cluster wipes out metadata structural tables.

Bash

hdfs namenode -format

2. Fire Up HDFS Daemons

Launch the master file system tracker and worker data collectors:

Bash

start-dfs.sh

3. Fire Up YARN Compute Daemons

Launch the processing resource allocations across the cluster fabric:

Bash

start-yarn.sh

🔍 Step 6: Verifying Cluster Health

Open your terminal and execute jps (Java Virtual Machine Process Status Tool) on both servers to see the active daemons:

  • On hadoop-master, you should see: NameNode, SecondaryNameNode, and ResourceManager.
  • On hadoop-worker1, you should see: DataNode and NodeManager.

The Web Interface Check

Open your browser to visually inspect the state of your infrastructure:

  • HDFS Storage Panel: http://hadoop-master:9870
  • YARN Cluster Dashboard: http://hadoop-master:8088

You now have a production-topology, distributed big data foundation up and running natively!