How to Optimize the Hadoop cluster for high performance?

A Hadoop cluster is the central part of the Hadoop framework that includes a
group of systems linked together through LAN. This is useful to store and process
big data sets. Moreover, the Hadoop cluster includes several commodity
hardware linked together. They communicate with a high-end system that acts as
a master.
Installation of the Hadoop cluster within production is like getting into a
battlefield. Besides, a Hadoop admin needs to tune the cluster setup to achieve
high performance. Moreover, the Hadoop cluster is configured with default
settings. It includes less hardware configuration. Therefore, admins must be
familiar with different hardware configurations.
There is no unique performance tuning technique that fits all Hadoop works. The
performance tuning tools & tricks may differ based on the size of data that is
being moved. It also depends upon the type of Hadoop task that runs within
production.
The highest sale point for Apache Hadoop as a big data processing framework is
the cost-effectiveness in configuring various data centers. This is useful for
processing large amounts of structured and unstructured data. However, the
major blockage in achieving high performance from a Hadoop cluster is its core
hardware stack.
Therefore, Hadoop admin has to make the best use of a cluster’s capacity to gain
the best performance from the hardware stack.
Hadoop cluster performance tuning
Let us discuss in detail some of the best and effective performance tuning
techniques, to set up Hadoop clusters with commodity hardware, to increase
cluster performance with reduced operational cost.

Hadoop cluster memory
The initial step to ensure high performance for a Hadoop task is tuning the best
configuration parameter for memory. It needs observing the usage of memory on
the server. Hadoop includes different options on memory, CPU, and network that
help to optimize the performance of the Hadoop cluster. Moreover, each Hadoop
MapReduce task gathers data of different input records read, the number of
records pipelined for further execution, reducer records, swap memory, etc.
Hadoop jobs are not bound to CPU. But the prime concern is to optimize memory
usage.

To learn Big Data and Hadoop admin certification course visit:hadoop administration online training
Here, there is a thumb rule while tuning the memory is to ensure that the tasks
don’t attempt swapping. The memory for the job is adjusted by modifying the
mapred.child.java.opts within the mapred-site.xml file.
IO Performance improvement
There are some key factors to follow while optimizing MapReduce performance.
It ensures that the Hadoop cluster setup is tuned very well.
Here, the Linux Operating System has a test point for each file including
checksum, last session time, creation time, the user of the file, etc. To gain better
IO performance, the test point should be disabled in HDFS. Besides, the HDFS
supports the write-once-read-many-times model. Therefore, the applications will
be able to access the information on the HDFS system randomly.
Here, the mount points for the Data Node configures with no time option. It
makes sure that the metadata is not updated by the NameNode. This is held each
time when the data is accessible. The mount for MapReduce storage and HDFS is
mounted with no time option. It automatically deactivates access time tracking.
Moreover, it offers increased IO performance.
It is important to know that the user doesn’t use LVM and RAID on Data Node
systems. It minimizes performance speed.

Minimize Disk Spill
Disk IO is the major bottleneck in performance speedup. Two different ways
minimize disk spilling are:
● MapReduce job uses a mapper with 70% of heap memory for spilling
buffer.
● Compressing Mapper output
The user mustn't spill more than once; otherwise, it needs to read & write once
again.

Compressing LZO
For instance, the Map Output is larger. Here, the mid-size data is reduced with
different techniques such as LZO, BZIP, Snappy, etc. Besides, the Map Output is
not compressive usually. To enable Map Output compress such as
MapReduce.map.output.compress, is set to be true. The code is useful to set on
the compression technique used such as LZO, Snappy, etc.
Each Map Reduce task releases large Map Output. This helps to benefit from
intermediate data compressed with LZO. Using the LZO compression each 1GB of
output data saves a maximum of 3GB of disk space. For example, there is a large

amount of data on the disk to execute Map tasks. Then it enhances the memory
size of the buffer to help it well.
Tuning the quantity Mapper/Reducer Tasks
Each map or reduce job generally takes 40 seconds to complete any execution.
When there is a big task to perform it doesn’t use all the slots available within the
Hadoop cluster. Therefore, it is much important to tune the number or quantity
of map or reduce jobs using the below techniques;
● In case, if the MapReduce task contains more than 1 TB input, then it needs
to make the number of jobs smaller. Here, the block size of the input
dataset needs to enhance up to 512M. The block size of existing files also
changed by setting up the dfs.block.size file. After the usage of commands
to modify the block size, the actual data can be deleted.
● In case, the MapReduce task upon the Hadoop cluster drives different map
tasks where each task completes in a few seconds. Then by reducing the
number of maps launched without impacting the setup of the cluster will
help optimizing its performance.
Using Skewed Joins
Using standard joins in the transform logic with Pig or Hive tools can result in the
fatal performance of the MapReduce tasks. Because the data processed may
include some skewness. This means 80% of the data is going towards a single
reducer. In case, if there is a large amount of data for a single key. Then one of
the reducers may hold up with major data processes. This is held when Skewed
join comes to prevent. Besides, the Skewed join computes a histogram to identify
the dominant key. Later, the data splits based on its different reducers to gain
optimal performance.
Writing a Combiner
Based on the Hadoop cluster ecosystem, combiner writing helps to reduce the
amount of data to transfer. This is useful other than the data compression

technique. Moreover, it also proves to be beneficial in enhancing cluster
performance.
Speculative Execution
The performance of MapReduce tasks severely impacts when the job takes longer
time to finish its execution. Speculative execution is a general approach to resolve
this issue by backing up slow going tasks on different systems.
Moreover, by setting up the various configuration parameters such as
‘MapReduce.map.tasks.speculative.execution’and‘MapReduce.reduce.tasks.spec
ulative.execution to true enables speculative execution. This helps to reduce the
task execution time in case the task progress becomes slow due to a lack of
memory.
There are various performance-optimizing tips and tricks for a Hadoop Cluster and
we have discussed some of the important points above. Moreover, the Hadoop
community also updates certain tips that help to get optimal performance
benefits. Hadoop scales horizontally therefore, admins keep adding many
instances to the cluster that results in maximum performance. Therefore, many
users want to have their Hadoop cluster so there is no need to share. The above-
mentioned tips may help in achieving the best performance.
Get more results practically through the expert’s voice by getting into hadoop
admin online training at Online IT Guru. This learning may help to enhance
Hadoop skills to get into a better career.