Technology Programming

YARN - Next Generation Distributed Computing Using Hadoop

By Technology Last updated Tuesday, April/23/2024

Introduction When someone mentions Map/Reduce, we immediately think of Hadoop and vice-a-versa.
With the idea being initiated by Google, Map/Reduce, generated immense interest in the computing world.
This interest was manifested in Hadoop, which was developed at Yahoo.
On general availability, Hadoop was used to develop solutions using commodity hardware, even though Map/Reduce was not a suitable algorithm for the problem at hand.
This triggered a rethink in the Hadoop world.
Hadoop was re-architected, making it capable of supporting distributed computing solutions, rather than only supporting Map/Reduce.
Post the re-architecture exercise, the main feature that differentiates Hadoop 2 (as the re-architected version is called) from Hadoop 1, is YARN (Yet Another Resource Negotiator).
Though YARN was developed as a component of the Map/Reduce project and was created to overcome some of the performance and scalability issues in Hadoop's original design, it was realized that YARN could be extended to support other solution models like DAG (Directed Acyclic Graph).
Why another programming model? For many years, Map/Reduce has been at the heart of Hadoop for distributed computing and has served well.
But Map/Reduce is restrictive, as it is batch oriented, has costly disk and network transfer operations and does not allow data/messages to be exchanged between the Map/Reduce jobs.
Some of the use cases where Map/Reduce is not suitable are as below: 1) Interactive Queries: The volume of data stored in Hadoop HDFS is growing exponentially and in some of enterprises, it has reached the petabyte scale.
Typically, Hive, Pig and Map/Reduce jobs are used to extract and process the data.
But enterprises are demanding quick retrieval of data via interactive queries, which need to generate results in a matter of a few seconds.
Some examples of interactive queries are display of dynamic, analytical charts, creation of aggregated data, etc.
2) Real time data processing: While it is known that Big Data must cater to the three V's attributes of data i.
e.
Volume, Variety and Velocity, in most cases, Hadoop could only cater to two of the attributes, namely Volume and Variety.
Velocity had to be addressed using technologies like In-Memory Computing (IMC) and Data Stream Processing.
Some of the use cases which require near real time response are credit card fraud detection, network fault prediction from sensor data, security threat prediction in network etc.
3) Efficient Machine Learning: Most machine learning algorithms are iterative in nature and consider the complete data set for accurate results and each iteration generates intermediate data.
Though tools like Apache Mahout are popular and commonly used for implementing machine learning solutions on top of Hadoop it uses Map/Reduce for each iteration and stores intermediate data in HDFS, reducing application performance.
Some of the use cases which require efficient machine learning algorithms are Customer Segmentation using K-means clustering, Sentiment Analysis using Latent Dirichlet Allocation (LDA), etc.
4) Efficient Graph Processing: When Google came out with Pregel, a graph processing architecture in 2010, it caught the attention of many enterprises.
Enterprises started demanding graph processing on top of Hadoop.
Apache Giraph was the open source answer to Google Pregel, which used Map/Reduce for its iterative graph processing.
But Giraph is inefficient on Map/Reduce, due to its iterative nature and its processing engine uses only the Map part of Map/Reduce.
Some of the use cases for graph processing are impact analysis and network planning, social graph for friend's recommendation etc.
In the following sections, we cover each of the points mentioned above along with the tools/techniques provided by Hadoop 2 and YARN.
Interactive Queries on YARN Apache Tez is the application framework defined on top of YARN, allowing development of solutions using Directed Acyclic Graph (DAG) of tasks in single job.
DAG tasks are a more powerful tool than traditional Map/Reduce, as it reduces the need to execute multiple jobs to query Hadoop.
Many Map/Reduce jobs are created to execute a single query.
Each Map/Reduce job has to be initialized, intermediate data needs to be stored and swapped between jobs, which slow down query execution.
In DAG it is single job and data does not need to be stored intermittently.
It is expected that Hive and Pig will eventually use Tez for interactive queries.
Real time Processing on YARN Apache STORM brings real time processing of high velocity data using the Spout-Bolt model.
A Spout is the message source and a Bolt processes the data.
YARN is expected to allow placement of STORM closer to the data, which in turn will reduce network transfer and the cost of acquiring data.
The acquired data can in turn be used by tasks that use DAG or Map-Reduce for further processing.
Iterative Machine Learning on YARN Apache SPARK is an in-memory computing framework and is ported on to Hadoop YARN.
SPARK is designed to make iterative machine learning algorithms faster by storing the data in memory.
Mlib is machine learning library which uses SPARK to store data in-memory for efficient execution of iterative machine learning algorithms.
Graph Processing on YARN Apache Giraph is an iterative graph processing system built for high scalability.
Giraph has been upgraded to run on YARN.
It uses YARN for Bulk Synchronous Processing (BSP) for semi structure graph data on huge volumes.
Giraph was designed to run on top of Hadoop 1, but was inefficient due to use of Map/Reduce and its iterative nature.
How everything stacks up on YARN The Hadoop 2 technology stack is expected to have a significant impact on application development.
Applications will be able to use batch processing, interactive queries, real-time computing and in-memory computing on top of YARN and federated HDFS.
Technology stack of YARN has different engines like Map/Reduce, Tez and Slider.
Different Hadoop components can execute on these engines or on YARN directly.
Some of the components like Tez and Slider are still in incubation phase.
The technology stack of the Hadoop 2 ecosystem is as follows 1) Map/Reduce: Map/Reduce will run on top of YARN.
Programmatically, the code remains same but configuration changes will be required to migrate an application to Hadoop 2.
2) Batch and Interactive: Tez is being built on top of YARN to provide interactive query support.
Tez generalizes the Map/Reduce paradigm to a more powerful framework for executing a complex DAG of tasks for near real-time big data processing.
Currently, Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the Map/Reduce framework for processing these programs and Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
Currently Pig and Hive use multiple Map/Reduce jobs, which in turn harm latency and throughput.
Eventually, Pig and Hive are expected to take advantage of Tez engine to meet fast response time and extreme throughput at petabytes scale.
3) Real Time-Slider: Slider engine will bridge the gap between existing application and YARN application and allow the existing application to use Hadoop 2 ecosystem via YARN.
With Slider, distributed applications that aren't YARN-aware can now "slide into YARN" to run on Hadoop - usually with no code changes.
STORM is planned to slide in initially.
4) Existing Products which have migrated to YARN: There are some APIs like SPARK and STORM which have made required changes and are using capabilities of YARN without using engines like Tez or Slider.
Conclusion YARN makes Hadoop 2 a more powerful, scalable and extendable architecture compared to its previous version.
YARN will eventually provide development and architecture community, a platform for big data application, which will have capabilities like batch, interactive queries, real time computing and others, in one ecosystem

Stay informed and read the latest news today from The ThatBiz Online, the definitive source for independent journalism from every corner of the globe.

YARN - Next Generation Distributed Computing Using Hadoop