Distributed Data Processing Using Apache Software Foundation Tools
This article will focus on the Apache Software Foundation technologies that make processing large amounts of data easier. I am going to describe the Apache Hadoop ecosystem, MapReduce model for distributed processing and its implementation used in Hadoop. Furthermore, I am going to introduce HDFS file system, Hbase column-oriented database, Hive data warehouse or centralized service for configuration management, distributed synchronization and group service called Zookeeper.
Hadoop is a framework for parallel applications running designed for processing large amounts of data on ordinary PCs. It includes several tools that are independent of each other and that solve problems of a certain part of a process of data capturing, storing and processing (e.g. freely available Cassandra non-relational database). Hadoop was created in 2005 and its creators are Mike Cafarella and Doug Cutting. It is designed to detect and solve errors at the application level in order to ensure higher reliability. Currently, it is a unique project for its functionality and popularity. A lot of commercial solutions are based on this project. The largest "players", such as Microsoft, Yahoo, Facebook, Ebay, etc., use it in their solutions and are also involved in its development.
It manages a set of auxiliary functions and interfaces for distributed file systems and general input/output, which is used by all framework libraries (tools for serializing, Java RPC - Remote Procedure Call, robust data structures, etc.).
It is a manager of planning and monitoring of tasks in the Hadoop framework. It introduces the concept of resource manager, which, based on information from thread planner and manager, handles distribution of tasks to individual threads (usually represented by single devices) in cluster. The resource manager receives information not only about condition of threads, but also about success or failure of individual tasks or threads. It also serves as a monitoring of the thread condition on which Hadoop is running and in case of a failure of one thread, it forwards the request to another thread of the cluster.
The task of processing large amounts of text data is basically very simple. However, if a main request is to finish a calculation in a reasonable time, it is necessary to divide it among more devices. This makes this relatively simple task complicated. Therefore, the Google company came up with an abstraction that hides parallelization, load balancing and data splitting into the library with the MapReduce program schedule, which can be run on ordinary computers. The makers gained their inspiration for naming it from "primitives" of functional programming languages (Haskell, Lisp).
The "Map" in the title indicates mapping "unary function" on the list and it is used for dividing tasks into individual slave nodes. The "Reduce" means reducing these lists using "binary function" and serves to reduce the results in master node (removing duplicate indexes and lines).
It is evident that the principle of master and slave nodes is applied here. In principle, the master node receives a user request. It forwards a map function to perform by its slave servers. It gets the results from the slave servers, after that it reduces the results and gives the final result to the user.
Hadoop MapReduce is based on Hadoop YARN, thus it acquires the following characteristics:
- distributability - even distribution of tasks to individual computing nodes with regard to their performance
- scalability - a possibility of adding additional nodes to the cluster
- robustness - a task recovery in local failure
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system designed to store very large files. It is an application written in Java forming another layer on top of a system file of the operation system. The HDFS client provides application programming interface (API) similar to POSIX standards. However, the authors did not meet the exact POSIX standards, at the expense of performance enhancement. HDFS stores information about data and the data itself separately. Metadata are stored on a dedicated server called "NameNode", application data is located on multiple servers called "DataNode". All servers are interconnected and communicate with each other using protocols based on TCP.
Briefly, this file system can be described as:
- economical - it runs on standard servers
- reliable - it solves data redundancy and recovery in case of failure (without RAID)
- expandable - it automatically transfers data to new nodes when cluster is increased
These tools are the basis for a wide range of solutions, which give us the freedom of choice in the most efficient access to large data. In the price/performance ratio, they allow us to get really high-end solutions that are trusted by the biggest companies. Now I would like to mention about couple of them that are, in my opinion, crucial.
Apache HBase is a non-relational (column-oriented) distributed database. HBase uses HDFS as a storage and supports both batch-oriented computing using MapReduce and point queries (random read). The response speed is more important than temporal consistency and completeness.
It is a distributed data warehouse, where data is stored in HDFS. Using HQL (query language based on SQL, which is translated on the fly to MapReduce tasks), Apache Hive allows to create tables, charts and reports directly above data enabling a structured approach.
Apache ZooKeeper is a centralized service for configuration management, naming, providing distributed synchronization and providing group services. It simplifies management of the servers on which Hadoop is running and reduces the risk of inconsistent settings across servers.
Besides these solutions, a lot of other projects and extensions are based on Hadoop framework. A non-relational database Apache Cassandra reaches great popularity, as well as a user-friendly web interface for cluster managing and monitoring - Apache Ambari.
It is possible to achieve very good results with the Apache Hadoop family of tools at very low cost. Since Hadoop is designed to run on commodity hardware, there is no need for data processing in specialized computers designed to work with large data volumes.