BIG Data Implementation - Java Based Tools
What follows is a brief presentation of the major
open-source Java based tools that are available today and supports Big Data:
|
HDFS is the primary distributed storage used by Hadoop
applications. A HDFS cluster primarily consists of a NameNode that manages
the file system metadata and DataNodes that store the actual data. HDFS is
specifically designed for storing vast amount of data, so it is optimized for
storing/accessing a relatively small number of very large files compared to
traditional file systems where are optimized to handle large numbers of
relatively small files.
|
|
Hadoop MapReduce is a software framework for easily writing
applications which process vast amounts of data (multi-terabyte data-sets)
in-parallel on large clusters (thousands of nodes) of commodity hardware in a
reliable, fault-tolerant manner.
|
|
Apache HBase is the Hadoop database, a distributed, scalable,
big data store. It provides random, realtime read/write access to Big Data
and is optimized for hosting very large tables — billions of rows X millions
of columns — atop clusters of commodity hardware. In its core Apache HBase is
a distributed, versioned, column-oriented store modeled after Google’s Bigtable:
A Distributed Storage System for Structured Data by Chang et
al. Just as Bigtable leverages the distributed data storage provided by the
Google File System, Apache HBase provides Bigtable-like capabilities on top
of Hadoop and HDFS.
|
|
The Apache Cassandra is a performant, linear scalable and high
available database that can run on commodity hardware or cloud infrastructure
making it the perfect platform for mission-critical data. Cassandra’s support
for replicating across multiple datacenters is best-in-class, providing lower
latency for users and the peace of mind of knowing that you can survive
regional outages. Cassandra’s data model offers the convenience of column
indexes with the performance of log-structured updates, strong support for
denormalization and materialized views, and powerful built-in caching.
|
|
Apache Hive is a data warehouse system for Hadoop that
facilitates easy data summarization, ad-hoc queries, and the analysis of
large datasets stored in Hadoop compatible file systems. Hive provides a
mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL. At the same time this language also allows
traditional map/reduce programmers to plug in their custom mappers and
reducers when it is inconvenient or inefficient to express this logic in
HiveQL.
|
|
Apache Pig is a platform for analyzing large data sets. It
consists of a high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs. The salient
property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets.
Pig’s infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs. Pig’s language layer currently consists of a textual
language called Pig Latin, which is developed with ease of programming,
optimization opportunities and extensibility in mind.
|
|
Apache Chukwa is an open source data collection system for
monitoring large distributed systems. It is built on top of the Hadoop
Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s
scalability and robustness. Chukwa also includes a flexible and powerful
toolkit for displaying, monitoring and analyzing results to make the best use
of the collected data.
|
|
Apache Ambari is a web-based tool for provisioning, managing,
and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS,
Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.
Ambari also provides a dashboard for viewing cluster health such as heatmaps
and ability to view MapReduce, Pig and Hive applications visually alongwith features
to diagnose their performance characteristics in a user-friendly manner.
|
|
Apache ZooKeeper is a centralized service for maintaining
configuration information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in some
form or another by distributed applications. In short Apache ZooKeeper is a
high-performance coordination service for distributed applications like those
run on a hadoop cluster.
|
|
Apache Sqoop is a tool designed for efficiently transferring
bulk data between Apache Hadoop and structured datastores such as relational
databases.
|
|
Apache Oozie is a scalable, reliable and extensible workflow
scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are
Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are
recurrent Oozie Workflow jobs triggered by time (frequency) and data
availabilty. Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific
jobs (such as Java programs and shell scripts).
|
|
Apache Mahout is a scalable machine learning and data mining
library. Currently Mahout supports mainly four use cases:
§ Recommendation
mining : Takes users’ behavior and from that tries to find
items users might like.
§ Clustering : Takes
e.g. text documents and groups them into groups of topically related
documents.
§ Classification : Learns
from existing categorized documents what documents of a specific category
look like and is able to assign unlabeled documents to the (hopefully)
correct category.
§ Frequent itemset
mining : Takes a set of item groups (terms in a query session,
shopping cart content) and identifies, which individual items usually appear
together.
|
|
Apache HCatalog is a table and storage management service for
data created using Apache Hadoop. This includes:
§ Providing a
shared schema and data type mechanism.
§ Providing a table
abstraction so that users need not be concerned with where or how their data
is stored.
§ Providing
interoperability across data processing tools such as Pig, Map Reduce, and
Hive.
|
That’s it; Big Data, a short
theoretical introduction and a compact matrix of implementation approaches
focused on overcoming the problems of a new era – the era that forces us to ask
bigger questions!
No comments:
Post a Comment