Many thanks for visiting my Blog..!!Please share this blog using below share buttons and leave your Comments/Feedback/Appreciations on Tab: Feedback
Share This Blog..!!

BIG Data Implementation - Java Based Tools

BIG Data Implementation - Java Based Tools
What follows is a brief presentation of the major open-source Java based tools that are available today and supports Big Data:

HDFS is the primary distributed storage used by Hadoop applications. A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data. HDFS is specifically designed for storing vast amount of data, so it is optimized for storing/accessing a relatively small number of very large files compared to traditional file systems where are optimized to handle large numbers of relatively small files.
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Apache HBase is the Hadoop database, a distributed, scalable, big data store. It provides random, realtime read/write access to Big Data and is optimized for hosting very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. In its core Apache HBase is a distributed, versioned, column-oriented store modeled after Google’s Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

The Apache Cassandra is a performant, linear scalable and high available database that can run on commodity hardware or cloud infrastructure making it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for users and the peace of mind of knowing that you can survive regional outages. Cassandra’s data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

Apache Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Pig is a platform for analyzing large data sets. It consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Pig’s infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig’s language layer currently consists of a textual language called Pig Latin, which is developed with ease of programming, optimization opportunities and extensibility in mind.

Apache Chukwa is an open source data collection system for monitoring large distributed systems. It is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Apache ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. In short Apache ZooKeeper is a high-performance coordination service for distributed applications like those run on a hadoop cluster.

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Apache Oozie is a scalable, reliable and extensible workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Apache Mahout is a scalable machine learning and data mining library. Currently Mahout supports mainly four use cases:
§  Recommendation mining : Takes users’ behavior and from that tries to find items users might like.
§  Clustering : Takes e.g. text documents and groups them into groups of topically related documents.
§  Classification : Learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the (hopefully) correct category.
§  Frequent itemset mining : Takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Apache HCatalog is a table and storage management service for data created using Apache Hadoop. This includes:
§  Providing a shared schema and data type mechanism.
§  Providing a table abstraction so that users need not be concerned with where or how their data is stored.
§  Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.
 That’s it; Big Data, a short theoretical introduction and a compact matrix of implementation approaches focused on overcoming the problems of a new era – the era that forces us to ask bigger questions!

No comments:

Post a Comment

disqus