What is Hadoop!
Apache Hadoop is an open source Java framework for processing and querying vast amounts of data (Multi Petabytes) on large clusters of commodity hardware. The original concept behind Hadoop comes from Google’s BigTable. Hadoop is an initiative started and led by Yahoo! Today Apache Hadoop has become an enterprise-ready cloud computing technology and is becoming the industry de-facto framework for big data processing.
Yahoo! runs the world’s largest Hadoop clusters. They work with academic institutions and other large corporations on advanced cloud computing research. Yahoo engineers are the leading participants in the Hadoop community.
Why Hadoop?
Primary goal of Hadoop is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable. This kind of reliability is achieved by replicating the data across multiple hosts, and removing the need for expensive RAID storage on hosts. To add to this, the replication and node failures are handled automatically.
Another thing unique to Big Data is, unlike in traditional data warehouses where IO operations take a major chunk of the time i.e. bringing data to the server for processing, here the data locations are exposed, and processing is sent to the place where data resides. This provides a very high aggregate bandwidth. Think of sending a 2MB Jar file to the place where data resides against bringing 2GB of data to the server for processing.
Hadoop support and tools are available from major enterprise players, such as Amazon, IBM and others. Almost every big internet company like facebook, NY Times, last.fm, Netflix, etc. are using Hadoop to some extent.
When and where can Hadoop be used?
When processing can easily be made parallel (certain types of sort algorithms), running batch jobs is acceptable, access to lots of cheap hardware is easy and there are no real-time data / user facing requirements like Document Analysis & Indexing, Web Graphs and Crawling, Hadoop can be used.
Applications that require a high degree of parallel data intensive distributed operation like very large production deployments (GRID), processing large amounts of unstructured data will also find Hadoop to be a best fit.
Having said this, one should also know where it should not be used. Below are a few points on the same:
- HDFS is not designed for low latency access to a huge number of small files
- Hadoop MapReduce is not designed for interactive applications
- HBase is not a relational database or a POSIX file system and does not have transactions or SQL support
- HDFS and HBase are not focused on security, encryption or multi-tenancy
- Hadoop is not a classical GRID solution
It is for these reasons it is said that Big Data cannot replace the traditional DW systems and can only co-exist to enhance or ease the bottlenecks which the traditional DWs pose in today’s environment. A few places (industries) where Hadoop is already being used are below:
- Modeling true risk (Insurance & Healthcare)
- Fraud detection (Insurance, Banking & Financial Services)
- Customer churn analysis (Retail)
- Recommendation engine (e-commerce & retail)
- Image processing and analysis (criminal database- face detection/matching)
- Trade surveillance (Stock Exchange)
- Genom analysis (Protein folding)
- Check-ins by users (Four Square, Gowalla, Trip advisor, etc)
- Sort large amounts of data
- Ad targeting (contextual ads)
- Point of sale analysis
- Network data analysis
- Search quality (Search engines)
- Internet archive processing
- Physics lab (E.g. Hardon collider, Switzerland – Generates 15PB of data per year)
How is Hadoop helping?
Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed to handle node failures automatically as part of the framework.
Few Hadoop BuzzWords:
- Pig – High-level data-flow language and execution framework for parallel computation. It’s a platform for analyzing large data sets. Their structure is amenable to substantial parallelization, to enable them to handle very large data sets. It consists of a compiler that produces sequences of Map-Reduce programs
- ZooKeeper – High-performance coordination service for distributed applications. A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Primary work : Master election, Locate ROOT region, Region server membership
- Hive – Facilitates ad-hoc query analysis, data summarization and analysis of large datasets. Provides a simple query language called HiveQL which is based on SQL. Can be used by both SQL users and MapReduce experts.
- Hbase – Database. HBase is an open-source, distributed, versioned, column-oriented store modelled after Google’ s Bigtable (A Distributed Storage System for Structured data). HBase provides Bigtable like capabilities on top of Hadoop core. It is in Java and focused more on scalability and robustness. HBase is recommended when you have records that are very sparse and it also great for versioned data. It is not recommended for storing large amounts of binary data. Uses HDFS file system. Locates data by storing meta information like: Row:String, Column:String; Timestamp – data model
- MapReduce – Fundamental data filtering algorithm. (More about this in separate post)
- Cassandra – Cassandra was open sourced by Facebook in 2008. It is a highly scalable second-generation column-oriented distributed database. It brings together Dynamo’s fully distributed design and Bigtable’s Column Family-based data model.
- Oozie – Yahoo!’s workflow engine to manage and coordinate data processing jobs running on Hadoop, including HDFS, Pig and MapReduce. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop
- Nutch – It is a crawler and search engine built on Lucene and Solr.
- Lucene – free text indexing and search engine
- Mahout – Apache Mahout is a scalable machine learning library that supports large data sets. It currently does: Collaborative Filtering, User & Item based recommendation, various types of clustering, frequent pattern mining, decision tree based classification, etc.
- Solr – High performance Enterprise search server
- Tika – Toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
- Hypertable – Hypertable is an HBase alternative. Written in C++ and primarily focused on Performance. It is not designed to support transactional applications but is designed to power any high traffic website.
Difference between Pig and Hive:
Hive is supposed to be closer to a traditional RDBMS and will appeal more to a community comfortable with SQL. Hive is designed to store data in tables, with a managed schema. It is possible to integrate this with existing BI tools like MicroStrategy once the required drivers (e.g. ODBC) which are under development are in place.
On the other hand, Pig can be easier for someone who had no experience in SQL. Pig Latin is procedural, whereas SQL is declarative. A simple comparison:
In SQL:
insert into ValuableClicksPerDMA
select dma, count(*)
from geoinfo join (
select name, ipaddr
from users join clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr
group by dma;
In Pig Latin:
Users = load ‘users’ as (name, age, ipaddr);
Clicks = load ‘clicks’ as (user, url, value);
ValuableClicks = filter Clicks by value > 0;
UserClicks = join Users by name, ValuableClicks by user;
Geoinfo = load ‘geoinfo’ as (ipaddr, dma);
UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into ‘ValuableClicksPerDMA’;
Source: http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/
Categories: Big Data
Tags: big data, big table, bigdata, bigtable, Cassandra, distributed file system, Hadoop, Hbase, HDFS, hive, hypertable, mahout, Map reduce, MapReduce, Nutch, Solr, tika, zookeeper
Comments: 5 Comments.












