Big Data - JK's Diary

What is Hadoop!

Apache Hadoop is an open source Java framework for processing and querying vast amounts of data (Multi Petabytes) on large clusters of commodity hardware. The original concept behind Hadoop comes from Google’s BigTable. Hadoop is an initiative started and led by Yahoo! Today Apache Hadoop has become an enterprise-ready cloud computing technology and is becoming the industry de-facto framework for big data processing.

Yahoo! runs the world’s largest Hadoop clusters. They work with academic institutions and other large corporations on advanced cloud computing research. Yahoo engineers are the leading participants in the Hadoop community.

Why Hadoop?

Primary goal of Hadoop is to reduce the impact of a rack power outage or switch failure so that even if these events occur, the data may still be readable. This kind of reliability is achieved by replicating the data across multiple hosts, and removing the need for expensive RAID storage on hosts. To add to this, the replication and node failures are handled automatically.

Another thing unique to Big Data is, unlike in traditional data warehouses where IO operations take a major chunk of the time i.e. bringing data to the server for processing, here the data locations are exposed, and processing is sent to the place where data resides. This provides a very high aggregate bandwidth. Think of sending a 2MB Jar file to the place where data resides against bringing 2GB of data to the server for processing.

Hadoop support and tools are available from major enterprise players, such as Amazon, IBM and others. Almost every big internet company like facebook, NY Times, last.fm, Netflix, etc. are using Hadoop to some extent.

When and where can Hadoop be used?

When processing can easily be made parallel (certain types of sort algorithms), running batch jobs is acceptable, access to lots of cheap hardware is easy and there are no real-time data / user facing requirements like Document Analysis & Indexing, Web Graphs and Crawling, Hadoop can be used.

Applications that require a high degree of parallel data intensive distributed operation like very large production deployments (GRID), processing large amounts of unstructured data will also find Hadoop to be a best fit.

Having said this, one should also know where it should not be used. Below are a few points on the same:

HDFS is not designed for low latency access to a huge number of small files
Hadoop MapReduce is not designed for interactive applications
HBase is not a relational database or a POSIX file system and does not have transactions or SQL support
HDFS and HBase are not focused on security, encryption or multi-tenancy
Hadoop is not a classical GRID solution

It is for these reasons it is said that Big Data cannot replace the traditional DW systems and can only co-exist to enhance or ease the bottlenecks which the traditional DWs pose in today’s environment. A few places (industries) where Hadoop is already being used are below:

Modeling true risk (Insurance & Healthcare)
Fraud detection (Insurance, Banking & Financial Services)
Customer churn analysis (Retail)
Recommendation engine (e-commerce & retail)
Image processing and analysis (criminal database- face detection/matching)
Trade surveillance (Stock Exchange)
Genom analysis (Protein folding)
Check-ins by users (Four Square, Gowalla, Trip advisor, etc)
Sort large amounts of data
Ad targeting (contextual ads)
Point of sale analysis
Network data analysis
Search quality (Search engines)
Internet archive processing
Physics lab (E.g. Hardon collider, Switzerland – Generates 15PB of data per year)

How is Hadoop helping?

Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed to handle node failures automatically as part of the framework.

Few Hadoop BuzzWords:

Pig – High-level data-flow language and execution framework for parallel computation. It’s a platform for analyzing large data sets. Their structure is amenable to substantial parallelization, to enable them to handle very large data sets. It consists of a compiler that produces sequences of Map-Reduce programs

ZooKeeper – High-performance coordination service for distributed applications. A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Primary work : Master election, Locate ROOT region, Region server membership

Hive – Facilitates ad-hoc query analysis, data summarization and analysis of large datasets. Provides a simple query language called HiveQL which is based on SQL. Can be used by both SQL users and MapReduce experts.

Hbase – Database. HBase is an open-source, distributed, versioned, column-oriented store modelled after Google’ s Bigtable (A Distributed Storage System for Structured data). HBase provides Bigtable like capabilities on top of Hadoop core. It is in Java and focused more on scalability and robustness. HBase is recommended when you have records that are very sparse and it also great for versioned data. It is not recommended for storing large amounts of binary data. Uses HDFS file system. Locates data by storing meta information like: Row:String, Column:String; Timestamp – data model

MapReduce – Fundamental data filtering algorithm. (More about this in separate post)

Cassandra – Cassandra was open sourced by Facebook in 2008. It is a highly scalable second-generation column-oriented distributed database. It brings together Dynamo’s fully distributed design and Bigtable’s Column Family-based data model.

Oozie – Yahoo!’s workflow engine to manage and coordinate data processing jobs running on Hadoop, including HDFS, Pig and MapReduce. It is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop

Nutch – It is a crawler and search engine built on Lucene and Solr.

Lucene – free text indexing and search engine

Mahout – Apache Mahout is a scalable machine learning library that supports large data sets. It currently does: Collaborative Filtering, User & Item based recommendation, various types of clustering, frequent pattern mining, decision tree based classification, etc.

Solr – High performance Enterprise search server

Tika – Toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Hypertable – Hypertable is an HBase alternative. Written in C++ and primarily focused on Performance. It is not designed to support transactional applications but is designed to power any high traffic website.

Difference between Pig and Hive:

Hive is supposed to be closer to a traditional RDBMS and will appeal more to a community comfortable with SQL. Hive is designed to store data in tables, with a managed schema. It is possible to integrate this with existing BI tools like MicroStrategy once the required drivers (e.g. ODBC) which are under development are in place.

On the other hand, Pig can be easier for someone who had no experience in SQL. Pig Latin is procedural, whereas SQL is declarative. A simple comparison:

In SQL:

insert into ValuableClicksPerDMA
select dma, count(*)
from geoinfo join (
select name, ipaddr
from users join clicks on (users.name = clicks.user)
where value > 0;
) using ipaddr
group by dma;

In Pig Latin:

Users                = load ‘users’ as (name, age, ipaddr);
Clicks               = load ‘clicks’ as (user, url, value);
ValuableClicks       = filter Clicks by value > 0;
UserClicks           = join Users by name, ValuableClicks by user;
Geoinfo              = load ‘geoinfo’ as (ipaddr, dma);
UserGeo              = join UserClicks by ipaddr, Geoinfo by ipaddr;
ByDMA                = group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo);
store ValuableClicksPerDMA into ‘ValuableClicksPerDMA’;

Source: http://developer.yahoo.com/blogs/hadoop/posts/2010/01/comparing_pig_latin_and_sql_fo/

Posted: May 2nd, 2011
Categories: Big Data
Tags: big data, big table, bigdata, bigtable, Cassandra, distributed file system, Hadoop, Hbase, HDFS, hive, hypertable, mahout, Map reduce, MapReduce, Nutch, Solr, tika, zookeeper
Comments: 5 Comments.

Permanent Link

What is Big Data?

In non-technical language as the name states, ‘Big data’ is the term used for voluminous data (structure, semi-structured and unstructured data). Unfortunately, many traditional tools are now capable of handling Terabyte to Petabytes of data and hence the definition of Big Data includes the ability to process this monster data also.

Now with some jargons, ‘Big data’ is used to store and query huge amounts of data (in the order of Petabytes and above) on large clusters of commodity hardware[1]. It does not need expensive storage devices like RAID or powerful systems like super computers. ‘Big data’ is horizontally scalable and fault tolerant with a high concurrency rate. It has a distributed database architecture[2] backbone to perform data-intensive distributed computing and can be setup either on a cluster of machines or on a single high-performance server.

Big Data is not exactly a disruptive force as quoted by a few people… though it has the potential to change the way we see and work with data warehouse. Big Data is here to complement the existing traditional Data Warehouse by helping organizations process enormous amounts of structured and unstructured data in multiple formats containing a wealth of information in a short time compared to traditional data warehouse.

Where is Big Data used or the most applicable?

Tasks that require Batch data processing that is not real-time/user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) can use Big Data
Applications that require a high amount of parallel data intensive distributed computing requirement
Big data apps are often also very industry specific and used in very large production deployments (GRID) like geological exploration in energy, genome research, medical research applications to predict disease and predicting terrorist threats

What can we do with Big Data?

If one has to categorize how various industries can leverage the Big Data concept, it would be as shown in the below table:

Industry	Big Data Purpose
Life Science	Genome Analysis Develop drug models
Healthcare	Patient behaviour study to treat chronic diseases Adverse drug effect analysis
Retail	Contextual and targeted ad marketing Point of Sale analysis Product recommendation engine (E.g. Amazon) Customer churn analysis
Insurance	Risk modelling Location Intelligence Catastrophe Modelling and Mapping Services Claims Fraud Detection and Incident Tracking
Banking & Financial Services	Stock Exchange – Processing & surveillance of trade data Credit card Fraud Detection
Government/Others	Internet Archive (Approx. 20TB per month) Hardon collider Switzerland (Approx. 15PB per year) User check-ins (Four-square, Gowalla, etc.)

As you can see in the table, big data can help in a big way with Fraud detection and prevention in the Financial Services sector, digital marketing optimization in sectors like Retail, Consumer Goods, Healthcare and Life Science. Big Data can also help organizations take strategic decisions by analyzing the vast amount of wealth available inside the Social networks. Post analysis, the data can be brought back into the DW and applied to production data for taking the necessary action.

For example if an online retailer’s customer always buys designer wear, search indexes can be constantly revised in the recommendation engine. A Hadoop-based system can scrub Web clicks and most popular search indexes, while the traditional data warehouse will need several years of integrated historical data.

When is it used?

To process large amounts of semi-structured data like analyzing log files
When your processing can easily be made parallel like a sorting of an entire countries census data
Running batch jobs is acceptable. For example website crawling by search engines
When you have access to lots of cheap hardware

When not to use Big Data?

If you are talking about data that can fit into memory and processed without too much of a trouble then Big Data is not for you. For example, up to a few TBs of data can be processed using existing tools like MySQL and does not require a Big Data backend. On the other hand, if someone is going to use Big Data to process say a few GBs or TBs of data, it means they have money to burn and time to waste.

Concepts/Buzzwords for BigData:

Open source
Fault tolerant systems
Horizontally scalable
Commodity hardware
MapReduce Algorithm
Multi Petabyte Datasets
Open data format
High throughput
Move computation to data
Column-oriented DBMS
Massively Parallel Processing (MPP)
Distributed File System
Resource Description Framework (RDF)
Data mining grids

Reference:

http://www.teradatamagazine.com/
http://en.wikipedia.org/wiki/
http://www.gigaom.com
http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10739/ds_concepts.htm

[1] Commodity hardware is nothing but large numbers of already available computing components from various vendors put together in clusters for parallel computing. This helps achieve maximum computation power at low costs.

[2] Set of databases in a distributed system that can appear to applications as a single data source.

Posted: April 10th, 2011
Categories: Big Data
Tags: big data, commodity hardware, distributed file system, HDFS, MapReduce, open source
Comments: 4 Comments.

JK's Diary

A world in itself…

Pages

Categories

Archives

Mood Changer

Info

QR

Recent Posts

Spoken!

News for the ‘Big Data’ Category

Hadoop – The Little Known Yellow Elephant

Before Big Data gets any bigger… Catch it.