JK's Diary

What is Big Data?

In non-technical language as the name states, ‘Big data’ is the term used for voluminous data (structure, semi-structured and unstructured data). Unfortunately, many traditional tools are now capable of handling Terabyte to Petabytes of data and hence the definition of Big Data includes the ability to process this monster data also.

Now with some jargons, ‘Big data’ is used to store and query huge amounts of data (in the order of Petabytes and above) on large clusters of commodity hardware[1]. It does not need expensive storage devices like RAID or powerful systems like super computers. ‘Big data’ is horizontally scalable and fault tolerant with a high concurrency rate. It has a distributed database architecture[2] backbone to perform data-intensive distributed computing and can be setup either on a cluster of machines or on a single high-performance server.

Big Data is not exactly a disruptive force as quoted by a few people… though it has the potential to change the way we see and work with data warehouse. Big Data is here to complement the existing traditional Data Warehouse by helping organizations process enormous amounts of structured and unstructured data in multiple formats containing a wealth of information in a short time compared to traditional data warehouse.

Where is Big Data used or the most applicable?

Tasks that require Batch data processing that is not real-time/user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) can use Big Data
Applications that require a high amount of parallel data intensive distributed computing requirement
Big data apps are often also very industry specific and used in very large production deployments (GRID) like geological exploration in energy, genome research, medical research applications to predict disease and predicting terrorist threats

What can we do with Big Data?

If one has to categorize how various industries can leverage the Big Data concept, it would be as shown in the below table:

Industry	Big Data Purpose
Life Science	Genome Analysis Develop drug models
Healthcare	Patient behaviour study to treat chronic diseases Adverse drug effect analysis
Retail	Contextual and targeted ad marketing Point of Sale analysis Product recommendation engine (E.g. Amazon) Customer churn analysis
Insurance	Risk modelling Location Intelligence Catastrophe Modelling and Mapping Services Claims Fraud Detection and Incident Tracking
Banking & Financial Services	Stock Exchange – Processing & surveillance of trade data Credit card Fraud Detection
Government/Others	Internet Archive (Approx. 20TB per month) Hardon collider Switzerland (Approx. 15PB per year) User check-ins (Four-square, Gowalla, etc.)

As you can see in the table, big data can help in a big way with Fraud detection and prevention in the Financial Services sector, digital marketing optimization in sectors like Retail, Consumer Goods, Healthcare and Life Science. Big Data can also help organizations take strategic decisions by analyzing the vast amount of wealth available inside the Social networks. Post analysis, the data can be brought back into the DW and applied to production data for taking the necessary action.

For example if an online retailer’s customer always buys designer wear, search indexes can be constantly revised in the recommendation engine. A Hadoop-based system can scrub Web clicks and most popular search indexes, while the traditional data warehouse will need several years of integrated historical data.

When is it used?

To process large amounts of semi-structured data like analyzing log files
When your processing can easily be made parallel like a sorting of an entire countries census data
Running batch jobs is acceptable. For example website crawling by search engines
When you have access to lots of cheap hardware

When not to use Big Data?

If you are talking about data that can fit into memory and processed without too much of a trouble then Big Data is not for you. For example, up to a few TBs of data can be processed using existing tools like MySQL and does not require a Big Data backend. On the other hand, if someone is going to use Big Data to process say a few GBs or TBs of data, it means they have money to burn and time to waste.

Concepts/Buzzwords for BigData:

Open source
Fault tolerant systems
Horizontally scalable
Commodity hardware
MapReduce Algorithm
Multi Petabyte Datasets
Open data format
High throughput
Move computation to data
Column-oriented DBMS
Massively Parallel Processing (MPP)
Distributed File System
Resource Description Framework (RDF)
Data mining grids

Reference:

http://www.teradatamagazine.com/
http://en.wikipedia.org/wiki/
http://www.gigaom.com
http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10739/ds_concepts.htm

[1] Commodity hardware is nothing but large numbers of already available computing components from various vendors put together in clusters for parallel computing. This helps achieve maximum computation power at low costs.

[2] Set of databases in a distributed system that can appear to applications as a single data source.

Posted: April 10th, 2011
Categories: Big Data
Tags: big data, commodity hardware, distributed file system, HDFS, MapReduce, open source
Comments: 4 Comments.

A world in itself…

Pages

Categories

Archives

Mood Changer

Info

QR

Recent Posts

Spoken!

News for April 2011

Before Big Data gets any bigger… Catch it.