News for April 2011

Before Big Data gets any bigger… Catch it.

 

What is Big Data?

In non-technical language as the name states, ‘Big data’ is the term used for voluminous data (structure, semi-structured and unstructured data). Unfortunately, many traditional tools are now capable of handling Terabyte to Petabytes of data and hence the definition of Big Data includes the ability to process this monster data also.

Now with some jargons, ‘Big data’ is used to store and query huge amounts of data (in the order of Petabytes and above) on large clusters of commodity hardware[1]. It does not need expensive storage devices like RAID or powerful systems like super computers. ‘Big data’ is horizontally scalable and fault tolerant with a high concurrency rate. It has a distributed database architecture[2] backbone to perform data-intensive distributed computing and can be setup either on a cluster of machines or on a single high-performance server.

Big Data is not exactly a disruptive force as quoted by a few people… though it has the potential to change the way we see and work with data warehouse. Big Data is here to complement the existing traditional Data Warehouse by helping organizations process enormous amounts of structured and unstructured data in multiple formats containing a wealth of information in a short time compared to traditional data warehouse.

 

Where is Big Data used or the most applicable?

  • Tasks that require Batch data processing that is not real-time/user facing (e.g. Document Analysis and Indexing, Web Graphs and Crawling) can use Big Data
  • Applications that require a high amount of parallel data intensive distributed computing requirement
  • Big data apps are often also very industry specific and used in very large production deployments (GRID) like geological exploration in energy, genome research, medical research applications to predict disease and predicting terrorist threats

Source: http://www.movingtothecloud.com/

What can we do with Big Data?

If one has to categorize how various industries can leverage the Big Data concept, it would be as shown in the below table:

Industry Big Data Purpose
Life Science Genome Analysis 

Develop drug models

Healthcare Patient behaviour study to treat chronic diseases 

Adverse drug effect analysis

Retail Contextual and targeted ad marketing 

Point of Sale analysis

Product recommendation engine (E.g. Amazon)

Customer churn analysis

Insurance Risk modelling 

Location Intelligence

Catastrophe Modelling and Mapping Services

Claims Fraud Detection and Incident Tracking

Banking & Financial Services Stock Exchange – Processing & surveillance of trade data
Credit card Fraud Detection
Government/Others Internet Archive (Approx. 20TB per month) 

Hardon collider Switzerland (Approx. 15PB per year)

User check-ins (Four-square, Gowalla, etc.)

As you can see in the table, big data can help in a big way with Fraud detection and prevention in the Financial Services sector, digital marketing optimization in sectors like Retail, Consumer Goods, Healthcare and Life Science. Big Data can also help organizations take strategic decisions by analyzing the vast amount of wealth available inside the Social networks. Post analysis, the data can be brought back into the DW and applied to production data for taking the necessary action.

For example if an online retailer’s customer always buys designer wear, search indexes can be constantly revised in the recommendation engine. A Hadoop-based system can scrub Web clicks and most popular search indexes, while the traditional data warehouse will need several years of integrated historical data.

 

When is it used?

  • To process large amounts of semi-structured data like analyzing log files
  • When your processing can easily be made parallel like a sorting of an entire countries census data
  • Running batch jobs is acceptable. For example website crawling by search engines
  • When you have access to lots of cheap hardware

 

When not to use Big Data?

If you are talking about data that can fit into memory and processed without too much of a trouble then Big Data is not for you. For example, up to a few TBs of data can be processed using existing tools like MySQL and does not require a Big Data backend. On the other hand, if someone is going to use Big Data to process say a few GBs or TBs of data, it means they have money to burn and time to waste.

 

Concepts/Buzzwords for BigData:

  • Open source
  • Fault tolerant systems
  • Horizontally scalable
  • Commodity hardware
  • MapReduce Algorithm
  • Multi Petabyte Datasets
  • Open data format
  • High throughput
  • Move computation to data
  • Column-oriented DBMS
  • Massively Parallel Processing (MPP)
  • Distributed File System
  • Resource Description Framework (RDF)
  • Data mining grids

 

Reference:

http://www.teradatamagazine.com/
http://en.wikipedia.org/wiki/
http://www.gigaom.com
http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10739/ds_concepts.htm

 

 


[1] Commodity hardware is nothing but large numbers of already available computing components from various vendors put together in clusters for parallel computing. This helps achieve maximum computation power at low costs.
[2] Set of databases in a distributed system that can appear to applications as a single data source.