{"id":140,"date":"2011-05-02T18:33:14","date_gmt":"2011-05-02T13:03:14","guid":{"rendered":"http:\/\/www.jkspeaks.com\/wordpress\/?p=140"},"modified":"2011-05-02T18:40:44","modified_gmt":"2011-05-02T13:10:44","slug":"hadoop-the-little-known-yellow-elephant","status":"publish","type":"post","link":"https:\/\/www.jkspeaks.com\/wordpress\/big-data\/hadoop-the-little-known-yellow-elephant\/","title":{"rendered":"Hadoop &#8211; The Little Known Yellow Elephant"},"content":{"rendered":"<p><strong>What is Hadoop!<\/strong><\/p>\n<p style=\"text-align: justify;\">Apache Hadoop is an open source Java framework for processing  and querying vast amounts of data (Multi Petabytes) on large clusters of  commodity hardware. The original concept behind Hadoop comes from Google\u2019s  BigTable. Hadoop is an initiative started and led by Yahoo! Today Apache Hadoop  has become an enterprise-ready cloud computing technology and is becoming the  industry de-facto framework for big data processing.<\/p>\n<p style=\"text-align: left;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hadoop.png\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright size-full wp-image-143\" title=\"Hadoop\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hadoop.png\" alt=\"The little yellow elephant\" width=\"322\" height=\"78\" srcset=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hadoop.png 322w, https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hadoop-300x72.png 300w\" sizes=\"auto, (max-width: 322px) 100vw, 322px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\">Yahoo! runs the world\u2019s largest Hadoop clusters. They work with  academic institutions and other large corporations on advanced cloud computing  research. Yahoo engineers are the leading participants in the Hadoop  community.<\/p>\n<p><strong>Why Hadoop?<\/strong><\/p>\n<p style=\"text-align: justify;\">Primary goal of Hadoop is to reduce the impact of a rack power  outage or switch failure so that even if these events occur, the data may still  be readable. This kind of reliability is achieved by replicating the data across  multiple hosts, and removing the need for expensive RAID storage on hosts. To  add to this, the replication and node failures are handled automatically.<\/p>\n<p style=\"text-align: justify;\">Another thing unique to Big Data is, unlike in traditional data  warehouses where IO operations take a major chunk of the time i.e. bringing data  to the server for processing, here the data locations are exposed, and  processing is sent to the place where data resides. This provides a very high  aggregate bandwidth. Think of sending a 2MB Jar file to the place where data  resides against bringing 2GB of data to the server for processing.<\/p>\n<p style=\"text-align: justify;\">Hadoop support and tools are available from major enterprise  players, such as Amazon, IBM and others. Almost every big internet company like  facebook, NY Times, last.fm, Netflix, etc. are using Hadoop to some extent.<\/p>\n<p><strong>When and where can Hadoop be used?<\/strong><\/p>\n<p style=\"text-align: justify;\">When processing can easily be made parallel (certain types of  sort algorithms), running batch jobs is acceptable, access to lots of cheap  hardware is easy and there are no real-time data \/ user facing requirements like  Document Analysis &amp; Indexing, Web Graphs and Crawling, Hadoop can be  used.<\/p>\n<p style=\"text-align: justify;\">Applications that require a high degree of parallel data  intensive distributed operation like very large production deployments (GRID),  processing large amounts of unstructured data will also find Hadoop to be a best  fit.<\/p>\n<p style=\"text-align: justify;\">Having said this, one should also know where it should not be  used. Below are a few points on the same:<\/p>\n<ul style=\"text-align: justify;\">\n<li>HDFS is not designed for low latency access to a huge number of  small files<\/li>\n<li>Hadoop MapReduce is not designed for interactive  applications<\/li>\n<li>HBase is not a relational database or a POSIX file system and  does not have transactions or SQL support<\/li>\n<li>HDFS and HBase are not focused on security, encryption or  multi-tenancy<\/li>\n<li>Hadoop is not a classical GRID solution<\/li>\n<\/ul>\n<p style=\"text-align: justify;\">It is for these reasons it is said that Big Data cannot replace  the traditional DW systems and can only co-exist to enhance or ease the  bottlenecks which the traditional DWs pose in today\u2019s environment.\u00a0A few places (industries) where Hadoop is already being used are below:<\/p>\n<ol>\n<li style=\"text-align: justify;\">Modeling true risk (Insurance &amp; Healthcare)<\/li>\n<li style=\"text-align: justify;\">Fraud detection  (Insurance, Banking &amp; Financial Services)<\/li>\n<li style=\"text-align: justify;\">Customer churn analysis  (Retail)<\/li>\n<li style=\"text-align: justify;\">Recommendation engine (e-commerce &amp; retail)<\/li>\n<li style=\"text-align: justify;\">Image  processing and analysis (criminal database- face detection\/matching)<\/li>\n<li style=\"text-align: justify;\">Trade  surveillance (Stock Exchange)<\/li>\n<li style=\"text-align: justify;\">Genom analysis (Protein folding)<\/li>\n<li style=\"text-align: justify;\">Check-ins by users (Four Square, Gowalla, Trip advisor, etc)<\/li>\n<li style=\"text-align: justify;\">Sort large  amounts of data<\/li>\n<li style=\"text-align: justify;\">Ad targeting (contextual ads)<\/li>\n<li style=\"text-align: justify;\">Point of sale  analysis<\/li>\n<li style=\"text-align: justify;\">Network data analysis<\/li>\n<li style=\"text-align: justify;\">Search quality (Search  engines)<\/li>\n<li style=\"text-align: justify;\">Internet archive processing<\/li>\n<li style=\"text-align: justify;\">Physics lab (E.g. Hardon  collider, Switzerland \u2013 Generates 15PB of data per year)<\/li>\n<\/ol>\n<p><strong>How is Hadoop helping?<\/strong><\/p>\n<p style=\"text-align: justify;\">Hadoop implements a computational paradigm named  <strong>Map\/Reduce<\/strong>, where the application is divided into many small  fragments of work, each of which may be executed or re-executed on any node in  the cluster. In addition, it provides a distributed file system  (<strong>HDFS<\/strong>) that stores data on the compute nodes, providing very  high aggregate bandwidth across the cluster. Both Map\/Reduce and the distributed  file system are designed to handle node failures automatically as part of the  framework.<\/p>\n<p><strong>Few Hadoop BuzzWords:<\/strong><\/p>\n<ul>\n<li style=\"text-align: justify;\"><strong><span style=\"color: #0000ff;\">Pig<\/span> <\/strong>&#8211; High-level  data-flow language and execution framework for parallel computation. It\u2019s a  platform for analyzing large data sets. Their structure is amenable to  substantial parallelization, to enable them to handle very large data sets. It  consists of a compiler that produces sequences of Map-Reduce  programs<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/pig.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-153\" title=\"pig\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/pig.png\" alt=\"\" width=\"83\" height=\"122\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Pig\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/pig.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">ZooKeeper<\/span><\/strong> &#8211;  High-performance coordination service for distributed applications. A  centralized service for maintaining configuration information, naming, providing  distributed synchronization, and providing group services. Primary work : Master  election, Locate ROOT region, Region server membership<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/zk-logo.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-156\" title=\"zookeeper\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/zk-logo.png\" alt=\"\" width=\"217\" height=\"57\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"ZooKeeper\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/zk-logo.png\"><\/a><a title=\"ZooKeeper\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/zk-logo.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Hive<\/span><\/strong> &#8211; Facilitates  ad-hoc query analysis, data summarization and analysis of large datasets.  Provides a simple query language called HiveQL which is based on SQL. Can be  used by both SQL users and MapReduce experts.<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hive.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-145\" title=\"Hive\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hive.png\" alt=\"\" width=\"100\" height=\"90\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Hive\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/hive.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Hbase<\/span><\/strong> &#8211; Database.  HBase is an open-source, distributed, versioned, column-oriented store modelled  after Google\u2019 s Bigtable (A Distributed Storage System for Structured data).  HBase provides Bigtable like capabilities on top of Hadoop core. It is in Java  and focused more on scalability and robustness. HBase is recommended when you  have records that are very sparse and it also great for versioned data. It is  not recommended for storing large amounts of binary data. Uses HDFS file system.  Locates data by storing meta information like: <em>Row:String, Column:String;  Timestamp &#8211; data model<\/em><\/li>\n<\/ul>\n<p style=\"text-align: center;\"><em><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hbase.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-144\" title=\"Hbase\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/Hbase.png\" alt=\"\" width=\"90\" height=\"90\" \/><\/a><br \/>\n<\/em><\/p>\n<p style=\"text-align: justify;\"><a title=\"HBase\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/hbase.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">MapReduce<\/span><\/strong> &#8211;  Fundamental data filtering algorithm. (<em>More about this in separate post<\/em>)<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/mapreduce-logo.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-150\" title=\"mapreduce-logo\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/mapreduce-logo.jpg\" alt=\"\" width=\"340\" height=\"118\" srcset=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/mapreduce-logo.jpg 340w, https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/mapreduce-logo-300x104.jpg 300w\" sizes=\"auto, (max-width: 340px) 100vw, 340px\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"MapReduce\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/mapreduce-logo.jpg\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Cassandra<\/span><\/strong> &#8211;  Cassandra was open sourced by Facebook in 2008. It is a highly scalable  second-generation column-oriented distributed database. It brings together  Dynamo\u2019s fully distributed design and Bigtable\u2019s Column Family-based data  model.<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/cassandra.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-142\" title=\"cassandra\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/cassandra.png\" alt=\"\" width=\"142\" height=\"95\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Cassandra\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/cassandra.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Oozie<\/span> <\/strong>&#8211; Yahoo!\u2019s  workflow engine to manage and coordinate data processing\u00a0jobs running on Hadoop,  including HDFS, Pig and MapReduce. It is an extensible, scalable and data-aware  service to orchestrate dependencies between jobs running on Hadoop<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/oozie.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-152\" title=\"oozie\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/oozie.png\" alt=\"\" width=\"225\" height=\"52\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Oozie\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/oozie.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Nutch<\/span><\/strong> &#8211; It is a  crawler and search engine built on Lucene and Solr.<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/nutch-logo.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-151\" title=\"nutch-logo\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/nutch-logo.gif\" alt=\"\" width=\"121\" height=\"48\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Nutch\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/nutch-logo.gif\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Lucene<\/span><\/strong> &#8211; free text  indexing and search engine<strong> <\/strong><\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/lucene_green_150.gif\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-148\" title=\"lucene_green_150\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/lucene_green_150.gif\" alt=\"\" width=\"150\" height=\"23\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Lucene\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/lucene_green_150.gif\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Mahout<\/span> <\/strong>&#8211; Apache  Mahout is a scalable machine learning library that supports large data sets. It  currently does: Collaborative Filtering, User &amp; Item based recommendation,  various types of clustering, frequent pattern mining, decision tree based  classification, etc.<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/mahout.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-149\" title=\"mahout\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/mahout.png\" alt=\"\" width=\"185\" height=\"78\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Mahout\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/mahout.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Solr<\/span><\/strong> &#8211; High  performance Enterprise search server<\/li>\n<\/ul>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/solr.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-154\" title=\"solr\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/solr.png\" alt=\"\" width=\"142\" height=\"61\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Solr\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/solr.png\"><\/a><\/p>\n<ul style=\"text-align: justify;\">\n<li><strong><span style=\"color: #0000ff;\">Tika<\/span><\/strong> &#8211; Toolkit for  detecting and extracting metadata and structured text content from various  documents using existing parser libraries.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/tika.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-155\" title=\"tika\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/tika.png\" alt=\"\" width=\"171\" height=\"29\" \/><\/a><\/p>\n<p style=\"text-align: justify;\"><a title=\"Tika\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/tika.png\"><\/a><\/p>\n<ul>\n<li style=\"text-align: justify;\"><strong><span style=\"color: #0000ff;\">Hypertable<\/span><\/strong> &#8211;  Hypertable is an HBase alternative. Written in C++ and primarily focused on  Performance. It is not designed to support transactional applications but is  designed to power any high traffic website.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/hypertable.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-146\" title=\"hypertable\" src=\"https:\/\/www.jkspeaks.com\/wordpress\/wp-content\/uploads\/2011\/05\/hypertable.jpg\" alt=\"\" width=\"288\" height=\"75\" \/><\/a><\/p>\n<p><a title=\"Hypertable\" href=\"https:\/\/ch1blogs.cognizant.com\/blogs\/187735\/files\/2011\/05\/hypertable.jpg\"><\/a><\/p>\n<p><span style=\"color: #333333;\"><strong>Difference between Pig and Hive: <\/strong><\/span><span style=\"color: #333333;\"> <\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"color: #000000;\">Hive is supposed to be closer to a traditional RDBMS and will  appeal more to a community comfortable with SQL. Hive is designed to store data  in tables, with a managed schema. It is possible to integrate this with existing  BI tools like MicroStrategy once the required drivers (e.g. ODBC) which are  under development are in place.<\/span><\/p>\n<p style=\"text-align: justify;\">On the other hand, Pig can be easier for someone who had no  experience in SQL. Pig Latin is procedural, whereas SQL is declarative. A simple  comparison:<\/p>\n<p><strong><em><span style=\"color: #0000ff;\">In SQL:<\/span><\/em><\/strong><\/p>\n<p><em>insert into ValuableClicksPerDMA<br \/>\n<\/em><em>select dma,  count(*)<br \/>\n<\/em><em>from geoinfo join (<br \/>\n<\/em><em> select  name, ipaddr<br \/>\n<\/em><em> <\/em><em> <\/em><em> <\/em><em> <\/em><em> <\/em>from users join clicks on (users.name =  clicks.user)<br \/>\n<em> where value &gt; 0;<br \/>\n<\/em><em>) using  ipaddr<br \/>\n<\/em><em>group by dma;<\/em><\/p>\n<p><strong><em><span style=\"color: #0000ff;\">In Pig Latin:<\/span><\/em><\/strong><\/p>\n<p><em>Users\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 = load \u2018users\u2019 as (name, age,  ipaddr);<br \/>\n<\/em><em>Clicks\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 = load \u2018clicks\u2019 as (user, url,  value);<br \/>\n<\/em><em>ValuableClicks\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 = filter Clicks by value &gt;  0;<br \/>\n<\/em><em>UserClicks\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 = join Users by name, ValuableClicks by  user;<br \/>\n<\/em><em>Geoinfo\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 = load \u2018geoinfo\u2019 as (ipaddr,  dma);<br \/>\n<\/em><em>UserGeo\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 = join UserClicks by ipaddr, Geoinfo by  ipaddr;<br \/>\n<\/em><em>ByDMA\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 = group UserGeo by  dma;<br \/>\n<\/em><em>ValuableClicksPerDMA = foreach ByDMA generate group,  COUNT(UserGeo);<br \/>\n<\/em><em>store ValuableClicksPerDMA into  \u2018ValuableClicksPerDMA\u2019;<\/em><\/p>\n<p><em><span style=\"color: #808080;\">Source: <\/span><a href=\"http:\/\/developer.yahoo.com\/blogs\/hadoop\/posts\/2010\/01\/comparing_pig_latin_and_sql_fo\/\">http:\/\/developer.yahoo.com\/blogs\/hadoop\/posts\/2010\/01\/comparing_pig_latin_and_sql_fo\/<\/a><\/em><\/p>\n<p><em><br \/>\n<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is Hadoop! Apache Hadoop is an open source Java framework for processing and querying vast amounts of data (Multi Petabytes) on large clusters of commodity hardware. The original concept behind Hadoop comes from Google\u2019s BigTable. Hadoop is an initiative started and led by Yahoo! Today Apache Hadoop has become an enterprise-ready cloud computing technology [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[21],"tags":[22,33,32,31,30,25,29,28,26,34,35,36,37,23,38,39,40,41],"class_list":["post-140","post","type-post","status-publish","format-standard","hentry","category-big-data","tag-big-data-2","tag-big-table","tag-bigdata","tag-bigtable","tag-cassandra","tag-distributed-file-system","tag-hadoop","tag-hbase","tag-hdfs","tag-hive","tag-hypertable","tag-mahout","tag-map-reduce","tag-mapreduce","tag-nutch","tag-solr","tag-tika","tag-zookeeper"],"_links":{"self":[{"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/posts\/140","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/comments?post=140"}],"version-history":[{"count":7,"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/posts\/140\/revisions"}],"predecessor-version":[{"id":161,"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/posts\/140\/revisions\/161"}],"wp:attachment":[{"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/media?parent=140"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/categories?post=140"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jkspeaks.com\/wordpress\/wp-json\/wp\/v2\/tags?post=140"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}