Back

TechnologyMar 11, 2014

Apache Pig Explained in 5 Minutes or Less

Michael Wilson

Apache Pig is a platform for analyzing large data sets that consist of a high-level language for creating MapReduce programs.  These programs can then be run in parallel on large-scale Hadoop clusters.  Originally Pig was developed at Yahoo around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large datasets.  In 2007 it was moved to the Apache Software Foundation and made open source.

Using Pig Latin, Pig’s platform language, parallel execution of simple data analysis becomes trivial. Complex tasks are broken into small data flow sequences, which make them easier to write, maintain, and understand.  Users are able to focus more on semantics rather than efficiency with Pig Latin, because tasks are encoded in a way that allows the system to automatically optimize the execution. By utilizing user-defined functions, users are also able to extend Pig Latin.  These functions can be written in many popular programming languages such as Java, Python, JavaScript, Ruby, or Groovy and then called directly using Pig Latin.

Common Use Cases

  • Searching and Reporting – Search through massive data sets to clean, analyze, and then create reports from that data on the fly.

  • Recommendation Engine – Analyze data sets and find correlations of data and results in order to create recommendations or predict possible outcomes.

  • Scheduled and Ad-hoc Jobs – Jobs can be created and run as both an ad-hoc, single-off job or be called on a given schedule. 

Key Features

  • Ease of Programming – Simple data analysis tasks easily achieve parallel execution of simple data analysis tasks.  Complex tasks that have multiple interrelated data transformations are encoded as data flow sequence.

  • Automatic Optimization – The way in which tasks are encoded permits the system to optimize their execution automatically.

  • Extensibility – Users can create their own functions in multiple languages to do special-purpose processing.

Companies Using Pig

  • Yahoo! – Uses Pig to support research for ad systems and web search.  It’s estimated that between 40 – 60% of their Hadoop jobs are generated using Pig.

  • LinkedIn – Uses Hadoop with Pig for discovering “People You May Know” and other fun facts for users.

  • Rovi Corporation – Uses Hadoop, Pig, and MapReduce to process extracted SQL data to generate json objects that are stored in MongoDB and served through our web services.  Each night they process over 160 Pig scripts and 50 MapReduce jobs that process over 600 GB of data.

  • Twitter – Uses Pig heavily for both scheduled and ad-hoc jobs such as processing logs or mining tweet data, due to its ability to accomplish a lot with few statements.

Take Away

  • Pig is a platform that can be used for analyzing large, distributed data sets through MapReduce programs.

  • Pig encodes programs that allow for automatic optimization.

  • Pig Latin is a language that’s easy to program, automatically optimizes, and is extensible using many different languages.