Apache Pig is a platform for analyzing large data sets that consist of a high-level language for creating MapReduce programs. These programs can then be run in parallel on large-scale Hadoop clusters. Originally Pig was developed at Yahoo around 2006 for researchers to have an ad-hoc way of creating and executing map-reduce jobs on very large datasets. In 2007 it was moved to the Apache Software Foundation and made open source.
Common Use Cases
Searching and Reporting – Search through massive data sets to clean, analyze, and then create reports from that data on the fly.
Recommendation Engine – Analyze data sets and find correlations of data and results in order to create recommendations or predict possible outcomes.
Scheduled and Ad-hoc Jobs – Jobs can be created and run as both an ad-hoc, single-off job or be called on a given schedule.
Ease of Programming – Simple data analysis tasks easily achieve parallel execution of simple data analysis tasks. Complex tasks that have multiple interrelated data transformations are encoded as data flow sequence.
Automatic Optimization – The way in which tasks are encoded permits the system to optimize their execution automatically.
Extensibility – Users can create their own functions in multiple languages to do special-purpose processing.
Companies Using Pig
Yahoo! – Uses Pig to support research for ad systems and web search. It’s estimated that between 40 – 60% of their Hadoop jobs are generated using Pig.
LinkedIn – Uses Hadoop with Pig for discovering “People You May Know” and other fun facts for users.
Rovi Corporation – Uses Hadoop, Pig, and MapReduce to process extracted SQL data to generate json objects that are stored in MongoDB and served through our web services. Each night they process over 160 Pig scripts and 50 MapReduce jobs that process over 600 GB of data.
Twitter – Uses Pig heavily for both scheduled and ad-hoc jobs such as processing logs or mining tweet data, due to its ability to accomplish a lot with few statements.
Pig is a platform that can be used for analyzing large, distributed data sets through MapReduce programs.
Pig encodes programs that allow for automatic optimization.
Pig Latin is a language that’s easy to program, automatically optimizes, and is extensible using many different languages.