PIG: Web-Scale Processing
· Christopher Olston
· The project originated in Y! Research.
· Example data analysis task: Find users that visit “good” web pages.
· Christopher points out that joins are hard to write in Hadoop and there are many ways of writing joins and choosing a join technique is actually a problem that requires some skill. Basically the same point made by the DB community years ago. PIG is a dataflow language that describes what you want to happen logically and then map it to map/reduce. The language of PIG is called Pig Latin
· Pig Latin allows the declaration of “views” (late bound queries)
· Pig Latin is essentially a text form of a data flow graph. It generates Hadoop Map/Reduce jobs.
o Operators: filter, foreach … generate, & group
o Binary operators: join, cogroup (“more customizable type of join”), & union
o Also support split operator
· How different from SQL?
o It’s a sequence of simple steps rather than a declarative expression. SQL is declarative whereas Pig Latin says what steps you want done in what order. Much closer to imperative programming and, consequently, they argue it is simpler.
o They argue that it’s easier to build a set of steps and work with each one at a time and slowly build them up to a complete and correct language.
· PIG is written as a language processing layer over Map/Reduce
· He propose writing SQL as a processing layer over PIG but this code isn’t yet written
· Is PIG+Hadoop a DBMS? (there have been lots of blogs on this question :-))
o P+H only support sequential scans super efficiently (no indexes or other access methods)
o P+H operate on any data format (PIGS eat anything) whereas DBMS only run on data that they store
o P+H is a sequence of steps rather than a sequence of constraints as used in DBMS
o P+H has custom processing as a “first class object” whereas UDFs were added to DBMSs later
· They want an Eclipse development environment but don’t have it running yet. Planning an Eclipse Plugin.
· Team of 10 engineers currently working on it.
· New version of PIG to come out next week will include “explain” (shows mapping to map/reduce jobs to help debug).
· Today PIG does joins exactly one way. They are adding more join techniques. There aren’t explicit stats tracked other than file size. Next version will allow user to specify. They will explore optimization.
James Hamilton, Windows Live Platform Services
Bldg RedW-D/2072, One Microsoft Way, Redmond, Washington, 98052
W:+1(425)703-9972 | C:+1(206)910-4692 | H:+1(206)201-1859 | JamesRH@microsoft.com
H:mvdirona.com | W:research.microsoft.com/~jamesrh | blog:http://perspectives.mvdirona.com