[Dmbu-l] Fwd: TALK:Wednesday 9-26-12 Programming and Debugging Large-Scale Data Processing Workflows

Charalampos Mavroforakis cmav at bu.edu
Fri Sep 21 12:31:13 EDT 2012

Hi everyone,

As we discussed in the meeting, I will be forwarding you the talks at MIT
that may be of interest for the group.

- Harry

---------- Forwarded message ----------
From: CSAIL Event Calendar <eventcalendar at csail.mit.edu>
Date: Tue, Sep 18, 2012 at 10:29 AM
Subject: TALK:Wednesday 9-26-12 Programming and Debugging Large-Scale Data
Processing Workflows
To: seminars at csail.mit.edu

Programming and Debugging Large-Scale Data Processing Workflows
Speaker: Christopher Olston
Speaker Affiliation: Google
Host: Samuel Madden
Host Affiliation: CSAIL

Date: 9-26-2012
Time: 4:00 PM - 5:00 PM
Refreshments: 3:45 PM
Location: 32-G449 (Patil/Kiva)

This talk gives an overview of my team's work on large-scale data
processing at Yahoo! Research. The talk begins by introducing two data
processing systems we helped develop: PIG, a dataflow programming
environment and Hadoop-based runtime, and NOVA, a workflow manager for
Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at
what can be done before, during and after execution of a data
processing operation:
* Pig's automatic EXAMPLE DATA GENERATOR is used before running a
Pig job to get a feel for what it will do, enabling certain kinds of
mistakes to be caught early and cheaply. The algorithm behind the
example generator performs a combination of sampling and synthesis to
balance several key factors---realism, conciseness and
completeness---of the example data it produces.
* INSPECTOR GADGET is a framework for creating custom tools that
monitor Pig job execution. We implemented a dozen user-requested
tools, ranging from data integrity checks to crash cause investigation
to performance profiling, each in just a few hundred lines of code.
* IBIS is a system that collects metadata about what happened during
data processing, for post-hoc analysis. The metadata is collected from
multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and
processing elements at different granularities (e.g. tables vs.
records; relational operators vs. reduce task attempts) and offer
disparate ways of querying it. IBIS integrates this metadata and
presents a uniform and powerful query interface to users.


Christopher Olston is a staff research scientist at Google, working on
structured data. He previously worked at Yahoo! (principal research
scientist) and Carnegie Mellon (assistant professor). He holds
computer science degrees from Stanford (2003 Ph.D., M.S.; funded by
NSF and Stanford fellowships) and UC Berkeley (B.S. with highest

At Yahoo, Olston co-created Apache Pig, which is used for large-scale
data processing by LinkedIn, Netflix, Salesforce, Twitter, Yahoo and
others, and is offered by Amazon as a cloud service. He gave the 2011
Symposium on Cloud Computing keynote, and won the 2009 SIGMOD best
paper award. During his flirtation with academia, Olston taught
undergrad and grad courses at Berkeley, Carnegie Mellon and Stanford,
and signed several Ph.D. dissertations.

Relevant URL(S):
For more information please contact: Sheila Marian, x3-1996,
sheila at csail.mit.edu

Seminars mailing list
Seminars at lists.csail.mit.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://cs-mailman.bu.edu/pipermail/dmbu-l/attachments/20120921/783c751d/attachment.html>

More information about the Dmbu-l mailing list