[cs-talks] Upcoming CS Seminars: NRG (Mon) + Data Management (Tues) + IVC (Tues) + Student Sem (Thur)
fgreen1 at bu.edu
Mon Nov 16 11:04:48 EST 2015
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
Ugur Kaynar, BU
Monday, November 16, 2015 at 11am in MCS 148
Authors: Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernandez-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, ´ Frances Perry, Eric Schmidt, Sam Whittle Google
Abstract: Unbounded, unordered, global-scale datasets are increasingly common in day-to-day business (e.g. Web logs, mobile usage statistics, and sensor networks). At the same time, consumers of these datasets have evolved sophisticated requirements, such as event-time ordering and windowing by features of the data themselves, in addition to an insatiable hunger for faster answers. Meanwhile, practicality dictates that one can never fully optimize along all dimensions of correctness, latency, and cost for these types of input. As a result, data processing practitioners are left with the quandary of how to reconcile the tensions between these seemingly competing propositions, often resulting in disparate implementations and systems. We propose that a fundamental shift of approach is necessary to deal with these evolved requirements in modern data processing. We as a field must stop trying to groom unbounded datasets into finite pools of information that eventually become complete, and instead live and breathe under the assumption that we will never know if or when we have seen all of our data, only that new data will arrive, old data may be retracted, and the only way to make this problem tractable is via principled abstractions that allow the practitioner the choice of appropriate tradeoffs along the axes of interest: correctness, latency, and cost. In this paper, we present one such approach, the Dataflow Model1 , along with a detailed examination of the semantics it enables, an overview of the core principles that guided its design, and a validation of the model itself via the real-world experiences that led to its development.
Data Management Seminar
Discovering and Interpreting Overlapping Communities in Graphs
Merrielle Spain (Lincoln Labs)
Tuesday, November 17, 2015 at 11am in MCS 148
Abstract: Graph communities reveal efficient resource configurations, simplify structure for supervised learning, and identify individuals of interest. In graphs such as communication networks, nodes naturally belong to multiple communities. Link clustering captures this by enabling nodes to inherit community membership from their edges [Ahn et al., 2010]. We identified how link clustering scales and harnessed this insight to increase speed. We then temporarily removed high-degree nodes to further increase speed. Finally, we considered measuring cluster quality with incomplete ground truth and interpreted clusters using metadata.
Small-Variance Asymptotics for Large-Scale Learning
Brian Kulis, BU
Tuesday, November 17, 2015 at 2pm in MCS 148
Abstract: This talk will focus on designing scalable learning algorithms via the technique of small-variance asymptotics. We will take as a starting point the widely known relationship between the Gaussian mixture model and k-means, for clustering data: as the covariances of the clusters shrink, the EM algorithm approaches the k-means algorithm and the negative log-likelihood approaches the k-means objective. Similar asymptotic connections exist for other machine learning models, including dimensionality reduction (probabilistic PCA becomes PCA), multiview learning (probabilistic CCA becomes CCA), and classification (a restricted Bayes optimal classifier becomes the SVM). The asymptotic non-probabilistic counterparts to the probabilistic models are almost always more scalable, and are typically easier to analyze, making them useful alternatives to the probabilistic models in many situations. We will explore how to extend such asymptotics to a richer class of probabilistic models, with a focus on large-scale graphical models, Bayesian nonparametric models, and time-series data. We will develop the necessary mathematical tools needed for these extensions and will describe a framework for designing scalable optimization problems derived from the rich probabilistic models. Applications are diverse, and include topic modeling, network evolution, and deep feature learning for large-scale data.
Bio: Brian Kulis is the Peter J. Levine Career Development Assistant Professor in the Department of Electrical and Computer Engineering and the Department of Computer Science at Boston University. His research focuses on machine learning, statistics, computer vision, data mining, and large-scale optimization. Previously, he was an assistant professor in computer science and in statistics at Ohio State University, and prior to that was a postdoctoral fellow at UC Berkeley EECS. He obtained his PhD in computer science from the University of Texas in 2008, and his BA degree from Cornell University in computer science and mathematics in 2003. For his research, he has won three best paper awards at top-tier conferences---two at the International Conference on Machine Learning (in 2005 and 2007) and one at the IEEE Conference on Computer Vision and Pattern Recognition (in 2008). He is also the recipient of an NSF CAREER Award in 2015, an MCD graduate fellowship from the University of Texas (2003-2007), and an Award of Excellence from the College of Natural Sciences at the University of Texas.
How to Trick People into Hiring You
Thursday, November 19, 2015 at 12pm in MCS 148
Description: Foteini will talk about how to create the illusion of being a likable and cooperative hard-worker.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the cs-talks