Intelligent Real-time Query Engines

     Monday, June 24, 2019

Speaker: Anastasia Ailamaki, IEEE Fellow, ACM Fellow

Abstract:  Data preparation is crucial for data analysis and applications, but involves multiple steps of transformations as users often need to integrate heterogeneous data, and therefore they need to homogenize data into to a common format. Then, to accurately execute queries over transformed data, users have to remove any inconsistencies by applying cleaning operations. Finally, to efficiently execute queries, they need to tune access paths over the data. Data preparation is therefore not only time-consuming but it is also wasteful, as it lacks knowledge of the workload: a lot of preparation effort is wasted on data never meant to be used. The talk will explain how we re-design query engines in a way that data preparation is weaved into data analysis, thereby eliminating the transform-and-load cost. We enable in-situ query processing which adapts to any data format and facilitates querying diverse datasets. To address the scalability issues of cleaning and tuning tasks, we inject cleaning operations into query processing, and adapt access paths on-the-fly. By integrating the aforementioned tasks into data analysis, we adapt data preparation to each workload, thereby minimizing query execution times. We incorporate these ideas in Proteus, the academic prototype of the code-generated query engine RAW, and demonstrate that a powerful query language and a potent mathematical infrastructure is the basis of high-performance real-time query engines. 

Short Bio: Anastasia Ailamaki is a Professor of Computer Sciences at the Ecole Polytechnique Federale de Lausanne (EPFL) in Switzerland and the co-founder of RAW Labs SA, a swiss company developing real-time analytics infrastructures for heterogeneous big data. Her research interests are in data-intensive systems and applications, and in particular (a) in strengthening the interaction between the database software and emerging hardware and I/O devices, and (b) in automating data management to support computationally- demanding, data-intensive scientific applications. She has received an ERC Consolidator Award (2013), a Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), an NSF CAREER award (2002), and ten best-paper awards in database, storage, and computer architecture conferences. She holds a Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She is an ACM fellow, an IEEE fellow, the Laureate for the 2018 Nemitsas Prize in Computer Science, and an elected member of the Swiss National Research Council. She has served as a CRA-W mentor, and is a member of the Expert Network of the World Economic Forum.

Value Creation from Massive Vehicle Trajectory Data: the Case of Routing

     Monday, June 24, 2019

Speaker: Christian S. Jensen, IEEE Fellow, ACM Fellow

Abstract:  As society-wide digitalization continues, important societal processes are being captured at a level of detail never seen before, in turn enabling us to better understand and improve those processes. Vehicular transportation is one such process, where populations of vehicles are able to generate massive volumes of trajectory data that hold the potential to fuel a broad range of value-creating analytics involving query processing, data mining, and machine learning. In particular, with massive trajectory data available, the traditional vehicle routing paradigm, where a road network is modeled as an edge-weighted graph, is no longer adequate. Instead, new paradigms that thrive on massive trajectory data are called for. The talk will focus on describing several such paradigms. As even massive volumes of trajectory data are sparse in these settings, a key challenge is to be able to make good use of the available data.

Short Bio: Christian S. Jensen is an Obel Professor of Computer Science at Aalborg University, Denmark. He was a Professor at Aarhus University for a 3-year period from 2010 to 2013, and he was previously at Aalborg University for two decades. He recently spent a 1-year sabbatical at Google Inc., Mountain View. His research concerns data management and data-intensive systems, and its focus is on temporal and spatio-temporal data management. Christian is an ACM and an IEEE fellow, and he is a member of the Academia Europaea, the Royal Danish Academy of Sciences and Letters, and the Danish Academy of Technical Sciences. He has received several national and international awards for his research. He is Editor-in-Chief of ACM TODS and was an Editor-in-Chief of The VLDB Journal from 2008 to 2014.

AI for the Physical World

     Monday, June 24, 2019

Speaker: Haixun Wang, IEEE Fellow

Abstract:  Artificial Intelligence and Machine Learning are making big strides in the cyberspace. Yet, there has been limited progress with AI in the physical world. With over 400 buildings around the world, WeWork has a fleet of spaces ripe for experimenting how to blend the physical and the digital. At this scale, every decision from day-to-day ones, such as how to schedule room cleaning, to billion-dollar ones, such as how to source our next building and location, becomes a non-trivial data science problem. We believe that intelligent environments will help make space more efficient, and the addition of human insight will make for a more engaging experience. From using AI to inform interior design and space layout to using ML to reshuffle conference room bookings to match guests with the perfect space for their meetings and predict the health of an organization based on engagement insights, we are exploring a variety of ways to use cutting-edge, data science techniques in the real world.

Short Bio: Haixun Wang is VP of Engineering and Distinguished Scientist at WeWork, where he leads the Research and Applied Science division. He was Director of Natural Language Processing at Amazon. Before Amazon, he led the NLP Infra team in Facebook working on Query and Document Understanding. From 2013 to 2015, he was with Google Research, working on  natural language processing. From 2009 to 2013, he led research in semantic search, graph data processing systems, and distributed query processing at  Microsoft Research Asia. His knowledge base project Probase has created significant impact in industry and academia. He had been a research staff member at IBM T. J. Watson Research Center from 2000 – 2009. He was Technical Assistant to Stuart Feldman (Vice President of Computer Science of IBM Research) from 2006 to 2007, and Technical Assistant to Mark Wegman (Head of Computer Science of IBM Research) from 2007 to 2009. Haiun is an IEEE fellow. He received the Ph.D. degree in Computer Science from the University of California, Los Angeles in 2000. He has published more than 150 research papers in referred international journals and conference proceedings. He served PC Chair of conferences such as CIKM’12, and he is on the editorial board of journals such as IEEE Transactions of Knowledge and Data Engineering (TKDE) and Journal of Computer Science and Technology (JCST). He won the best paper award in ICDE 2015, 10-year best paper award in ICDM 2013, and best paper award of ER 2009.

AI for Data Quality: Automating Data Science Pipelines

     Monday, June 24, 2019

Speaker: Ihab Francis Ilyas, ACM Distinguished Scientist, ACM SIGMOD Vice Chair

Abstract:  Data scientists spend big chunk of their time preparing, cleaning, and transforming raw data before getting the chance to feed this data to their well-crafted models. Despite the efforts to build robust predication and classification models, data errors still the main reason for having low quality results. This massive labor-intensive exercises to clean data remain the main impediment to automatic end-to-end AI pipeline for data science. In this talk, I focus on data cleaning as an inference problem that can be automated by leveraging the great advancements in AI and ML in the last few years. I will describe The HoloClean++ framework, a machine learning framework for data profiling and cleaning (error detection and repair). The framework has multiple successful deployments with cleaning census data, and pilots with commercial enterprises to boost the quality of source (training) data before feeding them to downstream analytics. HoloClean++ builds two main probabilistic models: a data generation model (describing how data was intended to look like); and a realization model (describing how errors might be introduced to the intended clean data). The framework uses few-shot learning, data augmentation, and weak supervision to learn the parameters of these models, and use them to predict both error and their possible repairs. While the idea of using statistical inference to model the joint data distribution of the underlying data is not new, the problem has been always: (1) how to scale a model with millions of data cells (corresponding to random variables); and (2) how to get enough training data to learn the complex models that are capable of accurately predicting the anomalies and the repairs. HoloClean++ tackles exactly these two problems.

Short Bio: Ihab Ilyas is a Professor in the Cheriton School of Computer Science at the University of Waterloo. He received his PhD in computer science from Purdue University, West Lafayette. He holds BS and MS degrees in computer science from Alexandria University. His main research is in the area of database systems, with special interest in data quality, managing uncertain data, rank-aware query processing, and Information extraction. From 2011 to 2013 he has been on leave leading the Data Analytics Group at the Qatar Computing Research Institute. Ihab is a recipient of an Ontario Early Researcher Award, a Cheriton Faculty Fellowship, an NSERC Discovery Accelerator Award, and a Google Faculty Award. He is also an ACM Distinguished Scientist. Ihab is a co-founder of Tamr, a startup focusing on large-scale data integration and cleaning.