Peter Bailis (Stanford) – July 2017

pbailis Prioritizing Attention in High-Volume, High-Dimensional Data Streams

The rise of Big Data infrastructure has supercharged the collection of high volume, highly heterogeneous data. However, increasingly, this data is too big for any cost-effective manual inspection, and, in practice, much of this “fast data” is only accessed in exceptional cases (e.g., to debug a failure). As a result, important behaviors often go unnoticed, leading to inefficiency, wasted resources, and limited visibility into complex application deployments.

In response, this research focuses on developing infrastructure for prioritizing human attention in this fast data, automatically providing users with high-level interpretable summaries and explanations of key behaviors in data streams. Early results with such a fast data prototype, MacroBase, have proven extremely promising at small scale: by combining statistical classification and explanation, MacroBase has discovered a range of previously unknown behaviors in mobile applications, online services, and manufacturing. This allows users to do more with the data they store, and to also provide insights about deployments of complex software such as storage systems.

However, scalability along two dimensions remains particularly challenging. First, data volumes continue to rise, at much faster rates than processor speeds. Second, data is collected from an increasing number of sources, requiring data fusion and multi-source processing. In response, this research proposes to adapt methods from conventional query processing to this new domain of fast data analysis. This research proposes to use approximate query processing techniques in sample-based stochastic stream processing (maintaining accuracy while running on a small subset of the stream) and multi-query processing to analyze multiple sources at once, both of which may have substantial impact on the design and implementation of storage software. This research plan to integrate and evaluate our techniques within MacroBase on real-world IoT sensor, time-series, and telemetry data as well as any available telemetry from NetApp storage systems.