Russel Pears: “Mining Data Streams: Issues, Challenges and Some Solutions”

pears

Speaker:          Russel Pears*

Date:     Tuesday, August 29

Time:    3:30-4:30 p.m.

Location:          Rice Hall, room 242

Host:                 Tom Horton (tbh3f)

Title:  Mining Data Streams: Issues, Challenges and Some Solutions

Abstract:   Data streams present unique challenges in terms of mining and knowledge extraction. Due to the open ended and fast data arrival rates associated with many data streams, standard machine learning approaches cannot be applied due to memory and speed constraints. In addition to these challenges, very often data streams are dynamic in nature with the underlying data distribution being subject to change over time.

In the first part of this talk I will touch on some of the methods that have been proposed to deal with the issues listed above. The second part of the talk will concentrate on my recent approach which explores a solution based on sensing stream volatility and tailoring the mining strategy to the level of volatility in the stream. Preliminary results from an experimental study have revealed that significant speed ups over state of the art approaches can be achieved while maintaining prediction accuracy.

About the speaker:  Russel Pears is currently attached to the Department of Computer Science at the Auckland University of Technology (AUT) in New Zealand. Russel’s career spans more than 3 decades in tertiary education, the last 16 of which has been at AUT University. During this period he has taught across a wide spectrum of courses in the Computer Science curriculum. He has also held senior leadership positions such as Programme Leader for the MSc and PhD programmes run by the School of Engineering. Computing and Mathematical Sciences at AUT University.

Russel’s research interests are in the Data Mining and Machine Learning areas where he has published widely in peer reviewed International conferences and journals. He currently supervises 4 Doctoral students and 2 MSc students in their thesis research.

Jonathan Dorn: PhD Dissertation Defense

Dorn-Jonathan

Thursday, July 20, 2017 at 12:30 pm in Rice 536

Committee members: Westley Weimer (Advisor), Baishakhi Ray (Committee Chair), Jason Lawrence, Stephanie Forrest (UNM) and Chad Dodson (Minor Representative).

Title: Optimizing Tradeoffs of Non-Functional Properties in Software

ABSTRACT:

Software systems have become integral to the daily life of millions of people. These systems provide much of our entertainment (e.g., video games, feature-length movies, and YouTube) and our transportation (e.g., planes, trains and automobiles). They ensure that the electricity to power homes and businesses is delivered and are significant consumers of that electricity themselves. With so many people consuming software, the best balance between runtime, energy or battery use, and accuracy is different for some users than for others. With so many applications playing so many different roles and so many developers producing and modifying them, the tradeoff between maintainability and other properties must be managed as well.

Existing methodologies for managing these “non-functional” properties require significant additional effort. Some techniques impose restrictions on how software may be designed or require time-consuming manual reviews. These techniques are frequently specific to a single application domain, programming language, or architecture, and are primarily applicable during initial software design and development. Further, modifying one property, such as runtime, often changes another property as well, such as maintainability.

In this dissertation, we present a framework, exemplified by three case studies, for automatically manipulating interconnected program properties to find the optimal tradeoffs. We exploit evolutionary search to explore the complex interactions of diverse properties and present the results to users. We demonstrate the applicability and effectiveness of this approach in three application domains, involving different combinations of dynamic properties (how the program behaves as it runs) and static properties (what the source code itself is like). In doing so, we describe the ways in which those domains impact the choices of how to represent programs, how to measure their properties effectively, and how to search for the best among many candidate program implementations. We show that effective choices enable the framework to take unmodified human-written programs and automatically produce new implementations with better properties—and better tradeoffs between properties—than before.

Qingyun Wu, Qualifying Exam Presentation

Date: Monday, March 27, 2017 at 3:00 PM in Rice 404

Wu,Qingyun1

Committee: Hongning Wang (Advisor), Jack Stankovic (Chair), Worthy Martin, and Yanjun Qi

Contextual Bandits in a Collaborative Environment

ABSTRACT

Contextual bandit algorithms provide principled online learning solutions to find optimal trade-offs between exploration and exploitation with companion side-information. They have been extensively used in many important practical scenarios, such as display advertising and content recommendation. A common practice estimates the unknown bandit parameters pertaining to each user independently. This unfortunately ignores dependency among users and thus leads to suboptimal solutions, especially for the applications that have strong social components. We develop a collaborative contextual bandit algorithm, in which the adjacency graph among users is leveraged to share context and payoffs among neighboring users while online updating. We rigorously prove an improved upper regret bound of the proposed collaborative bandit algorithm comparing to conventional independent bandit algorithms.

We also study the upper regret bound of the collaborative bandit algorithm when the input user dependency is erroneous and prove that under certain mild assumptions sub-linear upper regret bound is still achievable even when the input user dependency is erroneous. Based on the theoretical analysis, we enhance the collaborative contextual bandit algorithm to automatically estimate user dependency from the interaction data accumulated on the fly. With this collaborative framework, a greatly reduced learning complexity in a per-user basis can be achieved, which corresponds to rapid user preference learning and satisfaction improvement. Extensive experiments on both synthetic and three large-scale real-world datasets verified the improvement of our proposed algorithms against several state- of-the-art contextual bandit algorithms.

Elaheh Sadredini, PhD Qualifying Exam Presentation

sadredinielaheh

PhD Qualifying Exam Presentation

Date: Thursday, December 15, 2016 at 10:00 AM in Rice 242

Committee: Kevin Skadron (Advisor), Worthy N. Martin (Committee Chair),  Hongning Wang and Jane Qi.

Frequent Subtree Mining on the Automata Processor: Challenges and Opportunities

Tree structures are used abundantly in a wide variety of domains such as XML databases, bioinformatics, web mining, natural language processing, and chemical compound structure extraction. Frequency counting of complex patterns such as subtrees and subgraphs is more challenging than for simple itemsets and sequences, as the number of possible candidates in a tree are much higher than one-dimensional data structures, with dramatically higher processing times. Considering the ever-increasing amount of data and size of individual datasets, developing a scalable and efficient high-performance computing solution becomes more and more important. In this project, we propose a new and scalable solution to the frequent subtree mining (FTM) problem using a new and highly parallel accelerator architecture, called the Automata Processor (AP). We present a four-stage pruning strategy on the AP to reduce the search space of the FTM candidates. This achieves up to 353X speedup on four real-world and synthetic datasets, when compared with PatternMatcher, a practical CPU solution in lower support thresholds. To provide a fully accurate and still scalable solution, we propose a hybrid method to combine the AP-FTM with a CPU exact matching approach, and achieve up to 262X speedup over PatternMatcher on a challenging database. Finally, we develop a level-wise FTM algorithm on GPU to provide a comprehensive comparison of FTM on different platforms. The AP advantage grows further with larger datasets.

Nathan Brunelle, PhD Proposal Presentation

nathan-brunelle

Date: Tuesday, November 15, 2016 at 9:30AM in Rice 242
Advisor: Gabriel Robins
Committee: James Cohoon (Committee Chair), Kevin Skadron, Mircea Stan (Minor Representative) and Ke Wang.

Scalable Algorithms for the Post Moore’s Law Era

ABSTRACT
As Moore’s Law wanes over the coming years, new approaches will be necessary to handle massive data volumes. We propose two general methods for scalability that can persist beyond Moore’s Law, namely (1) designing algorithms to operate directly on highly compressed data, thereby avoiding expensive decompression and re-compression, and (2) leveraging massively parallel finite automata-based architectures for specific problem domains. We evaluate the efficiency, extensibility, and generality of these non-traditional approaches in big data environments, and present promising preliminary results.

Scott Ransom, NRAO (National Radio Astronomy Observatory)

ransomscott

Computer Science Special Guest Speaker

Scott Ransom, NRAO (National Radio Astronomy Observatory)
Friday, November 11, 2016
3:30 pm (short reception before talk at 3:00)
Rice Hall Auditorium
HOST: Andrew Grimshaw

The (obscene) Challenges of Next-Generation Pulsar Surveys

ABSTRACT

In the last decade, large-scale surveys for new radio pulsars have made incredible progress, particularly in their ability to find important binary and millisecond pulsars.  The reason for this progress has been Moore’s Law, the same reason behind our current efforts and plans to build fantastic next-generation radio facilities.

These new facilities, though, especially the radio arrays, will make pulsar searching incredibly difficult due to the (obscene) data rates that will be generated.  Dealing with data rates that we cannot record will demand new ways of thinking about and processing our pulsar data. And unfortunately these challenges apply not only to the SKA in some distant future, but are with us already today in the arrays we have in operation or under construction.

Bio

Scott is a tenured astronomer with the National Radio Astronomy Observatory (NRAO) in Charlottesville, VA where he studies all things “pulsar”.  He is also a Research Professor with the Astronomy Department at the University of Virginia where he has several graduate students and teaches the occasional graduate class.  He works on a wide variety of projects involving finding, timing, and exploiting pulsars of various types, using data from many different instruments and at energies from radio waves to gamma-rays.  His main focus is on searching for exotic pulsar systems, such as millisecond pulsars and binaries.  Once these pulsars are identified, he uses them as tools to probe a variety of basic physics, including tests of general relativity, the emission (and hopefully soon the direct detection) of gravitational waves (as part of the NANOGrav collaboration), and the physics of matter at supra-nuclear densities.  Much of his time is spent working on the state-of-the-art signal-processing instrumentation, high-performance computing and software that pulsar astronomy requires.

Bethany Teachman, Department of Psychology, UVA

bethany_teachman_inline
(Photo by Dan Addison, University Communications)

Computer Science Special Guest Speaker

Bethany Teachman, Department of Psychology, UVA
Friday, December 2
3:30 pm (short reception before talk at 3:00)
Rice Hall Auditorium
Host: Hongning Wang

Moving from the clinic to the lab to a screen near you:
Challenges in disseminating interpretation training

Michael Franklin, University of Chicago

franklinmichael

UVA Computer Science Distinguished Speaker Series

Michael Franklin, Liew Family Chair of Computer Science
University of Chicago
Friday, December 9, 2016
3:30 PM
Olsson Hall, Room 011 (Note change in venue)
Host: Jane Qi

A Retrospective on the AMPLab and the Berkeley Data Analytics Stack

ABSTRACT

The Algorithms, Machines and People Laboratory (AMPLab) was launched by a group of systems and machine learning faculty at UC Berkeley in early 2011 and was awarded an NSF CISE Expeditions in Computing grant in 2012. The goal of the lab is to develop a new approach to large-scale data analytics (i.e., Big Data processing) that seamlessly integrates the three main resources available for making sense of data at scale: Algorithms (machine learning, statistical and query processing techniques), Machines (scalable clusters and elastic cloud computing), and People (as individual analysts and as crowds). The lab has had significant impact on the Big Data software landscape through the development of a freely-available Open Source software stack called BDAS: the Berkeley Data Analytics Stack. BDAS is a comprehensive analytics platform that has been the incubator for a number of influential systems including the Mesos cluster resource manager (now Apache Mesos), the Spark in-memory computation framework (now Apache Spark), and the Tachyon distributed storage system (now called Alluxio). It contains interfaces for streaming analytics, distributed machine learning, high-performance SQL processing and graph analytics, among others. While serving as a unifying artifact for dozens of Ph.D. and Postdoctoral researchers, BDAS software features prominently in many industry discussions of the future of the Big Data analytics ecosystem – a rare degree of impact for an academic project.

The AMPLab is in the final year of its planned six-year existence. In this talk I will provide an overview of AMPLab and BDAS with a focus on identifying the overarching themes of this large software research project. I will highlight some risks we took and a few that we didn’t and will then provide some thoughts on the future of data analytics systems and the best ways for academic researchers to influence that future.

Bio

Michael Franklin is the Liew Family Chair of Computer Science and a Sr. Advisor to the Provost for Computation and Data at the University of Chicago. He joined U. Chicago in summer 2016 to help implement a major expansion of the Computer Science Department and to initiate a cross-campus effort in Data Science. He is also an Adjunct faculty member at UC Berkeley, where he was the Thomas M. Siebel Professor of Computer Science and former Chair of the Computer Science Division. Prof. Franklin is currently serving as PI of the Algorithms, Machines, and People Laboratory (AMPLab) expedition. In addition to its status as a NSF CISE Expeditions project, AMPLab works with dozens of industrial sponsors including founding sponsors Amazon Web Services, Google, IBM, and SAP. Prof. Franklin is a PI of the NSF Western Region Big Data Innovation Hub and was formerly a co-PI and Executive Committee member for the Berkeley Institute for Data Science, part of the Moore and Sloan Foundations’ initiative to advance Data Science Environments.  He is an ACM Fellow, a two-time winner of the ACM SIGMOD “Test of Time” award, has several recent “Best Paper” awards and two CACM Research Highlights selections, and is a recipient of the Outstanding Advisor Award from the Computer Science Graduate Student Association at Berkeley.