Books

Learning over joins

Author / Creator
Kumar, Arun, 1988- author
Available as
Online
Physical
Summary

Advanced analytics using machine learning (ML) is increasingly critical for a wide variety of data-driven applications that underpin the modern world. Many real-world datasets have multiple tables ...

Advanced analytics using machine learning (ML) is increasingly critical for a wide variety of data-driven applications that underpin the modern world. Many real-world datasets have multiple tables with key-foreign key relationships, but most ML toolkits force data scientists to join them into a single table before using ML. This process of "learning after joins" introduces redundancy in the data, which results in storage and runtime inefficiencies, as well as data maintenance headaches for data scientists. To mitigate these issues, this dissertation introduces the paradigm of "learning over joins," which includes two orthogonal techniques: avoiding joins physically and avoiding joins logically. The former shows how to push ML computations through joins to the base tables, which improves runtime performance without affecting ML accuracy. The latter shows that in many cases, it is possible, somewhat surprisingly, to ignore entire base tables without affecting ML accuracy significantly. Overall, our techniques help improve the usability and runtime performance of ML over multi-table datasets, sometimes by orders of magnitude, without degrading accuracy significantly. Our work forces one to rethink a prevalent practice in advanced analytics and opens up new connections between data management systems, database dependency theory, and machine learning.

Details

Additional Information