Books

Learning from code and non-code artifacts

Available as
Online
Physical
Summary

Three things are fundamentally true about software: (i) every day that passes we, as a society, generate more software (more code, more documentation, and more software-related artifacts of all kin...

Three things are fundamentally true about software: (i) every day that passes we, as a society, generate more software (more code, more documentation, and more software-related artifacts of all kinds), (ii) it is easier to write new software than it is to understand and maintain existing software, and (iii) we depend on software in every area of our lives (from critical infrastructure to entertainment and everything in between). These three fundamental truths set the stage for one massive problem: if there is more software every day, and it is hard to understand and maintain, how can we ever "keep up" with this unbounded growth? How can we ever truly understand the software we depend on if we are adding to it every single day? In this thesis, we provide new ideas and tools that help with some of these issues. More specifically, this thesis takes the position that we need tools and techniques for understanding and learning from software. To do this, we consider software to be a composite of source code and other, non-code, artifacts (build scripts, documentation, etc.). We introduce techniques for working with both code and non-code artifacts; for code, we introduce a form of code embeddings (learned from a semantic representation of code: abstracted symbol traces); we then create a novel specification mining technique that uses these semantic code embeddings; additionally, we explore the robustness of models of code; and, to address non-code artifacts, we mine tree-association rules from Dockerfiles, from which we learn best practices; we take these learned best practices and create a human-in-the-loop technique for automated repair of Dockerfiles. Finally, to accelerate empirical research on software and lay a groundwork for a more comprehensive solution to trusting the growing amount of software we-as a society-create, we introduce code-book. code-book is a tool for interactively querying and analyzing code inspired by the great successes of the Data Science community. With code-book, we introduce a novel query-by-example-based query language for asking questions about code. Furthermore, we develop this query language so that users can ask questions that incorporate both code structure and "fuzzy" semantic constraints (based on code embeddings).

Details

Additional Information