A new look at fault localization and repair in debug using learning based on deep semantic features. Paul Cunningham (Senior VP/GM, Verification at Cadence), Raúl Camposano (Silicon Catalyst, entrepreneur, former Synopsys CTO and now Silvaco CTO) and I continue our series on research ideas. As always, feedback welcome.
This month’s pick is Improving Fault Localization and Program Repair with Deep Semantic Features and Transferred Knowledge. The authors presented the paper at the 2022 International Conference on Software Engineering and are from Beihang University in Beijing and Newcastle University in New South Wales.
The method goes beyond familiar spectrum-based (SBFL) and mutation-based (MBFL) localization techniques to use deep learning from pre-qualified datasets of bugs and committed fixes. The fix aspect is important here because it depends on very accurate localization of a fault (in fact localization and nature of the fix are closely linked). The paper uses SBFL and MBFL metrics as input to their deep learning method. The authors demonstrate their methods are more effective than selected SBFL and MBFL approaches and argue this is because other methods have either no semantic understanding or only a shallow understanding of semantic features of the design, whereas they intentionally have a deeper understanding.
Fully automatic repair might be a step too far for RTL debug, however suggested fixes are already familiar for spell and grammar checks, hinting that this feature might also be valuable in verification.
One year on from our review of DeepFL, we take a look at another paper that tries to move the needle further. Automatic bug detection is now a regular topic with Cadence customers and expectations are high that deep neural networks or large language models can be used to dramatically improve time spent root causing bugs.
DeepFL used a RNN to rank code for bugs based on suspiciousness features (complexity-based, mutation-based, spectrum-based, text-based). This month’s paper adds an additional bug template matching step as a further input feature to improve accuracy. The bug template matcher is itself another RNN that matches individual code statements to one or more of 11 possible bug templates, e.g. a missed null pointer checker, incorrect type cast, incorrect statement position, or incorrect arithmetic operator.
The key contribution of the paper for me is the dataset the authors build to train their bug template matching network. They mine the entire github repository to find real bug fixes that match their bug templates. For each match they require that there be another matching statement elsewhere in the same source code file that is not part of the bug fix – i.e. so the dataset has both positive and false-positive matches. The final dataset has around 400,000 positive/false-positive bug fix pairs. Nice!
As with DeepFL, the authors benchmark their tool TRANSFER-FL using Defects4J. Results look decent – 171 of the 395 bugs in Defects4J are ranked Top-5 by TRANSFER-FL vs. 140 using DeepFL. However, from a commercial standpoint 171 is still less than half of the total 395 benchmark set. If you look at the average rank across all 395 it’s 80, a long way from Top-5, so a ways off commercial deployment. I’m looking forward to reviewing some large language model-based papers next year that move the needle yet further 😊
This month we move into the areas of fault localization and automatic program repair for SW, the reviewed paper explores these techniques for Java code. In May 2022 we reviewed DeepFL, which is similar to this paper in that it extends traditional spectrum- and mutation-based techniques for fault localization with deep learning models.
To state my conclusion upfront, perhaps automatic RTL or SystemC fault localization and code repair will become routine in the foreseeable future… The authors are optimistic regarding the applicability to other languages, “most of the fix templates can be generalized to other languages because of the generic representation of AST (Abstract Syntax Tree)” with the caveat that sufficient data needs to be available for training the different networks used in their approach. For the paper 2000 open-source Java projects from GitHub were collected to construct 11 fault localization datasets with a total of 392,567 samples (faulty statements for 11 bug types that have a bug fix commit); and a program repair dataset with 11 categories of bug fixes with a total of 408,091 samples, each sample consisting of a faulty statement with the contextual method and its corresponding bug type. An AST is used to do this matching.
The detailed approach, called TRANSFER, is rather complex and requires some time to digest, 67 references help to dive into the details. It leverages existing approaches for fault localization, 1) Spectrum-based features which take source code and relevant test cases as inputs and output a sorted list of code elements ordered by suspicion scores calculated from the execution of test cases, and 2) Mutation-based features which calculate suspicion scores by analyzing the changes of execution results between original code element and its mutants. It adds 3) Deep Semantic features obtained by using BiLSTM (Bidirectional Long Short-Term Memory) binary classifiers trained with the fault localization datasets. Program repair is done using a “fine-tuned” multi-classifier trained with the program repair dataset.
The bottom line is that TRANSFER outperforms existing approaches, successfully fixing 47 bugs (6 more than the best existing approaches) on the Defects4J benchmark.
Writing and debugging SW is already routinely assisted by AI such as GitHub Copilot; designing hardware, aka writing RTL or higher-level code, can’t be too far behind, perhaps the largest obstacle being the availability of the data required.