This is an old revision of the document!
The value of using static code attributes to learn defect predictors has been widely debated. Prior work has explored issues like the merits of “McCabes versus Halstead versus lines of code counts” for generating defect predictors. We show here that such debates are irrelevant since how the attributes are used to build predictors is much more important than which particular attributes are used. Also, contrary to prior pessimism, we show that such defect predictors are demonstrably useful and, on the data studied here, yield predictors with a mean probability of detection of 71 percent and mean false alarms rates of 25 percent. These predictors would be useful for prioritizing a resource-bound exploration of code that has yet to be inspected.
Yann-Gaël Guéhéneuc, 2014/02/12
The authors make the case that, when it comes to defect prediction, “how the attributes are used to build predictors is much more important than which particular attributes are used”. To come this this conclusion, they use 38 attributed and three different learners: OneR, J48, and naïve Bayes. They use as dataset the NASA MDP (Metric Data Program). They show that they can build a defect predictor that has “a mean probability of detection of 71 percent and mean false alarms rates of 25 percent”.
The authors start by justifying the need for such predictors: “These potential defect-prone trouble spots can then be examined in more detail by, say, modelchecking, intensive testing, etc.” Then, they justify the use of static code metrics, against Fenton and Pfleeger's “insightful example where the same functionality is achieved using different programming language constructs resulting in different static measurements for that module” by stating that static code metrics “are useful as probabilistic statements that the frequency of faults tends to increase in code modules that trigger the predictor” and that “[w]e [should] actively [research] better code metrics which, potentially, [would] yield “better” predictors.” But, before finding “better” code metrics, the authors argue that we need a baseline for prediction models. They propose such a baseline in their paper.
The authors build a baseline using the following data:
The authors explain their choice of the measures by explaining that “accuracy is a poor measure of a learner's performance. For example, a learner could score 90 percent accuracy on a data set with 10 percent defective modules, even if it predicts that all defective modules were defect-free”. They also avoids using self-tests that “can grossly overestimate performance”. A self-test is a test on part of the object used to build the predictor, typically when doing 10-fold validation. The authors favour using “holdout modules not used in the generation of that predictor”. Finally, the authors apply a logarithmic filter “on all numeric values [to] improve predictor performance”.