Differences

This shows you the differences between two versions of the page.

--- data_mining_static_code_attributes_to_learn_defect_predictors [2014/02/15 12:16] – yann
+++ data_mining_static_code_attributes_to_learn_defect_predictors [2025/01/15 21:40] (current) – external edit 127.0.0.1
@@ Line 11: / Line 11: @@
 The authors make the case that, when it comes to defect prediction, "how the attributes are used to build predictors is much more important than which particular attributes are used". To come this this conclusion, they use 38 attributed and three different learners: OneR, J48, and naïve Bayes. They use as dataset the [[http://nasa-softwaredefectdatasets.wikispaces.com/|NASA MDP]] (Metric Data Program). They show that they can build a defect predictor that has "a mean probability of detection of 71 percent and mean false alarms rates of 25 percent".
-They start by justifying the need for such predictors: "These potential defect-prone trouble spots can then be examined in more detail by, say, modelchecking, intensive testing, etc."
+The authors start by justifying the need for such predictors: "These potential defect-prone trouble spots can then be examined in more detail by, say, modelchecking, intensive testing, etc." Then, they justify the use of static code metrics, against Fenton and Pfleeger's "insightful example where the same functionality is achieved using different programming language constructs resulting in different static measurements for that module" by stating that static code metrics "are useful as probabilistic statements that the frequency of faults tends to increase in code modules that trigger the
+predictor" and that "[w]e [should] actively [research] better code metrics which, potentially, [would] yield "better" predictors." But, before finding "better" code metrics, the authors argue that we need a baseline for prediction models. They propose such a baseline in their paper.
+The authors build a baseline using the following data:
+  * Independent variables: the three learners OneR, J48, and naïve Bayes;
+  * Studied objects: 8 systems whose metric values are available in the NASA MDP dataset;
+  * Input data to the independent variables: 38 code metrics available in the NASA MDP dataset;
+  * Dependent variables: the binary variable //defective? = (error_count ≥ 1)// from the NASA MDP dataset;
+  * Measures: the probability of detection, //pd//, and of false alarm, //pf//;
+The authors explain their choice of the measures by explaining that "accuracy is a poor measure of a learner's performance. For example, a learner could score 90 percent accuracy on a data set with 10 percent defective modules, even if it predicts that all defective modules were defect-free". They also avoids using self-tests that "can grossly overestimate performance". A self-test is a test on part of the object used to build the predictor, typically when doing 10-fold validation. The authors favour using "//holdout// modules not used in the generation of that predictor". Finally, the authors apply a logarithmic filter "on all numeric values [to] improve predictor performance".
+The authors describe in details the procedure for building the predictors and comparing them with one another: the "[d]ata is filtered and the attributes are ranked using InfoGain. The data is then shuffled into a random order and divided into 10 bins. A learner is then applied to a training set built from nine of the bins. The learned predictor is tested on the remaining bin". They also explain the use of //performace deltas// to compare the performance of the predictors, according to the chosen measures: "The performance deltas were computed using simple subtraction, defined as follows: A positive performance delta for method X means that method X has outperformed some other method in one comparison. Using performance deltas, we say that the best method is the one that generates the largest performance deltas overall comparisons". They thus generated almost 800,000 performance deltas! Finally, the authors use ROC curves to determine the best balance between //pd// and //pf// using a simple Euclidean distance from the "sweet spot".
+In conclusion, the authors show that the naïve Bayes-based predictor was the best, i.e., has the best balance between //pd// and //pf// over all other possible combination of attributes and independent variables. But they also show that the different attributes were better for different object systems:
+  * For //pc1//, the best code metrics are call_pairs, μ2, and number_of_lines;
+  * For //mw1//, the best code metrics are B, node_count, μ2;
+  * For //kc3//, the best code metrics are loc_executable, L, T;
+  * For //cm1//, the best code metrics are loc_comments, μ1, μ2;
+  * For //pc2//, the best code metrics are loc_comments, percent_comments;
+  * For //kc4//, the best code metrics are call_pairs, edge_count, node_count;
+  * For //pc3//, the best code metrics are loc_blanks, I, number_of_lines;
+  * For //pc4//, the best code metrics are loc_blanks, loc_code_and_command, percent_comments;
+The only limitations to the study (in addition to the threats mentioned in the paper) are that one of the authors worked with the NASA on the MDP program, thus there is possibly an experimenter bias. More seriously, the NASA MDP only provide metric values, no source code is available to check the quality of the data, compute different metrics, and apply different analyses!