Machine Learning for Defect Prediction

3 min readAug 6, 2021

What is Defect Prediction

Software defect prediction is a process of identifying the modules that are defect prone and require extensive testing. It provides development groups with observable outcomes while contributing to industrial results and development faults predicting defective code areas can help developers identify bugs and organize their test activities.

Why is Defect Prediction Necessary

Defect prediction is one of the most important processes in software development which is an efficient means to relieve the burden on software code inspection or testing. Predicting defective code regions can help developers find bugs and prioritize their testing efforts. The resources can be used efficiently without violating the constraints.

Non ML Approaches

Search Based Techniques
Evolutionary Computing Techniques

Machine Learning Approaches

With-in Project Defect Prediction [WPDP]
Cross Project Defect Prediction [CPDP] for Similar Dataset
Cross Project Defect Prediction [CPDP] for Heterogeneous Dataset
Personalized Defect Prediction
Defect Prediction for Unlabeled Data
Better Feature Generation

Advantages of Defect Prediction

Defect prediction improves the efficiency of the testing phase further helping developers assess the quality and defect-proneness of the software product. Finding and fixing defects is estimated to cost billions, automated help in reliably predicting where the faults are, and focusing the efforts of testers, has a significant impact on the cost of production and maintenance of software.

Personalized Defect Prediction

In Software Engineering, different developers have different coding styles, so a generalized model may not work that good for each unique developer’s code.
Better solution is to train separate defect prediction models for each individual developer.
It is found that, every developer writes certain kind of code, that frequently has an error, like developer A might have 50% errors related to for loops, while developer B might have only 10% errors related to for loops.

Change Classifier: A change classifier predicts for defects whenever there is some changes made in the lines of code.

To get best results, a hybrid approach that uses both personalized predictors as well as a global predictor trained for all developers would give even better results.

Steps involved in training a ‘Change Predictor’:

Label each change, clean or buggy, by mining the project’s revision history.
Identify bug fixing changes, by searching commit logs for word ”fix”.
Lines belonging to bug fixing change, assume them to be location of a bug.

The changes those introduced those buggy lines are bug-introducing changes.

Extract features, such as Bag-of-Words and an additional feature call characteristic vector.
Characteristic Vectors: represent the syntactic structure by counting the number of each node type in abstract syntax tree(AST).
Suppose if code uses ‘if’, ’for’, and ’while’, the characteristic vector would be count of for, if and while. If there is change that remove one for then vector would be (0,-1,0).
In addition to the two features i.e., Bag of Words and Characteristic Vectors, generate features from metadata such as, commit hour, commit day, file age in days and so on.
For personalized change predictor, train a Machine Learning algorithm such as Naive Bayes.
For PCC, for each instance in test set, identify the developer who made the change, and feed to appropriate model trained for that developer to identify if the bug is a defect introducing change.
PCC+ : combination of PCC(personalized change predictor) and CC(normal change predictor).
For each prediction, use PCC, CC and weighted CC, the predictor which predicts with highest confidence, take it as final prediction.

How to Learn More

Learning is a never-ending process, to gain further knowledge on the application of Machine Learning for Bug Reports Triaging, following approaches can be taken:

Visit Top Software Engineering Conferences such as ICSE, ASE, FSE, TOSEM, TSE, MSR and so on.
Read some good survey paper or literature review
Check out the reference sections of the research papers.

Acknowledgement

I would like to thank my professor Dr Pankaj Jalote.

References

[1].

[2].

[3].

[4].