Software Defect Prediction on Unlabelled Datasets with Machine Learning Techniques

  • Data: 16 maggio 2019 dalle 14:30 alle 16:30

  • Luogo: Sala Venturi, c/o sede CNAF, viale Berti Pichat, 6/2

Contatto di riferimento:

Partecipanti: Elisabetta Ronchieri: CNAF

Abstract. Up to now Machine Learning techniques have been used to address a variety of software engineering tasks, such as software defect prediction, with supervised and unsupervised methods. The defined models can provide information about the program modules (such as files and classes) that are likely considered to be defective, enabling software teams to allocate resources effectively. To do so, it is essential to have proper datasets that are usually composed of a set of software metrics for the various modules (i.e. features over instances according to Machine Learning terminology). These datasets have to be suitably preprocessed before the application of Machine Learning techniques to avoid bias into outcomes interpretation.

Existing literature reports promising results with supervised defect prediction models. Unfortunately, gathering defect data is an expensive activity that implies effort and time: new projects or projects with partial historical data may lack some features' data and modules are unlabelled. The vast majority of software datasets is unlabelled.

In this talk we are going to describe how to preprocess unlabelled data in order to cluster and label instances as defective and non-defective, how to create defect prediction models on training labelled datasets and verify them on test datasets. Furthermore, we are going to provide results obtained by using different Machine Learning frameworks from Weka up to TensorFlow on Geant4 software unlabelled datasets.