Bug prediction dataset

Reference the dataset

If you use this dataset for your research, please reference the following paper:

An Extensive Comparison of Bug Prediction Approaches
Marco D'Ambros, Michele Lanza, Romain Robbes
In Proceedings of MSR 2010 (7th IEEE Working Conference on Mining Software Repositories), pp. 31 - 41. IEEE CS Press, 2010.

Download bibtex

What is it?

The bug prediction dataset is a collection of models and metrics of software systems and their histories. The goal of such a dataset is to allow people to compare different bug prediction approaches and to evaluate whether a new technque is an improvement over existing ones. In particular, the dataset contains the data needed to:

run a prediction technique based on source code metrics and/or historical measures and/or process information (cvs logs data);
compute the performance of the prediction by comparing its results with an oracle set, i.e., the number post release defects reported in bug tracking system.

The dataset is designed to perform bug prediction at the class level. However package or subsystem information can be derived by aggregating class data, since per each class it is specified the package that contains it.

What does it contain?

The bug prediction dataset contains data about the following software systems:

For each system the dataset includes the following pieces of information:

Biweekly versions of the systems parsed (with the inFusion tool) into object-oriented models, provided as mse files;
Historical information extracted from the cvs change log, including reconstructed transaction and links from transactions to model classes;
Value of 15 metrics computed from cvs change log data, for each class of the systems;
Values of 17 source code metrics (CK + 11 object oriented metrics), for each version of each class;
Categorized (with severity and priority) post-release defect counts for each class.

The data is available for download in the Download page.

What can I do with it?

With the bug prediction dataset it is possible to use (or to compute) a number of metrics which can be used to create generalized linear regression models which predict, at the class level, the number of post-release defects. The performances of these models can then be evaluated by comparing the prediction results agaist the actual post-release defects provided as part of the dataset.
In particular, it is possible to use/compute the following metrics to use as predictors, or to design and compute novel ones:

Change metrics (from CVS change logs), as proposed by Moser et. al.
CK metrics, as proposed by Basili et. al.
Object oriented metrics (e.g. number of methods, number of attributes, etc).
Number of previous defects, as proposed by Kim et. al.
Complexity of code change, as proposed by Hassan
Churn of CK and object oriented metrics, as proposed by D'Ambros et. al.
Entropy of CK and object oriented metrics, as proposed by D'Ambros et. al.

All the listed defect prediction techniques, and their application on the bug prediction dataset, are described in details in the paper:
An Extensive Comparison of Bug Prediction Approaches
Marco D'Ambros, Michele Lanza, Romain Robbes
In Proceedings of MSR 2010 (7th IEEE Working Conference on Mining Software Repositories), to be published. IEEE CS Press, 2010.