Abstract
A software project can
be comprised of several, highly connected files. A software developer may not
know the files that are connected to which are developed or that are changed by
another developer. This may induce faults by missing necessary edits on all
related files. We build a prediction model for identifying files that should be
edited together during a code change, and evaluate the performance of our model
on two Apache projects’ development history over more than 10 years. We conduct
an external, conceptual replication study based on Wiese et al.'s prior work on
predicting co-changed files. Our study shares the same goal but differentiates
the experimental design in terms of data set construction, selection of file
pairs, feature selection and the model output. Our prediction model’s results,
although the same performance measures are used, are much lower than what is
reported in Wiese et al.’s study, mainly due to the differences in calculating
these measures. The models evaluated at commit granularity could achieve 20%
and 45% lower recall and precision rates, respectively, than those aggregated
over all file-pairs. Although it is
practically more useful, predicting all files that will be co-changed together
during a commit is more challenging than predicting whether a particular file
will be changed in that commit. More information about the context of a
co-change, the degree of centrality of a file in the project, or project
characteristics could reveal more insights in building such predictors in the
future.