It's almost 2010 and I'm only just deciding what to implement. Sigh.
Basically it's a supervised learning model which I intend to make available at least through the Findbugs Swing UI, but hopefully, through the Eclipse UI as well. The project is going to be aimed at serving a subset of the workflow models or scenarios possible with Findbugs. Examples of the possible scenarios include: using rapid feedback from the tool while developing; using the tool as part of a continuous build process or using the tool to validate a release candidate as part of the Q&A release process. The scenario I want to focus on is the introduction of Findbugs to a large, existing project
.This scenario presents specific problems that I want to address with the project. These include:
- initial analyses with Findbugs could overwhelm users (consider the use of Findbugs on IntelliJ IDEA which resulted in ~2,800 bug alerts)
- initial inspections, if ordered badly, could result in a high initial false positive rate which could give the impression the tool only generates false positives, which could result in the tool being discarded
- with a huge number of alerts, users could be suppressing what they see as similar false positives time and time again, which is slow, laborious, costly and frustrating
The implementation will be based on techniques described by Kremeneck et al. in "Correlation Exploitation in Error Ranking". The two important concepts described in that paper are FEEDBACK-RANK and Information Gain. FEEDBACK-RANK is
a probabilistic
technique used to exploit several types of correlation between bug alerts to create an order from to
most-likely-true-positive to
least-likely-true-positive. This order can then be used to present bug alerts to users such that true positives are shown first. The effect of this is that when users begin to see high rates of false positives, they can stop inspecting the reports because they have most likely seen all the true errors. Information Gain is a factor used as a tie-breaker when bug alerts have an equal probability of being true positives. The alert chosen to be inspected is that which will provide the most useful information about the remaining alerts.
In the scenario of introducing Findbugs to a large project the initial ordering of results is crucial. Taking a factor from the tool itself is always going to be a limited approach, e.g. the
priority factor of a bug alert is chosen by developers of the tool, with no knowledge of the project the tool is being used on. While the FEEDBACK-RANK algorithm will quickly begin to differentiate between true and false positives, the initial ordering is random, and this could affect the order in which the true positives are eventually seen.
What I hope to achieve with the initial, supervised learning session, is that Information Gain is used solely to present alerts to the users, in order to get their feedback. This can then be used to make a larger up-front investment in order to gain a better context for the analysed project. It is hoped that this initial learning can result in a net loss in the number of alerts which have to be inspected before all true positives are seen.
One other crucial difference between Kremeneck et al.'s techniques and what I plan to implement is introducing a
priority to the feedback. With FEEDBACK-RANK, the bug alerts are ordered based on probability of being a true or false positive, but it is possible that the ordering of true positives are from lowest to highest priority. On a large project, for example, with 2,800 alerts to process, initial reports of only trivial, or "True but low impact defects", are likely to be just as damaging to the confidence of the tool. The hope is that an initial learning session can not only provide information on true positives, but also, how important those true positives are to the project.
The latest version of Findbugs includes several designations for bug alerts, so it is hoped that no new prioritisation scheme needs to be introduced. Although introducing a new one would probably be easier, I'd prefer to retain the terminology already used and endorsed by the Findbugs developers and community
.
What I've not described so far is how Kremeneck et al. defined 'correlation' between bug alerts. In their model, they measured correlation in terms of code locality. They found on two projects, Linux and an unnamed commercial project named System X, that true positives and false positives clustered on their location. The granularities of location range from function to file to leaf-directory. I intend to continue using this correlation model, which in terms of Java I will probably define as corresponding to method, class and package. I wish to keep this model for comparison reasons, I wish to compare their initial findings in the C language with a Java project, and use that as a baseline to begin with before introducing priority into learning.
I best get to work...