Graham Allan's Final Year Project (B)Log Book: December 2009

Wednesday 30 December 2009

Some notes about project scope

Before I start MoSCoW prioritisation of the tasks for the project, I thought I would define and justify some of the choices of scope for the project. One of the things that has been said to me is that I shouldn't try to be too ambitious, so I'm really going to try to cut back on what I want to do. With decent MoSCoW'ing I should still be able to think big, as long as I'm disciplined about the order in which I implement parts (i.e. no M's before S's).

Things out of scope

Dealing with multiple analyses - going from one of run of Findbugs on one day then doing another the next day, you would not want to see bugs you've already annotated reappearing, looking for inspection. I think there is ways to merge analyses and cut these out, but this will not be in scope for me. When it comes to initiating a training session, I will be ignoring already inspected bugs, or assuming all bugs need inspecting.
Integration with bug-tracking systems - this is something again, which I think the new version of Findbugs is intended to do. Well out of scope for me however, I'm not about to start writing or using integration adapters for different bug-tracking systems. Hopefully I'll be evaluating the system at Sword Ciboodle, who have their own in-house bug tracking system, so that kind of functionality wouldn't be evaluated anyway.
Handling the different Findbugs interfaces - once the model for calculating FEEDBACK-RANK and Information Gain is implemented, I will only be adapting a single interface for it (e.g. Swing UI or Eclipse plugin - definitely not IntelliJ or NetBeans). Once the functionality is available, providing it for another interface isn't likely to gain much in terms of a final year project. No doubt there will be a blog post of which UI I will aim for in the next few weeks.

Things in scope

Calculating the Information Gain for each bug report at the end of an analysis
Providing a learning mode that will show bug alerts in order of information gain
Calculating Feedback Rank for each bug report
Providing a triage mode that will order alerts based on Feedback Rank, combined with the priority scheme I wish to add on top of that

That is about as basic a requirements list as I can get. It is just a precursor to a full MoSCoW requirements phase. I just wanted to get clear in my head what I won't be implementing.

User feedback in the Findbugs model

Findbugs already provides ways for a user to annotate a bug report that may be useful. The possible categories are:

not a bug
mostly harmless
should fix
must fix
I will fix
bad analysis
unclassified
obsolete/unused; will not fix

This is stored within the XML generated for a saved project.

If Findbugs does not support ignoring already view bug alerts, the absence of this could be used as a flag that this should be shown to the user. I think managing the model across different sessions is probably out of scope for my project. I think there are Findbugs commands available to merge analyses anyway.

These designations could be ordered to create the priority I wish to model. Although in the latest version there is also the concept of 'Rank' for a bug alert. It's quite annoying there are these different ways to judge a bug. Hopefully they'll be merged somehow.

Just wanted to leave a note for myself that this is probably how the XML of an analysis will be handled for the learning mode.

Tuesday 29 December 2009

Final decision of what I'm going to implement.

It's almost 2010 and I'm only just deciding what to implement. Sigh.

Basically it's a supervised learning model which I intend to make available at least through the Findbugs Swing UI, but hopefully, through the Eclipse UI as well. The project is going to be aimed at serving a subset of the workflow models or scenarios possible with Findbugs. Examples of the possible scenarios include: using rapid feedback from the tool while developing; using the tool as part of a continuous build process or using the tool to validate a release candidate as part of the Q&A release process. The scenario I want to focus on is the introduction of Findbugs to a large, existing project.

This scenario presents specific problems that I want to address with the project. These include:

initial analyses with Findbugs could overwhelm users (consider the use of Findbugs on IntelliJ IDEA which resulted in ~2,800 bug alerts)
initial inspections, if ordered badly, could result in a high initial false positive rate which could give the impression the tool only generates false positives, which could result in the tool being discarded
with a huge number of alerts, users could be suppressing what they see as similar false positives time and time again, which is slow, laborious, costly and frustrating

The implementation will be based on techniques described by Kremeneck et al. in "Correlation Exploitation in Error Ranking". The two important concepts described in that paper are FEEDBACK-RANK and Information Gain. FEEDBACK-RANK is a probabilistic technique used to exploit several types of correlation between bug alerts to create an order from to most-likely-true-positive to least-likely-true-positive. This order can then be used to present bug alerts to users such that true positives are shown first. The effect of this is that when users begin to see high rates of false positives, they can stop inspecting the reports because they have most likely seen all the true errors. Information Gain is a factor used as a tie-breaker when bug alerts have an equal probability of being true positives. The alert chosen to be inspected is that which will provide the most useful information about the remaining alerts.

In the scenario of introducing Findbugs to a large project the initial ordering of results is crucial. Taking a factor from the tool itself is always going to be a limited approach, e.g. the priority factor of a bug alert is chosen by developers of the tool, with no knowledge of the project the tool is being used on. While the FEEDBACK-RANK algorithm will quickly begin to differentiate between true and false positives, the initial ordering is random, and this could affect the order in which the true positives are eventually seen.

What I hope to achieve with the initial, supervised learning session, is that Information Gain is used solely to present alerts to the users, in order to get their feedback. This can then be used to make a larger up-front investment in order to gain a better context for the analysed project. It is hoped that this initial learning can result in a net loss in the number of alerts which have to be inspected before all true positives are seen.

One other crucial difference between Kremeneck et al.'s techniques and what I plan to implement is introducing a priority to the feedback. With FEEDBACK-RANK, the bug alerts are ordered based on probability of being a true or false positive, but it is possible that the ordering of true positives are from lowest to highest priority. On a large project, for example, with 2,800 alerts to process, initial reports of only trivial, or "True but low impact defects", are likely to be just as damaging to the confidence of the tool. The hope is that an initial learning session can not only provide information on true positives, but also, how important those true positives are to the project.

The latest version of Findbugs includes several designations for bug alerts, so it is hoped that no new prioritisation scheme needs to be introduced. Although introducing a new one would probably be easier, I'd prefer to retain the terminology already used and endorsed by the Findbugs developers and community.

What I've not described so far is how Kremeneck et al. defined 'correlation' between bug alerts. In their model, they measured correlation in terms of code locality. They found on two projects, Linux and an unnamed commercial project named System X, that true positives and false positives clustered on their location. The granularities of location range from function to file to leaf-directory. I intend to continue using this correlation model, which in terms of Java I will probably define as corresponding to method, class and package. I wish to keep this model for comparison reasons, I wish to compare their initial findings in the C language with a Java project, and use that as a baseline to begin with before introducing priority into learning.

I best get to work...

Wednesday 23 December 2009

Using the Cloud to Paint a Picture of the Project

In a previous post I talked about how a system for reducing false positives should get to know the project. From some of the reading listed here there are several descriptions of the methods used to reduce false positives. Some, techniques, such as Heckman's[1] use developer feedback to directly influence the models used to rank alerts. Others, such as Kim and Ernst[2] mine the change history of a project to increase warning precision.

With regards to Findbugs, there may be a way to combine both developer feedback and change history. In the latest version, Findbugs provides a socal-networking backdrop to static analysis with its introduction of "The Cloud". Through the Swing UI, users of Findbugs can categorise alerts and have them persisted and shared through the Cloud. This functionality could provide the basis for a supervised learning session which could be used to rank alerts, as described by Heckman. Perhaps most appealingly, it could be used retrospectively. Findbugs' analysis also provides mechanisms[3] to recognise when a bug was first introduced, and when it last appeared. Accessing this information, which provides both the bug type and location, could be used as a change history to apply some of the techniques described by Kim and Earnst.

There are two ways I would like to see this information being used. First and most obviously, the UI could be adapted to show the highest ranked alerts first. Users could be told that alerts are ranked, and once they are consistently seeing false positives, they can stop analysing. A "skimming the cream" approach.

The second way, is to use the model from a Findbugs analysis to realise when a newly introduced bug should be fixed, and alert the developer as soon as possible. This was a point raised by Nat Aweyah* in an email correspondence: that the most cost-effective way to fix bugs was to raise it with the person who wrote the code, as soon after they wrote it as possible. In a continuous integration setting, armed with the knowledge that a bug alert will be acted upon, the system could "break the build". The developer who checked in the breaking code would then be alerted to the error shortly after writing it, keeping the context fresh in their mind. This is a double-edged sword though. While raising alerts so visibly could be the most cost-effective way to fix them, the negative effect of false positives would be amplified hugely. As far as I can tell, The Cloud already allows insertion into bug tracking systems, breaking the build would be a step up from that, and assuming enough confidence in the analysis, I doubt it would be much hassle to adapt the system to break the build, for instance with a plugin for ant/Maven/Hudson.

I think I'm going to try to prototype some of these analysis models. The Cloud idea which is intended for Findbugs 2.0 looks like a deep well for this type of integration.

* A PhD student at Maryland who has coauthored papers on Findbugs with Bill Pugh.
[1] Adaptively Ranking Alerts Generated From Automated Static Analysis - Sarah Smith Heckman, 2007
[2] Which Warnings Should I Fix First? - Sunghun Kim, Michael D. Earnst
[3] Tracking Defect Warning Across Versions - Jaime Spacco, David Hovermeyer, William Pugh

Monday 21 December 2009

Reading Review

I thought it would be useful to review some of the reading that I have done for the project so far. This will allow me to express the related works which I know about, it will also help me when it comes time to write the report (i.e. "Who was it that said that?").

Finding Bugs Is Easy - David Hovermeyer, William Pugh, 2004
The original paper introducing the Findbugs static analysis tool. This paper describes the concept of bug patterns, describes some of the first detectors written for Findbugs, discusses why errors occur, and compare Findbugs to PMD, another static analysis tool.

Finding More Null Pointer Bugs, But Not Too Many - David Hovermeyer, William Pugh, 2007
Report on the challenge laid down to the Findbugs project to find more null pointer analysis.

Using Findbugs On Production Software - Nathaniel Ayewah, William Pugh, J. David Morgenthaler, John Penix, YuQuian Zhou, 2007
Short report on the experience of using Findbugs at Google. Quantifies the issues review, reported and fixed. Also discusses the idea of "True but low impact defects" - quite relevant to this project.

Evaluating Static Analysis Defect Warnings On Production Software - Nathaniel Ayewah, William Pugh, J. David Morgenthaler, John Penix, YuQuian Zhou, 2007
Longer version of the previous report. Includes more detail on the use of Findbugs at Google. Also described the results of applying Findbugs to Sun's JDK and Glassfish. Introduces the concept of 'triaging' the bug reports.

A Brief Introduction Into Machine Learning - Gunnar Ratsch
Brief overview of several machine learning techniques including supervised and unsupervised learning and classification.

Adaptively Ranking Alerts Generated From Automated Static Analysis - Sarah Smith Heckman, 2007 (Published in Crossroads Volume 14, Issue 1)
Very interesting article which uses ranking techniques to classify issues reported by static analysis tools on a spectrum from false positive to true positive. Ranking is based on several factors, including code locality, developer suppression of alerts and alert type. Use of these techniques provided a significant benefit when used to order the alerts shown to developers. Using the techniques 81% of true positive alerts were discovered within the first 20% of inspections, compared to 22% in the first 20% of inspections found with a random ordering of alerts.

Because the techniques address the issue of false positives, and learn from alert suppression, this report is very relevant to the project.

A Report on a Survey and Study of Static Analysis Users - Nathaniel Ayewah, William Pugh, 2008
Reports on the findings of a controlled experiment used to determine the use of Findbugs and another static analysis tool, Fortify SCA. This study found classification of bugs took less than two minutes on average.

Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach - Joseph R. Ruthruff, John Penix, J. David Morgenthaler, Sebastian Elbaum and Gregg Rothermel, 2008
A paper which uses statistical techniques to reduce the false positives generated. This case study also uses Findbugs at Google. I think I initially read this paper on an early morning bus journey, and I'm not sure how much of it stuck. Definitely requires a re-read.

Reordering Violations by Analyzing Software History - Sunghun Kim, Michael D. Ernst, 2007
Like the paper described earlier "Adaptively Ranking Alerts Generated From Automated Static Analysis" this paper discusses techniques used to rank error reports. In this case however, the techniques do not adapt to a developer's use of a static analysis tool, instead, it analyses the lifetime of error reports within a codebase and calculates a model for ranking based on how long the error persisted. I.e. Short lived error reports were interpreted to indicate than an error was important, thus fixed quickly, likewise a long lived error was interpreted to be unimportant, hence it would remain.

Tracking Defect Warning Across Versions - Jaime Spacco, David Hovermeyer, William Pugh
Paper describing the techniques used by Findbugs to correlate warnings across different version of software (by different versions it means from build to build, milestone to milestone etc, rather than across major release versions). The instance hash mechanism was found to be an effective way to match persistent error reports.

Which Warnings Should I Fix First? - Sunghun Kim, Michael D. Earnst
Similar to "Reordering Violations by Analyzing Software History", this paper uses software history to classify error reports as false positives and goes on to describe an algorithm to increase the warning precision of Findbugs (and other tools). Using bug-fix information, the techniques described correlate changes which are part of a bug fix to static analysis alerts based on matching location. If a bug fix alters a line which an alert is generated for, the alert is considered to be a true positive. This information can then be used to re-prioritize the warnings.

This list is not exhaustive - exhausting maybe. Now that I've wrote them down it looks like I've read a lot less than I feel as though I have.

Sunday 20 December 2009

Time to make a final choice...

I feel I've neglected and ignored this project over the past few months. The pressures of four classes, all with significant programming coursework has left me with little spare time. But now the semester is done, and I've taken a couple of days for myself to relax (by which I of course mean upgrading from Kubuntu Hardy to Karmic), I can begin to make moves with the project.

The most vital thing now, and this is echoed by the feedback I received from my second marker, Richard Connor, is that I've got to make a decision on what to begin implementing.

Although I've not made a final decision, I have made a decision on one of my earlier brainstorms, in that I definitely won't be implementing it. Sorry Triage mode, you're out.

I didn't realise at the time, but the current version of Findbugs (1.3.9-rc) already has a triage mode of sorts. Running against a "social networking backdrop", users can designate a priority which is submitted to the "Cloud". Although I feel there could be improvements to the UI, it's not going to provide a decent enough return on investment.

I can't remember if I had read either of the two papers[1][2] when I suggested a triaging mode. My suggestion was based on a hypothetical improvement which could shave off X amount of seconds from each bug classification. Using the data collected in [1], which found bug classification took less than two minutes on average, this approach may still have been a possibility. However, the analysis in [2] suggested that the average triage took 8 minutes. The former collects its data from experiments, the latter from the use of Findbugs in production (at Google).

For me, the analysis using Findbugs on production software, with industry engineers rather than students/academic staff, is more significant. And when it takes around 8 minutes to triage a bug, there can be a reasonable assumption made that the gains from improving the UI (maybe a few mouse clicks or key presses) is not going to put much of a dent in that time. Even if I was being wildly generous to the concept of an optimized UI, and said it could shave off one minute from that time, which is very unlikely, I'd still say the ROI wasn't high enough.

No, the goal, as alluded to in another post is that the system should be able to get a better idea of what bugs to report. In the cases where it can make a pretty good guess that the bug report is irrelevant, it could shave 8 minutes off an 8 minute triage time.

That's the kind of ROI I'm aiming for.

[1] A Report on a Survey and Study of Static Analysis Users - Nat Aweyah, William Pugh
[2] Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach - Joseph R. Ruthruff, John Penix, J. David Morgenthaler, Sebastian Elbaum, and Gregg Rothermel

Graham Allan's Final Year Project (B)Log Book