Graham Allan's Final Year Project (B)Log Book: 2009

Wednesday 30 December 2009

Some notes about project scope

Before I start MoSCoW prioritisation of the tasks for the project, I thought I would define and justify some of the choices of scope for the project. One of the things that has been said to me is that I shouldn't try to be too ambitious, so I'm really going to try to cut back on what I want to do. With decent MoSCoW'ing I should still be able to think big, as long as I'm disciplined about the order in which I implement parts (i.e. no M's before S's).

Things out of scope

Dealing with multiple analyses - going from one of run of Findbugs on one day then doing another the next day, you would not want to see bugs you've already annotated reappearing, looking for inspection. I think there is ways to merge analyses and cut these out, but this will not be in scope for me. When it comes to initiating a training session, I will be ignoring already inspected bugs, or assuming all bugs need inspecting.
Integration with bug-tracking systems - this is something again, which I think the new version of Findbugs is intended to do. Well out of scope for me however, I'm not about to start writing or using integration adapters for different bug-tracking systems. Hopefully I'll be evaluating the system at Sword Ciboodle, who have their own in-house bug tracking system, so that kind of functionality wouldn't be evaluated anyway.
Handling the different Findbugs interfaces - once the model for calculating FEEDBACK-RANK and Information Gain is implemented, I will only be adapting a single interface for it (e.g. Swing UI or Eclipse plugin - definitely not IntelliJ or NetBeans). Once the functionality is available, providing it for another interface isn't likely to gain much in terms of a final year project. No doubt there will be a blog post of which UI I will aim for in the next few weeks.

Things in scope

Calculating the Information Gain for each bug report at the end of an analysis
Providing a learning mode that will show bug alerts in order of information gain
Calculating Feedback Rank for each bug report
Providing a triage mode that will order alerts based on Feedback Rank, combined with the priority scheme I wish to add on top of that

That is about as basic a requirements list as I can get. It is just a precursor to a full MoSCoW requirements phase. I just wanted to get clear in my head what I won't be implementing.

User feedback in the Findbugs model

Findbugs already provides ways for a user to annotate a bug report that may be useful. The possible categories are:

not a bug
mostly harmless
should fix
must fix
I will fix
bad analysis
unclassified
obsolete/unused; will not fix

This is stored within the XML generated for a saved project.

If Findbugs does not support ignoring already view bug alerts, the absence of this could be used as a flag that this should be shown to the user. I think managing the model across different sessions is probably out of scope for my project. I think there are Findbugs commands available to merge analyses anyway.

These designations could be ordered to create the priority I wish to model. Although in the latest version there is also the concept of 'Rank' for a bug alert. It's quite annoying there are these different ways to judge a bug. Hopefully they'll be merged somehow.

Just wanted to leave a note for myself that this is probably how the XML of an analysis will be handled for the learning mode.

Tuesday 29 December 2009

Final decision of what I'm going to implement.

It's almost 2010 and I'm only just deciding what to implement. Sigh.

Basically it's a supervised learning model which I intend to make available at least through the Findbugs Swing UI, but hopefully, through the Eclipse UI as well. The project is going to be aimed at serving a subset of the workflow models or scenarios possible with Findbugs. Examples of the possible scenarios include: using rapid feedback from the tool while developing; using the tool as part of a continuous build process or using the tool to validate a release candidate as part of the Q&A release process. The scenario I want to focus on is the introduction of Findbugs to a large, existing project.

This scenario presents specific problems that I want to address with the project. These include:

initial analyses with Findbugs could overwhelm users (consider the use of Findbugs on IntelliJ IDEA which resulted in ~2,800 bug alerts)
initial inspections, if ordered badly, could result in a high initial false positive rate which could give the impression the tool only generates false positives, which could result in the tool being discarded
with a huge number of alerts, users could be suppressing what they see as similar false positives time and time again, which is slow, laborious, costly and frustrating

The implementation will be based on techniques described by Kremeneck et al. in "Correlation Exploitation in Error Ranking". The two important concepts described in that paper are FEEDBACK-RANK and Information Gain. FEEDBACK-RANK is a probabilistic technique used to exploit several types of correlation between bug alerts to create an order from to most-likely-true-positive to least-likely-true-positive. This order can then be used to present bug alerts to users such that true positives are shown first. The effect of this is that when users begin to see high rates of false positives, they can stop inspecting the reports because they have most likely seen all the true errors. Information Gain is a factor used as a tie-breaker when bug alerts have an equal probability of being true positives. The alert chosen to be inspected is that which will provide the most useful information about the remaining alerts.

In the scenario of introducing Findbugs to a large project the initial ordering of results is crucial. Taking a factor from the tool itself is always going to be a limited approach, e.g. the priority factor of a bug alert is chosen by developers of the tool, with no knowledge of the project the tool is being used on. While the FEEDBACK-RANK algorithm will quickly begin to differentiate between true and false positives, the initial ordering is random, and this could affect the order in which the true positives are eventually seen.

What I hope to achieve with the initial, supervised learning session, is that Information Gain is used solely to present alerts to the users, in order to get their feedback. This can then be used to make a larger up-front investment in order to gain a better context for the analysed project. It is hoped that this initial learning can result in a net loss in the number of alerts which have to be inspected before all true positives are seen.

One other crucial difference between Kremeneck et al.'s techniques and what I plan to implement is introducing a priority to the feedback. With FEEDBACK-RANK, the bug alerts are ordered based on probability of being a true or false positive, but it is possible that the ordering of true positives are from lowest to highest priority. On a large project, for example, with 2,800 alerts to process, initial reports of only trivial, or "True but low impact defects", are likely to be just as damaging to the confidence of the tool. The hope is that an initial learning session can not only provide information on true positives, but also, how important those true positives are to the project.

The latest version of Findbugs includes several designations for bug alerts, so it is hoped that no new prioritisation scheme needs to be introduced. Although introducing a new one would probably be easier, I'd prefer to retain the terminology already used and endorsed by the Findbugs developers and community.

What I've not described so far is how Kremeneck et al. defined 'correlation' between bug alerts. In their model, they measured correlation in terms of code locality. They found on two projects, Linux and an unnamed commercial project named System X, that true positives and false positives clustered on their location. The granularities of location range from function to file to leaf-directory. I intend to continue using this correlation model, which in terms of Java I will probably define as corresponding to method, class and package. I wish to keep this model for comparison reasons, I wish to compare their initial findings in the C language with a Java project, and use that as a baseline to begin with before introducing priority into learning.

I best get to work...

Wednesday 23 December 2009

Using the Cloud to Paint a Picture of the Project

In a previous post I talked about how a system for reducing false positives should get to know the project. From some of the reading listed here there are several descriptions of the methods used to reduce false positives. Some, techniques, such as Heckman's[1] use developer feedback to directly influence the models used to rank alerts. Others, such as Kim and Ernst[2] mine the change history of a project to increase warning precision.

With regards to Findbugs, there may be a way to combine both developer feedback and change history. In the latest version, Findbugs provides a socal-networking backdrop to static analysis with its introduction of "The Cloud". Through the Swing UI, users of Findbugs can categorise alerts and have them persisted and shared through the Cloud. This functionality could provide the basis for a supervised learning session which could be used to rank alerts, as described by Heckman. Perhaps most appealingly, it could be used retrospectively. Findbugs' analysis also provides mechanisms[3] to recognise when a bug was first introduced, and when it last appeared. Accessing this information, which provides both the bug type and location, could be used as a change history to apply some of the techniques described by Kim and Earnst.

There are two ways I would like to see this information being used. First and most obviously, the UI could be adapted to show the highest ranked alerts first. Users could be told that alerts are ranked, and once they are consistently seeing false positives, they can stop analysing. A "skimming the cream" approach.

The second way, is to use the model from a Findbugs analysis to realise when a newly introduced bug should be fixed, and alert the developer as soon as possible. This was a point raised by Nat Aweyah* in an email correspondence: that the most cost-effective way to fix bugs was to raise it with the person who wrote the code, as soon after they wrote it as possible. In a continuous integration setting, armed with the knowledge that a bug alert will be acted upon, the system could "break the build". The developer who checked in the breaking code would then be alerted to the error shortly after writing it, keeping the context fresh in their mind. This is a double-edged sword though. While raising alerts so visibly could be the most cost-effective way to fix them, the negative effect of false positives would be amplified hugely. As far as I can tell, The Cloud already allows insertion into bug tracking systems, breaking the build would be a step up from that, and assuming enough confidence in the analysis, I doubt it would be much hassle to adapt the system to break the build, for instance with a plugin for ant/Maven/Hudson.

I think I'm going to try to prototype some of these analysis models. The Cloud idea which is intended for Findbugs 2.0 looks like a deep well for this type of integration.

* A PhD student at Maryland who has coauthored papers on Findbugs with Bill Pugh.
[1] Adaptively Ranking Alerts Generated From Automated Static Analysis - Sarah Smith Heckman, 2007
[2] Which Warnings Should I Fix First? - Sunghun Kim, Michael D. Earnst
[3] Tracking Defect Warning Across Versions - Jaime Spacco, David Hovermeyer, William Pugh

Monday 21 December 2009

Reading Review

I thought it would be useful to review some of the reading that I have done for the project so far. This will allow me to express the related works which I know about, it will also help me when it comes time to write the report (i.e. "Who was it that said that?").

Finding Bugs Is Easy - David Hovermeyer, William Pugh, 2004
The original paper introducing the Findbugs static analysis tool. This paper describes the concept of bug patterns, describes some of the first detectors written for Findbugs, discusses why errors occur, and compare Findbugs to PMD, another static analysis tool.

Finding More Null Pointer Bugs, But Not Too Many - David Hovermeyer, William Pugh, 2007
Report on the challenge laid down to the Findbugs project to find more null pointer analysis.

Using Findbugs On Production Software - Nathaniel Ayewah, William Pugh, J. David Morgenthaler, John Penix, YuQuian Zhou, 2007
Short report on the experience of using Findbugs at Google. Quantifies the issues review, reported and fixed. Also discusses the idea of "True but low impact defects" - quite relevant to this project.

Evaluating Static Analysis Defect Warnings On Production Software - Nathaniel Ayewah, William Pugh, J. David Morgenthaler, John Penix, YuQuian Zhou, 2007
Longer version of the previous report. Includes more detail on the use of Findbugs at Google. Also described the results of applying Findbugs to Sun's JDK and Glassfish. Introduces the concept of 'triaging' the bug reports.

A Brief Introduction Into Machine Learning - Gunnar Ratsch
Brief overview of several machine learning techniques including supervised and unsupervised learning and classification.

Adaptively Ranking Alerts Generated From Automated Static Analysis - Sarah Smith Heckman, 2007 (Published in Crossroads Volume 14, Issue 1)
Very interesting article which uses ranking techniques to classify issues reported by static analysis tools on a spectrum from false positive to true positive. Ranking is based on several factors, including code locality, developer suppression of alerts and alert type. Use of these techniques provided a significant benefit when used to order the alerts shown to developers. Using the techniques 81% of true positive alerts were discovered within the first 20% of inspections, compared to 22% in the first 20% of inspections found with a random ordering of alerts.

Because the techniques address the issue of false positives, and learn from alert suppression, this report is very relevant to the project.

A Report on a Survey and Study of Static Analysis Users - Nathaniel Ayewah, William Pugh, 2008
Reports on the findings of a controlled experiment used to determine the use of Findbugs and another static analysis tool, Fortify SCA. This study found classification of bugs took less than two minutes on average.

Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach - Joseph R. Ruthruff, John Penix, J. David Morgenthaler, Sebastian Elbaum and Gregg Rothermel, 2008
A paper which uses statistical techniques to reduce the false positives generated. This case study also uses Findbugs at Google. I think I initially read this paper on an early morning bus journey, and I'm not sure how much of it stuck. Definitely requires a re-read.

Reordering Violations by Analyzing Software History - Sunghun Kim, Michael D. Ernst, 2007
Like the paper described earlier "Adaptively Ranking Alerts Generated From Automated Static Analysis" this paper discusses techniques used to rank error reports. In this case however, the techniques do not adapt to a developer's use of a static analysis tool, instead, it analyses the lifetime of error reports within a codebase and calculates a model for ranking based on how long the error persisted. I.e. Short lived error reports were interpreted to indicate than an error was important, thus fixed quickly, likewise a long lived error was interpreted to be unimportant, hence it would remain.

Tracking Defect Warning Across Versions - Jaime Spacco, David Hovermeyer, William Pugh
Paper describing the techniques used by Findbugs to correlate warnings across different version of software (by different versions it means from build to build, milestone to milestone etc, rather than across major release versions). The instance hash mechanism was found to be an effective way to match persistent error reports.

Which Warnings Should I Fix First? - Sunghun Kim, Michael D. Earnst
Similar to "Reordering Violations by Analyzing Software History", this paper uses software history to classify error reports as false positives and goes on to describe an algorithm to increase the warning precision of Findbugs (and other tools). Using bug-fix information, the techniques described correlate changes which are part of a bug fix to static analysis alerts based on matching location. If a bug fix alters a line which an alert is generated for, the alert is considered to be a true positive. This information can then be used to re-prioritize the warnings.

This list is not exhaustive - exhausting maybe. Now that I've wrote them down it looks like I've read a lot less than I feel as though I have.

Sunday 20 December 2009

Time to make a final choice...

I feel I've neglected and ignored this project over the past few months. The pressures of four classes, all with significant programming coursework has left me with little spare time. But now the semester is done, and I've taken a couple of days for myself to relax (by which I of course mean upgrading from Kubuntu Hardy to Karmic), I can begin to make moves with the project.

The most vital thing now, and this is echoed by the feedback I received from my second marker, Richard Connor, is that I've got to make a decision on what to begin implementing.

Although I've not made a final decision, I have made a decision on one of my earlier brainstorms, in that I definitely won't be implementing it. Sorry Triage mode, you're out.

I didn't realise at the time, but the current version of Findbugs (1.3.9-rc) already has a triage mode of sorts. Running against a "social networking backdrop", users can designate a priority which is submitted to the "Cloud". Although I feel there could be improvements to the UI, it's not going to provide a decent enough return on investment.

I can't remember if I had read either of the two papers[1][2] when I suggested a triaging mode. My suggestion was based on a hypothetical improvement which could shave off X amount of seconds from each bug classification. Using the data collected in [1], which found bug classification took less than two minutes on average, this approach may still have been a possibility. However, the analysis in [2] suggested that the average triage took 8 minutes. The former collects its data from experiments, the latter from the use of Findbugs in production (at Google).

For me, the analysis using Findbugs on production software, with industry engineers rather than students/academic staff, is more significant. And when it takes around 8 minutes to triage a bug, there can be a reasonable assumption made that the gains from improving the UI (maybe a few mouse clicks or key presses) is not going to put much of a dent in that time. Even if I was being wildly generous to the concept of an optimized UI, and said it could shave off one minute from that time, which is very unlikely, I'd still say the ROI wasn't high enough.

No, the goal, as alluded to in another post is that the system should be able to get a better idea of what bugs to report. In the cases where it can make a pretty good guess that the bug report is irrelevant, it could shave 8 minutes off an 8 minute triage time.

That's the kind of ROI I'm aiming for.

[1] A Report on a Survey and Study of Static Analysis Users - Nat Aweyah, William Pugh
[2] Predicting Accurate and Actionable Static Analysis Warnings: An Experimental Approach - Joseph R. Ruthruff, John Penix, J. David Morgenthaler, Sebastian Elbaum, and Gregg Rothermel

Sunday 22 November 2009

A Possible Case Study Source - IntelliJ IDEA

One of the important parts of evaluating any system which interacts with software is the ability to deal with production-size projects. When thinking about candidates for a codebase to use in the evaluation, I considered a few. For example, Eclipse, Glassfish, JBoss, Guice and others. However, with the recent news that JetBrains' popular IDE, IntelliJ IDEA, is going open source, I considered this as a candidate.

There are a couple of reasons using IDEA could be a good choice:
* it's pretty big, approximately 1,450,000 lines of source statements (measured using SLOCCount)
* it's completely unfamiliar to me. This may seem like a strange thing to call a benefit, but I'll probably explain this in a later post. The system will be evaluated against a known code base as well.
* the timing is particularly fortunate.

This last point is the most crucial and needs explaining. The view (albeit a simplistic one) of open source software is that it is not under the same pressures as commercial software, and may not have justified adopting Findbugs as part of their process. Commercial software is likely to have less "casual eyeballs" on the source, whereas a project like Eclipse has thousands of contributors, increasing the likelihood that Findbugs was used to develop for it. More specifically for IDEA, it's unlikely that Findbugs is used there - I'm sure the authors would be proud to claim usage on their "Users and Supporters" page. What this all means is that I can now freely inspect a code base which has not already benefitted from the use of Findbugs. A code base for a real, used, respected product.

And what a tasty code base it appears to be, so ripe with bug reports!

The results of running Findbugs (1.3.9-rc1), with default settings, is displayed in the charts below.

Weighing in with 2,872 bug reports, hopefully the IntelliJ IDEA source can act as the larger end of the codebase spectrum. Folding in the 'unfamiliar' end of the spectrum, that leaves me requiring a small, familiar codebase to evaluate with. The 3rd Year Group Project Gizmoball system would probably fit the bill there.

Not that there'll be any bugs in our code, of course...

Friday 6 November 2009

Is it possible to classify all false positives? Or "One Project's False Positive is Another Project's Showstopper"

As part of my specification and plan for this project, I included the objective:

Identifying, classifying and quantifying a representative (rather than comprehensive) set of false positives reported by Findbugs

I'm now starting to see that this is a fruitless exercise.

There could be several techniques for achieving this: grokking an open-source project, running Findbugs over the codebase and finding the false positives; trying to create the code that will expose false positives; or doing an analysis of the Findbugs bug-tracker (which has the 'False Positive' category). So in theory, it certainly would be possible do identify, classify and quantify false positives, but the problem in this case is that the results would either be project-specific, or bugs in the Findbugs code itself.

So how does this relate to the project? Well, an important factor in the success of a system that enhances Findbugs listed in the specification is:

... additions and modifications to Findbugs should not require changes to the system in order for it to function. While being forced to track major changes to Findbugs may be excusable, the system should not require modification every time a new bug pattern detector is added to Findbugs.

Identifying false positives at a certain point in time is surely chasing the wrong prey. Instead, it would be more fruitful to consider that any bug report can be a false positive, depending on the context. For example, consider the following code:

  public String uh-oh(Object whoopsie) {
    ...
    if(whoopsie == null) {
      return whoopsie.toString();  // guaranteed NullPointerException    
    }
    ...
  }

This is an example of a guaranteed dereference of a null pointer. Findbugs will kindly report this for us. However, throwing an NPE may be part of the contract of the method, albeit implemented in a strange way. This would make the report a false positive in this context. Given this scenario, should the Findbugs null dereference detector be considered as generating a false positive? Definitely not - this would be a valid report in 99% of cases. But it illustrates that what constitues a false positive depends on the context. This is an example of a single detector, but any bug report could be a false positive. If the code can never be reached, all bets are off. By definition any bug reported will be a false positive.

So what does this mean for the project? Well, it means that trying to reduce false positives solely based on the bug detectors available in Findbugs is going to be a broken design. And when you consider that Findbugs has a plugin architecture for its detectors, allowing projects to add their own, this becomes even clearer.

If false positives are highly dependent on context, then building that context is what the system should be concerned with. In a previous post I talked about using the filter file to build a context, or conducting supervised learning as part of a triage session. At the time of writing this, following these avenues are clearly more desirable than trying to identify, classify and quantify the false positives reported by Findbugs.

To put it simpler, one project's false positive is another project's showstopper. The job of this system should be to get to know the project.

Monday 2 November 2009

Brainstorming techniques for Enhancing Findbugs - Triage Mode

In the last entry for brainstorming techniques I discussed the possibility of enhancing Findbugs by implementing more sophisticated filtering abilities through the GUI. However, it seems the manual I used as evidence that such a feature did not exist was possibly out of date. The latest available version of Findbugs (1.3.9-rc1) has the kind of filtering I discussed already available, in the Swing GUI, not the Eclipse plugin. So that isn't really an option.

However, a paper written by Pugh and others[1] discussed the use of Findbugs on large scale projects, including at Google. One interesting point mentioned here is the way Findbugs was used - two developers were assigned to run Findbugs after certain builds, and categorise the bugs reported. The result of this "bug triage" was that only priority bugs reported would be raised with the relevant developer.

The triage model of workflow is an interesting one. One of the important evaluation criteria for a system which enhances Findbugs is that it has a very low cost of use. Machine learning techniques that use supervised learning may be off-putting for developers if it doesn't fit in with their workflow. However, if a developer is assigned to perform triage, their workflow is already tied to Findbugs. If this is a common approach to deploying Findbugs within a development team, a possible way to enhance the Findbugs system would be to provide an interface which is designed and streamlined for bug triage.

Currently the Finbugs GUI reports errors with the ability to configure the way they are reported. They can be listed by Category, Rank, Pattern and Kind (shown in the image below). The developer then has to manually look through the reported bugs to investigate each one. This results in a couple of mouse clicks per bug report. There are keyboard shortcuts available, but like all keyboard shortcuts, they have a barrier to use. Although the UI issue may seem trivial, if Findbugs reports hundreds of errors the inefficiency will begin to accumulate.

A triage mode could be created to address this. The basic premise is that a UI is made which is specialised for processing a large number of bug reports in one session. The system would roll through each bug report, show all necessary information in one screen, and ask the user for an action on the bug. The action could involve reclassifying the bug (is it 'mostly harmless' or 'must fix'?), or crucially for this project, creating a filter rule based on the current bug. The use of filters is probably the best strategy for reducing false positives available.

Use of filters is probably the best strategy because what makes a specific bug report a false positive is not black and white. Reducing the number of false positives relies on having a consistent definition of what a false positive is. I'm moving towards the idea that a false positive is any bug reported that the user doesn't care about. Clearly this depends more on the user than Findbugs itself. For instance, reporting a call to System.exit() will make sense for a class library, but for a standalone GUI application this may be perfectly acceptable. In one context, the bug report is a false positive, in the other it is not. Therefore what constitutes a false positive is highly dependent on the codebase it is run over. Filters are created on a per-codebase basis, depending on the perspective of the developers. Using a triage mode which can define filters is a natural progression for the system to build up an idea of the context of the system, based on the user's categorisation of bugs.

I would be keen for a system to be able to build this context solely from the filter definition file, which would have several advantages. However, it could also be possible to build the context from supervised learning, conducted transparently in each triage session. A triage mode will most likely be one of the prototypes developed and evaluated.

[1] Evaluating Static Analysis Defect Warnings On Production Software - Nathaniel Ayewah, William Pugh, J. David Morgenthaler, John Penix, YuQian Zhou

Sunday 25 October 2009

Brainstorming techniques for Enhancing Findbugs - implement a planned feature.

While reading through the Findbugs manual, a section jumped out at me: Chapter 8, Filter Files. Since I had previously mentioned I would have to find out the level of support Findbugs had for excluding error reports, I chose to skip the first 7 chapters to have a quick glance.

Even before the introduction began, another comment caught my eye:

Planned Features
Filters are currently only supported by the Command Line interface. Eventually, filter support will be added to the GUI.

Being naive, I thought the support for filtering files would already be included in the GUI version of Findbugs. This would represent a serious candidate for a choice of how to enhance Findbugs - taking on the work of implementing filtering options through the GUI.

The filtering is done through a matching scheme, which offers a flexible and powerful way to define which bugs are excluded. A set of the possible ways to match bugs are:

Match all bug reports for a class.
Match certain tests from a class by specifying their abbreviations.
Match certain tests from all classes by specifying their abbreviations.
Match certain tests from all classes by specifying their category.
Match bug types from specified methods of a class by their abbreviations.
Match a particular bug pattern in a particular method.
Match a particular bug pattern with a given priority in a particular method.
Match minor bugs introduced by AspectJ compiler (you are probably not interested in these unless you are an AspectJ developer).
Match bugs in specific parts of the code base
Match bugs on fieds or methods with specific signatures

With the exclusion matchers available, it would be possible to exclude specific bugs in as fine a granularity as a method in a class. It would also allow wider matching, like 'ignore all of this bug type in this list of packages'. Using information attached to each bug report, such as location or bug type, the GUI would be able to provide a choice of potential matchers, greatly reducing the cost of reducing false positives.

Providing a GUI of defining these exclusion filters would have the following advantages:

The bugs to stop reporting are controlled by the developers rather than an algorithm. Developers are a cynical bunch, and it would probably take a lot to convince them to place their faith in machine learning techniques. NB: it may be interesting to hold a survey with this, with Findbugs users.
Filters are already an established model for removing bug reports. Though currently accessible through the command line, they do have documentation, and will no doubt have users who have became familiar with the technique. This would reduce the barrier to entry for using the developed system.
Including filtering in the GUI is a planned feature. This would greatly improve the chances of the developed system being somehow involved with the Findbugs project.
Completely compatible with the crowdsourcing technique described in an earlier post, but would still provide a benefit for single or small development teams.
The GUI would be able to recognise cases where, rather than specify an exclusion filter, an annotation could be used. For instance, when it's known a parameter is not null, rather than ignore the error, place Findbugs' @Nonnull annotation on it.

Some disadvantages (including personal rather than technical) are:

Implementing a GUI for Findbugs is unlikely to require machine learning techniques, removing the necessity for me to work with them (this may be a hidden benefit :-P)
False positive removal would still be a manual process, which is a bit of a burden. Though there is always going to be the trade-off of developer burden vs. trustability I think.

Not that I'm investing in this avenue for enhancing Findbugs, but it's probably my favoured technique so far.

Saturday 24 October 2009

Brainstorming techniques for Enhancing Findbugs - Crowdsourcing

One possible technique for removing the number of false positives (FPs) reported by Findbugs is crowdsourcing. This technique is completely different from the supervised learning[todo:link] described earlier. The technique of crowdsourcing involves combining the efforts of a large group of users (supervised learning does not necessarily have to be monogomous, that was just the scenario previously described). The actions required from users tend to be small, quick, unobtrusive and low in number. This is key difference from a supervised learning approach, where a single user's input would be collected over time, requiring a greater investment.

Using a crowdsourcing technique to reduce FPs would probably involve distributing an ignore-list for Findbugs. If memory serves, this is already a feature of Findbugs, though I don't know the file format, or its flexibility. When a user comes across a false positive, they would simply mark it as such. The rest of the developers on the project would then share the ignore-list and the FP would no longer be reported by their instance of Findbugs. The mechanism for achieving this which springs to mind is simple - commit the ignore-list to version control. Is it likely a team will be using Findbugs if they are not employing one of the first requirements for team development?

However, if a team is not using version control, there are alternatives. The Findbugs system could be modified to submit the ignore action to a centralised server when performed, and retrieve the ignore-list on each run. In this case the centralised server does not need to be anything complicated - a web server would suffice, the persistance layer would only need to be the ignore-list file.

There are some benefits to this system:

Simplicity - the concept is incredibly simple. The mapping between user action and the FPs ignored is 1:1. The system does no learning to use one action to ignore many FPs. Not only does this simplify development, it would lower the cognitive load on developers.
Safety - the system will not neglect any true errors. Users may set the system to ignore a true error, but as far as I know, that's a risk with Findbugs as it is anyway. But at least in that case the user is consciously making the decision to ignore the error, rather than the system doing it 'behind their back'.

There are also some obvious and unsurmountable disadvantages:

Inherently unscaledownable (not a word I know, but I like it) - a single or low-user development team will not benefit from a crowdsourcing system. The only real advantage a system could offer for a single developer would be to provide a better interface for ignoring FPs. There may not be any problem with the current interface for doing this, I've yet to find this out.
Higher maintanence - each FP must be explicitly ignored. If crowdsourcing is the only mechanism used and there's no intelligence about the system and reducing the number of FPs would take more developer effort.

Despite these disadvantages, crowdsourcing is still something to consider. However, whether crowdsourcing is the chosen technique, it's likely whatever system is implemented, it will have to be team-aware.

Friday 23 October 2009

Brainstorming possible techniques for Enhancing Findbugs - Supervised Learning

One possible technique for reducing the false positives (FPs) would be to have developers train a machine learning system to recognise occurrences. Under this scheme, the machine would recognise a user action which marked a reported bug as a FP. At this point, there are several pieces of information which would be candidates for use as training data:

- location of the FP
- the type(s) involved
- the type of error reported by Findbugs

At this point I'm only considering a scenario where a single developer works on the codebase. This list is also not comprehensive, just what I can think of on a Friday morning travel into uni.

Location of the FP
The location of the FP includes the complete range of scope, e.g. from project to package to class to method (though there are other locations to be considered, such as constructors or static initialisation blocks). At this point, the system would look for FPs marked in a similar location. Consider a hypothetical FP which reports a potential null dereference of a parameter in a package-private constructor . The calls to the constructor already check for nullness of the parameter in question, and construction is limited to the package. At this point the system could learn that this constructor, within this package, is checked for nullness. If another package private constructor is added, it could potentially use the learned information to remove errors relating to null dereferencing.


class Foo {
    ...
    Foo(Object potentiallyNull) {
        this.bar = potentiallyNull.getBar(); // Error: potential null dereference
    }
}

class FooCreator {
    ...
    void createFoo() {
        ...
        if(potentiallyNull == null) throw new NullPointerException();
        this.foo = new Foo(potentiallyNull); // Null dereference inside Foo constructor is not possible*
    }
}

* ignoring thread safety for this example.

This is example is not the greatest, as there is the possibility of overlooking true errors (who says a new call to the constructor is doing the null checks previous callers were?) but it does demonstrate how location can be taken into account to learn about the codebase.

It may be likely that Findbugs can't always trace back to the calls to check as it may be too expensive. But if this type of learning could trigger an expensive check only in cases where it was likely to remove an FP, it would be ensure true errors were not neglected. The original trigger would be the developer performing an action which said "It's safe to trigger this check in this situation". The problem with this approach is that the code to run such an expensive check would depend on the bug pattern detector, and must be explicitly written for each case - which is perhaps anaethema to machine learning.

The (Java) Types Involved
Re-using the example from above, but reversed (a method call has an error reported that it could be passing in a potential null value to a constructor), the system could learn about the type involved. The code below shows the scenario:


class Foo {
    ...
    Foo(Object potentiallyNull) {
        if(potentiallyNull != null)
            this.bar = potentiallyNull.getBar(); // Null dereference is not possible
    }
}

class FooCreator {
    ...
    void createFoo() {
        ...
        this.foo = new Foo(potentiallyNull); // Error: passing potentially null value to method
    }
}

In this case, the error is an FP - the type Foo checks its parameters for nullness. This again is similar to the previous point, it might be too expensive for Findbugs to follow method calls to check if nullness is checked for. But if the system can learn that the type Foo always checks for nullness, it can stop reporting errors. Or, when the Foo constructor is called, the system can learn to trigger a specific bug detector that will follow the control, knowing that checking a call to Foo is not overly expensive.

Type of Error Reported by Findbugs
The last possibility for supervised learning is slightly simpler. An example of how a machine could learn how to handle certain errors is to do with a class of errors reported by Findbugs: 'bad style' errors. But a developer may have a justifiable reason for ignoring these errors in certain situations. Although a blanket ignore-list for error types is available in Findbugs (I think, note to self: double check), the developer may not want to ignore it everywhere. The system could be used to recognise where a certain error won't be accepted by the developer and begin to remove these.

Learning to reduce FPs by only one of these factors is likely to be quite dangerous. In order to do learn effectively, without removing true errors, it's likely a combination of the location, the code and the type of error would need to be combined.

There will almost certainly be more information which is both available and useful at the point where a system is learning, and these will hopefully become apparent soon. Also, the hypothetical examples here are not necessarily representative of the errors Findbugs actually reports, as I just made them up. Further usage and inspection of Findbugs is will hopefully shed more light on this.

It is my uninformed opinion that supervised learning of the kind described here is likely to be frought with peril, but maybe, unlike the Hoff, I'm just hiding in the darkness, afraid to come into the light.

Wednesday 21 October 2009

LaTeX vs. Open Office 3

One decision that could have a major impact on my productivity during the final year project is the choice of authoring software for the report.

For the last major report I wrote I used LaTeX, with Kile to edit and quick 'n' dirty bash scripts to build. The result looked quite nice (if I do say so myself), and while I'm glad I used LaTeX, it was quite a painful learning curve at times.

For the final year project I'm at a stage where I'm considering whether to use LaTeX again, or go with Open Office 3. Running Kubuntu 8.04 as my primary environment rules out Microsoft Word. I thought listing the pros and cons, as well as MoSCoW'ing the features would be a fruitful endeavor.

LaTeX
Pros

since the entire document, including structure and markup, is written directly into the latex source, the files are quite version control friendly
the output is very nice
the writing environment can be very lean, Kile requires less resource to run than Open Office (though this is an assumption based on no facts at all). Not that I ever did, but I could edit the document through a command terminal if I so desired
organising references, with BibTeX, is very easy
because there's a "build phase" with the document, changes to diagrams and such like are picked up on the next build, without having to refresh through the document somehow
comments could be left in the source, which was useful for leaving TODO notes which could be easily grep'd and counted/located
some features which would be incredibly tedious to do manually, such as creating a table of contents, list of figures, alphabetised glossary and references, and an index, is quite straight forward

Cons

there is a build phase while the LaTeX source is built and converted to PDF. Upon nearing completion of the report mentioned earlier, when it was over 15,000 words and 50 pages, the build phase took about 20 seconds. Doesn't seem like much, but when you're feeling your way around new commands it's a definite pain. I guess this could be split up into smaller sections but I'm not sure how well that works
it's not WYSIWYG, meaning I had to trigger a build to see how something looked
not very portable - Open Office is installed on most of the machines I would use, whereas I needed certain LaTeX packages which were not part of the standard distribution. This would mean I couldn't build a PDF on university lab machines, for example.

MoSCoW Requirements
Some of the pros and cons listed are not really things I'm too concerned about. Prioritising my requirements of Open Office will help me better decide how to find out which one to choose for this project.
Must

allow automatic generation of the following: table of contents; bibliography.
allow automatic section, page and figure referencing
have an easy way to manage and cite references
allow exporting to PDF
be able to handle large documents, I think the report is going to end up around 30,000 words or so

Should

be able to 'link' to files for things like graphics, or data sources, so that they changes will be reflected in the report
allow leaving comments, or todo's in the document, which are ignored when printed/PDF is generated
allow splitting up the documents to be able to work on smaller parts at a time, to improve performance

Could

allow automatic generation of a list of figures
allow generating a PDF from the command line, so I don't need to run Open Office if all I've done is modify a graphic (would depend on 'linking')

Won't need (but Would be nice)

a diff view which would highlight the difference between versions
a plugin which offered SVN integration (one click commit, decent editor for commit message)

From what I know of Open Office, I think I could tick off most of the requirements as marked, and with the added benefit of portability (in my circumstances), it does have something which it can hold over LaTeX.

With just a little more research I should be able to make my mind up...

Project Overview

A direct excerpt from my project specification (short-titled "Enhancing Findbugs").

Overview
Findbugs is a static analyser which can detect errors in Java programs. Findbugs can find a large number of potential errors, however, they often include false positives. The goal of this project is to enhance Findbugs by reducing the number of false positives it reports, but not at the expense of any true errors.

The aim of this project is to provide a system which will reduce the number of false positives reported by Findbugs. This will include an investigation of a number of potential strategies, for example, machine learning techniques, and implementing at least one of the possible strategies identified.

This is an experimentation with significant software development-based project, and aims to have a working system when the time comes to submit. The experimentation comes in discovering the best way to reduce the false positives, the one example mentioned, machine learning, being one of many. I'll use a following entry to discuss some brainstormed ideas for approaching the problem.

Graham Allan's Final Year Project (B)Log Book