One of the important parts of evaluating any system which interacts with software is the ability to deal with production-size projects. When thinking about candidates for a codebase to use in the evaluation, I considered a few. For example, Eclipse, Glassfish, JBoss, Guice and others. However, with the recent news that JetBrains' popular IDE, IntelliJ IDEA, is going open source, I considered this as a candidate.
There are a couple of reasons using IDEA could be a good choice:
* it's pretty big, approximately 1,450,000 lines of source statements (measured using SLOCCount)
* it's completely unfamiliar to me. This may seem like a strange thing to call a benefit, but I'll probably explain this in a later post. The system will be evaluated against a known code base as well.
* the timing is particularly fortunate.
This last point is the most crucial and needs explaining. The view (albeit a simplistic one) of open source software is that it is not under the same pressures as commercial software, and may not have justified adopting Findbugs as part of their process. Commercial software is likely to have less "casual eyeballs" on the source, whereas a project like Eclipse has thousands of contributors, increasing the likelihood that Findbugs was used to develop for it. More specifically for IDEA, it's unlikely that Findbugs is used there - I'm sure the authors would be proud to claim usage on their "Users and Supporters" page. What this all means is that I can now freely inspect a code base which has not already benefitted from the use of Findbugs. A code base for a real, used, respected product.
And what a tasty code base it appears to be, so ripe with bug reports!
The results of running Findbugs (1.3.9-rc1), with default settings, is displayed in the charts below.
Weighing in with 2,872 bug reports, hopefully the IntelliJ IDEA source can act as the larger end of the codebase spectrum. Folding in the 'unfamiliar' end of the spectrum, that leaves me requiring a small, familiar codebase to evaluate with. The 3rd Year Group Project Gizmoball system would probably fit the bill there.
Not that there'll be any bugs in our code, of course...
Sunday, 22 November 2009
Friday, 6 November 2009
Is it possible to classify all false positives? Or "One Project's False Positive is Another Project's Showstopper"
As part of my specification and plan for this project, I included the objective:
There could be several techniques for achieving this: grokking an open-source project, running Findbugs over the codebase and finding the false positives; trying to create the code that will expose false positives; or doing an analysis of the Findbugs bug-tracker (which has the 'False Positive' category). So in theory, it certainly would be possible do identify, classify and quantify false positives, but the problem in this case is that the results would either be project-specific, or bugs in the Findbugs code itself.
So how does this relate to the project? Well, an important factor in the success of a system that enhances Findbugs listed in the specification is:
This is an example of a guaranteed dereference of a null pointer. Findbugs will kindly report this for us. However, throwing an NPE may be part of the contract of the method, albeit implemented in a strange way. This would make the report a false positive in this context. Given this scenario, should the Findbugs null dereference detector be considered as generating a false positive? Definitely not - this would be a valid report in 99% of cases. But it illustrates that what constitues a false positive depends on the context. This is an example of a single detector, but any bug report could be a false positive. If the code can never be reached, all bets are off. By definition any bug reported will be a false positive.
So what does this mean for the project? Well, it means that trying to reduce false positives solely based on the bug detectors available in Findbugs is going to be a broken design. And when you consider that Findbugs has a plugin architecture for its detectors, allowing projects to add their own, this becomes even clearer.
If false positives are highly dependent on context, then building that context is what the system should be concerned with. In a previous post I talked about using the filter file to build a context, or conducting supervised learning as part of a triage session. At the time of writing this, following these avenues are clearly more desirable than trying to identify, classify and quantify the false positives reported by Findbugs.
To put it simpler, one project's false positive is another project's showstopper. The job of this system should be to get to know the project.
Identifying, classifying and quantifying a representative (rather than comprehensive) set of false positives reported by FindbugsI'm now starting to see that this is a fruitless exercise.
There could be several techniques for achieving this: grokking an open-source project, running Findbugs over the codebase and finding the false positives; trying to create the code that will expose false positives; or doing an analysis of the Findbugs bug-tracker (which has the 'False Positive' category). So in theory, it certainly would be possible do identify, classify and quantify false positives, but the problem in this case is that the results would either be project-specific, or bugs in the Findbugs code itself.
So how does this relate to the project? Well, an important factor in the success of a system that enhances Findbugs listed in the specification is:
... additions and modifications to Findbugs should not require changes to the system in order for it to function. While being forced to track major changes to Findbugs may be excusable, the system should not require modification every time a new bug pattern detector is added to Findbugs.Identifying false positives at a certain point in time is surely chasing the wrong prey. Instead, it would be more fruitful to consider that any bug report can be a false positive, depending on the context. For example, consider the following code:
public String uh-oh(Object whoopsie) {
...
if(whoopsie == null) {
return whoopsie.toString(); // guaranteed NullPointerException
}
...
}
This is an example of a guaranteed dereference of a null pointer. Findbugs will kindly report this for us. However, throwing an NPE may be part of the contract of the method, albeit implemented in a strange way. This would make the report a false positive in this context. Given this scenario, should the Findbugs null dereference detector be considered as generating a false positive? Definitely not - this would be a valid report in 99% of cases. But it illustrates that what constitues a false positive depends on the context. This is an example of a single detector, but any bug report could be a false positive. If the code can never be reached, all bets are off. By definition any bug reported will be a false positive.
So what does this mean for the project? Well, it means that trying to reduce false positives solely based on the bug detectors available in Findbugs is going to be a broken design. And when you consider that Findbugs has a plugin architecture for its detectors, allowing projects to add their own, this becomes even clearer.
If false positives are highly dependent on context, then building that context is what the system should be concerned with. In a previous post I talked about using the filter file to build a context, or conducting supervised learning as part of a triage session. At the time of writing this, following these avenues are clearly more desirable than trying to identify, classify and quantify the false positives reported by Findbugs.
To put it simpler, one project's false positive is another project's showstopper. The job of this system should be to get to know the project.
Monday, 2 November 2009
Brainstorming techniques for Enhancing Findbugs - Triage Mode
In the last entry for brainstorming techniques I discussed the possibility of enhancing Findbugs by implementing more sophisticated filtering abilities through the GUI. However, it seems the manual I used as evidence that such a feature did not exist was possibly out of date. The latest available version of Findbugs (1.3.9-rc1) has the kind of filtering I discussed already available, in the Swing GUI, not the Eclipse plugin. So that isn't really an option.
However, a paper written by Pugh and others[1] discussed the use of Findbugs on large scale projects, including at Google. One interesting point mentioned here is the way Findbugs was used - two developers were assigned to run Findbugs after certain builds, and categorise the bugs reported. The result of this "bug triage" was that only priority bugs reported would be raised with the relevant developer.
The triage model of workflow is an interesting one. One of the important evaluation criteria for a system which enhances Findbugs is that it has a very low cost of use. Machine learning techniques that use supervised learning may be off-putting for developers if it doesn't fit in with their workflow. However, if a developer is assigned to perform triage, their workflow is already tied to Findbugs. If this is a common approach to deploying Findbugs within a development team, a possible way to enhance the Findbugs system would be to provide an interface which is designed and streamlined for bug triage.
Currently the Finbugs GUI reports errors with the ability to configure the way they are reported. They can be listed by Category, Rank, Pattern and Kind (shown in the image below). The developer then has to manually look through the reported bugs to investigate each one. This results in a couple of mouse clicks per bug report. There are keyboard shortcuts available, but like all keyboard shortcuts, they have a barrier to use. Although the UI issue may seem trivial, if Findbugs reports hundreds of errors the inefficiency will begin to accumulate.
A triage mode could be created to address this. The basic premise is that a UI is made which is specialised for processing a large number of bug reports in one session. The system would roll through each bug report, show all necessary information in one screen, and ask the user for an action on the bug. The action could involve reclassifying the bug (is it 'mostly harmless' or 'must fix'?), or crucially for this project, creating a filter rule based on the current bug. The use of filters is probably the best strategy for reducing false positives available.
Use of filters is probably the best strategy because what makes a specific bug report a false positive is not black and white. Reducing the number of false positives relies on having a consistent definition of what a false positive is. I'm moving towards the idea that a false positive is any bug reported that the user doesn't care about. Clearly this depends more on the user than Findbugs itself. For instance, reporting a call to System.exit() will make sense for a class library, but for a standalone GUI application this may be perfectly acceptable. In one context, the bug report is a false positive, in the other it is not. Therefore what constitutes a false positive is highly dependent on the codebase it is run over. Filters are created on a per-codebase basis, depending on the perspective of the developers. Using a triage mode which can define filters is a natural progression for the system to build up an idea of the context of the system, based on the user's categorisation of bugs.
I would be keen for a system to be able to build this context solely from the filter definition file, which would have several advantages. However, it could also be possible to build the context from supervised learning, conducted transparently in each triage session. A triage mode will most likely be one of the prototypes developed and evaluated.
[1] Evaluating Static Analysis Defect Warnings On Production Software - Nathaniel Ayewah, William Pugh, J. David Morgenthaler, John Penix, YuQian Zhou
However, a paper written by Pugh and others[1] discussed the use of Findbugs on large scale projects, including at Google. One interesting point mentioned here is the way Findbugs was used - two developers were assigned to run Findbugs after certain builds, and categorise the bugs reported. The result of this "bug triage" was that only priority bugs reported would be raised with the relevant developer.
The triage model of workflow is an interesting one. One of the important evaluation criteria for a system which enhances Findbugs is that it has a very low cost of use. Machine learning techniques that use supervised learning may be off-putting for developers if it doesn't fit in with their workflow. However, if a developer is assigned to perform triage, their workflow is already tied to Findbugs. If this is a common approach to deploying Findbugs within a development team, a possible way to enhance the Findbugs system would be to provide an interface which is designed and streamlined for bug triage.
Currently the Finbugs GUI reports errors with the ability to configure the way they are reported. They can be listed by Category, Rank, Pattern and Kind (shown in the image below). The developer then has to manually look through the reported bugs to investigate each one. This results in a couple of mouse clicks per bug report. There are keyboard shortcuts available, but like all keyboard shortcuts, they have a barrier to use. Although the UI issue may seem trivial, if Findbugs reports hundreds of errors the inefficiency will begin to accumulate.
A triage mode could be created to address this. The basic premise is that a UI is made which is specialised for processing a large number of bug reports in one session. The system would roll through each bug report, show all necessary information in one screen, and ask the user for an action on the bug. The action could involve reclassifying the bug (is it 'mostly harmless' or 'must fix'?), or crucially for this project, creating a filter rule based on the current bug. The use of filters is probably the best strategy for reducing false positives available.
Use of filters is probably the best strategy because what makes a specific bug report a false positive is not black and white. Reducing the number of false positives relies on having a consistent definition of what a false positive is. I'm moving towards the idea that a false positive is any bug reported that the user doesn't care about. Clearly this depends more on the user than Findbugs itself. For instance, reporting a call to System.exit() will make sense for a class library, but for a standalone GUI application this may be perfectly acceptable. In one context, the bug report is a false positive, in the other it is not. Therefore what constitutes a false positive is highly dependent on the codebase it is run over. Filters are created on a per-codebase basis, depending on the perspective of the developers. Using a triage mode which can define filters is a natural progression for the system to build up an idea of the context of the system, based on the user's categorisation of bugs.
I would be keen for a system to be able to build this context solely from the filter definition file, which would have several advantages. However, it could also be possible to build the context from supervised learning, conducted transparently in each triage session. A triage mode will most likely be one of the prototypes developed and evaluated.
[1] Evaluating Static Analysis Defect Warnings On Production Software - Nathaniel Ayewah, William Pugh, J. David Morgenthaler, John Penix, YuQian Zhou
Subscribe to:
Posts (Atom)