Saturday 20 February 2010

Improving the Learning Mode - Reflecting on the Evaluation at Sword-Ciboodle

In a previous blog entry I discussed the results of evaluating my additions to Findbugs at Sword-Ciboodle. In assessing the learning mode, a common theme among all the participants was that they were repeatedly shown very similar bugs, even after designating upwards of 100 of them as false positives.

I have done a bit of digging to find how this was caused. The expected pattern would be that after designating some the information from designating more would lessen. Well, that is actually the case. The problem comes in the word 'some'. Because the sample code I have used for testing did not exhibit this behaviour, I have performed no analysis into the possibility of seeing many, many similar bugs repeatedly being ranked highest for learning value.

A further investigation into how this occured at Sword-Ciboodle quickly provided some insight.

First, it's probably best to outline the scale of the analysis. The Findbugs analysis of the codebase turned up over 9,000 issues. Given that: Findbugs had not previously been used; the codebase is pushing 2 million lines of code; has been continuously developed for around a decade; and is used by many satisfied customers, this is definitely not something that the organisation should be embarrassed about. Indeed, as I'm about to discuss, this number includes at least several thousand false positives.

When I viewed the alerts displayed in learning mode, as the developers had, I spotted the cause of the repetition. Sword Ciboodle employs code generation techniques, creating Java source from an in-house, proprietary grammar. In the area where bugs were reported, there was just over 1,000 Java source files created. Because these Java source files were compiled and analysed, Findbugs raised issues just as it would with normal Java code. The issue repeatedly flagged by Findbugs was "May expose internal representation by returning reference to mutable object", where a method returns a primitive array, instead of a defensive copy. Because this is a very common pattern created by the code generator, there was an explosion of alerts of this kind. Even if the learning mode could recover from this, the convention of the generated code was to place it all in the same, or very closely located, packages. This meant that the code locality factor added to the explosion of correlated alerts. The result was, that with over 1,400 alerts with the same type, and in a very tight location, the learning value for each of the many bugs was consistently higher than other alerts. This persisted even after designating hundreds of those alerts as a false positive.

This effect was not something I had expected or considered, and devastated the usefulness of the learning mode. I believe the learning mode should be robust enough to handle this kind of scenario. After all, one of the extremely important concepts of this work on Findbugs is to combat the drop-off effect, and this is required from the learning mode as well.

This introduces the idea of anomaly detection.

Looking at the analysis from a different view, the extent of the outlying populations could be observed:

Alert Type outliers (by intuition rather than statistical inference)
May expose internal representation by returning reference to mutable object (973)
May expose internal representation by incorporating reference to mutable object (682)

No other alert type has over 500 alerts.

Package Population outliers
[package containing generated code] (1437)

Only one other package population has over 200 alerts (it has 368).

Given these values, it's not difficult to see how each single alert would dominate the information gain ranking. The questions now are, how can the model be modified to be robust enough to stand up against these anomalies? And once they've been detected, what should we do with them?

The former will involve anomaly detection techniques, which, if I'm to pursue this, will require more research. The second I can make a rough guess at the kind of behaviour would best help the user. With the support for filter generation already in Findbugs, this would provide a ready solution for these kind of anomalies. For instance, when it's detected that there is so many alerts within a certain package, the learning mode can suggest the creation of a filter based on that package name. The same could be done for alert types. They would not be mutually exclusive, indeed it may be possible to combine the factors involved in the anomaly in the filter (i.e. ignore bugs of this type AND in this package). If creating filters proved to be tricky, it would be just as useful to apply a designation for every alert within the anomalous population - this would work for true positives as well, so that if the user does care about it, that it reflected within the feedback rank mode.

This problem may just have came a little late to be fixed, but the discovery alone was worth the trip to Sword Ciboodle in my opinion.

No comments:

Post a Comment