Saturday, 22 May 2010

Final Year Project: The Result

This blog entry will demonstrate the system developed for the Enhancing FindBugs final year project. It is my hope that the functionality demonstrated here will be considered for inclusion in FindBugs. The extra features were found to consistently present all false positives in less inspections than the default FindBugs ordering. This was evaluated on three open-source codebases (Testability Explorer, Guava, FLL-SW) and a former academic project of mine. The full dissertation, which reports on the evaluation of the system, can be provided on request[1].

The following video gives a brief introduction into the working system.

The Feedback-Rank mode uses alert type, and the location of the source code, to determine how much bugs correlate. This is based on previous work[2][3] which found this an effective technique for prioritising alerts. A further strength of this implementation is that it breaks ties using the default FindBugs ordering. Thus, when no bugs have been inspected, they all tie, but are ordered by the alert priority originally decided by the FindBugs project.

Some points to note about the functionality shown here:
  • a bug being a false or true positive is determined by the FindBugs designation. This means there is no need for extra persistance mechanisms, and old projects with designated bugs can use the ranking modes.
  • it has been used on FindBugs projects with many bug reports (~9000) and the performance was never an issue.
  • bug inspections and designations can be completed efficiently, using only the keyboard.

I would very much like to contribute this work to FindBugs, hopefully the functionality will be compelling enough to interest those running the project.

[1] Note this is for an Honours degree, and contains a lot of material which may be irrelevant, e.g. discussions of the development methodology used.
[2] Kremenek, T., Ashcraft, K., Yang, J., and Engler, D. (2004). Correlation Exploitation in Error Ranking. In SIGSOFT ’04/FSE-12: Proceedings of the 12th ACM SIGSOFT twelfth international symposium on Foundations of software engineering, pages 83–93, New York, NY, USA. ACM.
[3] Heckman, S. S. (2007). Adaptively Ranking Alerts Generated From Automated Static Analysis. Crossroads, 14(1):1–11.

Tuesday, 30 March 2010

March 30th, project submitted...

... ahhhhh. Time for a nice big Cohiba :-D

Sunday, 7 March 2010

Lesson of the day: cloning and inner classes.

Today I spent a while tracking down a bug I had introduced into the Findbugs code. It concerned filtering bugs in the GUI. If you had one analysis opened, then opened a second one, and tried to filter a bug, the bugs displayed would revert to ones from the first analysis.

Long story short, the bug was in this code:

protected BugTreeModel(BugTreeModel other) {
this.root = new BugAspects(other.root);
this.sorter = other.sorter;
this.bugSet = new BugSet(other.bugSet);
this.bugTreeFilterListener = other.bugTreeFilterListener;
this.tree = other.tree;
this.filterActivity = other.filterActivity;

Pretty innocuous clone constructor, you might think. The problem is with the bugTreeFilterListener field. This field is an instance of an inner class. When that field gets copied over, so to does the implicit reference the inner class has to its outer class. This meant when the bugTreeFilterListener was calling methods to the outer class (which holds the collection of bugs displayed), it was calling it on the old, source-of-the-clone, instance. Reconstructing the field solved the problem.

Mental note to self: don't copy references of inner classes across instances of outer classes.

Saturday, 20 February 2010

Improving the Learning Mode - Reflecting on the Evaluation at Sword-Ciboodle

In a previous blog entry I discussed the results of evaluating my additions to Findbugs at Sword-Ciboodle. In assessing the learning mode, a common theme among all the participants was that they were repeatedly shown very similar bugs, even after designating upwards of 100 of them as false positives.

I have done a bit of digging to find how this was caused. The expected pattern would be that after designating some the information from designating more would lessen. Well, that is actually the case. The problem comes in the word 'some'. Because the sample code I have used for testing did not exhibit this behaviour, I have performed no analysis into the possibility of seeing many, many similar bugs repeatedly being ranked highest for learning value.

A further investigation into how this occured at Sword-Ciboodle quickly provided some insight.

First, it's probably best to outline the scale of the analysis. The Findbugs analysis of the codebase turned up over 9,000 issues. Given that: Findbugs had not previously been used; the codebase is pushing 2 million lines of code; has been continuously developed for around a decade; and is used by many satisfied customers, this is definitely not something that the organisation should be embarrassed about. Indeed, as I'm about to discuss, this number includes at least several thousand false positives.

When I viewed the alerts displayed in learning mode, as the developers had, I spotted the cause of the repetition. Sword Ciboodle employs code generation techniques, creating Java source from an in-house, proprietary grammar. In the area where bugs were reported, there was just over 1,000 Java source files created. Because these Java source files were compiled and analysed, Findbugs raised issues just as it would with normal Java code. The issue repeatedly flagged by Findbugs was "May expose internal representation by returning reference to mutable object", where a method returns a primitive array, instead of a defensive copy. Because this is a very common pattern created by the code generator, there was an explosion of alerts of this kind. Even if the learning mode could recover from this, the convention of the generated code was to place it all in the same, or very closely located, packages. This meant that the code locality factor added to the explosion of correlated alerts. The result was, that with over 1,400 alerts with the same type, and in a very tight location, the learning value for each of the many bugs was consistently higher than other alerts. This persisted even after designating hundreds of those alerts as a false positive.

This effect was not something I had expected or considered, and devastated the usefulness of the learning mode. I believe the learning mode should be robust enough to handle this kind of scenario. After all, one of the extremely important concepts of this work on Findbugs is to combat the drop-off effect, and this is required from the learning mode as well.

This introduces the idea of anomaly detection.

Looking at the analysis from a different view, the extent of the outlying populations could be observed:

Alert Type outliers (by intuition rather than statistical inference)
May expose internal representation by returning reference to mutable object (973)
May expose internal representation by incorporating reference to mutable object (682)

No other alert type has over 500 alerts.

Package Population outliers
[package containing generated code] (1437)

Only one other package population has over 200 alerts (it has 368).

Given these values, it's not difficult to see how each single alert would dominate the information gain ranking. The questions now are, how can the model be modified to be robust enough to stand up against these anomalies? And once they've been detected, what should we do with them?

The former will involve anomaly detection techniques, which, if I'm to pursue this, will require more research. The second I can make a rough guess at the kind of behaviour would best help the user. With the support for filter generation already in Findbugs, this would provide a ready solution for these kind of anomalies. For instance, when it's detected that there is so many alerts within a certain package, the learning mode can suggest the creation of a filter based on that package name. The same could be done for alert types. They would not be mutually exclusive, indeed it may be possible to combine the factors involved in the anomaly in the filter (i.e. ignore bugs of this type AND in this package). If creating filters proved to be tricky, it would be just as useful to apply a designation for every alert within the anomalous population - this would work for true positives as well, so that if the user does care about it, that it reflected within the feedback rank mode.

This problem may just have came a little late to be fixed, but the discovery alone was worth the trip to Sword Ciboodle in my opinion.

Friday, 19 February 2010

Results of the Evaluation at Sword-Ciboodle

Having spent two days at the offices of Sword-Ciboodle evaluating my enhancement of Findbugs, I'd say the investment was time well spent. I'm also very grateful to the company donated some resources and developer time for this exercise. Something that was very pleasing for me was that the experience of the developers who took part was higher than I had expected. The range of experience went from roughly 5 to 15 years of professional experience, with large portions of that being experience in Java development.

The actual evaluation differed a great deal from what I discussed in a previous blog entry. It was suggested that expecting developers to slog through an exercise for two hours was a bit too much. Also, since the number of developers was always going to be less than six, and the tool would be used on a single code base, any measurements or findings would not prove statistically significant. Instead I opted to use the time to get an initial feedback on first impressions of the usefulness of the tool. Since drop-off rates of using the tool are likely to happen in a short period of time, I decided to get the developers to use Findbugs (default functionality) for 20 minutes, then use my addition for a further 20 minutes. Following this they were asked to fill in a short questionairre.

The following is some interesting points raised by the questionairre.

Asked to rate the performance of the adaptive ranking modes, all developers answered favourably. Positive comments included that the mode was fast, that there was no instances of waiting when using the model, and another stated that feedback for waiting time was good. Negative comments included that when setting the designation on a parent node for many child nodes, it took some time, but there was no feedback (such as displaying a "waiting" cursor). This is functionality which already existed in Findbugs, which I have not modified, but it is worth noting, and I may take the time to make the small change.

Asked to rate the usability of the adaptive ranking mode, all developers rated it 'Good' (on the scale "Very bad", "Bad", "Neither good nor bad", "Good" or "Very good"). Comments included that there was a very clear set of options, with little to cause confusion. A couple of comments on the choice of bugs used in the learning mode showed that it was a hindrance that the mode repeatedly showed the same type of bug, and it took a while to get through insignificant bugs in the learning mode. Another very interesting comment was that it was difficult to gauge the amount of time to spend in the learning mode before switching to the feedback rank mode. Perhaps it would be worth investigating ways to continuously recalculate the feedback rank order while in learning mode, and working out a way where it could notify the user to switch from learning mode.

Asked to rate the intuitiveness of the adaptive ranking mode, responses were a bit cooler. Two of the four were positive, while the remaining two rated it as "Neither good nor bad". In terms of the interface, comments included that multiple select would be useful on the bug tree, as well as keyboard shortcuts for setting the status (which is available) and refreshing (which isn't). Further comments mentioned how many similar types of bugs were displayed repeatedly. Also mentioned that some of the information was not clear on its purpose, there were no specifics included in the comment, but from further discussion, I believe this was aimed at the exact figure for the learning value being included. The figure by itself is pretty meaningless to users, and it may be better to remove the information. Another interesting comment was that it was not immediately obvious that it's best to use learning mode and feedback rank mode in conjuction, because it's not clear that when you classify a bug in either mode, this is reflected in the other modes, and indeed, saved to the overall analysis.

I don't think there is a single issue which is not fixable. Some are easy (enabling multi-select would require a refactor, but is doable), others are more tricky, but in my opinion a great way to solve this would be to demonstrate usage with a screencast of the application, with additional dialogue to explain what's happening.

Following the focus on the adaptive ranking model in isolation, the questionairre moved on to assessing the enhancements in the overall context of introducing the tool to the developers' workflow. Only one of the four developers had used Findbugs before, but 3 of the 4 were either "In favour", or "Strongly in favour" of introducing Findbugs into their current development environment and workflow. The remaining developer was indifferent to the notion. When asked to give their opinion on the factors preventing introduction of Findbugs, reasons which applied included: "Finds too many bugs initially, is overwhelming"; "Reports too many bugs which we don't care about / are low priority" and "Inspecting alerts is too time consuming". None of the four developers included the "Time / cost to train developers to use Findbugs" or "Findbugs does not find any bugs which are not discovered by other means". Comments also included that Findbugs is most useful when it is applied to give immediate feedback to the developer at the time of writing, and that on a mature codebase the cost/benefit ratio is low, as many bugs are found through testing or customer support calls. These results, although not statistically significant, give the impression that the development team is willing to invest in using Findbugs, if only it could present the results to developers in a way that provided value sooner.

When asked if they felt the adaptive ranking model makes Findbugs more useful, 4/4 developers stated Yes. Comments included that the learning mode would avoid the need to manually create filters, and that their impression of the standard Findbugs mechanisms seemed "tedious", particularly with a large code base.

The final question asked was to describe any improvements to the adaptive ranking model which would increase the viability of introducing Findbugs to their workflow. Comments included that the learning would need to retained over time (this should be possible with existing functionality for merging analyses). Another comment stated it would be useful to be able to create filters for the learning mode (this functionality was removed by me, and could be reinstated). Another comment asked if it could be part of an automated continuous integration build, interestingly a question I asked after the evaluation was, "if an automated build could find the feedback rank of any new bugs, would they be comfortable breaking the build over it?" The feeling was that, if they felt the ranking became accurate enough over time, then yes, they would like to break the build over it.

Another suggestion was that it would be useful to be able to inspect the call heirarchy involved with a bug, further discussions found that it would also be necessary to browse the type heirarchy to classify a bug. Indeed, it's not possible to browse the code directly with the Findbugs GUI, it's only possible to see the source where the bug is registered. I would suggest that the best way to address this is with better integration with IDE's, as this functionality is likely to already be built in. Further gains could also come from this, such as IDE integration with version control, as commit messages are a rich source of information. It's always been at the front of my mind that the adaptive ranking computation should not depend on the Findbugs GUI, for this very reason.

As well as using a questionairre, I also took notes on the questions I was asked throughout the evaluation. Not only did this allow me to note a couple of bugs the developers found with my code, but it gave me a greater insight into the kinds of problems faced by new users. These notes reinforced some of the observations made from the questionairre: what is the meaning of the learning value?; are classifications made from the learning mode carried over into the rest of the analysis?; how long should I use the learning mode? why does the learning mode keep showing the same types of bug again and again (I'll probably return to this point in a later blog)?

Overall I'd say the evaluation was a success, not only in how well Findbugs with the adaptive ranking I've introduced was received, but also the feedback I've obtained from experienced developers.

Sunday, 7 February 2010

The Entropy Strikes Back: Revenge of Information Gain

Despite spending many frustrating hours trying to learn, I still find myself unable to understand the concept of Information Gain in Bayesian systems. The formulas used by Kremenek et al. (which I described here) just won't seem to sink in for me. It's getting a bit late in the day now, so I'm hereby giving up on following the formula to the letter. Instead I'm going to try to define a formula for information gain which has the characteristics I'm looking for.

This unfortunately means that I'm taking the step away from proven formulae in information theory towards intuition. This wouldn't worry me so much if it wasn't for Bayes theory being pretty unintuitive, meaning that although my intuition may provide a model which seems right, but without the backing of information theory and all it's damn log calculations, may be crucially flawed. I'll just have to attempt to find a decent model for what I'm trying to achieve.

I've went through several formulas, and the one I have chosen is:

Given alert A,
Information Gain(A) = (sum[for all populations] (population weight * # unclassified alerts in population) ) / # of bugs
0 <= # unclassified alerts in population <= # of bugs
sum[for all populations](population weight) = 1
If alert A is the only alert in an analysis, Information Gain(A) = 1.

Characteristics of this model include:
  • the result is a number in the range [0,1] - the result has a meaning when viewed alone, if ( result > (1 / # bugs) ), then there is a benefit to classifying it in terms of providing information of the model
  • however, there is no value for result which we can use to say "The benefit of continuing to use the learning mode outweighs the benefit of starting the real bug triaging." This would be dependent on developers. Although we could provide clues and help on this.
  • the model does not try to maximise the amount of true positives, or minimise the amount of false positives. Instead, it attempts to maximise information. This is not a concern, as all bugs will be shown to the user, in order. If we were trying to find alerts which hit a certain threshold for probability of being a true positive, this may have been a concern. But, as it is, finding that an alert is more likely to be a false positive is just as useful as finding that it is a true positive. The model reflects this.

I'll demonstrate the use of this model with some examples. Let's say we have an analysis which yields 20 bug reports from a system. The population weights are as follows: AlertType=0.5, Package=0.0425, Class=0.17, Method=0.2875 (these are the actual values used).
Consider alert A, which is a singleton bug, which has not been classified. There is only one instance of the alert type it has, it is the only alert in its package, class and method. The value for how much information it gives is:
IG(A) = ((AlertType * # unclassified) + (Package * # unclassified) + (Class * # unclassified) + (Method * # unclassified) ) / # of bugs
= ((0.5*1) + (0.0425*1) + (0.17*1) + (0.2875*1)) / 20
= 1/20
= 0.05

Notice how in this case IG(A) = 1 / # of bugs, which as mentioned earlier, shows there is no benefit to using this alert for learning. Let's consider another case, where we have bugs B and C. B shares an alert type with 10 other bugs, C shares a package with 10 other bugs. They share no other populations with other alerts. Here we would expect B to have a higher value of gain than C, since in our model alert type has greater than significance than sharing a package:
IG(B) = ((0.5 * 10) + 0 + 0 + 0) / 20 = 5 / 20 = 0.25
IG(C) = ( 0 + (0.0425*10) + 0 + 0) / 20 = 0.425 / 20 = 0.02125

So the model matches what we would expect in this case, with the value from C much lower than B. Another characteristic of this model which is crucial is that it takes into account the unclassified bugs, rather than just the number of bugs. There is no information required for bugs which have already been classified, so there's no point in trying to maximise information for them. Let's consider two further bugs, D and E. They have different alert types, but each shares their alert type with 10 other bugs. However, 4 of the 10 bugs with E's alert type have been classified. We would expect to see the classification of D offer greater value.
IG(D) = ( (0.5*10) + 0 + 0 + 0) / 20 = 5 / 20 = 0.25
IG(E) = ( (0.5*(10-4) + 0 + 0 + 0) / 20 = 3 / 20 = 0.15
The value of using number of unclassified comes when the learning mode is applied to a partially classified codebase, where the learning mode will take into account what the system already knows about the codebase in order to decide which bugs to use to learn.

So, given that I have not been able to fully understand the elements of information theory which would have provided a deeper insight into information gain, I think this is a good enough model, which will do the job.

Wednesday, 3 February 2010

Evaluating the System

Well, it looks like all systems go for visiting Sword Ciboodle in Inchinnan to evaluate the project. Having talked with my manager from my placement days it looks like I should be able to get 4 hours worth of developer time. I'm still at the stage of not knowing exactly how to evaluate the system, and because I want to make the most of the time at Inchinnan, I'm going to go over some possible evaluation strategies.

First I'll discuss what I want to get from the evaluation.

Verification (Am I building the the thing right?)
This is one aspect of evaluation that I think will be transparently tested during any evaluation. Concrete metrics for performance will be available (how long does it take to calculate Information Gain ranks? How long is the update time for Feedback Rank when the user is designating bugs as true or false positives?). Failure rates could be monitored as well (how many times did the application crash?) as well as usability (how many times did the user ask for directions? how did the user rate usability?).

In terms of using the time at Inchinnan, the performance can be measured (I may as well, it will have to be computed) but it may not be statistically valid. What would be better is to run the areas of the system I want to measure several times on my personal machine, to provide the number of results necessary to make it valid. For usability it could be useful to get the feedback of developers in industry, but the same kind of survey could be performed by peers at the university (if I'm careful about the friends I pick, they could at least be useful subjects in determining if the system is idiot proof :-P). So usability is not necessarily what I want to evaluate at Inchinnan.

What I really want to use the time at Inchinnan for is not for verification, but for validation.

Validation (Am I building the right thing?)
This is the really important question for a research based project. In my case the validation strategy is kind of baked right in to the project title. Does what I'm building actually Enhance Findbugs?

The previous studies in this area, from Kremenek et al. (I'll spell it correctly this time) and S. S. Heckman have demonstrated that using bug correlations to rank alerts can reduce the number of false positives a user has to inspect. Comparisons of their ranking were made against the optimal ordering, and a random ordering (S. S. Heckman did include the default Eclipse ordering, which is alphabetical). However, the comparisons were not made with knowledge of the manual ranking capabilities available in Findbugs (Kremenek et al.'s tool was for the C language, S. S. Heckman used a custom Eclipse plugin). This could be a significant factor which I don't want to overlook.

The whole idea of the project is to reduce the cost developers need to spend to see alerts they care about. Any improvement I suggest for Findbugs should be compared in real terms. It's really no use to say my ranking is better than a random ordering of bugs, because users of Findbugs do not generally view the bugs in a random order. Findbugs actually has some pretty strong capabilities for sorting and filtering bugs on a per-project basis. The system I develop should be compared to the real, expected usage pattern of Findbugs.

The actual basis for comparison should be real as well, but this is pretty simple I think: what shows the user more bugs which they care about? Easily measured, and meaningful. This will be a real way to compare my system against the expected usage pattern for a user of Findbugs.

But what is the "real, expected usage pattern"? Well, that's for the developer using it to decide. I don't think there is a more realistic comparison than just allowing a developer to choose how to use Findbugs. So this is my instinct for how to do the evaluation. Get the same developer to perform two sessions of bug triaging. In each they can spend some time getting acquainted with how to use the system. In each they use the same codebase. Since the actual time for deciding if a bug is a false positive or not should be independent of the system, the winning system is the one with the highest proportion of true positives.

Strength in simplicity.

Although a clear winner would provide a lot of information. All is not lost for my project if it doesn't yield the most true positives. Although it would be nice to be better, it would be good to be complementary. This would have to be determined through some other means, perhaps a survey of the developers to ask if they thought usage of the system could provide an enhancement over the vanilla Findbugs.

Would be nice to win though.

So, I've described what I think is a fair and valid way to evaluate the system. Given that developers would be used to perform analysis twice, it's probably better to have 2 hours from 2 developers rather than 1 hour from 4. Filling out surveys could then be done at their leisure. Performance will be evaluated outwith Inchinnan.

Best get crackin' on beating Findbugs then...