Saturday, 20 February 2010

Improving the Learning Mode - Reflecting on the Evaluation at Sword-Ciboodle

In a previous blog entry I discussed the results of evaluating my additions to Findbugs at Sword-Ciboodle. In assessing the learning mode, a common theme among all the participants was that they were repeatedly shown very similar bugs, even after designating upwards of 100 of them as false positives.

I have done a bit of digging to find how this was caused. The expected pattern would be that after designating some the information from designating more would lessen. Well, that is actually the case. The problem comes in the word 'some'. Because the sample code I have used for testing did not exhibit this behaviour, I have performed no analysis into the possibility of seeing many, many similar bugs repeatedly being ranked highest for learning value.

A further investigation into how this occured at Sword-Ciboodle quickly provided some insight.

First, it's probably best to outline the scale of the analysis. The Findbugs analysis of the codebase turned up over 9,000 issues. Given that: Findbugs had not previously been used; the codebase is pushing 2 million lines of code; has been continuously developed for around a decade; and is used by many satisfied customers, this is definitely not something that the organisation should be embarrassed about. Indeed, as I'm about to discuss, this number includes at least several thousand false positives.

When I viewed the alerts displayed in learning mode, as the developers had, I spotted the cause of the repetition. Sword Ciboodle employs code generation techniques, creating Java source from an in-house, proprietary grammar. In the area where bugs were reported, there was just over 1,000 Java source files created. Because these Java source files were compiled and analysed, Findbugs raised issues just as it would with normal Java code. The issue repeatedly flagged by Findbugs was "May expose internal representation by returning reference to mutable object", where a method returns a primitive array, instead of a defensive copy. Because this is a very common pattern created by the code generator, there was an explosion of alerts of this kind. Even if the learning mode could recover from this, the convention of the generated code was to place it all in the same, or very closely located, packages. This meant that the code locality factor added to the explosion of correlated alerts. The result was, that with over 1,400 alerts with the same type, and in a very tight location, the learning value for each of the many bugs was consistently higher than other alerts. This persisted even after designating hundreds of those alerts as a false positive.

This effect was not something I had expected or considered, and devastated the usefulness of the learning mode. I believe the learning mode should be robust enough to handle this kind of scenario. After all, one of the extremely important concepts of this work on Findbugs is to combat the drop-off effect, and this is required from the learning mode as well.

This introduces the idea of anomaly detection.

Looking at the analysis from a different view, the extent of the outlying populations could be observed:

Alert Type outliers (by intuition rather than statistical inference)
May expose internal representation by returning reference to mutable object (973)
May expose internal representation by incorporating reference to mutable object (682)

No other alert type has over 500 alerts.

Package Population outliers
[package containing generated code] (1437)

Only one other package population has over 200 alerts (it has 368).

Given these values, it's not difficult to see how each single alert would dominate the information gain ranking. The questions now are, how can the model be modified to be robust enough to stand up against these anomalies? And once they've been detected, what should we do with them?

The former will involve anomaly detection techniques, which, if I'm to pursue this, will require more research. The second I can make a rough guess at the kind of behaviour would best help the user. With the support for filter generation already in Findbugs, this would provide a ready solution for these kind of anomalies. For instance, when it's detected that there is so many alerts within a certain package, the learning mode can suggest the creation of a filter based on that package name. The same could be done for alert types. They would not be mutually exclusive, indeed it may be possible to combine the factors involved in the anomaly in the filter (i.e. ignore bugs of this type AND in this package). If creating filters proved to be tricky, it would be just as useful to apply a designation for every alert within the anomalous population - this would work for true positives as well, so that if the user does care about it, that it reflected within the feedback rank mode.

This problem may just have came a little late to be fixed, but the discovery alone was worth the trip to Sword Ciboodle in my opinion.

Friday, 19 February 2010

Results of the Evaluation at Sword-Ciboodle

Having spent two days at the offices of Sword-Ciboodle evaluating my enhancement of Findbugs, I'd say the investment was time well spent. I'm also very grateful to the company donated some resources and developer time for this exercise. Something that was very pleasing for me was that the experience of the developers who took part was higher than I had expected. The range of experience went from roughly 5 to 15 years of professional experience, with large portions of that being experience in Java development.

The actual evaluation differed a great deal from what I discussed in a previous blog entry. It was suggested that expecting developers to slog through an exercise for two hours was a bit too much. Also, since the number of developers was always going to be less than six, and the tool would be used on a single code base, any measurements or findings would not prove statistically significant. Instead I opted to use the time to get an initial feedback on first impressions of the usefulness of the tool. Since drop-off rates of using the tool are likely to happen in a short period of time, I decided to get the developers to use Findbugs (default functionality) for 20 minutes, then use my addition for a further 20 minutes. Following this they were asked to fill in a short questionairre.

The following is some interesting points raised by the questionairre.

Asked to rate the performance of the adaptive ranking modes, all developers answered favourably. Positive comments included that the mode was fast, that there was no instances of waiting when using the model, and another stated that feedback for waiting time was good. Negative comments included that when setting the designation on a parent node for many child nodes, it took some time, but there was no feedback (such as displaying a "waiting" cursor). This is functionality which already existed in Findbugs, which I have not modified, but it is worth noting, and I may take the time to make the small change.

Asked to rate the usability of the adaptive ranking mode, all developers rated it 'Good' (on the scale "Very bad", "Bad", "Neither good nor bad", "Good" or "Very good"). Comments included that there was a very clear set of options, with little to cause confusion. A couple of comments on the choice of bugs used in the learning mode showed that it was a hindrance that the mode repeatedly showed the same type of bug, and it took a while to get through insignificant bugs in the learning mode. Another very interesting comment was that it was difficult to gauge the amount of time to spend in the learning mode before switching to the feedback rank mode. Perhaps it would be worth investigating ways to continuously recalculate the feedback rank order while in learning mode, and working out a way where it could notify the user to switch from learning mode.

Asked to rate the intuitiveness of the adaptive ranking mode, responses were a bit cooler. Two of the four were positive, while the remaining two rated it as "Neither good nor bad". In terms of the interface, comments included that multiple select would be useful on the bug tree, as well as keyboard shortcuts for setting the status (which is available) and refreshing (which isn't). Further comments mentioned how many similar types of bugs were displayed repeatedly. Also mentioned that some of the information was not clear on its purpose, there were no specifics included in the comment, but from further discussion, I believe this was aimed at the exact figure for the learning value being included. The figure by itself is pretty meaningless to users, and it may be better to remove the information. Another interesting comment was that it was not immediately obvious that it's best to use learning mode and feedback rank mode in conjuction, because it's not clear that when you classify a bug in either mode, this is reflected in the other modes, and indeed, saved to the overall analysis.

I don't think there is a single issue which is not fixable. Some are easy (enabling multi-select would require a refactor, but is doable), others are more tricky, but in my opinion a great way to solve this would be to demonstrate usage with a screencast of the application, with additional dialogue to explain what's happening.

Following the focus on the adaptive ranking model in isolation, the questionairre moved on to assessing the enhancements in the overall context of introducing the tool to the developers' workflow. Only one of the four developers had used Findbugs before, but 3 of the 4 were either "In favour", or "Strongly in favour" of introducing Findbugs into their current development environment and workflow. The remaining developer was indifferent to the notion. When asked to give their opinion on the factors preventing introduction of Findbugs, reasons which applied included: "Finds too many bugs initially, is overwhelming"; "Reports too many bugs which we don't care about / are low priority" and "Inspecting alerts is too time consuming". None of the four developers included the "Time / cost to train developers to use Findbugs" or "Findbugs does not find any bugs which are not discovered by other means". Comments also included that Findbugs is most useful when it is applied to give immediate feedback to the developer at the time of writing, and that on a mature codebase the cost/benefit ratio is low, as many bugs are found through testing or customer support calls. These results, although not statistically significant, give the impression that the development team is willing to invest in using Findbugs, if only it could present the results to developers in a way that provided value sooner.

When asked if they felt the adaptive ranking model makes Findbugs more useful, 4/4 developers stated Yes. Comments included that the learning mode would avoid the need to manually create filters, and that their impression of the standard Findbugs mechanisms seemed "tedious", particularly with a large code base.

The final question asked was to describe any improvements to the adaptive ranking model which would increase the viability of introducing Findbugs to their workflow. Comments included that the learning would need to retained over time (this should be possible with existing functionality for merging analyses). Another comment stated it would be useful to be able to create filters for the learning mode (this functionality was removed by me, and could be reinstated). Another comment asked if it could be part of an automated continuous integration build, interestingly a question I asked after the evaluation was, "if an automated build could find the feedback rank of any new bugs, would they be comfortable breaking the build over it?" The feeling was that, if they felt the ranking became accurate enough over time, then yes, they would like to break the build over it.

Another suggestion was that it would be useful to be able to inspect the call heirarchy involved with a bug, further discussions found that it would also be necessary to browse the type heirarchy to classify a bug. Indeed, it's not possible to browse the code directly with the Findbugs GUI, it's only possible to see the source where the bug is registered. I would suggest that the best way to address this is with better integration with IDE's, as this functionality is likely to already be built in. Further gains could also come from this, such as IDE integration with version control, as commit messages are a rich source of information. It's always been at the front of my mind that the adaptive ranking computation should not depend on the Findbugs GUI, for this very reason.

As well as using a questionairre, I also took notes on the questions I was asked throughout the evaluation. Not only did this allow me to note a couple of bugs the developers found with my code, but it gave me a greater insight into the kinds of problems faced by new users. These notes reinforced some of the observations made from the questionairre: what is the meaning of the learning value?; are classifications made from the learning mode carried over into the rest of the analysis?; how long should I use the learning mode? why does the learning mode keep showing the same types of bug again and again (I'll probably return to this point in a later blog)?

Overall I'd say the evaluation was a success, not only in how well Findbugs with the adaptive ranking I've introduced was received, but also the feedback I've obtained from experienced developers.

Sunday, 7 February 2010

The Entropy Strikes Back: Revenge of Information Gain

Despite spending many frustrating hours trying to learn, I still find myself unable to understand the concept of Information Gain in Bayesian systems. The formulas used by Kremenek et al. (which I described here) just won't seem to sink in for me. It's getting a bit late in the day now, so I'm hereby giving up on following the formula to the letter. Instead I'm going to try to define a formula for information gain which has the characteristics I'm looking for.

This unfortunately means that I'm taking the step away from proven formulae in information theory towards intuition. This wouldn't worry me so much if it wasn't for Bayes theory being pretty unintuitive, meaning that although my intuition may provide a model which seems right, but without the backing of information theory and all it's damn log calculations, may be crucially flawed. I'll just have to attempt to find a decent model for what I'm trying to achieve.

I've went through several formulas, and the one I have chosen is:

Given alert A,
Information Gain(A) = (sum[for all populations] (population weight * # unclassified alerts in population) ) / # of bugs
Where:
0 <= # unclassified alerts in population <= # of bugs
sum[for all populations](population weight) = 1
If alert A is the only alert in an analysis, Information Gain(A) = 1.

Characteristics of this model include:
  • the result is a number in the range [0,1] - the result has a meaning when viewed alone, if ( result > (1 / # bugs) ), then there is a benefit to classifying it in terms of providing information of the model
  • however, there is no value for result which we can use to say "The benefit of continuing to use the learning mode outweighs the benefit of starting the real bug triaging." This would be dependent on developers. Although we could provide clues and help on this.
  • the model does not try to maximise the amount of true positives, or minimise the amount of false positives. Instead, it attempts to maximise information. This is not a concern, as all bugs will be shown to the user, in order. If we were trying to find alerts which hit a certain threshold for probability of being a true positive, this may have been a concern. But, as it is, finding that an alert is more likely to be a false positive is just as useful as finding that it is a true positive. The model reflects this.

I'll demonstrate the use of this model with some examples. Let's say we have an analysis which yields 20 bug reports from a system. The population weights are as follows: AlertType=0.5, Package=0.0425, Class=0.17, Method=0.2875 (these are the actual values used).
Consider alert A, which is a singleton bug, which has not been classified. There is only one instance of the alert type it has, it is the only alert in its package, class and method. The value for how much information it gives is:
IG(A) = ((AlertType * # unclassified) + (Package * # unclassified) + (Class * # unclassified) + (Method * # unclassified) ) / # of bugs
= ((0.5*1) + (0.0425*1) + (0.17*1) + (0.2875*1)) / 20
= 1/20
= 0.05

Notice how in this case IG(A) = 1 / # of bugs, which as mentioned earlier, shows there is no benefit to using this alert for learning. Let's consider another case, where we have bugs B and C. B shares an alert type with 10 other bugs, C shares a package with 10 other bugs. They share no other populations with other alerts. Here we would expect B to have a higher value of gain than C, since in our model alert type has greater than significance than sharing a package:
IG(B) = ((0.5 * 10) + 0 + 0 + 0) / 20 = 5 / 20 = 0.25
IG(C) = ( 0 + (0.0425*10) + 0 + 0) / 20 = 0.425 / 20 = 0.02125

So the model matches what we would expect in this case, with the value from C much lower than B. Another characteristic of this model which is crucial is that it takes into account the unclassified bugs, rather than just the number of bugs. There is no information required for bugs which have already been classified, so there's no point in trying to maximise information for them. Let's consider two further bugs, D and E. They have different alert types, but each shares their alert type with 10 other bugs. However, 4 of the 10 bugs with E's alert type have been classified. We would expect to see the classification of D offer greater value.
IG(D) = ( (0.5*10) + 0 + 0 + 0) / 20 = 5 / 20 = 0.25
IG(E) = ( (0.5*(10-4) + 0 + 0 + 0) / 20 = 3 / 20 = 0.15
The value of using number of unclassified comes when the learning mode is applied to a partially classified codebase, where the learning mode will take into account what the system already knows about the codebase in order to decide which bugs to use to learn.

So, given that I have not been able to fully understand the elements of information theory which would have provided a deeper insight into information gain, I think this is a good enough model, which will do the job.

Wednesday, 3 February 2010

Evaluating the System

Well, it looks like all systems go for visiting Sword Ciboodle in Inchinnan to evaluate the project. Having talked with my manager from my placement days it looks like I should be able to get 4 hours worth of developer time. I'm still at the stage of not knowing exactly how to evaluate the system, and because I want to make the most of the time at Inchinnan, I'm going to go over some possible evaluation strategies.

First I'll discuss what I want to get from the evaluation.

Verification (Am I building the the thing right?)
This is one aspect of evaluation that I think will be transparently tested during any evaluation. Concrete metrics for performance will be available (how long does it take to calculate Information Gain ranks? How long is the update time for Feedback Rank when the user is designating bugs as true or false positives?). Failure rates could be monitored as well (how many times did the application crash?) as well as usability (how many times did the user ask for directions? how did the user rate usability?).

In terms of using the time at Inchinnan, the performance can be measured (I may as well, it will have to be computed) but it may not be statistically valid. What would be better is to run the areas of the system I want to measure several times on my personal machine, to provide the number of results necessary to make it valid. For usability it could be useful to get the feedback of developers in industry, but the same kind of survey could be performed by peers at the university (if I'm careful about the friends I pick, they could at least be useful subjects in determining if the system is idiot proof :-P). So usability is not necessarily what I want to evaluate at Inchinnan.

What I really want to use the time at Inchinnan for is not for verification, but for validation.

Validation (Am I building the right thing?)
This is the really important question for a research based project. In my case the validation strategy is kind of baked right in to the project title. Does what I'm building actually Enhance Findbugs?

The previous studies in this area, from Kremenek et al. (I'll spell it correctly this time) and S. S. Heckman have demonstrated that using bug correlations to rank alerts can reduce the number of false positives a user has to inspect. Comparisons of their ranking were made against the optimal ordering, and a random ordering (S. S. Heckman did include the default Eclipse ordering, which is alphabetical). However, the comparisons were not made with knowledge of the manual ranking capabilities available in Findbugs (Kremenek et al.'s tool was for the C language, S. S. Heckman used a custom Eclipse plugin). This could be a significant factor which I don't want to overlook.

The whole idea of the project is to reduce the cost developers need to spend to see alerts they care about. Any improvement I suggest for Findbugs should be compared in real terms. It's really no use to say my ranking is better than a random ordering of bugs, because users of Findbugs do not generally view the bugs in a random order. Findbugs actually has some pretty strong capabilities for sorting and filtering bugs on a per-project basis. The system I develop should be compared to the real, expected usage pattern of Findbugs.

The actual basis for comparison should be real as well, but this is pretty simple I think: what shows the user more bugs which they care about? Easily measured, and meaningful. This will be a real way to compare my system against the expected usage pattern for a user of Findbugs.

But what is the "real, expected usage pattern"? Well, that's for the developer using it to decide. I don't think there is a more realistic comparison than just allowing a developer to choose how to use Findbugs. So this is my instinct for how to do the evaluation. Get the same developer to perform two sessions of bug triaging. In each they can spend some time getting acquainted with how to use the system. In each they use the same codebase. Since the actual time for deciding if a bug is a false positive or not should be independent of the system, the winning system is the one with the highest proportion of true positives.

Strength in simplicity.

Although a clear winner would provide a lot of information. All is not lost for my project if it doesn't yield the most true positives. Although it would be nice to be better, it would be good to be complementary. This would have to be determined through some other means, perhaps a survey of the developers to ask if they thought usage of the system could provide an enhancement over the vanilla Findbugs.

Would be nice to win though.

So, I've described what I think is a fair and valid way to evaluate the system. Given that developers would be used to perform analysis twice, it's probably better to have 2 hours from 2 developers rather than 1 hour from 4. Filling out surveys could then be done at their leisure. Performance will be evaluated outwith Inchinnan.

Best get crackin' on beating Findbugs then...

Tuesday, 2 February 2010

The trouble with static calls...

As part of the work to enhance Findbugs, I have reserved an area of the GUI to perform supervised learning. The reasoning behind segregating an area off is that:
  • I can control unnecessary performance hits of calculating stuff that's not needed, by putting up a barrier for the user to stipulate "Yes, I do want you do this stuff."
  • I can bend rules in terms of the conventions of the UI in other areas, adding buttons that only make sense to "my stuff" is not a problem for the rest of the application.
  • It's easier to develop if you only have yourself to consider.
Despite sequestering myself into a small, separate part of the GUI, I have tried to reuse elements from the main GUI that may be useful. The main example of this is the bug tree:


This is quite a nifty piece of UI. The buttons shown above the tree ("Bug Rank", "Bug Pattern", etc) are used to sort the tree. They can be dragged to change the order that sorters are applied to the list of bugs. I liked the UI here, and although I didn't need the functionality of changing the sort order, I did like how the code was set up for this. The sorters themselves were decoupled from the UI, and had the infrastructure around them to make it a breeze to add a new, custom sorter.

This is what I did with the learning mode:


The learning mode calculates the bug which would provide the most information about the all the other bugs. A value is calculated for each bug, and the tree is sorted on order of lowest to highest. This was fairly easy to implement (apart from the large refactoring of extracting an interface from an enum type that was used all over the place, but that was necessary for other concerns, the actual sorting and making a tree was easy). I was pretty impressed with how the existing code in Findbugs allowed quite a complicated piece of UI to be added.

That was until I started to see some strange errors.

I mentioned how the part I was working on was segregated, in a kind of application sandbox, away from the rest of the GUI. I constructed and used different instances of JTree, TreeModel and the class used for sorting. I was then quite surprised to see, the sorting method I had used had leaked out of my sandbox and into the rest of the application.

Somehow the main bug tree with the nice order dragging control no longer worked, and the tree was sorted based on my new, exposed sorting model.

I was a bit confused how this could have happened - I didn't publish the tree instance, or the model, or the sorter in my code, so how could have it escaped? I'm still not sure I've found the answer, but I'd be willing to lay down a decent sized bet that it was to do with this piece of code:
public BugTreeModel(SortableBugTree tree, SorterTableColumnModel st, BugSet data) {
st.addColumnModelListener(this);
this.tree = tree;
this.st = st;
this.bugSet = data;
BugSet.setAsRootAndCache(this.bugSet, st.getOrderAfterDivider());


Nothing about this constructor seems innocuous until you see the line:
BugSet.setAsRootAndCache(this.bugSet, st.getOrderAfterDivider());
A hidden dependency. An unexpected behaviour. And, if my instincts prove to be right, a pain in the ol' decouples. Having made sure the instances I had created did not get passed around after I constructed them, one of them had went ahead published them for me anyway, behind my back! The call "st.getOrderAfterDivider()" will return a list containing my custom sorter implementation.

Now, for previous use cases of this code, where there existed only one bug tree in the application, this behaviour is okay. But when it comes to providing clean, decoupled code, exercising the principle of least astonishment, it's pretty bad. That is what I see as the main problem with a certain class of static code - it's no better than a global variable, it doesn't have to declare itself (if the BugTreeModel constructor had taken a cache parameter, callers would be alerted to it), and the types of errors they tend to lead to are subtle, and difficult to track down.

As I mentioned, the developers of the code were looking after their own use case, as you would expect, they had their objectives, and they delivered. There was nothing at the time (I assume) to say that there may be more than one bug tree in the application. But that's the problem with writing good code, you have no idea how it will be used in the years to come. Having code like this, with a hidden dependency is akin to leaving a hidden tripwire behind for future developers (although I don't know the developer who wrote the code, I doubt very much he's as malicious as this statement makes out, but I couldn't think of a metaphor which shed him in better light).

You can't see it, it hides itself well, but it can trip you up. Not only that, but after you've fallen over, landed on your face, you can take a look around and still not know what happened to you. And that's the trouble with static calls.