Tuesday, 2 February 2010

The trouble with static calls...

As part of the work to enhance Findbugs, I have reserved an area of the GUI to perform supervised learning. The reasoning behind segregating an area off is that:
  • I can control unnecessary performance hits of calculating stuff that's not needed, by putting up a barrier for the user to stipulate "Yes, I do want you do this stuff."
  • I can bend rules in terms of the conventions of the UI in other areas, adding buttons that only make sense to "my stuff" is not a problem for the rest of the application.
  • It's easier to develop if you only have yourself to consider.
Despite sequestering myself into a small, separate part of the GUI, I have tried to reuse elements from the main GUI that may be useful. The main example of this is the bug tree:


This is quite a nifty piece of UI. The buttons shown above the tree ("Bug Rank", "Bug Pattern", etc) are used to sort the tree. They can be dragged to change the order that sorters are applied to the list of bugs. I liked the UI here, and although I didn't need the functionality of changing the sort order, I did like how the code was set up for this. The sorters themselves were decoupled from the UI, and had the infrastructure around them to make it a breeze to add a new, custom sorter.

This is what I did with the learning mode:


The learning mode calculates the bug which would provide the most information about the all the other bugs. A value is calculated for each bug, and the tree is sorted on order of lowest to highest. This was fairly easy to implement (apart from the large refactoring of extracting an interface from an enum type that was used all over the place, but that was necessary for other concerns, the actual sorting and making a tree was easy). I was pretty impressed with how the existing code in Findbugs allowed quite a complicated piece of UI to be added.

That was until I started to see some strange errors.

I mentioned how the part I was working on was segregated, in a kind of application sandbox, away from the rest of the GUI. I constructed and used different instances of JTree, TreeModel and the class used for sorting. I was then quite surprised to see, the sorting method I had used had leaked out of my sandbox and into the rest of the application.

Somehow the main bug tree with the nice order dragging control no longer worked, and the tree was sorted based on my new, exposed sorting model.

I was a bit confused how this could have happened - I didn't publish the tree instance, or the model, or the sorter in my code, so how could have it escaped? I'm still not sure I've found the answer, but I'd be willing to lay down a decent sized bet that it was to do with this piece of code:
public BugTreeModel(SortableBugTree tree, SorterTableColumnModel st, BugSet data) {
st.addColumnModelListener(this);
this.tree = tree;
this.st = st;
this.bugSet = data;
BugSet.setAsRootAndCache(this.bugSet, st.getOrderAfterDivider());


Nothing about this constructor seems innocuous until you see the line:
BugSet.setAsRootAndCache(this.bugSet, st.getOrderAfterDivider());
A hidden dependency. An unexpected behaviour. And, if my instincts prove to be right, a pain in the ol' decouples. Having made sure the instances I had created did not get passed around after I constructed them, one of them had went ahead published them for me anyway, behind my back! The call "st.getOrderAfterDivider()" will return a list containing my custom sorter implementation.

Now, for previous use cases of this code, where there existed only one bug tree in the application, this behaviour is okay. But when it comes to providing clean, decoupled code, exercising the principle of least astonishment, it's pretty bad. That is what I see as the main problem with a certain class of static code - it's no better than a global variable, it doesn't have to declare itself (if the BugTreeModel constructor had taken a cache parameter, callers would be alerted to it), and the types of errors they tend to lead to are subtle, and difficult to track down.

As I mentioned, the developers of the code were looking after their own use case, as you would expect, they had their objectives, and they delivered. There was nothing at the time (I assume) to say that there may be more than one bug tree in the application. But that's the problem with writing good code, you have no idea how it will be used in the years to come. Having code like this, with a hidden dependency is akin to leaving a hidden tripwire behind for future developers (although I don't know the developer who wrote the code, I doubt very much he's as malicious as this statement makes out, but I couldn't think of a metaphor which shed him in better light).

You can't see it, it hides itself well, but it can trip you up. Not only that, but after you've fallen over, landed on your face, you can take a look around and still not know what happened to you. And that's the trouble with static calls.

Saturday, 2 January 2010

The Mechanics of Information Gain

Information Gain is one of the features of Kremeneck et. al's "Correlation Exploitation in Error Ranking" that I identified as one of the central points. For this project I intend to use it to identify the training data for a supervised learning session. In the paper Information Gain is described in less than half a page, with some pretty mean looking formulas. Here I'm going to try to decipher the algorithm in a way that I can apply in the code. Since the technique is heavily involved in information theory, I'll probably descend into a narrative to attempt to claw my way out of n00bdom.

Information Gain is a value assigned to each bug alert. It is a value which shows how much the alert will help us deduce about the remaining alerts. The higher the value, the more information we can get from classifying the bug alert. The overall Information Gain is calculated using the gain from each pairing of bug alerts, also called "Mutual Information". Mutual information is the amount of information one single bug alert (Rj) tells us about another, single bug alert (Rk). Because we are more interested in gaining information on the entire set of bug reports, we don't really want to calculate the mutual information between two single reports. So instead of calculating MI(Rj; Rk) we wish to calculate the highest value of MI(Rtot; Rj) - which is how much information Rj can give us for the total set of bug alerts, Rtot.

Kremeneck et. al's formula for this is described as:
MI - mutual information value
R - the set of all bug reports
Rj - the single, current report being used
b - report which is classified as a bug
fp - report which is classified as a false positive
i - the values from the set of already inspected reports
Pr(rj|i) - the probability of the value rj given all the values i

D(P||Q) is described as:
This is all new stuff to me. Not only the terminology in information theory, such as entropy, but the mathematical notation used here. I reckon I could muddle through this stuff and have a working implementation, but I'd much rather reuse implementations from projects with in-depth knowledge of machine learning and information theory.

The Kremeneck paper also notes that this is computationally expensive, hence they only use this formula to break ties in alerts with equal FEEDBACK-RANK. Hopefully the time taken to calculate the training example is not so prohibitive it can't be done. It could even be appended to the analysis stage so that Findbugs could just be left to do its thing.

I'm sure I'll become more intimately acquainted to Information Gain in the next few weeks. Hopefully, because it's all a bit confusing at the moment...

Optimum Use Case

I've never really written a use case before, but I thought I would try it, in order to focus on the usage of the system I intend to develop. This is a single use-case, which covers the optimal experience of the system, which intends to be a better experience than I am able to implement.


Introducing Stevie...
Stevie works on a large project which has been in development for many years. In an effort to improve overall quality of the code, Stevie attempts to introduce Findbugs. Stevie is aware that Findbugs may generate a lot of noise, particularly for a large codebase.

Stevie configures Findbugs and analyses his project. Upon analysing, Stevie inspects the analysis using a Findbugs UI. He sees that Findbugs has discovered a large number of potential bugs (approximately 5,000). Knowing that it is likely the majority of reports do not need to be acted on, and that it is necessary to make the best use of his time, he attempts to find the best way to use Findbugs.

Optimum Experience
Upon navigating the Findbugs UI for the first time, Stevie discovers a "Train Findbugs On Your Codebase" option. Upon selection, a dialog screen is presented. This dialog contains information on the estimated time it will take for Findbugs to calculate the best training data. Being unfamiliar with this option, Stevie selects an option to display help information on the purpose of training, when it should be done, what the user should do, and how long it should take. Stevie decides to use the training option, notes that it may take a while to begin training, selects the options for Findbugs to calculate the training data. When a progress bar is displayed, which reports that Findbugs has begun calculating training data, he decides to carry on with other work.

After the estimated time elapses, Stevie returns to Findbugs to see that it has calculated the training data, and that training can begin. Since it has taken a while to calculate training data, Stevie has forgotten what he should be doing. He selects the context help available in training mode, he navigates the help to find that he should begin to classify bug alerts. The help also states that while the more he trains the data the better the ordering of the 5,000 reports will be, he can stop at any time and begin classifying the bugs which are likely to be the most important to him. The context help also suggests the number of alerts to classify before ending the training, and informs Stevie of an interface control which he can monitor to see if he has reached this point. Stevie closes the context help menu to return to the training interface.

Stevie views the list of alerts and inspects the first. He spends some time viewing the description of the bug, as provided by Findbugs, looking at the code, and finally judges whether this should be fixed or not. He selects an option from the UI to classify the bug as one which should be fixed, but is low priority. Although the UI allows him to store a short, free-text message to leave as a reason for his decision, he decides not to leave a reason, and confirms his classification. The alert is then removed from the list, and the UI is updated to show the next alert in the list. Stevie repeats this process several times, making note of the number of alerts he has classified, and how many more Findbugs suggests to classify. When Stevie has classified about half the amount Findbugs suggests, he selects the "End Training" option. This presents a help dialog which informs him that he may return to the training mode where he left off, and continue to train Findbugs. Stevie reads the message and selects the "Do not show this message again" option, ignores the "Cancel" option and selects a "Confirm" option, which takes him out of training mode back to the original list of alerts. Stevie changes the ordering of the list, which is defaulted based on bug type, choosing an "Order Based On Training" option. A dialog informs Stevie that this may take a while to compute, suggests an expected amount of time, offering a confirm or cancel option. Stevie selects the "Don't show this message again" option, selects "Confirm", sees the progress bar, and waits for Findbugs to complete processing.

Once Findbugs has completed, Stevie views the list of alerts. He sees the UI controls which show how likely the first alert is one expected to be important to Stevie. As he inspects the first alert and classifies it, he sees that the list of alerts changes its order slightly, and he is presented with the next alert in the list. Stevie continues to inspect bug alerts until he begins to encounter several alerts which he is not concerned about, at this point, he inspects the UI controls to see that the remaining alerts are not likely to be ones he cares about. Stevie decides that he has inspected enough, and found enough issues to be raised with other developers on the project. He saves the analysis and exits the program.

MoSCoW Requirements

I'm not entirely sure how requirements are normally arrived at. I decided to stick to a prose style and tried to cover all the requirements I could think of.

Overview

The system is intended to be used after the analysis of a project. When the analysis is complete, the bug alerts will be displayed in an ordered fashion designed to provide the best return on investment for the time spent triaging. To provide this ordering, the user should take part in a supervised learning session to train the system on the analysed project. The learning should select examples from the analysis which will provide the most information about the project. To determine the order of bug alerts for the learning session Kremeneck et. al's Information Gain algorithm will be used. Users should be able to end the training session whenever they wish. This means that bugs which are morely to be true positives are shown first. Within the bugs likely to be true positives there should be an ordering of priority. This probability of a true positive will be determined using Kremeneck et. al's FEEDBACK-RANK algorithm. A custom priority model should be implemented to bring the bugs the user most cares about to the top of the list.

It is intended that the investment in performing a supervised learning session will provide a net gain in the efficiency of bug triaging. Because priority of a bug is included in the ordering, the system should identify the bugs most important to the user first. If the user has limited time to triage bugs, they should be able to use a "skimming the cream" approach, which should still justify Findbugs' usage. The user should be able to detect when they are encountering a large ratio of false positives, and have a high confidence that they can end the triaging session.


Learning Session
Once an analysis is complete, the user should be able to enable a learning mode. At this point the system should calculate the Information Gain for each bug report. The Information Gain algorithm should be the same as used by Kremeneck et. al. Once this value is calculated for each alert, the interface should display the alerts in order. The user should then be able to classify each alert as a true or false positive, with a priority. The system could also alert a user when the Information Gain value from the remaining alerts reaches a low enough level that it is no longer worth continuing the learning session. The system could also present a streamlined user interface for classifying the alerts, for example, by displaying all the keyboard shortcuts required to make the mouse unneccesary. The system should also persist the results of the learning session so they may be shared, triaging can be deferred and learning is an incremental part of the continuous build of a project.


Triaging Ranked Alerts
Following a learning session, the system should be able to order the alerts based on the FEEDBACK-RANK (with priority). The system could display information on the ranking (such as a gauge for each alert ala Google's PageRank). The system could also provide a streamlined interface, as described for the learning session. The system should use the feedback provided for each alert to update the ordering of the alerts. This could be done automatically, but if performance is an issue it could be at the user's request. The system could provide information on the user's progress through the ordered alerts (for example, traversing the graph of the orderered FEEDBACK-RANKs). The system could provide a separate area of the UI or use the existing UI to display the ordered bugs. Once an alert in the queue is classified, it should be removed from the interface. However, the user must be able to revisit already classified bugs (they may change their mind), therefore the system should allow displaying already classified alerts. The system should persist classifications in this mode in the same way user classifications are already persisted. The system could inform the user when they have reached a point where the remaining bug alerts are more likely to be false positives.

Wednesday, 30 December 2009

Some notes about project scope

Before I start MoSCoW prioritisation of the tasks for the project, I thought I would define and justify some of the choices of scope for the project. One of the things that has been said to me is that I shouldn't try to be too ambitious, so I'm really going to try to cut back on what I want to do. With decent MoSCoW'ing I should still be able to think big, as long as I'm disciplined about the order in which I implement parts (i.e. no M's before S's).

Things out of scope
  • Dealing with multiple analyses - going from one of run of Findbugs on one day then doing another the next day, you would not want to see bugs you've already annotated reappearing, looking for inspection. I think there is ways to merge analyses and cut these out, but this will not be in scope for me. When it comes to initiating a training session, I will be ignoring already inspected bugs, or assuming all bugs need inspecting.
  • Integration with bug-tracking systems - this is something again, which I think the new version of Findbugs is intended to do. Well out of scope for me however, I'm not about to start writing or using integration adapters for different bug-tracking systems. Hopefully I'll be evaluating the system at Sword Ciboodle, who have their own in-house bug tracking system, so that kind of functionality wouldn't be evaluated anyway.
  • Handling the different Findbugs interfaces - once the model for calculating FEEDBACK-RANK and Information Gain is implemented, I will only be adapting a single interface for it (e.g. Swing UI or Eclipse plugin - definitely not IntelliJ or NetBeans). Once the functionality is available, providing it for another interface isn't likely to gain much in terms of a final year project. No doubt there will be a blog post of which UI I will aim for in the next few weeks.
Things in scope
  • Calculating the Information Gain for each bug report at the end of an analysis
  • Providing a learning mode that will show bug alerts in order of information gain
  • Calculating Feedback Rank for each bug report
  • Providing a triage mode that will order alerts based on Feedback Rank, combined with the priority scheme I wish to add on top of that
That is about as basic a requirements list as I can get. It is just a precursor to a full MoSCoW requirements phase. I just wanted to get clear in my head what I won't be implementing.

User feedback in the Findbugs model

Findbugs already provides ways for a user to annotate a bug report that may be useful. The possible categories are:
  • not a bug
  • mostly harmless
  • should fix
  • must fix
  • I will fix
  • bad analysis
  • unclassified
  • obsolete/unused; will not fix
This is stored within the XML generated for a saved project.

If Findbugs does not support ignoring already view bug alerts, the absence of this could be used as a flag that this should be shown to the user. I think managing the model across different sessions is probably out of scope for my project. I think there are Findbugs commands available to merge analyses anyway.

These designations could be ordered to create the priority I wish to model. Although in the latest version there is also the concept of 'Rank' for a bug alert. It's quite annoying there are these different ways to judge a bug. Hopefully they'll be merged somehow.

Just wanted to leave a note for myself that this is probably how the XML of an analysis will be handled for the learning mode.

Tuesday, 29 December 2009

Final decision of what I'm going to implement.

It's almost 2010 and I'm only just deciding what to implement. Sigh.

Basically it's a supervised learning model which I intend to make available at least through the Findbugs Swing UI, but hopefully, through the Eclipse UI as well. The project is going to be aimed at serving a subset of the workflow models or scenarios possible with Findbugs. Examples of the possible scenarios include: using rapid feedback from the tool while developing; using the tool as part of a continuous build process or using the tool to validate a release candidate as part of the Q&A release process. The scenario I want to focus on is the introduction of Findbugs to a large, existing project.

This scenario presents specific problems that I want to address with the project. These include:
  • initial analyses with Findbugs could overwhelm users (consider the use of Findbugs on IntelliJ IDEA which resulted in ~2,800 bug alerts)
  • initial inspections, if ordered badly, could result in a high initial false positive rate which could give the impression the tool only generates false positives, which could result in the tool being discarded
  • with a huge number of alerts, users could be suppressing what they see as similar false positives time and time again, which is slow, laborious, costly and frustrating

The implementation will be based on techniques described by Kremeneck et al. in "Correlation Exploitation in Error Ranking". The two important concepts described in that paper are FEEDBACK-RANK and Information Gain. FEEDBACK-RANK is a probabilistic technique used to exploit several types of correlation between bug alerts to create an order from to most-likely-true-positive to least-likely-true-positive. This order can then be used to present bug alerts to users such that true positives are shown first. The effect of this is that when users begin to see high rates of false positives, they can stop inspecting the reports because they have most likely seen all the true errors. Information Gain is a factor used as a tie-breaker when bug alerts have an equal probability of being true positives. The alert chosen to be inspected is that which will provide the most useful information about the remaining alerts.

In the scenario of introducing Findbugs to a large project the initial ordering of results is crucial. Taking a factor from the tool itself is always going to be a limited approach, e.g. the priority factor of a bug alert is chosen by developers of the tool, with no knowledge of the project the tool is being used on. While the FEEDBACK-RANK algorithm will quickly begin to differentiate between true and false positives, the initial ordering is random, and this could affect the order in which the true positives are eventually seen.

What I hope to achieve with the initial, supervised learning session, is that Information Gain is used solely to present alerts to the users, in order to get their feedback. This can then be used to make a larger up-front investment in order to gain a better context for the analysed project. It is hoped that this initial learning can result in a net loss in the number of alerts which have to be inspected before all true positives are seen.

One other crucial difference between Kremeneck et al.'s techniques and what I plan to implement is introducing a priority to the feedback. With FEEDBACK-RANK, the bug alerts are ordered based on probability of being a true or false positive, but it is possible that the ordering of true positives are from lowest to highest priority. On a large project, for example, with 2,800 alerts to process, initial reports of only trivial, or "True but low impact defects", are likely to be just as damaging to the confidence of the tool. The hope is that an initial learning session can not only provide information on true positives, but also, how important those true positives are to the project.

The latest version of Findbugs includes several designations for bug alerts, so it is hoped that no new prioritisation scheme needs to be introduced. Although introducing a new one would probably be easier, I'd prefer to retain the terminology already used and endorsed by the Findbugs developers and community.

What I've not described so far is how Kremeneck et al. defined 'correlation' between bug alerts. In their model, they measured correlation in terms of code locality. They found on two projects, Linux and an unnamed commercial project named System X, that true positives and false positives clustered on their location. The granularities of location range from function to file to leaf-directory. I intend to continue using this correlation model, which in terms of Java I will probably define as corresponding to method, class and package. I wish to keep this model for comparison reasons, I wish to compare their initial findings in the C language with a Java project, and use that as a baseline to begin with before introducing priority into learning.

I best get to work...