Saturday 2 January 2010

The Mechanics of Information Gain

Information Gain is one of the features of Kremeneck et. al's "Correlation Exploitation in Error Ranking" that I identified as one of the central points. For this project I intend to use it to identify the training data for a supervised learning session. In the paper Information Gain is described in less than half a page, with some pretty mean looking formulas. Here I'm going to try to decipher the algorithm in a way that I can apply in the code. Since the technique is heavily involved in information theory, I'll probably descend into a narrative to attempt to claw my way out of n00bdom.

Information Gain is a value assigned to each bug alert. It is a value which shows how much the alert will help us deduce about the remaining alerts. The higher the value, the more information we can get from classifying the bug alert. The overall Information Gain is calculated using the gain from each pairing of bug alerts, also called "Mutual Information". Mutual information is the amount of information one single bug alert (Rj) tells us about another, single bug alert (Rk). Because we are more interested in gaining information on the entire set of bug reports, we don't really want to calculate the mutual information between two single reports. So instead of calculating MI(Rj; Rk) we wish to calculate the highest value of MI(Rtot; Rj) - which is how much information Rj can give us for the total set of bug alerts, Rtot.

Kremeneck et. al's formula for this is described as:
MI - mutual information value
R - the set of all bug reports
Rj - the single, current report being used
b - report which is classified as a bug
fp - report which is classified as a false positive
i - the values from the set of already inspected reports
Pr(rj|i) - the probability of the value rj given all the values i

D(P||Q) is described as:
This is all new stuff to me. Not only the terminology in information theory, such as entropy, but the mathematical notation used here. I reckon I could muddle through this stuff and have a working implementation, but I'd much rather reuse implementations from projects with in-depth knowledge of machine learning and information theory.

The Kremeneck paper also notes that this is computationally expensive, hence they only use this formula to break ties in alerts with equal FEEDBACK-RANK. Hopefully the time taken to calculate the training example is not so prohibitive it can't be done. It could even be appended to the analysis stage so that Findbugs could just be left to do its thing.

I'm sure I'll become more intimately acquainted to Information Gain in the next few weeks. Hopefully, because it's all a bit confusing at the moment...

Optimum Use Case

I've never really written a use case before, but I thought I would try it, in order to focus on the usage of the system I intend to develop. This is a single use-case, which covers the optimal experience of the system, which intends to be a better experience than I am able to implement.


Introducing Stevie...
Stevie works on a large project which has been in development for many years. In an effort to improve overall quality of the code, Stevie attempts to introduce Findbugs. Stevie is aware that Findbugs may generate a lot of noise, particularly for a large codebase.

Stevie configures Findbugs and analyses his project. Upon analysing, Stevie inspects the analysis using a Findbugs UI. He sees that Findbugs has discovered a large number of potential bugs (approximately 5,000). Knowing that it is likely the majority of reports do not need to be acted on, and that it is necessary to make the best use of his time, he attempts to find the best way to use Findbugs.

Optimum Experience
Upon navigating the Findbugs UI for the first time, Stevie discovers a "Train Findbugs On Your Codebase" option. Upon selection, a dialog screen is presented. This dialog contains information on the estimated time it will take for Findbugs to calculate the best training data. Being unfamiliar with this option, Stevie selects an option to display help information on the purpose of training, when it should be done, what the user should do, and how long it should take. Stevie decides to use the training option, notes that it may take a while to begin training, selects the options for Findbugs to calculate the training data. When a progress bar is displayed, which reports that Findbugs has begun calculating training data, he decides to carry on with other work.

After the estimated time elapses, Stevie returns to Findbugs to see that it has calculated the training data, and that training can begin. Since it has taken a while to calculate training data, Stevie has forgotten what he should be doing. He selects the context help available in training mode, he navigates the help to find that he should begin to classify bug alerts. The help also states that while the more he trains the data the better the ordering of the 5,000 reports will be, he can stop at any time and begin classifying the bugs which are likely to be the most important to him. The context help also suggests the number of alerts to classify before ending the training, and informs Stevie of an interface control which he can monitor to see if he has reached this point. Stevie closes the context help menu to return to the training interface.

Stevie views the list of alerts and inspects the first. He spends some time viewing the description of the bug, as provided by Findbugs, looking at the code, and finally judges whether this should be fixed or not. He selects an option from the UI to classify the bug as one which should be fixed, but is low priority. Although the UI allows him to store a short, free-text message to leave as a reason for his decision, he decides not to leave a reason, and confirms his classification. The alert is then removed from the list, and the UI is updated to show the next alert in the list. Stevie repeats this process several times, making note of the number of alerts he has classified, and how many more Findbugs suggests to classify. When Stevie has classified about half the amount Findbugs suggests, he selects the "End Training" option. This presents a help dialog which informs him that he may return to the training mode where he left off, and continue to train Findbugs. Stevie reads the message and selects the "Do not show this message again" option, ignores the "Cancel" option and selects a "Confirm" option, which takes him out of training mode back to the original list of alerts. Stevie changes the ordering of the list, which is defaulted based on bug type, choosing an "Order Based On Training" option. A dialog informs Stevie that this may take a while to compute, suggests an expected amount of time, offering a confirm or cancel option. Stevie selects the "Don't show this message again" option, selects "Confirm", sees the progress bar, and waits for Findbugs to complete processing.

Once Findbugs has completed, Stevie views the list of alerts. He sees the UI controls which show how likely the first alert is one expected to be important to Stevie. As he inspects the first alert and classifies it, he sees that the list of alerts changes its order slightly, and he is presented with the next alert in the list. Stevie continues to inspect bug alerts until he begins to encounter several alerts which he is not concerned about, at this point, he inspects the UI controls to see that the remaining alerts are not likely to be ones he cares about. Stevie decides that he has inspected enough, and found enough issues to be raised with other developers on the project. He saves the analysis and exits the program.

MoSCoW Requirements

I'm not entirely sure how requirements are normally arrived at. I decided to stick to a prose style and tried to cover all the requirements I could think of.

Overview

The system is intended to be used after the analysis of a project. When the analysis is complete, the bug alerts will be displayed in an ordered fashion designed to provide the best return on investment for the time spent triaging. To provide this ordering, the user should take part in a supervised learning session to train the system on the analysed project. The learning should select examples from the analysis which will provide the most information about the project. To determine the order of bug alerts for the learning session Kremeneck et. al's Information Gain algorithm will be used. Users should be able to end the training session whenever they wish. This means that bugs which are morely to be true positives are shown first. Within the bugs likely to be true positives there should be an ordering of priority. This probability of a true positive will be determined using Kremeneck et. al's FEEDBACK-RANK algorithm. A custom priority model should be implemented to bring the bugs the user most cares about to the top of the list.

It is intended that the investment in performing a supervised learning session will provide a net gain in the efficiency of bug triaging. Because priority of a bug is included in the ordering, the system should identify the bugs most important to the user first. If the user has limited time to triage bugs, they should be able to use a "skimming the cream" approach, which should still justify Findbugs' usage. The user should be able to detect when they are encountering a large ratio of false positives, and have a high confidence that they can end the triaging session.


Learning Session
Once an analysis is complete, the user should be able to enable a learning mode. At this point the system should calculate the Information Gain for each bug report. The Information Gain algorithm should be the same as used by Kremeneck et. al. Once this value is calculated for each alert, the interface should display the alerts in order. The user should then be able to classify each alert as a true or false positive, with a priority. The system could also alert a user when the Information Gain value from the remaining alerts reaches a low enough level that it is no longer worth continuing the learning session. The system could also present a streamlined user interface for classifying the alerts, for example, by displaying all the keyboard shortcuts required to make the mouse unneccesary. The system should also persist the results of the learning session so they may be shared, triaging can be deferred and learning is an incremental part of the continuous build of a project.


Triaging Ranked Alerts
Following a learning session, the system should be able to order the alerts based on the FEEDBACK-RANK (with priority). The system could display information on the ranking (such as a gauge for each alert ala Google's PageRank). The system could also provide a streamlined interface, as described for the learning session. The system should use the feedback provided for each alert to update the ordering of the alerts. This could be done automatically, but if performance is an issue it could be at the user's request. The system could provide information on the user's progress through the ordered alerts (for example, traversing the graph of the orderered FEEDBACK-RANKs). The system could provide a separate area of the UI or use the existing UI to display the ordered bugs. Once an alert in the queue is classified, it should be removed from the interface. However, the user must be able to revisit already classified bugs (they may change their mind), therefore the system should allow displaying already classified alerts. The system should persist classifications in this mode in the same way user classifications are already persisted. The system could inform the user when they have reached a point where the remaining bug alerts are more likely to be false positives.