Information Gain is one of the features of Kremeneck et. al's "Correlation Exploitation in Error Ranking" that I identified as one of the central points. For this project I intend to use it to identify the training data for a supervised learning session. In the paper Information Gain is described in less than half a page, with some pretty mean looking formulas. Here I'm going to try to decipher the algorithm in a way that I can apply in the code. Since the technique is heavily involved in information theory, I'll probably descend into a narrative to attempt to claw my way out of n00bdom.
Information Gain is a value assigned to each bug alert. It is a value which shows how much the alert will help us deduce about the remaining alerts. The higher the value, the more information we can get from classifying the bug alert. The overall Information Gain is calculated using the gain from each pairing of bug alerts, also called "Mutual Information". Mutual information is the amount of information one single bug alert (Rj) tells us about another, single bug alert (Rk). Because we are more interested in gaining information on the entire set of bug reports, we don't really want to calculate the mutual information between two single reports. So instead of calculating MI(Rj; Rk) we wish to calculate the highest value of MI(Rtot; Rj) - which is how much information Rj can give us for the total set of bug alerts, Rtot.
Kremeneck et. al's formula for this is described as:
MI - mutual information value
R - the set of all bug reports
Rj - the single, current report being used
b - report which is classified as a bug
fp - report which is classified as a false positive
i - the values from the set of already inspected reports
Pr(rj|i) - the probability of the value rj given all the values i
D(P||Q) is described as:
This is all new stuff to me. Not only the terminology in information theory, such as entropy, but the mathematical notation used here. I reckon I could muddle through this stuff and have a working implementation, but I'd much rather reuse implementations from projects with in-depth knowledge of machine learning and information theory.
The Kremeneck paper also notes that this is computationally expensive, hence they only use this formula to break ties in alerts with equal FEEDBACK-RANK. Hopefully the time taken to calculate the training example is not so prohibitive it can't be done. It could even be appended to the analysis stage so that Findbugs could just be left to do its thing.
I'm sure I'll become more intimately acquainted to Information Gain in the next few weeks. Hopefully, because it's all a bit confusing at the moment...
Saturday, 2 January 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment