Sunday 7 February 2010

The Entropy Strikes Back: Revenge of Information Gain

Despite spending many frustrating hours trying to learn, I still find myself unable to understand the concept of Information Gain in Bayesian systems. The formulas used by Kremenek et al. (which I described here) just won't seem to sink in for me. It's getting a bit late in the day now, so I'm hereby giving up on following the formula to the letter. Instead I'm going to try to define a formula for information gain which has the characteristics I'm looking for.

This unfortunately means that I'm taking the step away from proven formulae in information theory towards intuition. This wouldn't worry me so much if it wasn't for Bayes theory being pretty unintuitive, meaning that although my intuition may provide a model which seems right, but without the backing of information theory and all it's damn log calculations, may be crucially flawed. I'll just have to attempt to find a decent model for what I'm trying to achieve.

I've went through several formulas, and the one I have chosen is:

Given alert A,
Information Gain(A) = (sum[for all populations] (population weight * # unclassified alerts in population) ) / # of bugs
Where:
0 <= # unclassified alerts in population <= # of bugs
sum[for all populations](population weight) = 1
If alert A is the only alert in an analysis, Information Gain(A) = 1.

Characteristics of this model include:
  • the result is a number in the range [0,1] - the result has a meaning when viewed alone, if ( result > (1 / # bugs) ), then there is a benefit to classifying it in terms of providing information of the model
  • however, there is no value for result which we can use to say "The benefit of continuing to use the learning mode outweighs the benefit of starting the real bug triaging." This would be dependent on developers. Although we could provide clues and help on this.
  • the model does not try to maximise the amount of true positives, or minimise the amount of false positives. Instead, it attempts to maximise information. This is not a concern, as all bugs will be shown to the user, in order. If we were trying to find alerts which hit a certain threshold for probability of being a true positive, this may have been a concern. But, as it is, finding that an alert is more likely to be a false positive is just as useful as finding that it is a true positive. The model reflects this.

I'll demonstrate the use of this model with some examples. Let's say we have an analysis which yields 20 bug reports from a system. The population weights are as follows: AlertType=0.5, Package=0.0425, Class=0.17, Method=0.2875 (these are the actual values used).
Consider alert A, which is a singleton bug, which has not been classified. There is only one instance of the alert type it has, it is the only alert in its package, class and method. The value for how much information it gives is:
IG(A) = ((AlertType * # unclassified) + (Package * # unclassified) + (Class * # unclassified) + (Method * # unclassified) ) / # of bugs
= ((0.5*1) + (0.0425*1) + (0.17*1) + (0.2875*1)) / 20
= 1/20
= 0.05

Notice how in this case IG(A) = 1 / # of bugs, which as mentioned earlier, shows there is no benefit to using this alert for learning. Let's consider another case, where we have bugs B and C. B shares an alert type with 10 other bugs, C shares a package with 10 other bugs. They share no other populations with other alerts. Here we would expect B to have a higher value of gain than C, since in our model alert type has greater than significance than sharing a package:
IG(B) = ((0.5 * 10) + 0 + 0 + 0) / 20 = 5 / 20 = 0.25
IG(C) = ( 0 + (0.0425*10) + 0 + 0) / 20 = 0.425 / 20 = 0.02125

So the model matches what we would expect in this case, with the value from C much lower than B. Another characteristic of this model which is crucial is that it takes into account the unclassified bugs, rather than just the number of bugs. There is no information required for bugs which have already been classified, so there's no point in trying to maximise information for them. Let's consider two further bugs, D and E. They have different alert types, but each shares their alert type with 10 other bugs. However, 4 of the 10 bugs with E's alert type have been classified. We would expect to see the classification of D offer greater value.
IG(D) = ( (0.5*10) + 0 + 0 + 0) / 20 = 5 / 20 = 0.25
IG(E) = ( (0.5*(10-4) + 0 + 0 + 0) / 20 = 3 / 20 = 0.15
The value of using number of unclassified comes when the learning mode is applied to a partially classified codebase, where the learning mode will take into account what the system already knows about the codebase in order to decide which bugs to use to learn.

So, given that I have not been able to fully understand the elements of information theory which would have provided a deeper insight into information gain, I think this is a good enough model, which will do the job.

No comments:

Post a Comment