Wednesday 3 February 2010

Evaluating the System

Well, it looks like all systems go for visiting Sword Ciboodle in Inchinnan to evaluate the project. Having talked with my manager from my placement days it looks like I should be able to get 4 hours worth of developer time. I'm still at the stage of not knowing exactly how to evaluate the system, and because I want to make the most of the time at Inchinnan, I'm going to go over some possible evaluation strategies.

First I'll discuss what I want to get from the evaluation.

Verification (Am I building the the thing right?)
This is one aspect of evaluation that I think will be transparently tested during any evaluation. Concrete metrics for performance will be available (how long does it take to calculate Information Gain ranks? How long is the update time for Feedback Rank when the user is designating bugs as true or false positives?). Failure rates could be monitored as well (how many times did the application crash?) as well as usability (how many times did the user ask for directions? how did the user rate usability?).

In terms of using the time at Inchinnan, the performance can be measured (I may as well, it will have to be computed) but it may not be statistically valid. What would be better is to run the areas of the system I want to measure several times on my personal machine, to provide the number of results necessary to make it valid. For usability it could be useful to get the feedback of developers in industry, but the same kind of survey could be performed by peers at the university (if I'm careful about the friends I pick, they could at least be useful subjects in determining if the system is idiot proof :-P). So usability is not necessarily what I want to evaluate at Inchinnan.

What I really want to use the time at Inchinnan for is not for verification, but for validation.

Validation (Am I building the right thing?)
This is the really important question for a research based project. In my case the validation strategy is kind of baked right in to the project title. Does what I'm building actually Enhance Findbugs?

The previous studies in this area, from Kremenek et al. (I'll spell it correctly this time) and S. S. Heckman have demonstrated that using bug correlations to rank alerts can reduce the number of false positives a user has to inspect. Comparisons of their ranking were made against the optimal ordering, and a random ordering (S. S. Heckman did include the default Eclipse ordering, which is alphabetical). However, the comparisons were not made with knowledge of the manual ranking capabilities available in Findbugs (Kremenek et al.'s tool was for the C language, S. S. Heckman used a custom Eclipse plugin). This could be a significant factor which I don't want to overlook.

The whole idea of the project is to reduce the cost developers need to spend to see alerts they care about. Any improvement I suggest for Findbugs should be compared in real terms. It's really no use to say my ranking is better than a random ordering of bugs, because users of Findbugs do not generally view the bugs in a random order. Findbugs actually has some pretty strong capabilities for sorting and filtering bugs on a per-project basis. The system I develop should be compared to the real, expected usage pattern of Findbugs.

The actual basis for comparison should be real as well, but this is pretty simple I think: what shows the user more bugs which they care about? Easily measured, and meaningful. This will be a real way to compare my system against the expected usage pattern for a user of Findbugs.

But what is the "real, expected usage pattern"? Well, that's for the developer using it to decide. I don't think there is a more realistic comparison than just allowing a developer to choose how to use Findbugs. So this is my instinct for how to do the evaluation. Get the same developer to perform two sessions of bug triaging. In each they can spend some time getting acquainted with how to use the system. In each they use the same codebase. Since the actual time for deciding if a bug is a false positive or not should be independent of the system, the winning system is the one with the highest proportion of true positives.

Strength in simplicity.

Although a clear winner would provide a lot of information. All is not lost for my project if it doesn't yield the most true positives. Although it would be nice to be better, it would be good to be complementary. This would have to be determined through some other means, perhaps a survey of the developers to ask if they thought usage of the system could provide an enhancement over the vanilla Findbugs.

Would be nice to win though.

So, I've described what I think is a fair and valid way to evaluate the system. Given that developers would be used to perform analysis twice, it's probably better to have 2 hours from 2 developers rather than 1 hour from 4. Filling out surveys could then be done at their leisure. Performance will be evaluated outwith Inchinnan.

Best get crackin' on beating Findbugs then...

No comments:

Post a Comment