Friday 19 February 2010

Results of the Evaluation at Sword-Ciboodle

Having spent two days at the offices of Sword-Ciboodle evaluating my enhancement of Findbugs, I'd say the investment was time well spent. I'm also very grateful to the company donated some resources and developer time for this exercise. Something that was very pleasing for me was that the experience of the developers who took part was higher than I had expected. The range of experience went from roughly 5 to 15 years of professional experience, with large portions of that being experience in Java development.

The actual evaluation differed a great deal from what I discussed in a previous blog entry. It was suggested that expecting developers to slog through an exercise for two hours was a bit too much. Also, since the number of developers was always going to be less than six, and the tool would be used on a single code base, any measurements or findings would not prove statistically significant. Instead I opted to use the time to get an initial feedback on first impressions of the usefulness of the tool. Since drop-off rates of using the tool are likely to happen in a short period of time, I decided to get the developers to use Findbugs (default functionality) for 20 minutes, then use my addition for a further 20 minutes. Following this they were asked to fill in a short questionairre.

The following is some interesting points raised by the questionairre.

Asked to rate the performance of the adaptive ranking modes, all developers answered favourably. Positive comments included that the mode was fast, that there was no instances of waiting when using the model, and another stated that feedback for waiting time was good. Negative comments included that when setting the designation on a parent node for many child nodes, it took some time, but there was no feedback (such as displaying a "waiting" cursor). This is functionality which already existed in Findbugs, which I have not modified, but it is worth noting, and I may take the time to make the small change.

Asked to rate the usability of the adaptive ranking mode, all developers rated it 'Good' (on the scale "Very bad", "Bad", "Neither good nor bad", "Good" or "Very good"). Comments included that there was a very clear set of options, with little to cause confusion. A couple of comments on the choice of bugs used in the learning mode showed that it was a hindrance that the mode repeatedly showed the same type of bug, and it took a while to get through insignificant bugs in the learning mode. Another very interesting comment was that it was difficult to gauge the amount of time to spend in the learning mode before switching to the feedback rank mode. Perhaps it would be worth investigating ways to continuously recalculate the feedback rank order while in learning mode, and working out a way where it could notify the user to switch from learning mode.

Asked to rate the intuitiveness of the adaptive ranking mode, responses were a bit cooler. Two of the four were positive, while the remaining two rated it as "Neither good nor bad". In terms of the interface, comments included that multiple select would be useful on the bug tree, as well as keyboard shortcuts for setting the status (which is available) and refreshing (which isn't). Further comments mentioned how many similar types of bugs were displayed repeatedly. Also mentioned that some of the information was not clear on its purpose, there were no specifics included in the comment, but from further discussion, I believe this was aimed at the exact figure for the learning value being included. The figure by itself is pretty meaningless to users, and it may be better to remove the information. Another interesting comment was that it was not immediately obvious that it's best to use learning mode and feedback rank mode in conjuction, because it's not clear that when you classify a bug in either mode, this is reflected in the other modes, and indeed, saved to the overall analysis.

I don't think there is a single issue which is not fixable. Some are easy (enabling multi-select would require a refactor, but is doable), others are more tricky, but in my opinion a great way to solve this would be to demonstrate usage with a screencast of the application, with additional dialogue to explain what's happening.

Following the focus on the adaptive ranking model in isolation, the questionairre moved on to assessing the enhancements in the overall context of introducing the tool to the developers' workflow. Only one of the four developers had used Findbugs before, but 3 of the 4 were either "In favour", or "Strongly in favour" of introducing Findbugs into their current development environment and workflow. The remaining developer was indifferent to the notion. When asked to give their opinion on the factors preventing introduction of Findbugs, reasons which applied included: "Finds too many bugs initially, is overwhelming"; "Reports too many bugs which we don't care about / are low priority" and "Inspecting alerts is too time consuming". None of the four developers included the "Time / cost to train developers to use Findbugs" or "Findbugs does not find any bugs which are not discovered by other means". Comments also included that Findbugs is most useful when it is applied to give immediate feedback to the developer at the time of writing, and that on a mature codebase the cost/benefit ratio is low, as many bugs are found through testing or customer support calls. These results, although not statistically significant, give the impression that the development team is willing to invest in using Findbugs, if only it could present the results to developers in a way that provided value sooner.

When asked if they felt the adaptive ranking model makes Findbugs more useful, 4/4 developers stated Yes. Comments included that the learning mode would avoid the need to manually create filters, and that their impression of the standard Findbugs mechanisms seemed "tedious", particularly with a large code base.

The final question asked was to describe any improvements to the adaptive ranking model which would increase the viability of introducing Findbugs to their workflow. Comments included that the learning would need to retained over time (this should be possible with existing functionality for merging analyses). Another comment stated it would be useful to be able to create filters for the learning mode (this functionality was removed by me, and could be reinstated). Another comment asked if it could be part of an automated continuous integration build, interestingly a question I asked after the evaluation was, "if an automated build could find the feedback rank of any new bugs, would they be comfortable breaking the build over it?" The feeling was that, if they felt the ranking became accurate enough over time, then yes, they would like to break the build over it.

Another suggestion was that it would be useful to be able to inspect the call heirarchy involved with a bug, further discussions found that it would also be necessary to browse the type heirarchy to classify a bug. Indeed, it's not possible to browse the code directly with the Findbugs GUI, it's only possible to see the source where the bug is registered. I would suggest that the best way to address this is with better integration with IDE's, as this functionality is likely to already be built in. Further gains could also come from this, such as IDE integration with version control, as commit messages are a rich source of information. It's always been at the front of my mind that the adaptive ranking computation should not depend on the Findbugs GUI, for this very reason.

As well as using a questionairre, I also took notes on the questions I was asked throughout the evaluation. Not only did this allow me to note a couple of bugs the developers found with my code, but it gave me a greater insight into the kinds of problems faced by new users. These notes reinforced some of the observations made from the questionairre: what is the meaning of the learning value?; are classifications made from the learning mode carried over into the rest of the analysis?; how long should I use the learning mode? why does the learning mode keep showing the same types of bug again and again (I'll probably return to this point in a later blog)?

Overall I'd say the evaluation was a success, not only in how well Findbugs with the adaptive ranking I've introduced was received, but also the feedback I've obtained from experienced developers.

No comments:

Post a Comment