Frequent Errors In Scientific Software May Undermine Many Published Results
from the it's-a-bug-not-a-feature dept
It’s a commonplace that software permeates modern society. But it’s less appreciated that increasingly it permeates many fields of science too. The move from traditional, analog instruments, to digital ones that run software, brings with it a new kind of issue. Although analog instruments can be — and usually are ? inaccurate to some degree, they don’t have bugs in the same way as digital ones do. Bugs are much more complex and variable in their effects, and can be much harder to spot. A study in the F1000 Research journal by David A. W. Soergel, published as open access using open peer review, tries to estimate just how much of an issue that might be. He points out that software bugs are really quite common, especially for hand-crafted scientific software:
It has been estimated that the industry average rate of programming errors is “about 15-50 errors per 1000 lines of delivered code”. That estimate describes the work of professional software engineers — not of the graduate students who write most scientific data analysis programs, usually without the benefit of training in software engineering and testing. The recent increase in attention to such training is a welcome and essential development. Nonetheless, even the most careful software engineering practices in industry rarely achieve an error rate better than 1 per 1000 lines. Since software programs commonly have many thousands of lines of code (Table 1), it follows that many defects remain in delivered code — even after all testing and debugging is complete.
To take account of the fact that even when there are bugs in code, they may not affect the result meaningfully, and that there’s also the chance that a scientist might spot them before they get published, Soergel uses the following formula to estimate the scale of the problem:
Number of errors per program execution =
total lines of code (LOC)
* proportion executed
* probability of error per line
* probability that the error meaningfully affects the result
* probability that an erroneous result appears plausible to the scientist.
He then considers some different cases. For what he calls a “typical medium-scale bioinformatics analysis”:
we expect that two errors changed the output of this program run, so the probability of a wrong output is effectively 100%. All bets are off regarding scientific conclusions drawn from such an analysis.
Things are better for what he calls a “small focused analysis, rigorously executed”: here the probability of a wrong output is 5%. Soergel freely admits:
The factors going into the above estimates are rank speculation, and the conclusion varies widely depending on the guessed values.
But he rightly goes on to point out:
Nonetheless it is sobering that some plausible values can produce high total error rates, and that even conservative values suggest that an appreciable proportion of results may be erroneous due to software defects — above and beyond those that are erroneous for more widely appreciated reasons.
That’s an important point, and is likely to become even more relevant as increasingly complex code starts to turn up in scientific apparatus, and researchers routinely write even more programs. At the very least, Soergel’s results suggest that more research needs to be done to explore the issue of erroneous results caused by bugs in scientific software — although it might be a good idea not to use computers for this particular work….