posted 4 years ago
This won't provide you any learning, I don't think. But the hardest-to-debug situation I ever encountered was never solved.
It was one batch of orders which didn't come out right; it looked like one if-statement in one program was always taking the wrong branch and so naturally things didn't come out right. Other batches of orders that night worked fine, the problem had never happened before, and in fact the problem never occurred again. It was a program which we ran for at least 20 years in 20 different warehouses several times every day. There was no reason why that if-statement should have been defective in that one batch of orders. I just happened to be the one in the office that evening, I was working late for some reason, but my main task then was to reconstruct the orders and get them re-entered so the batch could be run again and the orders could go out.
As I said, we never solved the problem. Our best guess was this: You know that the hardware where memory is stored contains error-checking and error-correcting mechanisms, because there's always the possibility that very tiny electrical fluctuations can flip a bit in the memory. So those mechanisms catch and correct something like 99.999999% of those errors. (I don't know how many 9's there actually were in the statistics for our systems.) But not 100%. So there's a tiny, tiny probability that a random error can occur, but if you run your machines for enough years an error will occur. But of course we could never prove that was what actually happened.