It's January 2012 and I'm sitting on a cross-Atlantic flight. Sweat is beading on my brow and it's nothing to do with the cabin temperature. I am not a happy bunny. I'm a very unhappy bunny and somebody is going to pay.
On this fateful day I'm on my way to Chicago to run an Arm DDT training seminar at NCSA. Having exhausted the list of second-rate in-flight movies, I've started working through our training material in preparation.
It's all been going swimmingly, when suddenly I spot a problem - a major problem. It looks like there's a bug in our just-hit-the-website 3.1 release. The one I'm going to demo at a hands-on workshop. Tomorrow!
The problem is this: in 3.1 we added a fancy new feature called sparklines, which draws a tiny graph next to each variable in the interface, comparing its value across all the processes, instantly. Normally this is really useful, but today it looked... wrong:
The graphs are all corrupt! The graph next to my_rank should be a nice diagonal line, showing that process 0 has a rank of 0 and process 9 has a rank of 9! And p is the size of the job, that should be the same across all processes, but there's some kind of peak in it!
Somebody has broken the build. And tomorrow I'm going to be running a hands-on training session with it. Definitely. Not. Happy.
My first instinct is to raise a positively incandescent bug report. I draft one that starts with "WHO BROKE MY #$@! SPARKLINES?!?!!11", but there's no in-flight WiFi so submitting it has to wait. Instead, I anxiously poke around in the interface to find out how bad the damage is.
The first thing I do is hover my mouse over the sparkline to see the range of values reported:
Ok, so there's clearly some junk in there. 1126236160 is definitely not a valid process rank.
That raises the question as to what the values all actually are, so I click once on the sparkline, which brings up a quick cross-process comparison dialog that shows me the actual values across every process:
That's odd, why would just three processes have the same random value? Suddenly, this doesn't feel quite like a problem with Arm DDT any more. I right-click and make a group out of the three processes with the incorrect value and it all drops into place:
I'm not looking at a bug in Arm DDT at all - I'm looking at a bug in the training program. All three of these processes are merrily looping around and around overwriting memory. The type of the tables array is shown underneath the variables list - it's just a 12 by 12. Yet these processes are already writing to tables[0][112623621] and beyond! They've trashed the stack, including my_rank, p and a whole lot of other variables. It's a small miracle the program hasn't crashed yet!
I look back at the training material. Oh, yes, there we are. Exercise 1: why does the program crash or loop indefinitely when run with 10 processes?
Glancing around to see if anybody has noticed, I delete the outraged bug report from my drafts folder and insert a note into the training material:
"An excellent use of sparklines is spotting memory corruption, even with data on the stack or when memory debugging is turned off."
I glance back at the screen and somewhat grudgingly accept that it's actually pretty cool. The relief is palpable, but I still need a drink. Stewardess!