We have a product built from the t89c51cc01 or 03 chips using C51. Timer 0 does a 10mS interrupt to check a few hardware inputs. Timer 1 runs the serial port. Timer 2 provides an adjustable clock output around 40 KHz. CAN is initialized, but during most of our testing no messages are being transmitted or received.
The program spends most of its time in a measurement loop that can be set anywhere from 20 mS to 1400 mS per iteration, with some of the time extension taken by added measurements and some by timer delays. If the user presses 'm' on a serial terminal, the loop stops while a configuration menu runs.
C51 printf causes us more problems than any other software aspect of this product. Trying to guess what will happen if we change a printf statement, or change something else near a printf statement, seems an exercise in superstition as much as logic.
When I refer to printf, I mean to include sprintf as well. I had hopes that doing an sprintf and then sending the string out with putchar would solve some of our printf problems, but so far the sprintf version has always shown the same failure as the printf version.
I've searched the web and the Keil site, and while <http://www.keil.com/support/docs/867.htm> explains some of our issues, it is far from explaining the spooky ones:
----- ... variable length argument lists must be limited to some predetermined size. By default, we limit that to 15 bytes of arguments for memory models that use DATA memory. The limit is 40 bytes for memory models using XDATA. The MAXARGS compiler directive lets you change the default limit to whatever you desire. Note that since the limit [o]f DATA space is 128 bytes (including register banks and the bit area) you can't just set MAXARGS to 128. -----
We've learned to live with a maximum of 7 ints, 5 string pointers, or 3 long ints or floats per printf. What made this initially confusing is that using more doesn't always fail. We've had printfs that had worked for years suddenly start crashing the entire program because we changed the timing of the loop, or changed a different nearby printf. It also seems like as long as printf prints at least one of its values incorrectly, it doesn't crash. If you manage to get that incorrect display problem fixed, it is very likely the "fixed" printf will crash within ten or maybe a hundred iterations.
The output of printf is apparently "spooled" somehow. When we speed our loop up to 20 mS per iteration, we see characters output at the serial port almost continuously, though we know the actual printf calls complete in very tiny fractions of that total time. Recently we saw a case where this "spooling" repeatably failed:
We added a new prompt in our operator menu: printf("\n 'U' moves +/right, 'D' -/left, 'Q' accepts: "); Even written as: printf("\n 'U' moves +/right, 'D' -/left, "); printf("'Q' accepts: "); it would always be truncated exactly after the 'Q'.
After the function including the above printf exited, its calling function called two more printfs:
ANSI_cls(); // #define ANSI_cls() printf("\x1B[2J") printf("\n --- RUNNING...");
Both of those, which had previously worked fine for years, vanished from the serial output when we added the new prompt.
When that calling function exited, the main loop resumed, and its first printf did still appear properly:
printf("\n\n\nTime(ms)%cSmpl%cCnt%cNDVI%cselected_VI%c", s1, s1, s1, s1, s1);
Adding a delay after the first printf:
printf("\n 'U' moves +/right, 'D' -/left, 'Q' accepts: "); TmrDelay(10);
seemed to solve (or hide) this problem. Sometimes this helps and sometimes it doesn't, and we haven't a clue why.
Sometimes there seems to be a problem with passing the results of calculations to printf. This worked perfectly...
printf("\r%02d <%+05.1f %+05.1f> ", (NDVIsmoothed_16s+5)/10, (FP)(NDVIvalley_16s-target_nom_16s)/10.0, (FP)(NDVIpeak_16s-target_nom_16s)/10.0);
until the result of the third calculation got bigger than about 40000, which would crash the program almost instantly.
By the time it worked for all values, it looked like this:
minF = (FP)(NDVIvalley_16s-target_nom_16s)/10.0; maxF = (FP)(NDVIpeak_16s-target_nom_16s)/10.0; printf("\r% 03.0f <", NDVIsmoothed*100); // this is an FP of NDVIsmoothed_16s printf("%+05.1f ", minF); printf("%+05.1f> ", maxF);
Sometimes no amount of separating or dividing will make a certain printf work:
NDVIdev_16s = (NDVIpeak_16s-NDVIvalley_16s)/2; printf("\n%03d ±%03d ", NDVIsmoothed_16s, NDVIdev_16s);
That crashes with loop times under 400mS, even with the calculation separate.
Same with this sequence:
printf("\n%03d ", NDVIsmoothed_16s); printf("{-%3d +%3d} ", minus16s, plus16s);
Both of those would crash, almost always after printing a single '-' on the new line, within maybe 100 iterations - if the loop time was under 400 mS.
This works in exactly that same spot:
printf("\n% 05.1f ", (FP)NDVIsmoothed_16s/10.0);
Then there is the long series of printfs that form the normal output of the product, that have remained unchanged for many years, and that typically repeat every 100 mS whenever it is on. With a few particular recent hardware instances, if we set the loop time to exactly 50 mS, the program will crash after several minutes of operation. Choose any other loop time and the same device will run forever. Probably 99% of our systems run fine at any speed. We don't know for sure this is caused by printf, but the resemblance seems suggestive.
There must be much more to this than just the variable length argument limit. Could we have an in-depth explanation of the other limitations of C51 printf?
Today I discovered an overrun buffer in one of the putchar sequences adjacent to one of the printf issues in my original post. After the printfs shown in the two following versions - the first of which failed if the result of the third calculation got bigger than about 40000:
and the second of which worked even with the overrun buffer following it:
there was code to configure and print a string:
if ((g_idx >= 0) && (g_idx <= 28)) g_ctr[g_idx] = '>'; // insert more chars at various positions in the buffer... for (i = 0; i <= 28; i++) { putchar(g_ctr[i]); }
The buffer "g_ctr[]" was declared too small for the number of locations we referred to in configuring and printing it. In every case we wrote '-' characters to the same excess length in the buffer, but when the result of the third calculation got bigger than about 40000, we began writing other characters, like '<', '^', or '>', past the declared space.
I don't have the ability to go back to the map files for the failed versions, but the buffer was always in plain data memory, not xdata, and in the version that worked following the second printf form shown above, it was located here:
D:0007H SYMBOL g_idx D:0029H SYMBOL g_ctr
I've gotten pretty good at understanding maps of variables in xdata, but I'm not aware of the fine points of regular 'D' memory. It looks like almost every routine that has any 'D' variables has them at 0007H and 0029H, or in that area, and nobody has any beyond 004FH. Then there comes:
------- MODULE PRINTF D:0050H PUBLIC ?_PRINTF?BYTE D:0050H PUBLIC ?_SPRINTF?BYTE C:7F7AH PUBLIC _PRINTF C:7F74H PUBLIC _SPRINTF ------- ENDMOD PRINTF
At one point the g_ctr buffer was getting characters written from 0 out to 32 (instead of just 28), which would run from 0x29 to 0x50, just exactly enough to step on the "?_PRINTF?BYTE", whatever that is. If the buffer was actually located there in the problem versions, that would be highly suspicious...
Except if the "?_PRINTF?BYTE" is critical to printf, how could overwriting it with '-' hundreds of times be acceptable, but overwriting it with '>' not be?
How could the second form of printf survive having its "?_PRINTF?BYTE" overwritten when the first form clearly couldn't?
And how could overwriting the buffer _after_ the printf had returned control to the following code spoil the printf that should already have been complete? Maybe it killed the _next_ printf? In every case where it crashed, it printed the initial '\r' of the printf, and then a single '-' character. None of the printfs should have printed a '-' there, but the following putchar sequence almost always should have printed one right after the printf returned.
What is the "?_PRINTF?BYTE", anyway? Google took me directly to <http://www.keil.com/support/man/docs/bl51/bl51_ln_mapfile.htm> which shows the item but doesn't explain it to me.
Does this line of exploration possibly explain at least this one printf mystery? It can't be a general explanation, because we hardly have any plain Data variables; almost everything is in xdata.
If you've got a buffer overrun, then you're really into "undefined behaviour" territory - so anything could happen!
Especially with Data Overlaying!