Has someone a good hint how to implement a parity-calculation of a byte in the most efficient way on an C51-device?
excellent example of the cost of portability. Many that have more concern for 'purity' (which, probably, is a great idea for PC code) end up with high hardware costs, simply because they do not ralize that, in small embedded, efficiency is the name of the game.
Erik
"excellent example of the cost of portability"
Actually, it's a poor example. Some other example would demonstrate the cost much better. On the non-8051 architecture where this is used, the expression compiles to the same 6-instruction sequence as the hand optimized assembly.
I am away without access to my C51 toolchain at the moment to include the compiler-generated assembly, but you will find with the http://www.keil.com/support/docs/1619.htm and its function parameter and call/return overhead, that the non-portable function and portable expression versions compare quite closely.
That's why if you want to shave off a few bytes and cycles, you should write it assembly, but do it inline within a larger assembly module that uses the parity, otherwise you've still got the overhead of a function call without benefit.
Dan, you are right; however I see a possible misunderstanding of your stetement and thus, without any malice, I correct it
"but do it within a larger assembly routine that uses the parity"
On my first read I read 'inline' as "inline assembly in a C module"
Once again I agree substantially with what is said above about mixed-mode development and non-portable implementation.
However, there are some things that simply don't justify the cost of a function call. One of such things is exactly the parity function in a '51 derivative.
If you are in C, you can get the parity in 2 machine instructions:
d = 0x54; if(ACC=d, P) // if odd parity { d = 0x55; par = (ACC=d, P); // par is 1 for odd, 0 for even. }
The expression (ACC=value, P) uses the comma operator to guarantee that the accumulator will not be dirty before testing the PSW.P bit. It is an ancient C trick to force the compiler to perform low-level operations in a certain sequence.
Albeit not 'portable', it is ANSI, and that use of the comma operator is guaranteed not to be optimized by the compiler.
It produces a parity evaluation in 2 machine instructions. I doubt that this can be done faster in any other implementation.
The comma operator is perfect for yet another nice trick: update a 16bit timer in C, by adding the 16bit value of the period constant to the timer registers:
/////////////////////////////////////////////////////////////////////////// // The following macros reload the 1ms timer, compensating for the interrupt // latency to achieve an average frequency of 1ms, corrected at every period. // // The code uses a few C tricks: // - Accesses the PSW carry directly, to obtain the overflow from an unsigned // char addition. We use the comma operator for that, to perform a side effect // addition right before of a carry bit test, not allowing the compiler to // reorder the code and make the carry dirty. // The comma operator is used also to insert padding NOPs before the addition, // in the cases that the lower byte of the adjust value is {0,1,2}, because the // C51 compiler generates INC instead, or suppresses the operation. // // This code was verified with optimization levels from 0 to 9, and the compiler // doesn't try to break the code with optimizations. // // The whole procedure is wrapped in a function-like macro. // // This code was verified in a P89C668 @18.432MHz, and the frequency deviation // measured was ±20ppm, essentially the cpu crystal variation, completely // removing the timer interrupt latency. /////////////////////////////////////////////////////////////////////////// #ifdef ADJ_DELAY #undef ADJ_DELAY #endif #ifdef RELOAD_TIMER0 #undef RELOAD_TIMER0 #endif #if (((TMR_1MS & 0xff) < (0xff-8)) || ((TMR_1MS & 0xff) > (0xff-6))) #define ADJ_DELAY 9 #define RELOAD_TIMER0() do { \ TR0 = 0; \ if (TL0 += ((TMR_1MS & 0xff) + ADJ_DELAY), CY) TH0++; \ TH0 += (TMR_1MS >> 8); \ TR0 = 1; \ } while(0) #elif ((TMR_1MS & 0xff) == (0xff-6)) #define ADJ_DELAY 10 #define RELOAD_TIMER0() do { \ TR0 = 0; \ if (_nop_(), TL0 += ((TMR_1MS & 0xff) + ADJ_DELAY), CY) TH0++; \ TH0 += (TMR_1MS >> 8); \ TR0 = 1; \ } while(0) #elif ((TMR_1MS & 0xff) == (0xff-7)) #define ADJ_DELAY 11 #define RELOAD_TIMER0() do { \ TR0 = 0; \ if (_nop_(), _nop_(), TL0 += ((TMR_1MS & 0xff) + ADJ_DELAY), CY) TH0++; \ TH0 += (TMR_1MS >> 8); \ TR0 = 1; \ } while(0) #elif ((TMR_1MS & 0xff) == (0xff-8)) #define ADJ_DELAY 12 #define RELOAD_TIMER0() do { \ TR0 = 0; \ if (_nop_(), _nop_(), _nop_(), TL0 += ((TMR_1MS & 0xff) + ADJ_DELAY), CY) TH0++; \ TH0 += (TMR_1MS >> 8); \ TR0 = 1; \ } while(0) #endif
In the timer handler, a call to RELOAD_TIMER() invokes the macro, that is synthesized for any reload value.
Another use of the comma operator: you can get the 16bit result from the multiplication of 2 unsigned char without any call to the math clib:
unsigned char data a, b; unsigned int data x; a = 36; b = 130; // this is 400% faster than a x = (unsigned int) a * b *(unsigned char*)&x = ((((unsigned char*)&x)[1] = a * b), B);
I am just too lazy: you can make the int variable a union to access the msb and lsb parts, to improve readability, but the generated code is the same: 11 instruction cycles against 40 cycles using a 16bit cast.