Hello everyone,
I thought I'd share some information about some experiments and observations I have made recently. If any Keil staff is reading this: Here's a great opportunity to improve your product (i.e. make it generate smaller and faster code).
I am working with an AT91SAM7S, which features early termination of multiplication instructions, i.e. multiplications take between 2 and 5 CPU cycles depending on number of significant bits of the second multiplicand (Rs in the ARM architecture manual, which already mentions the possibility of this feature and states that early termination must be implemented using Rs and not using Rm).
The test program is fairly simple: It multiplies a signed integer variable (32 bit) with a constant with few significant bits (170, which fits in 8 bits, or 45839 (0xB345), which fits in 16 bits.
volatile signed int vol_var1 = 0x12345678; volatile signed int vol_var2 = 0x98765432; volatile signed int vol_var3 = 0xDEADBEEF; volatile signed int vol_var4 = 0xDECAFBAD; int main(void) { const signed int multiplicand = 170; vol_var1 = multiplicand * vol_var1; vol_var2 = multiplicand * vol_var2; vol_var3 = multiplicand * vol_var3; vol_var4 = multiplicand * vol_var4; vol_var1 = vol_var1 * multiplicand; vol_var2 = vol_var2 * multiplicand; vol_var3 = vol_var3 * multiplicand; vol_var4 = vol_var4 * multiplicand; for(;;); return(0); }
What I have observed is that if multiplicand is 170, the compiler performs the multiplications with three shift/add operations (taking 3 cycles), even though a regular MUL would only take two cycles (and one register, which could be re-used in each multiplication).
If multiplicand is 0xB345, the compiler loads the value in R1, and then multiplies with "MUL R2, R1, R2", which uses the input variable to determine whether early termination is possible instead of the constant. This means that the multiplication takes between 2 and 5 cycles (biased towards 5 if the value of the variable is random, which is what the compiler should assume), instead of a constant 3 cycles.
I have found no way of influencing the compilers behavior in this case, so the only way to use the chip optimally would be doing the multiplications in assembly.
> First of all, I cannot reproduce this.
I can reproduce this. It seems that Keil's version of armcc does generate three instructions, when -Otime is specified.
Regards Marcus http://www.doulos.com/arm/