Hello everyone,
I thought I'd share some information about some experiments and observations I have made recently. If any Keil staff is reading this: Here's a great opportunity to improve your product (i.e. make it generate smaller and faster code).
I am working with an AT91SAM7S, which features early termination of multiplication instructions, i.e. multiplications take between 2 and 5 CPU cycles depending on number of significant bits of the second multiplicand (Rs in the ARM architecture manual, which already mentions the possibility of this feature and states that early termination must be implemented using Rs and not using Rm).
The test program is fairly simple: It multiplies a signed integer variable (32 bit) with a constant with few significant bits (170, which fits in 8 bits, or 45839 (0xB345), which fits in 16 bits.
volatile signed int vol_var1 = 0x12345678; volatile signed int vol_var2 = 0x98765432; volatile signed int vol_var3 = 0xDEADBEEF; volatile signed int vol_var4 = 0xDECAFBAD; int main(void) { const signed int multiplicand = 170; vol_var1 = multiplicand * vol_var1; vol_var2 = multiplicand * vol_var2; vol_var3 = multiplicand * vol_var3; vol_var4 = multiplicand * vol_var4; vol_var1 = vol_var1 * multiplicand; vol_var2 = vol_var2 * multiplicand; vol_var3 = vol_var3 * multiplicand; vol_var4 = vol_var4 * multiplicand; for(;;); return(0); }
What I have observed is that if multiplicand is 170, the compiler performs the multiplications with three shift/add operations (taking 3 cycles), even though a regular MUL would only take two cycles (and one register, which could be re-used in each multiplication).
If multiplicand is 0xB345, the compiler loads the value in R1, and then multiplies with "MUL R2, R1, R2", which uses the input variable to determine whether early termination is possible instead of the constant. This means that the multiplication takes between 2 and 5 cycles (biased towards 5 if the value of the variable is random, which is what the compiler should assume), instead of a constant 3 cycles.
I have found no way of influencing the compilers behavior in this case, so the only way to use the chip optimally would be doing the multiplications in assembly.
I'm not a member of Keil's staff, but I'll give it a try anyway...
First of all, I cannot reproduce this. Both RealView Compilers that I use[1] produce MUL instructions.
I see two different aspects of this:
The compiler sees that "multiplicand" is a 16bit type. It doesn't care whether the value would fit in 8bits or not. Multiplying (MUL) a 16bit value would take three cycles worst case to execute on an ARM7TDMI. No difference in execution time compared to three single cycle instructions.
It is arguable just "how constant" a constant actually is. One thing compilers agree about is that a constant cannot be assigned at run-time. Whether the initial value can always be relied upon or not is not defined. Think of a "const uint32_t fwcrc = 0x0L" that will be altered after linking the firmware.
Regards Marcus http://www.doulos.com/arm/
Footnotes: [1] ARM/Thumb C/C++ Compiler with , RVCT3.1 [Build 942] for uVision ARM C/C++ Compiler, RVCT4.0 [Build 471]
> First of all, I cannot reproduce this.
I can reproduce this. It seems that Keil's version of armcc does generate three instructions, when -Otime is specified.
At what optimization setting? I would assume that on low optimization settings, the compiler always uses a MUL to stick as closely to the C code as possible.
You should be able to observe the second behavior, though, where the compiler uses the variable as the second multiplicand and the constant as the first.
The compiler sees that "multiplicand" is a 16bit type.
Actually, since I didn't specify otherwise in the program, it should see it as an int, which is a 32-bit type.
However, looking at the compiler elaborately constructing the multiplication with a "small" constant from shifts and adds, I would assume that the compiler looks closely at the constant to determine the "best" approach for calculating the multiplication, but does not take the early termination feature into account.
It is arguable just "how constant" a constant actually is.
The compiler does not behave differently if the constant is given as a literal, which is pretty much as constant as it can get.
> Actually, since I didn't specify otherwise in the > program, it should see it as an int, which is a 32-bit > type.
Of course. So just looking at the type, the MUL would take five cycles. Which might still end up better than three core cycles plus instruction fetches.
I agree that this is not ideal. In any case, newer RealView compiler versions do generate better code in this case.