Hello everyone,
I thought I'd share some information about some experiments and observations I have made recently. If any Keil staff is reading this: Here's a great opportunity to improve your product (i.e. make it generate smaller and faster code).
I am working with an AT91SAM7S, which features early termination of multiplication instructions, i.e. multiplications take between 2 and 5 CPU cycles depending on number of significant bits of the second multiplicand (Rs in the ARM architecture manual, which already mentions the possibility of this feature and states that early termination must be implemented using Rs and not using Rm).
The test program is fairly simple: It multiplies a signed integer variable (32 bit) with a constant with few significant bits (170, which fits in 8 bits, or 45839 (0xB345), which fits in 16 bits.
volatile signed int vol_var1 = 0x12345678; volatile signed int vol_var2 = 0x98765432; volatile signed int vol_var3 = 0xDEADBEEF; volatile signed int vol_var4 = 0xDECAFBAD; int main(void) { const signed int multiplicand = 170; vol_var1 = multiplicand * vol_var1; vol_var2 = multiplicand * vol_var2; vol_var3 = multiplicand * vol_var3; vol_var4 = multiplicand * vol_var4; vol_var1 = vol_var1 * multiplicand; vol_var2 = vol_var2 * multiplicand; vol_var3 = vol_var3 * multiplicand; vol_var4 = vol_var4 * multiplicand; for(;;); return(0); }
What I have observed is that if multiplicand is 170, the compiler performs the multiplications with three shift/add operations (taking 3 cycles), even though a regular MUL would only take two cycles (and one register, which could be re-used in each multiplication).
If multiplicand is 0xB345, the compiler loads the value in R1, and then multiplies with "MUL R2, R1, R2", which uses the input variable to determine whether early termination is possible instead of the constant. This means that the multiplication takes between 2 and 5 cycles (biased towards 5 if the value of the variable is random, which is what the compiler should assume), instead of a constant 3 cycles.
I have found no way of influencing the compilers behavior in this case, so the only way to use the chip optimally would be doing the multiplications in assembly.
First of all, I cannot reproduce this. Both RealView Compilers that I use[1] produce MUL instructions.
At what optimization setting? I would assume that on low optimization settings, the compiler always uses a MUL to stick as closely to the C code as possible.
You should be able to observe the second behavior, though, where the compiler uses the variable as the second multiplicand and the constant as the first.
The compiler sees that "multiplicand" is a 16bit type.
Actually, since I didn't specify otherwise in the program, it should see it as an int, which is a 32-bit type.
However, looking at the compiler elaborately constructing the multiplication with a "small" constant from shifts and adds, I would assume that the compiler looks closely at the constant to determine the "best" approach for calculating the multiplication, but does not take the early termination feature into account.
It is arguable just "how constant" a constant actually is.
The compiler does not behave differently if the constant is given as a literal, which is pretty much as constant as it can get.
> Actually, since I didn't specify otherwise in the > program, it should see it as an int, which is a 32-bit > type.
Of course. So just looking at the type, the MUL would take five cycles. Which might still end up better than three core cycles plus instruction fetches.
I agree that this is not ideal. In any case, newer RealView compiler versions do generate better code in this case.
Regards Marcus http://www.doulos.com/arm/