Is there any way the RealView compiler can take advantage of the multiply-and-accumulate instructions of the Cortex M3?
I wrote a simple MAC loop and the compiler didn't generate any SMLAL or UMLAL instructions, which was disappointing.
Thanks, Andrew Queisser HP
Yes, the compiler does generate SMLAL/UMLAL instructions.
Simple test code (UMLAL):
unsigned long long mac_test (unsigned long *a, unsigned long *b, int cnt) { unsigned long long res = 0; while (cnt--) { res += (unsigned long long)*a++ * (unsigned long long)*b++; } return (res); }
Compiler output:
mac_test PROC ;;;1 unsigned long long mac_test (unsigned long *a, unsigned long *b, int cnt) { 000000 b570 PUSH {r4-r6,lr} 000002 4603 MOV r3,r0 000004 460c MOV r4,r1 000006 2000 MOVS r0,#0 000008 4601 MOV r1,r0 ;;;2 unsigned long long res = 0; ;;;3 ;;;4 while (cnt--) { 00000a e005 B |L1.24| |L1.12| ;;;5 res += (unsigned long long)*a++ * (unsigned long long)*b++; 00000c cb20 LDM r3!,{r5} 00000e cc40 LDM r4!,{r6} 000010 fba56506 UMULL r6,r5,r5,r6 000014 1830 ADDS r0,r6,r0 000016 4169 ADCS r1,r1,r5 |L1.24| 000018 1e52 SUBS r2,r2,#1 ;4 00001a d2f7 BCS |L1.12| ;;;6 } ;;;7 return (res); ;;;8 } 00001c bd70 POP {r4-r6,pc} ENDP
000010 fba56506 UMULL r6,r5,r5,r6 000014 1830 ADDS r0,r6,r0 000016 4169 ADCS r1,r1,r5
Ironically, this compiler output doesn't show UMLAL being used here at all. But with my local versions of armcc (RVCT4.0 [Build 677], RVCT4.0 [Build 821]), UMLAL is generated indeed.
Regards Marcus http://www.doulos.com/arm/
Hi Marcus,
Thanks for the tip - I'm using armcc 4.1 Build 561 from the Keil UV4 installation. What are the command line options you use to see the UMLAL instructions?
Thanks, Andrew
You mean other than "--cpu=cortex-m3"? Nothing in case of my Q&D test. I think that by default, "-O2 -Ospace" are selected by the compiler.
Regarding your statement
> Once I've forced the compiler to generate close to what I want, in this case the MAC > instructions, I throw away the C-code and tweak the assembly.
May I ask why? The RealView compiler is fairly good at generating very efficient code. Implementing things in C increases the chance that you will remember what you did two weeks from now. I don't find many places where I could have outsmarted the compiler.
-- Marcus
>> Once I've forced the compiler to generate close to what I want, in this case the MAC >> instructions, I throw away the C-code and tweak the assembly.
> May I ask why? The RealView compiler is fairly good at generating very efficient code. > Implementing things in C increases the chance that you will remember what you did two > weeks from now. I don't find many places where I could have outsmarted the compiler.
Exactly. In this particular case, if the compiler uses the MAC instructions I'm happy. Otherwise I might grudgingly resort to assembly but only if our profiling shows that optimizing this particular operation is worthwhile. Since our application is very power sensitive we want to get the RMS calculation done as quickly as possible so we can put the CPU back to sleep.
Andrew