I was at first very happy to see that with ARMCC 5 now inline assembler works with the thumb mode of Cortex M4.
Just I fail to program an efficient inline assembler routine for 64 bit addition (later I want to do overflow checking, therefore I need inline assembly).
If I use the following function in C++:
__forceinline int satAdd( int a, int b){ int c; __asm{ ADDS c, a, b } return c; }
This generates very nice and compact code (just one assembly line, as expected).
But I have no glue how to do this with 64 bit.
In GNU CC inline assembler, this would be very easy:
ADDS %2, %1, %0 ADCS %d2, %d1, %d0
I am frightened there is no possibility to access the upper register of a "64 bit register double" in Keil C++? (like "%d2"?)
(Even accessing the lower register of the "register doubles" seems to be impossible, as the Keil inline assembler seems to do type checking for the variables in the assembly part - so not possible to just use the "64 bit variable" in a ADDS command).
Further the inline assembler crashes, if I try to check the overflow flag with BVS / BVC (BEQ/ BNE/ BCS/ BCC /BMI / BPL all work ...).
SOS - I hope somebody can help here?
why not do it in an assembler module?
Erik
when the compiler inlines it can skip the prologue/epilogue code.
Inline indeed would be very helpful, because this 64-bit addition consists only of 2 assembly commands (If I add the BVS checking, then 3), so it is just too short for a function branching.
With a good inline assembler support, this can be done very fast.
If I do it in an assembly function (I just try to use the "embedded assembly" feature - thus declaring an assembly function in my c++ module), then it principally now works with the following code:
__asm long long addSat( long long a, long long b) { ADDS R0, R0, R2 ADCS R1, R1, R3 BVC AllOk BMI Oflw // underflow into pos. number range: limit to 0x8000... MOV R0, #0 MOV R1, #1 LSLS R1, R1, #31 B AllOk Oflw // overflow into neg. number range: limit to 0x7FFFF... MOV R0, #0 SUBS R0, #1 LSRS R1, R0, #1 AllOk BX LR }
Just, if the addition does not overflow, this is a terrible spoiling of processor time. The processor will at least need two branches (branch into function and return of function). If we count only this, then we have 2-4 cycles overhead for this 3-cycle addition - so an overhead of about 100%.
If we take into account, that the compiler also needs some further work to "force" the variables into r0-r1 and r2-r3, then the overhead will be much larger. Also, if we take into account, that branching usually spoils quite a bit of additional time for a modern processor with prefetch queue.
This is really annoying, if you want to do saturation-safe addition at several times in a time critical loop. (Inline assembly usually would handle this MUCH more smart).