We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I was at first very happy to see that with ARMCC 5 now inline assembler works with the thumb mode of Cortex M4.
Just I fail to program an efficient inline assembler routine for 64 bit addition (later I want to do overflow checking, therefore I need inline assembly).
If I use the following function in C++:
__forceinline int satAdd( int a, int b){ int c; __asm{ ADDS c, a, b } return c; }
This generates very nice and compact code (just one assembly line, as expected).
But I have no glue how to do this with 64 bit.
In GNU CC inline assembler, this would be very easy:
ADDS %2, %1, %0 ADCS %d2, %d1, %d0
I am frightened there is no possibility to access the upper register of a "64 bit register double" in Keil C++? (like "%d2"?)
(Even accessing the lower register of the "register doubles" seems to be impossible, as the Keil inline assembler seems to do type checking for the variables in the assembly part - so not possible to just use the "64 bit variable" in a ADDS command).
Further the inline assembler crashes, if I try to check the overflow flag with BVS / BVC (BEQ/ BNE/ BCS/ BCC /BMI / BPL all work ...).
SOS - I hope somebody can help here?
why not do it in an assembler module?
Erik
when the compiler inlines it can skip the prologue/epilogue code.
Inline indeed would be very helpful, because this 64-bit addition consists only of 2 assembly commands (If I add the BVS checking, then 3), so it is just too short for a function branching.
With a good inline assembler support, this can be done very fast.
If I do it in an assembly function (I just try to use the "embedded assembly" feature - thus declaring an assembly function in my c++ module), then it principally now works with the following code:
__asm long long addSat( long long a, long long b) { ADDS R0, R0, R2 ADCS R1, R1, R3 BVC AllOk BMI Oflw // underflow into pos. number range: limit to 0x8000... MOV R0, #0 MOV R1, #1 LSLS R1, R1, #31 B AllOk Oflw // overflow into neg. number range: limit to 0x7FFFF... MOV R0, #0 SUBS R0, #1 LSRS R1, R0, #1 AllOk BX LR }
Just, if the addition does not overflow, this is a terrible spoiling of processor time. The processor will at least need two branches (branch into function and return of function). If we count only this, then we have 2-4 cycles overhead for this 3-cycle addition - so an overhead of about 100%.
If we take into account, that the compiler also needs some further work to "force" the variables into r0-r1 and r2-r3, then the overhead will be much larger. Also, if we take into account, that branching usually spoils quite a bit of additional time for a modern processor with prefetch queue.
This is really annoying, if you want to do saturation-safe addition at several times in a time critical loop. (Inline assembly usually would handle this MUCH more smart).