Hi.
I wonder how to calculate absolute value of a 32-bit signed integer in C as fast as possible. I saw that there is a FPU instruction VABS.F32, which do that in one cycle (above the floats). I thought, if it is possible to use it also with integers (sign bit is in both cases MSB bit). Maybe in some way with inline or embedded assembly?
Or do you have any other advice how to get absolute value of an integer in C in the fastest way.
Not to forget - I am speaking regarding Cortex-M4 processor.
Thanks
Before you create any of the above codes, you may want first to discover what is available from your C compiler package. For example examine the assembly language instructions for abs(), AbsVal(), etc. to avoid unnecessary duplication.
I've never actually looked at what the compilers produce for that but I can't think you'll get anything much faster unless ARM sticks in a special instruction to do it. If you have very many more positive than negative values you could branch out of line to negate and that would bring the cycle count down near to two on the M4. And if you just want to be different here's some interesting code which loses a register and doesn't save any cycles on the M4 ;-)
asrs r1, r0, #31
eors r0, r1
sub r0, r1
In fact just checked the timings and I think this sequence should only take two cycles even though it occupies 8 bytes
eor r1, r0, r0, asr #31
sub r0, r1, r0, asr #31
In C terms that would be
int src, tmp, dst;
...
tmp = src ^ (src >> 31);
dst = tmp - (src >> 31);
but what an optimising compiler does with that - who knows except by looking at the output. Which I think is an exercise I'll leave to the reader ;-)
I want to say something about the codes that you posted. However, I don't have enough time anymore so, in the meantime, I'll just reply about the last portion of your reply.
Invoke common subexpression elimination replacing the code with
tmp = src >> 31;
dst = src ^ (tmp);
dst -= tmp;
.
Well, C compilers for ARM architecture are aware that the shift is a "built-in" feature of EOR and SUB so at least we are assured that the code will not be transformed as above. The rest of the compiler's behavior remains an exercise for us.
And if you just want to be different here's some interesting code which loses a register and doesn't save any cycles on the M4 ;-) asrs r1, r0, #31 eors r0, r1 sub r0, r1
And if you just want to be different here's some interesting code which loses a register and doesn't save any cycles on the M4 ;-)
There's a good reason, not just to be different, for using this method. I frequently use this sequence of instructions (Arithmetic Shift Right, Exclusive OR, Subtract) for determining the absolute value of integral data type in various processor architectures that I use. My main reason is not necessarily speed based on the total or statistical average of instruction cycle count but the property that the code does not branch so we get the same number of cycles for calculating the absolute value regardless of the sign of the input. The code just make unnecessary operations when the input is positive.
eor r1, r0, r0, asr #31 sub r0, r1, r0, asr #31
Comparing this two-instruction and the previous three-instruction method, the increase in the size of code from 6 to 8 bytes might have a negative impact on the overall performance so further analysis may be needed before making a choice between the two. In ARM state (matic specified Cortex-M4, just for comparison) this is irrelevant.
For all the codes that were posted, when the input is 0x80000000 (-2147483648) the output will be the same (0x80000000). The simplest way to handle this special case is to treat the output (dst in your C code) as unsigned. This will allow 0x80000000 to be treated as (+)2147483648 without requiring additional bit(s) of storage. Also, or consequently, we can just proceed with the arithmetic instruction, simply ignoring the flags. With (signed) integer data type, the output 0x80000000 is still -2147483648 after the supposed absolute value operation. Although we can still make the (signed) integer data type work, especially since C is loosely typed, the unsigned type will help make the code easier to comprehend and work with.
If you don't want (or cannot afford) to treat absolute output as unsigned as @goodwin suggested, here is what I would do to perform Absolute function on signed integer keeping output as signed (i.e. saturating abs(-2^31) to -1+2^31)
if (val < 0) { val = __qsub(0, val); }
This would be translated to
cmp r0,#0 itt MI movMI r1,#0 qsubMI r0,r1,r0
This will take 4 cycles regardless input value (maybe 3 if you have some 0 constant already in your registers !) on a Cortex-M4 but requires that you use intrinsics (or inline assembly) from your compiler !
The trick with eor can be used to do the saturating abs in 3 cycles
qsub r0, r1
tmp = (int)src >> 31;
dst = __qsub(val^tmp, tmp);
Thanks. I like that solution most.
I write it as __QSUB(src^src >> 31, src >> 31) and it works well also with 0x80000000 input. Result (if signed) become 2147483647 (0x7FFFFFFF). So, perhaps it isn't necesary to compute intermediate "tmp" value as you did, is it? Therefor you avoid additional STR operation.
Also, is casting with (int) there necessary? Because, the dissasembly code and result are the same witout it. As in my example.
Looks that this solution is the most efficient, I would put some parenthesis though ....
Regarding the STR instruction, as long as tmp variable is local to your function, there is no reason a compiler with decent optimization would store this intermediate result !! Did you compile without optimization ? Or maybe, your tmp variable is declared as global ?
For the cast to int, it may not be necessary (if you declared src as an int), but this way you can make sure that compiler will use ASR (arithmetic shift) instead of LSR (logical shift). Also anyone that reads this C code will understand that your intention was to sign-fill your variable !