This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to get absolute value of a 32-bit signed integer as fast as possible?

Hi.

I wonder how to calculate absolute value of a 32-bit signed integer in C as fast as possible. I saw that there is a FPU instruction VABS.F32, which do that in one cycle (above the floats). I thought, if it is possible to use it also with integers (sign bit is in both cases MSB bit). Maybe in some way with inline or embedded assembly?

Or do you have any other advice how to get absolute value of an integer in C in the fastest way.

Not to forget - I am speaking regarding Cortex-M4 processor.

Thanks

Parents

0 G. Goodwin L. Pitos over 8 years ago

Before you create any of the above codes, you may want first to discover what is available from your C compiler package. For example examine the assembly language instructions for abs(), AbsVal(), etc. to avoid unnecessary duplication.
Cancel
Up 0 Down

Cancel

Reply

0 G. Goodwin L. Pitos over 8 years ago

Before you create any of the above codes, you may want first to discover what is available from your C compiler package. For example examine the assembly language instructions for abs(), AbsVal(), etc. to avoid unnecessary duplication.
Cancel
Up 0 Down

Cancel

Children

0 daith over 8 years ago in reply to G. Goodwin L. Pitos

I've never actually looked at what the compilers produce for that but I can't think you'll get anything much faster unless ARM sticks in a special instruction to do it. If you have very many more positive than negative values you could branch out of line to negate and that would bring the cycle count down near to two on the M4. And if you just want to be different here's some interesting code which loses a register and doesn't save any cycles on the M4 ;-)
     asrs    r1, r0, #31
     eors    r0, r1
     sub     r0, r1
In fact just checked the timings and I think this sequence should only take two cycles even though it occupies 8 bytes
     eor     r1, r0, r0, asr #31
     sub     r0, r1, r0, asr #31
In C terms that would be
     int src, tmp, dst;
     ...
     tmp = src ^ (src >> 31);
     dst = tmp - (src >> 31);
but what an optimising compiler does with that - who knows except by looking at the output. Which I think is an exercise I'll leave to the reader ;-)
Cancel
Up 0 Down

Cancel
0 G. Goodwin L. Pitos over 8 years ago in reply to daith

I want to say something about the codes that you posted. However, I don't have enough time anymore so, in the meantime, I'll just reply about the last portion of your reply.
but what an optimising compiler does with that - who knows except by looking at the output. Which I think is an exercise I'll leave to the reader ;-)
Invoke common subexpression elimination replacing the code with
     tmp = src >> 31;
     dst = src ^ (tmp);
     dst -= tmp;
.
Well, C compilers for ARM architecture are aware that the shift is a "built-in" feature of EOR and SUB so at least we are assured that the code will not be transformed as above. The rest of the compiler's behavior remains an exercise for us.
Cancel
Up 0 Down

Cancel
0 G. Goodwin L. Pitos over 8 years ago in reply to daith

And if you just want to be different here's some interesting code which loses a register and doesn't save any cycles on the M4 ;-)
     asrs    r1, r0, #31
     eors    r0, r1
     sub     r0, r1
There's a good reason, not just to be different, for using this method. I frequently use this sequence of instructions (Arithmetic Shift Right, Exclusive OR, Subtract) for determining the absolute value of integral data type in various processor architectures that I use. My main reason is not necessarily speed based on the total or statistical average of instruction cycle count but the property that the code does not branch so we get the same number of cycles for calculating the absolute value regardless of the sign of the input. The code just make unnecessary operations when the input is positive.
     eor     r1, r0, r0, asr #31
     sub     r0, r1, r0, asr #31
Comparing this two-instruction and the previous three-instruction method, the increase in the size of code from 6 to 8 bytes might have a negative impact on the overall performance so further analysis may be needed before making a choice between the two. In ARM state (matic specified Cortex-M4, just for comparison) this is irrelevant.
     int src, tmp, dst;
For all the codes that were posted, when the input is 0x80000000 (-2147483648) the output will be the same (0x80000000). The simplest way to handle this special case is to treat the output (dst in your C code) as unsigned. This will allow 0x80000000 to be treated as (+)2147483648 without requiring additional bit(s) of storage. Also, or consequently, we can just proceed with the arithmetic instruction, simply ignoring the flags. With (signed) integer data type, the output 0x80000000 is still -2147483648 after the supposed absolute value operation. Although we can still make the (signed) integer data type work, especially since C is loosely typed, the unsigned type will help make the code easier to comprehend and work with.
Cancel
Up 0 Down

Cancel
0 Thibaut ZEISSLOFF over 8 years ago in reply to G. Goodwin L. Pitos
If you don't want (or cannot afford) to treat absolute output as unsigned as @goodwin suggested, here is what I would do to perform Absolute function on signed integer keeping output as signed (i.e. saturating abs(-2^31) to -1+2^31)
if (val < 0) { val = __qsub(0, val); }
This would be translated to
cmp r0,#0 itt MI movMI r1,#0 qsubMI r0,r1,r0
This will take 4 cycles regardless input value (maybe 3 if you have some 0 constant already in your registers !) on a Cortex-M4 but requires that you use intrinsics (or inline assembly) from your compiler !
Cancel
Up 0 Down

Cancel
+1 daith over 8 years ago in reply to Thibaut ZEISSLOFF

The trick with eor can be used to do the saturating abs in 3 cycles
     asrs    r1, r0, #31
     eors    r0, r1
     qsub    r0, r1
     tmp = (int)src >> 31;
     dst = __qsub(val^tmp, tmp);
Cancel
Up 0 Down

Cancel
0 Matic over 8 years ago in reply to daith

Thanks. I like that solution most.
I write it as __QSUB(src^src >> 31, src >> 31) and it works well also with 0x80000000 input. Result (if signed) become 2147483647 (0x7FFFFFFF). So, perhaps it isn't necesary to compute intermediate "tmp" value as you did, is it? Therefor you avoid additional STR operation.
Also, is casting with (int) there necessary? Because, the dissasembly code and result are the same witout it. As in my example.
Cancel
Up 0 Down

Cancel
0 Thibaut ZEISSLOFF over 8 years ago in reply to Matic

Looks that this solution is the most efficient, I would put some parenthesis though ....
Regarding the STR instruction, as long as tmp variable is local to your function, there is no reason a compiler with decent optimization would store this intermediate result !! Did you compile without optimization ? Or maybe, your tmp variable is declared as global ?
For the cast to int, it may not be necessary (if you declared src as an int), but this way you can make sure that compiler will use ASR (arithmetic shift) instead of LSR (logical shift). Also anyone that reads this C code will understand that your intention was to sign-fill your variable !
Cancel
Up 0 Down

Cancel