This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to get absolute value of a 32-bit signed integer as fast as possible?

Matic over 9 years ago

Hi.

I wonder how to calculate absolute value of a 32-bit signed integer in C as fast as possible. I saw that there is a FPU instruction VABS.F32, which do that in one cycle (above the floats). I thought, if it is possible to use it also with integers (sign bit is in both cases MSB bit). Maybe in some way with inline or embedded assembly?

Or do you have any other advice how to get absolute value of an integer in C in the fastest way.

Not to forget - I am speaking regarding Cortex-M4 processor.

Thanks

0 Yasuhiko Koumoto over 9 years ago
Hi,
I think you might have some misunderstandings.
Regarding the sign bit, the FPU data format is different from the format of the integer.
In the FPU case, data consists of a sign bit and an absolute value.
In the integer case, data has 2's complement format.
Therefore, the absolute value of FPU data can be gotten because the sign bit can be cleared.
To the contrary, to get the absolute value of the integer there would be 2 steps. The first is to compare with zero. The second is to negate the value if it is a negative value.
The instruction sequence will be the following and it will take at least 2 cycles.
cmp r0,#0 it lt neglt r0,r0
Best regards,
Yasuhiko Koumoto.
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago

Before you create any of the above codes, you may want first to discover what is available from your C compiler package. For example examine the assembly language instructions for abs(), AbsVal(), etc. to avoid unnecessary duplication.
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to Yasuhiko Koumoto

Hi matic and Yasuhiko,
If the original value needs to be preserved, below is Example 3.1 from section 3.3.7. Conditional execution of Cortex-M4 Devices Generic User Guide.
Example 3.1 shows the use of a conditional instruction to find the absolute value of a number. R0 = abs(R1).
Example 3.1. Absolute value
    MOVS    R0, R1          ; R0 = R1, setting flags
    IT      MI              ; skipping next instruction if value 0 or positive
    RSBMI   R0, R0, #0      ; If negative, R0 = -R0
For both the negate and reverse subtract instructions, we have to handle the special case when the input is 0x80000000 (sign bit=1, all others are 0; -2147483648). We have to check the behavior of the flags particularly V (overflow).
Regards,
Goodwin
Cancel
Vote up 0 Vote down

Cancel
0 daith over 9 years ago in reply to G. Goodwin L. Pitos

I've never actually looked at what the compilers produce for that but I can't think you'll get anything much faster unless ARM sticks in a special instruction to do it. If you have very many more positive than negative values you could branch out of line to negate and that would bring the cycle count down near to two on the M4. And if you just want to be different here's some interesting code which loses a register and doesn't save any cycles on the M4 ;-)
     asrs    r1, r0, #31
     eors    r0, r1
     sub     r0, r1
In fact just checked the timings and I think this sequence should only take two cycles even though it occupies 8 bytes
     eor     r1, r0, r0, asr #31
     sub     r0, r1, r0, asr #31
In C terms that would be
     int src, tmp, dst;
     ...
     tmp = src ^ (src >> 31);
     dst = tmp - (src >> 31);
but what an optimising compiler does with that - who knows except by looking at the output. Which I think is an exercise I'll leave to the reader ;-)
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to daith

I want to say something about the codes that you posted. However, I don't have enough time anymore so, in the meantime, I'll just reply about the last portion of your reply.
but what an optimising compiler does with that - who knows except by looking at the output. Which I think is an exercise I'll leave to the reader ;-)
Invoke common subexpression elimination replacing the code with
     tmp = src >> 31;
     dst = src ^ (tmp);
     dst -= tmp;
.
Well, C compilers for ARM architecture are aware that the shift is a "built-in" feature of EOR and SUB so at least we are assured that the code will not be transformed as above. The rest of the compiler's behavior remains an exercise for us.
Cancel
Vote up 0 Vote down

Cancel
0 G. Goodwin L. Pitos over 9 years ago in reply to daith

And if you just want to be different here's some interesting code which loses a register and doesn't save any cycles on the M4 ;-)
     asrs    r1, r0, #31
     eors    r0, r1
     sub     r0, r1
There's a good reason, not just to be different, for using this method. I frequently use this sequence of instructions (Arithmetic Shift Right, Exclusive OR, Subtract) for determining the absolute value of integral data type in various processor architectures that I use. My main reason is not necessarily speed based on the total or statistical average of instruction cycle count but the property that the code does not branch so we get the same number of cycles for calculating the absolute value regardless of the sign of the input. The code just make unnecessary operations when the input is positive.
     eor     r1, r0, r0, asr #31
     sub     r0, r1, r0, asr #31
Comparing this two-instruction and the previous three-instruction method, the increase in the size of code from 6 to 8 bytes might have a negative impact on the overall performance so further analysis may be needed before making a choice between the two. In ARM state (matic specified Cortex-M4, just for comparison) this is irrelevant.
     int src, tmp, dst;
For all the codes that were posted, when the input is 0x80000000 (-2147483648) the output will be the same (0x80000000). The simplest way to handle this special case is to treat the output (dst in your C code) as unsigned. This will allow 0x80000000 to be treated as (+)2147483648 without requiring additional bit(s) of storage. Also, or consequently, we can just proceed with the arithmetic instruction, simply ignoring the flags. With (signed) integer data type, the output 0x80000000 is still -2147483648 after the supposed absolute value operation. Although we can still make the (signed) integer data type work, especially since C is loosely typed, the unsigned type will help make the code easier to comprehend and work with.
Cancel
Vote up 0 Vote down

Cancel
0 Thibaut ZEISSLOFF over 9 years ago in reply to G. Goodwin L. Pitos
If you don't want (or cannot afford) to treat absolute output as unsigned as @goodwin suggested, here is what I would do to perform Absolute function on signed integer keeping output as signed (i.e. saturating abs(-2^31) to -1+2^31)
if (val < 0) { val = __qsub(0, val); }
This would be translated to
cmp r0,#0 itt MI movMI r1,#0 qsubMI r0,r1,r0
This will take 4 cycles regardless input value (maybe 3 if you have some 0 constant already in your registers !) on a Cortex-M4 but requires that you use intrinsics (or inline assembly) from your compiler !
Cancel
Vote up 0 Vote down

Cancel
0 Matic over 9 years ago

Hello.
I found out that sequence of ASR, EOR and SUB takes the same amount of instructions as if I simply write a C code like: if(x < 0) x = -x; (of course with optimization level 3 in Keil, otherwise "if" version takes much more instructions)
It is true, that if branching is preferred to be avoid, then ASR, EOR, SUB combination is better then "if". Otherwise, I am not really sure.
Thank you all for the replies.
Cancel
Vote up 0 Vote down

Cancel
+1 daith over 9 years ago in reply to Thibaut ZEISSLOFF

The trick with eor can be used to do the saturating abs in 3 cycles
     asrs    r1, r0, #31
     eors    r0, r1
     qsub    r0, r1
     tmp = (int)src >> 31;
     dst = __qsub(val^tmp, tmp);
Cancel
Vote up 0 Vote down

Cancel
0 Matic over 9 years ago in reply to daith

Thanks. I like that solution most.
I write it as __QSUB(src^src >> 31, src >> 31) and it works well also with 0x80000000 input. Result (if signed) become 2147483647 (0x7FFFFFFF). So, perhaps it isn't necesary to compute intermediate "tmp" value as you did, is it? Therefor you avoid additional STR operation.
Also, is casting with (int) there necessary? Because, the dissasembly code and result are the same witout it. As in my example.
Cancel
Vote up 0 Vote down

Cancel
0 Thibaut ZEISSLOFF over 9 years ago in reply to Matic

Looks that this solution is the most efficient, I would put some parenthesis though ....
Regarding the STR instruction, as long as tmp variable is local to your function, there is no reason a compiler with decent optimization would store this intermediate result !! Did you compile without optimization ? Or maybe, your tmp variable is declared as global ?
For the cast to int, it may not be necessary (if you declared src as an int), but this way you can make sure that compiler will use ASR (arithmetic shift) instead of LSR (logical shift). Also anyone that reads this C code will understand that your intention was to sign-fill your variable !
Cancel
Vote up 0 Vote down

Cancel

0 Thibaut ZEISSLOFF over 9 years ago

Hi everybody,

Just worked again on this absolute value topic and reached another way to achieve both good efficiency and portability including the saturation !

The trick is to start inverting the input value in order to test its sign in the process.

Here is the 32-bit version:

int32_t myAbs32(int32_t parVal)
{
    int32_t wResult;
   
    /* Take opposite of value and test value in the process */
    wResult = -parVal;
    
    if (wResult < 0)
    {
        /* result is negative, that means :
        - parVal is positive : following operation will subtract 0 to parVal
        - parVal is 0x80000000 : following operation will subtract 1 to parVal => 0x7FFFFFFF */   
        wResult = parVal - (((uint32_t)parVal) >> 31);
    }
    /* else result is positive, that means that parVal is negative, nothing else to do */
    return wResult;
}

When inlined in calling code of Cortex M3/M4/M7 (Thumb 2), this leads to a 2 or 3 cycles (most probably 2 cycles because of alignment of 16-bit RSBS and IT:

RSBS      R1, R0, #0
IT        MI
SUBMI     R1, R0, R0, LSR #31

In ARM mode, this takes exactly two instructions !

A very similar approach can be applied for all data sizes below, here is the 16-bit version:

int32_t myAbs16(int16_t parVal)
{
    int32_t wResult;
   
    /* Take opposite of value and test value in the process */
    wResult = -parVal;
    
    if (wResult >= 0)
    {
        /* result is positive or null, that means :
        - parVal is 0xFFFF8000, result is 0x00008000 : following operation will subtract 1 => 0x00007FFF
        - parVal is negative >= 0xFFFF8001 : following operation will subtract 0
        - parVal is 0 : following operation will sutract 0 */   
        wResult -= (wResult >> 15);
    }
    else
    {
        /* result is negative, that means that parVal is positive, restore it */
        wResult = parVal;
    }
    return wResult;
}

Same remark, this should take 2 cycles :

RSBS      R1, R0, #0
IT        PL
SUBPL     R0, R1, R1, ASR #15

It is even good for Cortex M0(+), assembly code for 32-bit absolute (3 or 4 cycles for M0+, 4 cycles for M0) :

    RSBS      R1, R0, #0
    BPL       next
    LSRS      R1, R0, #31
    SUBS      R1, R0, R1
next:

I thought this was worth digging up this topic !