This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARMCC generating incompetent VFP codes

Note: This was originally posted on 6th September 2013 at http://forums.arm.com

Hi all.


I'm currently working on some teaching materials about ARM/VFP/NEON optimization for the A-15.

Since assembly has always been my primary programming language, and NEON being my primary choice for performance critical routines, it's the first time for me taking a look at armcc generated VFP codes....

And I'm shocked at how bad the generated codes are :

Example 1 :



void pythagoras(float *pDst, float * pSrc1, float * pSrc2, unsigned int size)
{
float a, b, c;
if (size == 0) return;
do
{
  a = *pSrc1++;
  b = *pSrc2++;
  c = a*a + b*b;
  *pDst++ = sqrtf(c);
} while (--size);
}


A simple pythagoras calculation turns into a nightmare looking at the generated code :


pythagoras
        0x00000018:    CMP      r3,#0
        0x0000001C:    BXEQ  lr
        0x00000020:    PUSH  {r4-r8,lr}
        0x00000024:    MOV      r7,r3
        0x00000028:    MOV      r4,r2
        0x0000002C:    MOV      r5,r1
        0x00000030:    MOV      r6,r0
        0x00000034:    VLDM  r5!,{s1}
        0x00000038:    VLDM  r4!,{s0}
        0x0000003C:    VMUL.F32 s1,s1,s1
        0x00000040:    VMLA.F32 s1,s0,s0
        0x00000044:    VSQRT.F32 s0,s1
        0x00000048:    VCMP.F32 s0,s0
        0x0000004C:    VMRS  APSR_nzcv,FPSCR
        0x00000050:    BEQ      pythagoras+68 ; 0x5C
        0x00000054:    VMOV.F32 s0,s1
        0x00000058:    BL    pythagoras+64 ; 0x58
        0x0000005C:    SUBS  r7,r7,#1
        0x00000060:    VSTM  r6!,{s0}
        0x00000064:    BNE      pythagoras+28 ; 0x34
        0x00000068:    POP      {r4-r8,pc}



Even elementary school kids would know that the variable c cannot be negative at all, but armcc insists on checking if the VSQRT resulted in NaN :(
Changing to sqrtf(fabsf©) doesn't help either. It just puts a meaningless vabs.f32, and the NaN check still remains.

I simply cannot believe that this piece of s*** is generated by $7k, "best-in-its-class" tools.

The same applies to vdiv. I'm very aware that vsqrt and vdiv can cause exceptions, but most of the time programmers simply know that this never occurs in their own routines.

There should be a way to override those NaN checks since VMRS is an extremely expensive instruction with the pipeline stalls. Is it possible via compiler options? Did I miss something?

Example 2:



void integer_mix(unsigned int * pDst, unsigned int * pSrc1, unsigned int * pSrc2, float scalar, unsigned int size)
{
unsigned int ratio1 = (unsigned int) (scalar*256.0f);
unsigned int ratio2 = 256-ratio1;
unsigned int a, b, c;
if (size == 0) return;
do
{
  a = *pSrc1++;
  b = *pSrc2++;
  c = ratio1*a + ratio2*b;
  c += 128;
  c >>= 8;
  *pDst++ = c;
} while (--size);
}


The parameter "scalar" holds a value between 0.0~1.0, serving as the mix-ratio. It's converted to q8 fixed point at the start.




integer_mix
        0x0000006C:    PUSH  {r4-r6}
        0x00000070:    VLDR  s1,[pc,#220]
        0x00000074:    CMP      r3,#0
        0x00000078:    VMUL.F32 s0,s0,s1
        0x0000007C:    VCVT.U32.F32 s0,s0
        0x00000080:    VMOV  r12,s0
        0x00000084:    RSB      r4,r12,#0x100
        0x00000088:    BEQ      integer_mix+76 ; 0xB8
        0x0000008C:    CMN      r3,#0x80000001
        0x00000090:    BLS      integer_mix+84 ; 0xC0
        0x00000094:    LDR      r5,[r1],#4
        0x00000098:    LDR      r6,[r2],#4
        0x0000009C:    MUL      r5,r12,r5
        0x000000A0:    MLA      r5,r4,r6,r5
        0x000000A4:    ADD      r5,r5,#0x80
        0x000000A8:    LSR      r5,r5,#8
        0x000000AC:    SUBS  r3,r3,#1
        0x000000B0:    STR      r5,[r0],#4
        0x000000B4:    BNE      integer_mix+40 ; 0x94
        0x000000B8:    POP      {r4-r6}
        0x000000BC:    BX    lr
        0x000000C0:    CMP      r3,#1
        0x000000C4:    MOVLE    r3,#1
        0x000000C8:    BLE      integer_mix+104 ; 0xD4
        0x000000CC:    CMP      r3,#0
        0x000000D0:    BLE      integer_mix+76 ; 0xB8
        0x000000D4:    TST      r3,#1
        0x000000D8:    SUB      r1,r1,#4
        0x000000DC:    SUB      r2,r2,#4
        0x000000E0:    SUB      r0,r0,#4
        0x000000E4:    BEQ      integer_mix+152 ; 0x104
        0x000000E8:    LDR      r5,[r1,#4]!
        0x000000EC:    MUL      r5,r5,r12
        0x000000F0:    LDR      r6,[r2,#4]!
        0x000000F4:    MLA      r5,r4,r6,r5
        0x000000F8:    ADD      r5,r5,#0x80
        0x000000FC:    LSR      r5,r5,#8
        0x00000100:    STR      r5,[r0,#4]!
        0x00000104:    ASRS  r3,r3,#1
        0x00000108:    BEQ      integer_mix+76 ; 0xB8
        0x0000010C:    LDR      r6,[r1,#4]
        0x00000110:    MUL      r6,r6,r12
        0x00000114:    LDR      r5,[r2,#4]
        0x00000118:    SUBS  r3,r3,#1
        0x0000011C:    MLA      r5,r4,r5,r6
        0x00000120:    ADD      r5,r5,#0x80
   0x00000124:    LSR      r5,r5,#8
        0x00000128:    STR      r5,[r0,#4]
        0x0000012C:    LDR      r5,[r1,#8]!
        0x00000130:    MUL      r5,r5,r12
        0x00000134:    LDR      r6,[r2,#8]!
        0x00000138:    MLA      r5,r4,r6,r5
        0x0000013C:    ADD      r5,r5,#0x80
        0x00000140:    LSR      r5,r5,#8
        0x00000144:    STR      r5,[r0,#8]!
        0x00000148:    BNE      integer_mix+160 ; 0x10C
        0x0000014C:    POP      {r4-r6}
        0x00000150:    BX    lr
        0x00000154:    DCD      0x43800000


What an unpleasant surprise!
Instead of converting to q8 fixed point via an immediate 8 as fraction bits, it simply follows the expression written in C : with a literal pool load followed by a multiply prior to the converting. What's optimized here?

As you can see above, the loop is unrolled due to the compiler option -Otime, but does it make sense without utilizing load multiple? With the branch prediction properly working, loop overhead is rather negligible, and the extended code length only harms the performance.

I know very well that writing compilers isn't an easy job. But I'm not asking for the impossible :
1. If it's difficult for the compiler to decide whether a NaN check is necessary, make a compiler option available disabling this manually. (Is there already?)
2. Smart float to fixed conversion or vice versa is trivial IMO. It must be easy fixing this.
3. ldm and stm should be utilized when the loop is unrolled. What's so difficult about this?

True, ARMCC is far better than GCC. But isn't it already a disgrace being compared to GCC that only generates "working" ****?

I'm really disappointed.
Parents
  • Note: This was originally posted on 7th September 2013 at http://forums.arm.com

    The parameter "scalar" holds a value between 0.0~1.0, serving as the mix-ratio


    From the example you have posted the compiler doesn't know this. The input could have any range. ARMCC does have the "__promise" intrinsic statements which let you teach the compiler about validity of data values though. [I've never tried these with floating point mind you, normally just for integer vectorization hints for NEON].

    http://infocenter.ar...c/CJACHIDG.html

    Does it make sense without utilizing load multiple?


    Yes; the compiler gets more ability to schedule instructions (pull loads away from first use, etc). Internally both single and multiple loads instructions use the same load store hardware so on modern ARM cores it is unlikely to make a huge amount of difference because the hardware will merge reads and writes where possible anyway (which is the reason the load/store multiple instructions have gone in ARMv8 - they don't help much and are a right pain to implement in the microarchitecture).

    In terms of the NaN check being expensive - probably historically true, but on Cortex-A15 the floating point unit is "just another pipeline" rather than a bolt-on coprocessor and gets full benefits of the out-of-order pipeline execution. It may well not be as expensive as you fear.

    Aside - assuming your input pointers don't overlap I would recommend using "restrict" on the parameters - this should give the compiler more free range to vectorize.
Reply
  • Note: This was originally posted on 7th September 2013 at http://forums.arm.com

    The parameter "scalar" holds a value between 0.0~1.0, serving as the mix-ratio


    From the example you have posted the compiler doesn't know this. The input could have any range. ARMCC does have the "__promise" intrinsic statements which let you teach the compiler about validity of data values though. [I've never tried these with floating point mind you, normally just for integer vectorization hints for NEON].

    http://infocenter.ar...c/CJACHIDG.html

    Does it make sense without utilizing load multiple?


    Yes; the compiler gets more ability to schedule instructions (pull loads away from first use, etc). Internally both single and multiple loads instructions use the same load store hardware so on modern ARM cores it is unlikely to make a huge amount of difference because the hardware will merge reads and writes where possible anyway (which is the reason the load/store multiple instructions have gone in ARMv8 - they don't help much and are a right pain to implement in the microarchitecture).

    In terms of the NaN check being expensive - probably historically true, but on Cortex-A15 the floating point unit is "just another pipeline" rather than a bolt-on coprocessor and gets full benefits of the out-of-order pipeline execution. It may well not be as expensive as you fear.

    Aside - assuming your input pointers don't overlap I would recommend using "restrict" on the parameters - this should give the compiler more free range to vectorize.
Children
No data