This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARMCC generating incompetent VFP codes

Note: This was originally posted on 6th September 2013 at http://forums.arm.com

Hi all.


I'm currently working on some teaching materials about ARM/VFP/NEON optimization for the A-15.

Since assembly has always been my primary programming language, and NEON being my primary choice for performance critical routines, it's the first time for me taking a look at armcc generated VFP codes....

And I'm shocked at how bad the generated codes are :

Example 1 :



void pythagoras(float *pDst, float * pSrc1, float * pSrc2, unsigned int size)
{
float a, b, c;
if (size == 0) return;
do
{
  a = *pSrc1++;
  b = *pSrc2++;
  c = a*a + b*b;
  *pDst++ = sqrtf(c);
} while (--size);
}


A simple pythagoras calculation turns into a nightmare looking at the generated code :


pythagoras
        0x00000018:    CMP      r3,#0
        0x0000001C:    BXEQ  lr
        0x00000020:    PUSH  {r4-r8,lr}
        0x00000024:    MOV      r7,r3
        0x00000028:    MOV      r4,r2
        0x0000002C:    MOV      r5,r1
        0x00000030:    MOV      r6,r0
        0x00000034:    VLDM  r5!,{s1}
        0x00000038:    VLDM  r4!,{s0}
        0x0000003C:    VMUL.F32 s1,s1,s1
        0x00000040:    VMLA.F32 s1,s0,s0
        0x00000044:    VSQRT.F32 s0,s1
        0x00000048:    VCMP.F32 s0,s0
        0x0000004C:    VMRS  APSR_nzcv,FPSCR
        0x00000050:    BEQ      pythagoras+68 ; 0x5C
        0x00000054:    VMOV.F32 s0,s1
        0x00000058:    BL    pythagoras+64 ; 0x58
        0x0000005C:    SUBS  r7,r7,#1
        0x00000060:    VSTM  r6!,{s0}
        0x00000064:    BNE      pythagoras+28 ; 0x34
        0x00000068:    POP      {r4-r8,pc}



Even elementary school kids would know that the variable c cannot be negative at all, but armcc insists on checking if the VSQRT resulted in NaN :(
Changing to sqrtf(fabsf©) doesn't help either. It just puts a meaningless vabs.f32, and the NaN check still remains.

I simply cannot believe that this piece of s*** is generated by $7k, "best-in-its-class" tools.

The same applies to vdiv. I'm very aware that vsqrt and vdiv can cause exceptions, but most of the time programmers simply know that this never occurs in their own routines.

There should be a way to override those NaN checks since VMRS is an extremely expensive instruction with the pipeline stalls. Is it possible via compiler options? Did I miss something?

Example 2:



void integer_mix(unsigned int * pDst, unsigned int * pSrc1, unsigned int * pSrc2, float scalar, unsigned int size)
{
unsigned int ratio1 = (unsigned int) (scalar*256.0f);
unsigned int ratio2 = 256-ratio1;
unsigned int a, b, c;
if (size == 0) return;
do
{
  a = *pSrc1++;
  b = *pSrc2++;
  c = ratio1*a + ratio2*b;
  c += 128;
  c >>= 8;
  *pDst++ = c;
} while (--size);
}


The parameter "scalar" holds a value between 0.0~1.0, serving as the mix-ratio. It's converted to q8 fixed point at the start.




integer_mix
        0x0000006C:    PUSH  {r4-r6}
        0x00000070:    VLDR  s1,[pc,#220]
        0x00000074:    CMP      r3,#0
        0x00000078:    VMUL.F32 s0,s0,s1
        0x0000007C:    VCVT.U32.F32 s0,s0
        0x00000080:    VMOV  r12,s0
        0x00000084:    RSB      r4,r12,#0x100
        0x00000088:    BEQ      integer_mix+76 ; 0xB8
        0x0000008C:    CMN      r3,#0x80000001
        0x00000090:    BLS      integer_mix+84 ; 0xC0
        0x00000094:    LDR      r5,[r1],#4
        0x00000098:    LDR      r6,[r2],#4
        0x0000009C:    MUL      r5,r12,r5
        0x000000A0:    MLA      r5,r4,r6,r5
        0x000000A4:    ADD      r5,r5,#0x80
        0x000000A8:    LSR      r5,r5,#8
        0x000000AC:    SUBS  r3,r3,#1
        0x000000B0:    STR      r5,[r0],#4
        0x000000B4:    BNE      integer_mix+40 ; 0x94
        0x000000B8:    POP      {r4-r6}
        0x000000BC:    BX    lr
        0x000000C0:    CMP      r3,#1
        0x000000C4:    MOVLE    r3,#1
        0x000000C8:    BLE      integer_mix+104 ; 0xD4
        0x000000CC:    CMP      r3,#0
        0x000000D0:    BLE      integer_mix+76 ; 0xB8
        0x000000D4:    TST      r3,#1
        0x000000D8:    SUB      r1,r1,#4
        0x000000DC:    SUB      r2,r2,#4
        0x000000E0:    SUB      r0,r0,#4
        0x000000E4:    BEQ      integer_mix+152 ; 0x104
        0x000000E8:    LDR      r5,[r1,#4]!
        0x000000EC:    MUL      r5,r5,r12
        0x000000F0:    LDR      r6,[r2,#4]!
        0x000000F4:    MLA      r5,r4,r6,r5
        0x000000F8:    ADD      r5,r5,#0x80
        0x000000FC:    LSR      r5,r5,#8
        0x00000100:    STR      r5,[r0,#4]!
        0x00000104:    ASRS  r3,r3,#1
        0x00000108:    BEQ      integer_mix+76 ; 0xB8
        0x0000010C:    LDR      r6,[r1,#4]
        0x00000110:    MUL      r6,r6,r12
        0x00000114:    LDR      r5,[r2,#4]
        0x00000118:    SUBS  r3,r3,#1
        0x0000011C:    MLA      r5,r4,r5,r6
        0x00000120:    ADD      r5,r5,#0x80
   0x00000124:    LSR      r5,r5,#8
        0x00000128:    STR      r5,[r0,#4]
        0x0000012C:    LDR      r5,[r1,#8]!
        0x00000130:    MUL      r5,r5,r12
        0x00000134:    LDR      r6,[r2,#8]!
        0x00000138:    MLA      r5,r4,r6,r5
        0x0000013C:    ADD      r5,r5,#0x80
        0x00000140:    LSR      r5,r5,#8
        0x00000144:    STR      r5,[r0,#8]!
        0x00000148:    BNE      integer_mix+160 ; 0x10C
        0x0000014C:    POP      {r4-r6}
        0x00000150:    BX    lr
        0x00000154:    DCD      0x43800000


What an unpleasant surprise!
Instead of converting to q8 fixed point via an immediate 8 as fraction bits, it simply follows the expression written in C : with a literal pool load followed by a multiply prior to the converting. What's optimized here?

As you can see above, the loop is unrolled due to the compiler option -Otime, but does it make sense without utilizing load multiple? With the branch prediction properly working, loop overhead is rather negligible, and the extended code length only harms the performance.

I know very well that writing compilers isn't an easy job. But I'm not asking for the impossible :
1. If it's difficult for the compiler to decide whether a NaN check is necessary, make a compiler option available disabling this manually. (Is there already?)
2. Smart float to fixed conversion or vice versa is trivial IMO. It must be easy fixing this.
3. ldm and stm should be utilized when the loop is unrolled. What's so difficult about this?

True, ARMCC is far better than GCC. But isn't it already a disgrace being compared to GCC that only generates "working" ****?

I'm really disappointed.
Parents
  • Note: This was originally posted on 8th September 2013 at http://forums.arm.com

    Unfortunately, I have to report that the __promise intrinsic didn't work with vsqrt.


    Thanks for trying - thought it was a long shot anyway but worth a crack

    And in case of scalar, - I repeat - the compiler doesn't have to know the range of the value it contains.


    Hmm, on paper maths (i.e. infinite precision) you are probably right, but if you don't do the full multiply by 256.0 in floating point domain I suspect you run into issues with some of the corner cases (infinities, denormals, correct rounding mode) where we run headlong into the C spec's correctness rules. While you might "get away with it" for common case values and if you were always rounding towards zero, the fact it will fail for the corner cases means that this turns into the something the compiler isn't allowed to do (without hacks like GCC's fast-math option to basically ignore the spec).

    I don't have the compiler to hand, but I suspect it is more likely to produce better code if you convert the scalar float to fixed point first (with 16 fractional bits), and then shift in integer domain to drop the 8 fractional bits.

    Given you know the value is between 0 and 1 you are not going to lose any accuracy doing it this way and integer maths is generally more amenable to optimization (fewer horrible rules like rounding modes to worry about). Still won't be one instruction though.

    I do miss working with Ada - range hinting on variable declaration was great for this kind of thing.
Reply
  • Note: This was originally posted on 8th September 2013 at http://forums.arm.com

    Unfortunately, I have to report that the __promise intrinsic didn't work with vsqrt.


    Thanks for trying - thought it was a long shot anyway but worth a crack

    And in case of scalar, - I repeat - the compiler doesn't have to know the range of the value it contains.


    Hmm, on paper maths (i.e. infinite precision) you are probably right, but if you don't do the full multiply by 256.0 in floating point domain I suspect you run into issues with some of the corner cases (infinities, denormals, correct rounding mode) where we run headlong into the C spec's correctness rules. While you might "get away with it" for common case values and if you were always rounding towards zero, the fact it will fail for the corner cases means that this turns into the something the compiler isn't allowed to do (without hacks like GCC's fast-math option to basically ignore the spec).

    I don't have the compiler to hand, but I suspect it is more likely to produce better code if you convert the scalar float to fixed point first (with 16 fractional bits), and then shift in integer domain to drop the 8 fractional bits.

    Given you know the value is between 0 and 1 you are not going to lose any accuracy doing it this way and integer maths is generally more amenable to optimization (fewer horrible rules like rounding modes to worry about). Still won't be one instruction though.

    I do miss working with Ada - range hinting on variable declaration was great for this kind of thing.
Children
No data