This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARMCC generating incompetent VFP codes

Note: This was originally posted on 6th September 2013 at http://forums.arm.com

Hi all.

I'm currently working on some teaching materials about ARM/VFP/NEON optimization for the A-15.

Since assembly has always been my primary programming language, and NEON being my primary choice for performance critical routines, it's the first time for me taking a look at armcc generated VFP codes....

And I'm shocked at how bad the generated codes are :

Example 1 :



void pythagoras(float *pDst, float * pSrc1, float * pSrc2, unsigned int size)
{
 float a, b, c;
 if (size == 0) return;
 do
 {
  a = *pSrc1++;
  b = *pSrc2++;
  c = a*a + b*b;
  *pDst++ = sqrtf(c);
 } while (--size);
}

A simple pythagoras calculation turns into a nightmare looking at the generated code :


pythagoras
        0x00000018:    CMP      r3,#0
        0x0000001C:    BXEQ  lr
        0x00000020:    PUSH  {r4-r8,lr}
        0x00000024:    MOV      r7,r3
        0x00000028:    MOV      r4,r2
        0x0000002C:    MOV      r5,r1
        0x00000030:    MOV      r6,r0
        0x00000034:    VLDM  r5!,{s1}
        0x00000038:    VLDM  r4!,{s0}
        0x0000003C:    VMUL.F32 s1,s1,s1
        0x00000040:    VMLA.F32 s1,s0,s0
        0x00000044:    VSQRT.F32 s0,s1
        0x00000048:    VCMP.F32 s0,s0
        0x0000004C:    VMRS  APSR_nzcv,FPSCR
        0x00000050:    BEQ      pythagoras+68 ; 0x5C
        0x00000054:    VMOV.F32 s0,s1
        0x00000058:    BL    pythagoras+64 ; 0x58
        0x0000005C:    SUBS  r7,r7,#1
        0x00000060:    VSTM  r6!,{s0}
        0x00000064:    BNE      pythagoras+28 ; 0x34
        0x00000068:    POP      {r4-r8,pc}

Even elementary school kids would know that the variable c cannot be negative at all, but armcc insists on checking if the VSQRT resulted in NaN

Changing to sqrtf(fabsf©) doesn't help either. It just puts a meaningless vabs.f32, and the NaN check still remains.

I simply cannot believe that this piece of s*** is generated by $7k, "best-in-its-class" tools.

The same applies to vdiv. I'm very aware that vsqrt and vdiv can cause exceptions, but most of the time programmers simply know that this never occurs in their own routines.

There should be a way to override those NaN checks since VMRS is an extremely expensive instruction with the pipeline stalls. Is it possible via compiler options? Did I miss something?

Example 2:



void integer_mix(unsigned int * pDst, unsigned int * pSrc1, unsigned int * pSrc2, float scalar, unsigned int size)
{
 unsigned int ratio1 = (unsigned int) (scalar*256.0f);
 unsigned int ratio2 = 256-ratio1;
 unsigned int a, b, c;
 if (size == 0) return;
 do
 {
  a = *pSrc1++;
  b = *pSrc2++;
  c = ratio1*a + ratio2*b;
  c += 128;
  c >>= 8;
  *pDst++ = c;
 } while (--size);
}

The parameter "scalar" holds a value between 0.0~1.0, serving as the mix-ratio. It's converted to q8 fixed point at the start.



integer_mix
        0x0000006C:    PUSH  {r4-r6}
        0x00000070:    VLDR  s1,[pc,#220]
        0x00000074:    CMP      r3,#0
        0x00000078:    VMUL.F32 s0,s0,s1
        0x0000007C:    VCVT.U32.F32 s0,s0
        0x00000080:    VMOV  r12,s0
        0x00000084:    RSB      r4,r12,#0x100
        0x00000088:    BEQ      integer_mix+76 ; 0xB8
        0x0000008C:    CMN      r3,#0x80000001
        0x00000090:    BLS      integer_mix+84 ; 0xC0
        0x00000094:    LDR      r5,[r1],#4
        0x00000098:    LDR      r6,[r2],#4
        0x0000009C:    MUL      r5,r12,r5
        0x000000A0:    MLA      r5,r4,r6,r5
        0x000000A4:    ADD      r5,r5,#0x80
        0x000000A8:    LSR      r5,r5,#8
        0x000000AC:    SUBS  r3,r3,#1
        0x000000B0:    STR      r5,[r0],#4
        0x000000B4:    BNE      integer_mix+40 ; 0x94
        0x000000B8:    POP      {r4-r6}
        0x000000BC:    BX    lr
        0x000000C0:    CMP      r3,#1
        0x000000C4:    MOVLE    r3,#1
        0x000000C8:    BLE      integer_mix+104 ; 0xD4
        0x000000CC:    CMP      r3,#0
        0x000000D0:    BLE      integer_mix+76 ; 0xB8
        0x000000D4:    TST      r3,#1
        0x000000D8:    SUB      r1,r1,#4
        0x000000DC:    SUB      r2,r2,#4
        0x000000E0:    SUB      r0,r0,#4
        0x000000E4:    BEQ      integer_mix+152 ; 0x104
        0x000000E8:    LDR      r5,[r1,#4]!
        0x000000EC:    MUL      r5,r5,r12
        0x000000F0:    LDR      r6,[r2,#4]!
        0x000000F4:    MLA      r5,r4,r6,r5
        0x000000F8:    ADD      r5,r5,#0x80
        0x000000FC:    LSR      r5,r5,#8
        0x00000100:    STR      r5,[r0,#4]!
        0x00000104:    ASRS  r3,r3,#1
        0x00000108:    BEQ      integer_mix+76 ; 0xB8
        0x0000010C:    LDR      r6,[r1,#4]
        0x00000110:    MUL      r6,r6,r12
        0x00000114:    LDR      r5,[r2,#4]
        0x00000118:    SUBS  r3,r3,#1
        0x0000011C:    MLA      r5,r4,r5,r6
        0x00000120:    ADD      r5,r5,#0x80
   0x00000124:    LSR      r5,r5,#8
        0x00000128:    STR      r5,[r0,#4]
        0x0000012C:    LDR      r5,[r1,#8]!
        0x00000130:    MUL      r5,r5,r12
        0x00000134:    LDR      r6,[r2,#8]!
        0x00000138:    MLA      r5,r4,r6,r5
        0x0000013C:    ADD      r5,r5,#0x80
        0x00000140:    LSR      r5,r5,#8
        0x00000144:    STR      r5,[r0,#8]!
        0x00000148:    BNE      integer_mix+160 ; 0x10C
        0x0000014C:    POP      {r4-r6}
        0x00000150:    BX    lr
        0x00000154:    DCD      0x43800000

What an unpleasant surprise!
Instead of converting to q8 fixed point via an immediate 8 as fraction bits, it simply follows the expression written in C : with a literal pool load followed by a multiply prior to the converting. What's optimized here?

As you can see above, the loop is unrolled due to the compiler option -Otime, but does it make sense without utilizing load multiple? With the branch prediction properly working, loop overhead is rather negligible, and the extended code length only harms the performance.

I know very well that writing compilers isn't an easy job. But I'm not asking for the impossible :
1. If it's difficult for the compiler to decide whether a NaN check is necessary, make a compiler option available disabling this manually. (Is there already?)
2. Smart float to fixed conversion or vice versa is trivial IMO. It must be easy fixing this.
3. ldm and stm should be utilized when the loop is unrolled. What's so difficult about this?

True, ARMCC is far better than GCC. But isn't it already a disgrace being compared to GCC that only generates "working" ****?

I'm really disappointed.

Peter Harris over 12 years ago

Note: This was originally posted on 7th September 2013 at http://forums.arm.com

The parameter "scalar" holds a value between 0.0~1.0, serving as the mix-ratio

From the example you have posted the compiler doesn't know this. The input could have any range. ARMCC does have the "__promise" intrinsic statements which let you teach the compiler about validity of data values though. [I've never tried these with floating point mind you, normally just for integer vectorization hints for NEON].

http://infocenter.ar...c/CJACHIDG.html

Does it make sense without utilizing load multiple?

Yes; the compiler gets more ability to schedule instructions (pull loads away from first use, etc). Internally both single and multiple loads instructions use the same load store hardware so on modern ARM cores it is unlikely to make a huge amount of difference because the hardware will merge reads and writes where possible anyway (which is the reason the load/store multiple instructions have gone in ARMv8 - they don't help much and are a right pain to implement in the microarchitecture).

In terms of the NaN check being expensive - probably historically true, but on Cortex-A15 the floating point unit is "just another pipeline" rather than a bolt-on coprocessor and gets full benefits of the out-of-order pipeline execution. It may well not be as expensive as you fear.

Aside - assuming your input pointers don't overlap I would recommend using "restrict" on the parameters - this should give the compiler more free range to vectorize.
Cancel
Vote up 0 Vote down

Cancel
Jake Lee over 12 years ago

Note: This was originally posted on 8th September 2013 at http://forums.arm.com

Thanks a lot for the enlightening reply. I really appreciate that.

From the example you have posted the compiler doesn't know this. The input could have any range. ARMCC does have the "__promise" intrinsic statements which let you teach the compiler about validity of data values though. [I've never tried these with floating point mind you, normally just for integer vectorization hints for NEON].

http://infocenter.ar...c/CJACHIDG.html

I'm complaining about armcc not doing such a trivial optimization compiling :

unsigned int ratio1 = (unsigned int) (scalar*256.0f);

into a single instruction :

VCVT.U32.F32 s0,s0, #8

In this case, the compiler doesn't have to know the range of the float value given as the parameter.
Should an "overflow" occur, vcvt results in 0x7fffffff either way.

The __promise intrinsic assures the compiler a value never being zero. It doesn't help in case of sqrt.

Yes; the compiler gets more ability to schedule instructions (pull loads away from first use, etc). Internally both single and multiple loads instructions use the same load store hardware so on modern ARM cores it is unlikely to make a huge amount of difference because the hardware will merge reads and writes where possible anyway (which is the reason the load/store multiple instructions have gone in ARMv8 - they don't help much and are a right pain to implement in the microarchitecture).

That's very interesting. I'm glad that I don't have to "enforce" ldm and stm when writing assembly codes anymore.

But looking at the armcc and gcc generated codes, compilers don't do the scheduling very well (if at all) in conjunction with the loop unrolling. They rather tend to use the same registers over and over rendering the loop unrolling meaningless.

In terms of the NaN check being expensive - probably historically true, but on Cortex-A15 the floating point unit is "just another pipeline" rather than a bolt-on coprocessor and gets full benefits of the out-of-order pipeline execution. It may well not be as expensive as you fear.

Hmmm.... I don't see how a pipeline stall plus a conditional branch can be compensated by the out-of-order execution. Does the instruction VMSR behave THAT differently on the A15 than on older ones? Not causing pipeline stall for example?

Thanks in advance

Jake
Cancel
Vote up 0 Vote down

Cancel
Peter Harris over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Jake Lee over 12 years ago

Note: This was originally posted on 8th September 2013 at http://forums.arm.com

Unfortunately, I have to report that the __promise intrinsic didn't work with vsqrt.

And in case of scalar, - I repeat - the compiler doesn't have to know the range of the value it contains.

It's just as trivial as "a += b*256" compiled to "add a, a, b, lsl #8" which armcc has been doing so well since forever.

On the other hand, I was really surprised that armcc removed an addition prior shift right for the rounding purpose and put rounding shift right instead within the vectorized code. THAT I call smart. (GCC doesn't do that, of course).

Overall, I have the feeling that the people at ARM did a wonderful job optimizing the compiler for the integer part and more or less skipped VFP optimizations while concentrating on NEON.

Thanks again for your wonderful reply.
Cancel
Vote up 0 Vote down

Cancel
Peter Harris over 12 years ago

Note: This was originally posted on 8th September 2013 at http://forums.arm.com

Unfortunately, I have to report that the __promise intrinsic didn't work with vsqrt.

Thanks for trying - thought it was a long shot anyway but worth a crack

And in case of scalar, - I repeat - the compiler doesn't have to know the range of the value it contains.

Hmm, on paper maths (i.e. infinite precision) you are probably right, but if you don't do the full multiply by 256.0 in floating point domain I suspect you run into issues with some of the corner cases (infinities, denormals, correct rounding mode) where we run headlong into the C spec's correctness rules. While you might "get away with it" for common case values and if you were always rounding towards zero, the fact it will fail for the corner cases means that this turns into the something the compiler isn't allowed to do (without hacks like GCC's fast-math option to basically ignore the spec).

I don't have the compiler to hand, but I suspect it is more likely to produce better code if you convert the scalar float to fixed point first (with 16 fractional bits), and then shift in integer domain to drop the 8 fractional bits.

Given you know the value is between 0 and 1 you are not going to lose any accuracy doing it this way and integer maths is generally more amenable to optimization (fewer horrible rules like rounding modes to worry about). Still won't be one instruction though.

I do miss working with Ada - range hinting on variable declaration was great for this kind of thing.
Cancel
Vote up 0 Vote down

Cancel
Simon Craske over 12 years ago
Cancel
Vote up 0 Vote down

Cancel
Jake Lee over 12 years ago

Note: This was originally posted on 10th September 2013 at http://forums.arm.com

Thank you very much.

Now I can sleep in peace~

PS : I'm taking back the word "incompetent"
Cancel
Vote up 0 Vote down

Cancel