We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
void pythagoras(float *pDst, float * pSrc1, float * pSrc2, unsigned int size){ float a, b, c; if (size == 0) return; do { a = *pSrc1++; b = *pSrc2++; c = a*a + b*b; *pDst++ = sqrtf(c); } while (--size);}
pythagoras 0x00000018: CMP r3,#0 0x0000001C: BXEQ lr 0x00000020: PUSH {r4-r8,lr} 0x00000024: MOV r7,r3 0x00000028: MOV r4,r2 0x0000002C: MOV r5,r1 0x00000030: MOV r6,r0 0x00000034: VLDM r5!,{s1} 0x00000038: VLDM r4!,{s0} 0x0000003C: VMUL.F32 s1,s1,s1 0x00000040: VMLA.F32 s1,s0,s0 0x00000044: VSQRT.F32 s0,s1 0x00000048: VCMP.F32 s0,s0 0x0000004C: VMRS APSR_nzcv,FPSCR 0x00000050: BEQ pythagoras+68 ; 0x5C 0x00000054: VMOV.F32 s0,s1 0x00000058: BL pythagoras+64 ; 0x58 0x0000005C: SUBS r7,r7,#1 0x00000060: VSTM r6!,{s0} 0x00000064: BNE pythagoras+28 ; 0x34 0x00000068: POP {r4-r8,pc}
void integer_mix(unsigned int * pDst, unsigned int * pSrc1, unsigned int * pSrc2, float scalar, unsigned int size){ unsigned int ratio1 = (unsigned int) (scalar*256.0f); unsigned int ratio2 = 256-ratio1; unsigned int a, b, c; if (size == 0) return; do { a = *pSrc1++; b = *pSrc2++; c = ratio1*a + ratio2*b; c += 128; c >>= 8; *pDst++ = c; } while (--size);}
integer_mix 0x0000006C: PUSH {r4-r6} 0x00000070: VLDR s1,[pc,#220] 0x00000074: CMP r3,#0 0x00000078: VMUL.F32 s0,s0,s1 0x0000007C: VCVT.U32.F32 s0,s0 0x00000080: VMOV r12,s0 0x00000084: RSB r4,r12,#0x100 0x00000088: BEQ integer_mix+76 ; 0xB8 0x0000008C: CMN r3,#0x80000001 0x00000090: BLS integer_mix+84 ; 0xC0 0x00000094: LDR r5,[r1],#4 0x00000098: LDR r6,[r2],#4 0x0000009C: MUL r5,r12,r5 0x000000A0: MLA r5,r4,r6,r5 0x000000A4: ADD r5,r5,#0x80 0x000000A8: LSR r5,r5,#8 0x000000AC: SUBS r3,r3,#1 0x000000B0: STR r5,[r0],#4 0x000000B4: BNE integer_mix+40 ; 0x94 0x000000B8: POP {r4-r6} 0x000000BC: BX lr 0x000000C0: CMP r3,#1 0x000000C4: MOVLE r3,#1 0x000000C8: BLE integer_mix+104 ; 0xD4 0x000000CC: CMP r3,#0 0x000000D0: BLE integer_mix+76 ; 0xB8 0x000000D4: TST r3,#1 0x000000D8: SUB r1,r1,#4 0x000000DC: SUB r2,r2,#4 0x000000E0: SUB r0,r0,#4 0x000000E4: BEQ integer_mix+152 ; 0x104 0x000000E8: LDR r5,[r1,#4]! 0x000000EC: MUL r5,r5,r12 0x000000F0: LDR r6,[r2,#4]! 0x000000F4: MLA r5,r4,r6,r5 0x000000F8: ADD r5,r5,#0x80 0x000000FC: LSR r5,r5,#8 0x00000100: STR r5,[r0,#4]! 0x00000104: ASRS r3,r3,#1 0x00000108: BEQ integer_mix+76 ; 0xB8 0x0000010C: LDR r6,[r1,#4] 0x00000110: MUL r6,r6,r12 0x00000114: LDR r5,[r2,#4] 0x00000118: SUBS r3,r3,#1 0x0000011C: MLA r5,r4,r5,r6 0x00000120: ADD r5,r5,#0x80 0x00000124: LSR r5,r5,#8 0x00000128: STR r5,[r0,#4] 0x0000012C: LDR r5,[r1,#8]! 0x00000130: MUL r5,r5,r12 0x00000134: LDR r6,[r2,#8]! 0x00000138: MLA r5,r4,r6,r5 0x0000013C: ADD r5,r5,#0x80 0x00000140: LSR r5,r5,#8 0x00000144: STR r5,[r0,#8]! 0x00000148: BNE integer_mix+160 ; 0x10C 0x0000014C: POP {r4-r6} 0x00000150: BX lr 0x00000154: DCD 0x43800000
The parameter "scalar" holds a value between 0.0~1.0, serving as the mix-ratio
Does it make sense without utilizing load multiple?
From the example you have posted the compiler doesn't know this. The input could have any range. ARMCC does have the "__promise" intrinsic statements which let you teach the compiler about validity of data values though. [I've never tried these with floating point mind you, normally just for integer vectorization hints for NEON].http://infocenter.ar...c/CJACHIDG.html
unsigned int ratio1 = (unsigned int) (scalar*256.0f);
VCVT.U32.F32 s0,s0, #8
Yes; the compiler gets more ability to schedule instructions (pull loads away from first use, etc). Internally both single and multiple loads instructions use the same load store hardware so on modern ARM cores it is unlikely to make a huge amount of difference because the hardware will merge reads and writes where possible anyway (which is the reason the load/store multiple instructions have gone in ARMv8 - they don't help much and are a right pain to implement in the microarchitecture).
In terms of the NaN check being expensive - probably historically true, but on Cortex-A15 the floating point unit is "just another pipeline" rather than a bolt-on coprocessor and gets full benefits of the out-of-order pipeline execution. It may well not be as expensive as you fear.
Unfortunately, I have to report that the __promise intrinsic didn't work with vsqrt.
And in case of scalar, - I repeat - the compiler doesn't have to know the range of the value it contains.