vcmp.f64 d0, d6vmrs APSR_nzcv, fpscrvcmpeq.f64 d1, d7vmrseq APSR_nzcv, fpscr
vceq.i32 q15, q0, q3vmovn.i32 d31, q15vshl.s16 d31, d31, #8vcmp.f64 d31, d29vmrs APSR_nzcv, fpscr
vmov.i16 d29, #65280 ; 0xff00
vmrs
for i = 1, n do if A[i] != K then break;done
Wouldn't it be better to "unroll" the "pre-filling loop" in order to avoid the branch mispredictions?
Why are there exactly FOUR iterations (comparison pairs) in the "pre-filling loop" ?
Why are there exactly TWO 128-bit comparisons in one iteration? Is this because the A9 is dual-issue?
How does the memory disambiguation mechanism impact the performance in your solution? Doesn't it stall the ARM pipeline?
Why is the transfer from VFP registers to ARM registers (more precisely the VMRS instruction) so frowned upon?The "Cortex A9 MPE TRM" (section 3.4.10) states the a transfer from a VFP register to an "integer core" register has a latency of only 3 cycles.How does the pipeline impact of VMRS compare to the impact of the memory disambiguation mechanism?