在A15上使用PLD 指令比不用PLD指令优化效果差,为什么会出现这种情况? 按理说,PLD是提升cache hit的概率, 这样的话,CPU处理的性能应该会提升,但是测试的情况是没有提升。
以下是我编写的memcpy汇编代码:
loop:
vldm r1!, {d0-d7}
vldm r1!, {d16-d23}
pld [r1, #0x0]
pld [r1, #0x40]
vstm r0!, {d0-d7}
vstm r0!, {d16-d23}
subs r2, #0x80
bgt loop
bx lr
I've seen preloading xc0 in front but not x100, one thing that worries me is that extra unnecessary fetches are done, I'd put in a check that the data will be required.