Hello,I have the following piece of code:
So, after bx*16 bytes, we need to jump to another location in memory and read/store bx*16 bytes again, and so on.One possible ASM code for NEON to support the aforementioned function is the following (assuming that bx=by=4):
Hi George,
2) Now I understand. Thanks. Just one question. Does the execution throughput refer to instances of the same instruction? I mean which one of the following is preferable:
ld1b {z0.b}, p0/z, [x0] ld1b {z1.b}, p0/z, [x1] add x0, x0, x5 add x1, x1, x6
or
ld1b {z0.b}, p0/z, [x0] add x0, x0, x5 ld1b {z1.b}, p0/z, [x1] add x1, x1, x6
If the execution throughput refers to instances of the same instruction, I guess the firth option is the best. Or?
3) Your code works perfectly. Thanks!
4) I switched back to using the zero-extending loads instead. Regarding the performance, I think it is better, as you said. Thanks!
I might come back to you if I need anything else. Thanks for everything!
BR,
Akis
Hi Akis,
Throughput in this case is referring to the number of the same instruction thatcan begin execution on each cycle. The exact code layout is not particularlyimportant for large out-of-order cores like Neoverse N2 or Neoverse V2, so Iwould expect both arrangements to perform more or less the same. The bottleneckin such cores is instead usually any dependency chain between instructions, forexample in the case of the load instructions here the loads cannot beginexecution until the addresses x0 and x1 have been calculated.
Glad to hear the new code worked as expected!
Thanks,George