We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
I'm having trouble finding any informations on partial neon register dependencies.
Take for example the following code:
ld2 {v0.16b, v1.16b}[0], [x0] ld2 {v0.16b, v1.16b}[1], [x1] ld2 {v0.16b, v1.16b}[2], [x2] ...
Does the second load have to wait for the previous one to complete or may it continue right away?
I'm working with image data that needs to be palletised from a 256 16-bit entry table and I want to further process it with neon. Unfortunately due to the table size are tbl instructions not an option, since it would take up all of the 32 registers. Would doing the look up with arm first, then combining and transfering the results in 4 64-bit registers be faster?
If it helps I'm targeting Cortex-A57.