Support forums

Architectures and Processors forum Partial register dependency neon

State Suggested Answer
Locked Locked
Replies 4 replies
Answers 1 answer
Subscribers 347 subscribers
Views 27593 views
Users 0 members are here

Options

How was your experience today?

This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Partial register dependency neon

doofenstein over 4 years ago

I'm having trouble finding any informations on partial neon register dependencies.

Take for example the following code:

ld2 {v0.16b, v1.16b}[0], [x0]
ld2 {v0.16b, v1.16b}[1], [x1]
ld2 {v0.16b, v1.16b}[2], [x2]
...

Does the second load have to wait for the previous one to complete or may it continue right away?

I'm working with image data that needs to be palletised from a 256 16-bit entry table and I want to further process it with neon. Unfortunately due to the table size are tbl instructions not an option, since it would take up all of the 32 registers. Would doing the look up with arm first, then combining and transfering the results in 4 64-bit registers be faster?

If it helps I'm targeting Cortex-A57.

Parents

0 doofenstein over 4 years ago

I recently got access to the PMU on my target processor and was eable to perform some profiling. After some troubles I got simple cycle counting to work, looking back I could have used the regular timer to get the same result.

Anyway my results were that the method of loading everything into ARM registers first and then transfering it over NEON performed slightly better than loading directly into NEON registers via ld2.
Cancel
Up 0 Down

Cancel

Reply

0 doofenstein over 4 years ago

I recently got access to the PMU on my target processor and was eable to perform some profiling. After some troubles I got simple cycle counting to work, looking back I could have used the regular timer to get the same result.

Anyway my results were that the method of loading everything into ARM registers first and then transfering it over NEON performed slightly better than loading directly into NEON registers via ld2.
Cancel
Up 0 Down

Cancel

Children

No data