Dear ARM Support Team,
I am reaching out to ask whether pipeline stalls may occur during the execution of vector operations such as multiplication, addition, or subtraction... when using NEON, SVE, or SVE2 instructions on the target hardware platform.
Specifically, I am interested in the following:
Are there any stalls during sequential execution of vector arithmetic instructions?
What is the latency between dependent instructions, especially when operating on the same registers?
Does the microarchitecture apply techniques to mitigate stalls in these cases?
Does the vector length (in SVE) influence the likelihood or duration of stalls?
I would greatly appreciate any technical details or references to documentation that might provide deeper insights into how the processor handles such scenarios.Illustrative examples to better understand the question:
Best regards,Yevh
Hi YevhThis forum is for questions about Arm Development Studio. Your questions about pipelines stalls, etc, relate more to architectures and processors rather than Arm DS, so would be best handled by another route.You could try posting to Architectures and Processors forum at
Architectures and Processors forum
but, given your specific interest in our latest Cortex-A320 processor in your previous post, I suggest instead that you "Open a Support Case" from the links at the bottom of this web page, and our Support team will be able to help with your enquiry.Stephen
Thanks.Moved topic to the to the Architectures and Processors forum.
Adding to Stephen's answer, you might also want to look at the Software Optimisation Guide for the relevant core(s). For example, for the Cortex-A320:
https://developer.arm.com/documentation/110285/r0p1/?lang=en