Yes, it is shared fetch and decode unit for the initial NEON decode before issue to the ARM pipeline. The NEON backend does some more to work out exactly what to do with it. NEON reuses some ARM registers for things like address generation which the NEON unit does not have direct access to, so things like address calculation happens "on the way through" the ARM pipeline. NEON has it's own issue queue after the ARM pipeline. The two are decoupled to some extent. See above - ARM functional units are reused for part of NEON operation.
Here I got some more questions:dual cycle NEON instruction like VMUL.F32 Qd, Qn, Dm[x] computing the Qd in two cycles, low byte first and then high byte.What's the behavior of the VMLA instruction after finishing the first cycle, return to instruction queue or return to the beginning of pipeline again? I'm confused here. It should come to the instruction queue because the last cycle of multi-cycle data processing instruction can be dual issued with a load/store instruction. It should also stay in pipeline because the result must be generated in N9 stage. But which one is right? The other similar one:In the TRM, instructions like VMLA and VMLS are said to start execution on the fp multiply pipeline first and then the result is forwarded to the fp add pipeline.Does this "forward" mean a shortcut? skip the instruction queue? The final one:What would happen to the two-cycle multiply-accumulation instruction VMLA.F32 Qd, Qn, Dm[x]?
View all questions in Arm Development Studio forum