Hi, can anyone suggest me how to know the instructions cycle timing of the arm_v8 instructions.does it take more cycles to transmit from neon to basic arm instructions in arm_v8.
please suggest me how to calculate instruction cycles in arm_v8
Instructions timings are processor specific. So could vary between processors, even those that implement the same version of the architecture.
This means you have to refer to the processor specific documentation, for example for the Cortex-A57:
Cortex-57 Software Optimisation Guide
It is quite interesting that guide, but it says practically nothing about memory or caches or prefetching so it leaves out an absolutely crucial part of the whole picture.
Hi daith,
Memory and caches and prefetching behaviours are system-specific. Depending on the RAM choices at your silicon vendor, and configurable items like the size (i.e. how big is L2), bridges and register slices, size of queues (configurable on some cores), choice of interconnect (CCI or CCN, for example), whether that interconnect has L3 cache or not (and it's size), access latencies at various cache levels can vary between SoC and even between revisions of the same SoC. Prefetch behaviour (such as prefetch read ahead distance) can usually be configured on a per-core basis.
Usually when writing code where you want to care about cycle timings, you assume perfect memory system behaviour (i.e. all my loads and stores will be from/to L1 with a particular best-case time). If that isn't the case, you can use PRFM instructions to game the caches into pulling the data there ahead of time (or hint that they should stay in L2 if that is what you want). You can't do much more than that once you're in silicon..
Most of the time with an ARMv8 core, you really won't care about cycle timings - you read those documents to be sure that you aren't inadvertently causing some kind of implementation-specific quirk to occur through re-use of a particular part of the pipeline or ALU logic. The most important parts of the Cortex-A57 (and -A72) documentation, for example, would be the pairs of instructions that can undergo fusion within the pipeline, and the miscellaneous hazards from conditional execution in AArch32 state. The rest of the information CAN be useful (especially as a comparison between Cortex-A57 and Cortex-A72 performance), but it is by far not interesting to most people implementing an ARMv8 core at the clock speeds most people ship ARMv8 cores at, especially as any one of the above options could drastically change the performance of a particular code sequence.
Ta,
Matt Sealey
Well that's true and I sympathize with it, but the document does say it is an optimisation guide. As you say to do reasonable optimisation one should probably be putting in the occasional prefetch. It would be good to have some guidance about that, there are a large number of options and possibilities for use. At the next level it would be good to have something said about the overheads or problems of memory barriers or operations like LDREX even if most people should not be using them directly.
Again, all dependent on the SoC as a whole... It's up to the SiP to provide guidance on how their memory system behaves.