I program A53 without OS for some arithmetic operations.
The task generates 2K 32b numbers using polynomial of CRC32, store, move from/to and compares different portions of 32b numbers in L1 data cache, continuously.
Right now I get a instruction per cycle of 1.05 in Xilinx's Zynq Ultrascale device. What is a guesstimated IPC for such workloads?
I am pondering whether there are room for improvement from 1.05.
I understand the A53 has two instruction decoders, would that mean the peak IPC would be 2?
Thank you.
Yes, the peak IPC would be two and the core can dual issue most arithmetic/logic instructions. There are a few types of instruction with only a single backend which can only issue a single operation per clock of that type. There is a single load/store pipe, only one integer pipe has a multiplier, and there is only a single vector pipe for NEON.
The main thing to be aware of when optimizing for A53 is that it's an in-order pipeline, so much more sensitive to instruction ordering and load-use time than the bigger out-of-order cores.
I can't find a version of the software optimization guide for Cortex-A53, but here is the one for Cortex-A55 which should be similar in terms of broad concepts:
developer.arm.com/.../
HTH,Pete
Thank you Pete.
A55 is very similar to A53 I know so the opt guide would applicable to A53 too.