This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why does FPU performance differ in AArch64 and AArch32 with Cortex-A53?

Hello experts,


I have come to having a question.
VFP Benchmark is a benchmark application which was made by a certain Japanese in order to measure ARM VFP performance especially for ARMv7-A and ARMv8-A.
The software can be downloaded from the following link.
http://dench.flatlib.jp/app/vfpbench

Also,  below I would show some SoC's performance results of VFP Benchmark from a web site (http://wlog.flatlib.jp/item/1793).

SP: Single Precision  DP: Double Precision ST: Single Thread  MT: Multi-Thread

I am very surprised at this results because Cortex-A53 FPU performances are different between AArch32 and AArch64.
I have believed that an FPU operation will be executed in the same way for each AArch64 and AArch32.
From this view point, the Cortex-A72 results would be reasonable.
That is, the FPU performances are the same for AArch64 and AArch32.
However, regarding Cortex-A53, the double precision performances are the same for both AArch64 and AArch32 but the single precision performance of AArch32 is a half of AArch64.
My question is why the Cortex-A53 SP performances are different between AArch64 and AArch32.
Could anyone answer this question as far as it would not invade the NDA of the hardware implementation?


Best regards,
Yasuhiko Koumoto.

Parents
  • Just came across this. I don't know why it happens but i can guess it probably has to do with that two floating point registers fitting into each of the first 16 double precision registers in Aarch32 whereas each of the 32 register holds just one floating point register in Aarch64.

    They probably implemented fetching and storing the floating point registers okay, so I would guess there is some nasty dependency problem involved where if you access a pair of odd and even registers mapping to the same double the second has to wait for the first to finish before proceeding. If so it might be possible to fix quite a bit of the problem by some change to the compiler by trying to do something like using all the even floating point registers first.

Reply
  • Just came across this. I don't know why it happens but i can guess it probably has to do with that two floating point registers fitting into each of the first 16 double precision registers in Aarch32 whereas each of the 32 register holds just one floating point register in Aarch64.

    They probably implemented fetching and storing the floating point registers okay, so I would guess there is some nasty dependency problem involved where if you access a pair of odd and even registers mapping to the same double the second has to wait for the first to finish before proceeding. If so it might be possible to fix quite a bit of the problem by some change to the compiler by trying to do something like using all the even floating point registers first.

Children