This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Why does FPU performance differ in AArch64 and AArch32 with Cortex-A53?

Hello experts,

I have come to having a question.
VFP Benchmark is a benchmark application which was made by a certain Japanese in order to measure ARM VFP performance especially for ARMv7-A and ARMv8-A.
The software can be downloaded from the following link.
http://dench.flatlib.jp/app/vfpbench

Also, below I would show some SoC's performance results of VFP Benchmark from a web site (http://wlog.flatlib.jp/item/1793).

SP: Single Precision DP: Double Precision ST: Single Thread MT: Multi-Thread

I am very surprised at this results because Cortex-A53 FPU performances are different between AArch32 and AArch64.
I have believed that an FPU operation will be executed in the same way for each AArch64 and AArch32.
From this view point, the Cortex-A72 results would be reasonable.
That is, the FPU performances are the same for AArch64 and AArch32.
However, regarding Cortex-A53, the double precision performances are the same for both AArch64 and AArch32 but the single precision performance of AArch32 is a half of AArch64.
My question is why the Cortex-A53 SP performances are different between AArch64 and AArch32.
Could anyone answer this question as far as it would not invade the NDA of the hardware implementation?

Best regards,
Yasuhiko Koumoto.

Parents

+1 daith over 9 years ago in reply to Yasuhiko Koumoto

Sorry it seems we both made a mistake. Originally you said the single precision performance differed whereas as you say now it is the double precision figures which differ - and that is what the figures say. I should have looked at the figures. What I said about a possible dependency clash therefore is just wrong.
Thinking about the problem again if the A53 and A72 versions of the armv7a and arm64 code are the same then I haven't the foggiest idea why the times for the double precision code should be so different.
About the four times difference in general:
When using SIMD if double and single take about the same time then overall an algorithm using double will take double the time - with 128 bit registers 2 doubles are operated on at a time as opposed to four single precision numbers. The store and loads will also take about double the times as there is double the data.
If SIMD double operations took double the time then overall the time might get towards four times longer if all the data is in local caches.
However this does not explain the difference between the two A53 figures.
Cancel
Vote up 0 Vote down

Cancel

Reply

+1 daith over 9 years ago in reply to Yasuhiko Koumoto

Sorry it seems we both made a mistake. Originally you said the single precision performance differed whereas as you say now it is the double precision figures which differ - and that is what the figures say. I should have looked at the figures. What I said about a possible dependency clash therefore is just wrong.
Thinking about the problem again if the A53 and A72 versions of the armv7a and arm64 code are the same then I haven't the foggiest idea why the times for the double precision code should be so different.
About the four times difference in general:
When using SIMD if double and single take about the same time then overall an algorithm using double will take double the time - with 128 bit registers 2 doubles are operated on at a time as opposed to four single precision numbers. The store and loads will also take about double the times as there is double the data.
If SIMD double operations took double the time then overall the time might get towards four times longer if all the data is in local caches.
However this does not explain the difference between the two A53 figures.
Cancel
Vote up 0 Vote down

Cancel

Children

No data