Hello experts,
I have come to having a question.VFP Benchmark is a benchmark application which was made by a certain Japanese in order to measure ARM VFP performance especially for ARMv7-A and ARMv8-A.The software can be downloaded from the following link.http://dench.flatlib.jp/app/vfpbench
Also, below I would show some SoC's performance results of VFP Benchmark from a web site (http://wlog.flatlib.jp/item/1793).
SP: Single Precision DP: Double Precision ST: Single Thread MT: Multi-Thread
I am very surprised at this results because Cortex-A53 FPU performances are different between AArch32 and AArch64.I have believed that an FPU operation will be executed in the same way for each AArch64 and AArch32.From this view point, the Cortex-A72 results would be reasonable.That is, the FPU performances are the same for AArch64 and AArch32.However, regarding Cortex-A53, the double precision performances are the same for both AArch64 and AArch32 but the single precision performance of AArch32 is a half of AArch64.My question is why the Cortex-A53 SP performances are different between AArch64 and AArch32.Could anyone answer this question as far as it would not invade the NDA of the hardware implementation?
Best regards,Yasuhiko Koumoto.
Hi daith,
thank you for your commets.The concerning point is that Cortex-A72 is the same performance between AArch64 and AArch32 FPU whereas Cortex-A53 is not the same performance.
I think that the DP (Double Precision) performance is a half of SP (Single Precision) would be reasonable. However the DP of Cortex-A53 is a fourth of SP.
My biggest wondering is FPU pipleline would be different by AArch64 and AArch32. My biggest wondering is that FPU pipeline would be different by AArch64 and AArch32 although Cortex-A72 would be the same.
There is additional information at the following link.http://dench.flatlib.jp/opengl/vfpbenchlog
Regarding this, Cortex-A8, A9 and A15 are the same situation as the Cortex-A53 AArch32 (i.e. DP is a fourth of DP performance). By these resuts, daith guess might be correct.
I think that Cortex-A53 would be ordinary and Cortex-A72 would be special. Regarding Cortex-A53 AArch64 FPU, I guess that it had been improved at some occasion. AArch64 DP performance of Raspberry Pi3 is a fourth of SP would be Cortex-A53 of Raspberry Pi3 would be an older version of the processor (just my guess).
Anyway, I know the truth of the implementation would be not proved at this forum, and I must accept the facts.
Sorry it seems we both made a mistake. Originally you said the single precision performance differed whereas as you say now it is the double precision figures which differ - and that is what the figures say. I should have looked at the figures. What I said about a possible dependency clash therefore is just wrong.
Thinking about the problem again if the A53 and A72 versions of the armv7a and arm64 code are the same then I haven't the foggiest idea why the times for the double precision code should be so different.
About the four times difference in general:
When using SIMD if double and single take about the same time then overall an algorithm using double will take double the time - with 128 bit registers 2 doubles are operated on at a time as opposed to four single precision numbers. The store and loads will also take about double the times as there is double the data.
If SIMD double operations took double the time then overall the time might get towards four times longer if all the data is in local caches.
However this does not explain the difference between the two A53 figures.