This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex-A72 Maximum Theoretical Linpack Performance R_peak

As part of my MSc Scientific Computing at UCL, I'm benchmarking a small Raspberry Pi 4 Model B cluster.

I would like to reference the theoretical maximum performance of the BCM2711 (4 x ARM Cortex-A72) in Linpack terminology, R_peak.

I believe R_peak to be: 1.5 GHz x 3-way dispatch x 4 cores = 18 Gflops. This seems to be the "standard" Linpack methodology.

It would be very helpful if someone more knowledgable than me can confirm that this seems reasonable. Or even better, if there is some official ARM benchmarking material which I can reference in my dissertation?

Best wishes

John

Top replies

+1 Chris Goodyer over 5 years ago

Hi John,

An A72 core has a single 128-bit vector pipeline. This can therefore do two double precision FLOPs per cycle, such as are used in High Performance Linpack. The use of FMA instructions (multiply and add) means every cycle you can do 2 FLOPs on 2 doubles, i.e. 4FLOPs. At 1.5GHz you therefore have a maximum performance per core of 6GFLOPs. This gives a peak performance for 4 cores of 24 GFLOPs.

Hope this helps.

Chris
Cancel
Vote up 0 Vote down

Cancel
0 John Duffy over 5 years ago in reply to Chris Goodyer

Hi Chris

Thank you. That is very helpful!

If you don't mind a follow up question...

On Page 6 of the Cortex-A72 Software Optimisation Guide, 2.1 Pipeline Overview, the block diagram, and subsequent instruction details, refer to two floating point pipelines, FP/ASIMD 0 and FP/ASIMD 1. I'm not sure how these relate to "a single 128-bit vector pipeline"?

Kind regards

John
Cancel
Vote up 0 Vote down

Cancel
0 Chris Goodyer over 5 years ago in reply to John Duffy

Hi.

I think the easiest way to imagine this is that the core has the option to use two 64-bit pipelines or combine them together as a single 128-bit vector. If you look at the throughput numbers in the document you reference hopefully that explains those away a bit more.

Chris
Cancel
Vote up 0 Vote down

Cancel
+1 John Duffy over 5 years ago in reply to Chris Goodyer

Thank you again Chris.

John
Cancel
Vote up +1 Vote down

Cancel
0 Timo over 3 years ago in reply to Chris Goodyer

Hi Chris,

The above 4FLOPs is based on 64bit double precision, right?

For 32bit single precision operation, should the FLOPS in 32bit be doubled?

Thanks,

Timo
Cancel
Vote up 0 Vote down

Cancel
+1 Chris Goodyer over 3 years ago in reply to Timo

Hi Timo,

Yes, that's right. Each element takes half as much space in single precision, meaning you can get 4 "lanes" of single precision into a 128-bit vector meaning that the overall computation rates are doubled. This therefore enables A72 to achieve 8FLOPs per core per cycle when using FMA operations.

Note it is not universally true for every core in the world that double precision can be calculated at the same rate as single precision, but almost all that I've used do so.

Chris
Cancel
Vote up +1 Vote down

Cancel
0 Timo over 3 years ago in reply to Timo

Hi Chris,

Thanks for the answer which helps me understand it more accurately.

Regards,

Timo
Cancel
Vote up 0 Vote down

Cancel
0 Annie over 3 years ago in reply to Timo

Hi Timo if you think that Chris has answered your question then please mark it as an accepted answer. Many thanks!
Cancel
Vote up 0 Vote down

Cancel
0 Timo over 3 years ago in reply to Chris Goodyer

Thanks Chris for your answer.
Cancel
Vote up 0 Vote down

Cancel