This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to calculate GFLOPS / GOPS ?

Hi. 

I'm an SoC design engineer and want to provide my customers with Mali-G51's ML nominal performance in terms of GFLOPS or GOPS or GMACS.

My GPU is Mali-G51MP4 at 800MHz and followings are information I get from articles.

 - G51MP4 = 6 execution engines.

 - Each execution engine = 4 SIMD Lanes.

 - Each Lane = 4 MACs per cycle (not sure which data type is assumed.. fp32 or fp16 or int8 ???)

Then, my calculation of the nominal(ideal) ML processing capability is,  

 (6 engine/MP4) x (4 lane/engine) x (4 MACs/lane) x 800MHz = 76.8 GMACs/second = 153.6 GFLOPs/second. 

Is it correct?

If the data type is fp32 above, then can I expect double the numbers above in case of fp16 ?

How about the case of int8 data type and GOPs calculation? 

  •  - Each Lane = 4 MACs per cycle (not sure which data type is assumed.. fp32 or fp16 or int8 ???)

    You get 4 FP32 MACs per execution engine.

     (6 engine/MP4) x (4 lane/engine) x (4 MACs/lane) x 800MHz = 76.8 GMACs/second = 153.6 GFLOPs/second. 

    There are some additional simple maths units which can run in parallel to the MAC hardware (adders, etc), but in terms of MAC performance which is probably what you care about for ML then that looks correct to me.

    If the data type is fp32 above, then can I expect double the numbers above in case of fp16

    Yes.

    How about the case of int8 data type and GOPs calculation?

    Should be double again.

    Cheers, 
    Pete

  • Thank you Peter.

    I have one thing to clear about your answer to int8.

    The article about the new G52 product(below) says that proper(=fast) 8bit operations are begun to start from Mali-G52 in mid-range GPU line-up.

    It sounds like Mali-G51 does'nt architected to perform well with int8 operation as expected(4x times faster than 32bit data operation), and actually that was as I've seen during my ML benchmark evaluation. Int8 infererencing on G51 was not faster than fp16, even than fp32.

    For int8 inferencing there is overhead of quantization and dequantization. But with 4x faster arithmatic operations INT8 should lead to at leat a little faster execution than fp32.

    Could you confirm that 2D Convolution in int8(int8 dot product is a major type of operation) can be 4x faster than fp32 in Mali-G51?

    " Arm says that this is where most of the gains in performance and density come from as the doubling of the ALU lanes only increases the core area by ~1.22x. The 3.6x increase in machine learning workloads is attributed to the fact that the new ALUs can now handle 8-bit dot product operations." (https://www.anandtech.com/show/12501/arm-launches-new-mali-g52-g31-gpus-new-display-and-video-ip)

  • It sounds like Mali-G51 does'nt architected to perform well with int8 operation as expected(4x times faster than 32bit data operation), and actually that was as I've seen during my ML benchmark evaluation. Int8 infererencing on G51 was not faster than fp16, even than fp32.

    We have 8-bit operations in all of the Midgard and Bifrost Mali GPUs - char and uchar are part of the core specification for OpenCL. However, basic vector operations in OpenCL do not benefit from the integer promotion rules, so if you multiply two 8-bit vector values you get an 8-bit result which is therefore prone to overflow. To avoid that you have to cast up to shorts, at which point you're no longer doing 8-bit processing.

    For int8 inferencing there is overhead of quantization and dequantization.

    Ideally you wouldn't need to do this and could get close to zero copy for many use cases, in particular when dealing with image data which is commonly 8-bit per color channel to start off with. 

    Could you confirm that 2D Convolution in int8(int8 dot product is a major type of operation) can be 4x faster than fp32 in Mali-G51?

    The operations added in Mali-G52 are an explicit 8-bit DOT and DOT+ACCUMULATE with internal result widening .These both do 4 8-bit * 8-bit multiplies to a 16-bit intermediate precision, and accumulate to a 32-bit result, so you avoid the issues with overflow you would get trying to do this without a hardware-backed function. 

    See the extension specification here:

    www.khronos.org/.../cl_arm_integer_dot_product.txt

    Both of these are a single cycle operations, so will be at least 4x faster than manually written fp32 code at the instruction level. Exactly what you see at the overall kernel level will vary depending on what your kernel does - if the 8-bit dot product is only 25% of the workload then obviously any benefits will be reduced.

    HTH,
    Pete