This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to calculate GFLOPS / GOPS ?

Hi. 

I'm an SoC design engineer and want to provide my customers with Mali-G51's ML nominal performance in terms of GFLOPS or GOPS or GMACS.

My GPU is Mali-G51MP4 at 800MHz and followings are information I get from articles.

 - G51MP4 = 6 execution engines.

 - Each execution engine = 4 SIMD Lanes.

 - Each Lane = 4 MACs per cycle (not sure which data type is assumed.. fp32 or fp16 or int8 ???)

Then, my calculation of the nominal(ideal) ML processing capability is,  

 (6 engine/MP4) x (4 lane/engine) x (4 MACs/lane) x 800MHz = 76.8 GMACs/second = 153.6 GFLOPs/second. 

Is it correct?

If the data type is fp32 above, then can I expect double the numbers above in case of fp16 ?

How about the case of int8 data type and GOPs calculation? 

Parents
  • It sounds like Mali-G51 does'nt architected to perform well with int8 operation as expected(4x times faster than 32bit data operation), and actually that was as I've seen during my ML benchmark evaluation. Int8 infererencing on G51 was not faster than fp16, even than fp32.

    We have 8-bit operations in all of the Midgard and Bifrost Mali GPUs - char and uchar are part of the core specification for OpenCL. However, basic vector operations in OpenCL do not benefit from the integer promotion rules, so if you multiply two 8-bit vector values you get an 8-bit result which is therefore prone to overflow. To avoid that you have to cast up to shorts, at which point you're no longer doing 8-bit processing.

    For int8 inferencing there is overhead of quantization and dequantization.

    Ideally you wouldn't need to do this and could get close to zero copy for many use cases, in particular when dealing with image data which is commonly 8-bit per color channel to start off with. 

    Could you confirm that 2D Convolution in int8(int8 dot product is a major type of operation) can be 4x faster than fp32 in Mali-G51?

    The operations added in Mali-G52 are an explicit 8-bit DOT and DOT+ACCUMULATE with internal result widening .These both do 4 8-bit * 8-bit multiplies to a 16-bit intermediate precision, and accumulate to a 32-bit result, so you avoid the issues with overflow you would get trying to do this without a hardware-backed function. 

    See the extension specification here:

    www.khronos.org/.../cl_arm_integer_dot_product.txt

    Both of these are a single cycle operations, so will be at least 4x faster than manually written fp32 code at the instruction level. Exactly what you see at the overall kernel level will vary depending on what your kernel does - if the 8-bit dot product is only 25% of the workload then obviously any benefits will be reduced.

    HTH,
    Pete

Reply
  • It sounds like Mali-G51 does'nt architected to perform well with int8 operation as expected(4x times faster than 32bit data operation), and actually that was as I've seen during my ML benchmark evaluation. Int8 infererencing on G51 was not faster than fp16, even than fp32.

    We have 8-bit operations in all of the Midgard and Bifrost Mali GPUs - char and uchar are part of the core specification for OpenCL. However, basic vector operations in OpenCL do not benefit from the integer promotion rules, so if you multiply two 8-bit vector values you get an 8-bit result which is therefore prone to overflow. To avoid that you have to cast up to shorts, at which point you're no longer doing 8-bit processing.

    For int8 inferencing there is overhead of quantization and dequantization.

    Ideally you wouldn't need to do this and could get close to zero copy for many use cases, in particular when dealing with image data which is commonly 8-bit per color channel to start off with. 

    Could you confirm that 2D Convolution in int8(int8 dot product is a major type of operation) can be 4x faster than fp32 in Mali-G51?

    The operations added in Mali-G52 are an explicit 8-bit DOT and DOT+ACCUMULATE with internal result widening .These both do 4 8-bit * 8-bit multiplies to a 16-bit intermediate precision, and accumulate to a 32-bit result, so you avoid the issues with overflow you would get trying to do this without a hardware-backed function. 

    See the extension specification here:

    www.khronos.org/.../cl_arm_integer_dot_product.txt

    Both of these are a single cycle operations, so will be at least 4x faster than manually written fp32 code at the instruction level. Exactly what you see at the overall kernel level will vary depending on what your kernel does - if the 8-bit dot product is only 25% of the workload then obviously any benefits will be reduced.

    HTH,
    Pete

Children
No data