This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Mali T760MP4 OpenCL performance issue

Hi :

I use RK3288 SoC and force the Mali T760MP4 work at 600Mhz. "clpeak" program from git hub is used for testing performance. "clpeak" always show mali works at 200Mhz not 600Mhz.

(1) OS is TinkerOS_Debian V1.8. It can download from dlcdnet.asus.com/.../20170417-tinker-board-linaro-stretch-alip-v1.8.zip

(2) clpleak can be downloaded from  https://github.com/krrishnarraj/clpeak

(3) linaro@linaro-alip:/proc$ cat /sys/class/misc/mali0/device/devfreq/ffa30000.gpu/cur_freq
600000000

I got the following results. It seem Mali T760 MP4 has very poor performance. What's wrong with Mali T760 MP4 ?

I also found that when Mali T760 is running. The linux api "clock_t clock(void);" always got wrong value, but  "gettimeofday()"  got correct time. Is it the reason why "clpeak" generate wrong performance report ?


-----------------------------------------------------------------

Platform: ARM Platform
  Device: Mali-T760
    Driver version  : 1.2 (Linux ARM)
    Compute units   : 4
    Clock frequency : 200 MHz

    Global memory bandwidth (GBPS)
      float   : 2.90
      float2  : 4.60
      float4  : 4.74
      float8  : 3.94
      float16 : 3.61

    Single-precision compute (GFLOPS)
      float   : 12.94
      float2  : 5.93
      float4  : 5.95
      float8  : 31.21
      float16 : 7.04

    half-precision compute (GFLOPS)
      half   : 2.89
      half2  : 6.14
      half4  : 14.32
      half8  : 13.91
      half16 : 18.97

      Double-precision compute (GFLOPS)
      double   : 1.66
      double2  : 1.55
      double4  : 15.70
      double8  : 15.46
      double16 : 15.26

    Integer compute (GIOPS)
      int   : 2.61
      int2  : 6.10
      int4  : 6.71
      int8  : 7.50
      int16 : 30.89

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 3.86
      enqueueReadBuffer          : 1.38
      enqueueMapBuffer(for read) : 1237.03
        memcpy from mapped ptr   : 1.37
      enqueueUnmap(after write)  : 2350.57
        memcpy to mapped ptr     : 1.34

    Kernel launch latency : 74.72 us

-----------------------------------------------------------------

Thank you

-Jack

Parents
  • hiwu said:
    It seem Mali T760 MP4 has very poor performance.

    How many GFLOPS were you expecting? Depending what you call a "flop" the architectural maximum for Mali-T760 is 14 single precision flops per clock per core, so 12*4*600M = 33GFLOPS best case. You are hitting close to that for most of the single precision test cases with some vector lengths.

    In terms of why the performance is so erratic, I can see two problems.

    Firstly, it is important to note that the Midgard Mali GPUs are a SIMD vector architecture per thread with a 128-bit data path. The MAC chains in the short vector length test kernels are scalar with a hard data dependency due to the accumulation, so are not able to benefit from the datapath widths of the hardware here. For Mali you really need to write vector code to get the full performance out of the maths units (and the load/store unit - use vector loads and stores too).

    Secondly, looking at the benchmark a lot of the kernel variants are rather naive in terms of how they have been structured; the basic test variants contain 2048 unrolled MAC calls once you expand out the macros. I firmly expect most of these to be thrashing the instruction caches on a mobile GPU, rather than actually telling you anything useful about floating point performance.

    Good benchmarks look like real use cases; anything too extreme is likely to hit some caching problems, especially for mobile and embedded parts, so try and write kernels which look "realistic" to some extent if you can.

    HTH, 
    Pete

Reply
  • hiwu said:
    It seem Mali T760 MP4 has very poor performance.

    How many GFLOPS were you expecting? Depending what you call a "flop" the architectural maximum for Mali-T760 is 14 single precision flops per clock per core, so 12*4*600M = 33GFLOPS best case. You are hitting close to that for most of the single precision test cases with some vector lengths.

    In terms of why the performance is so erratic, I can see two problems.

    Firstly, it is important to note that the Midgard Mali GPUs are a SIMD vector architecture per thread with a 128-bit data path. The MAC chains in the short vector length test kernels are scalar with a hard data dependency due to the accumulation, so are not able to benefit from the datapath widths of the hardware here. For Mali you really need to write vector code to get the full performance out of the maths units (and the load/store unit - use vector loads and stores too).

    Secondly, looking at the benchmark a lot of the kernel variants are rather naive in terms of how they have been structured; the basic test variants contain 2048 unrolled MAC calls once you expand out the macros. I firmly expect most of these to be thrashing the instruction caches on a mobile GPU, rather than actually telling you anything useful about floating point performance.

    Good benchmarks look like real use cases; anything too extreme is likely to hit some caching problems, especially for mobile and embedded parts, so try and write kernels which look "realistic" to some extent if you can.

    HTH, 
    Pete

Children