Mali-G715-Immortalis MC11 r1p2 slower than mali G72 MC24

Hi,

I got new xiaomi 13T pro with Mali-G715-Immortalis MC11 r1p2 and méditek 9200+.

And the speed of my program is 20% less than on my old hauwei honnor play with mali G72 MC24.

I added

#pragma OPENCL EXTENSION cl_khr_priority_hints : enable


And this improved the speed from 140ms to 110 ms but still superior to the 80ms of the hauwei.

Does using CL_Buffer with Mali-G715 could drop the speed.
  • Without knowing what you are doing, it's going to be very hard to provide any specific advice. Have you profiled both platforms with our Streamline profiler? It's free-of-charge as part of Arm Performance Studio.

    Kind regards, 
    Pete

  • thanks for the information. I run streamline but i got this error.

    Could not initialize class com.arm.streamline.jni.elfdwarf.ElfDwarfParser
      java.lang.NoClassDefFoundError: Could not initialize class com.arm.streamline.jni.elfdwarf.ElfDwarfParser
          at com.arm.streamline.analysis.elfdwarf.ElfDwarf.isProcessingNeeded(ElfDwarf.java:102)
          at com.arm.streamline.analysis.session.SessionProcessor.produceReport(SessionProcessor.java:486)
          at com.arm.streamline.capture.apc.APCCapture.lambda$17(APCCapture.java:366)
          at com.arm.streamline.capture.apc.APCCapture.doIfValidCaptureSettings(APCCapture.java:430)
          at com.arm.streamline.capture.apc.APCCapture.analyze(APCCapture.java:339)
          at com.arm.streamline.live.LiveCaptureUiUtils.lambda$1(LiveCaptureUiUtils.java:59)
          at com.arm.streamline.common.utility.Task.run(Task.java:291)
          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
          at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
          at java.base/java.lang.Thread.run(Thread.java:1583)
      Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.AssertionError: Failed to load Streamline JNI lib [in thread "Main thread for report"]
          at com.arm.streamline.jni.elfdwarf.ElfDwarfParser.<clinit>(Unknown Source)
          ... 13 more

    I was not able to execute those command line.

    1. Run the following command on the device:
    setprop dalvik.vm.dex2oat-flags --no-strip-symbols
    2. Re-install the APK file
    3. To verify the options for dex2oat are set correctly, run the command:
    getprop dalvik.vm.dex2oat-flags
    4. To check whether DEX files contain .debug_* sections, you can use the GNU tools readelf
    command, for example:
    readelf -S .../images/*.dex

  • the problem of NoClassDefFoundErro is because i use windows7.

    But i could see the graphics. And xiaomi only use 4 core using OpenCL and all the core using only OpenGL.

    That could be the problem ?

  • If you are only seeing 4 cores used for OpenCL that could certainly explain it. 

    This sounds like a customization from the OEM outside of our standard driver, so you might be able to find out more on the OEM forums.

    Kind regards, 
    Pete

  • i made a new test and if i use OpenCL and OpenGL in the same APK. xiaomi use only 4 core and the kernel became even slower.

  • It is possible to install your standart driver at the place of the xiaomi driver. Can i do something except buying another phonne for testing.

    And may be that you could what is the best phonne to run OpenCL. ;))

  • It is definitly not a problem of OEM. It is a problem with android toolchain and nkd version. It is the mess.

  • Hi,

    I managed to compile android APK in 64bit under the xiaomi 13T pro and all the core are running with OpenCL. I can get good performance nearly the same as the hauwei honnor play.

    But honnor play is still more stable in time processing. The xiaomi is very fluctuing from 35 ms to 80 ms from one frame to another.

  • OpenCL support definitely varies from OEM to OEM, and over time as well, which can be frustrating.

    Unfortunately we can't provide a driver, we deliver to our partners who adapt the driver code to put it on their phones, so it has to come from them.

    Xiaomi forums are in theory at https://c.mi.com/global/ but I can't get to it (from UK). However, there's an unofficial EU forum that apparently works with Xiaomi at https://xiaomi.eu/community/ - I've not used it to see what the quality of support is, but it seems active at any rate.

    That said - looks like you've done great work yourself to get better performance!

  • Thanks for the information but i already try to contact xiaomi by mail concerning opencl. But after that,  my e-mail was refused to create account. But they said that they do not have information about it. Xiaomi is just an integrator not a chip producer like hauwei. They may not even know that opoencl exist ;))

    Yes i spend nearly 2 months on the xiaomi problem, not full time of course, but thinking about the possible problem until i found it. ;))

    But i think that i anderstoud why OpenCL is a problem for OEM. It use a lot of batterieand produce a lot of heat when running it with CPU multithreading. When openGL is not so consumer of batterie and heat.

    Yes i it quite frustrating that OpenCL is so different from one OEM to another and we cannot find so much help from any OEM.

    But i noticed that OpenCL performance is related with the number of core on the GPU. So G72 with 24 core should be faster than G715 with 11 core. I may be wrong but that what i think.

    The only imprvment i could see for openCL would be to be able to compute group in order not in aléatoir. that would be great we could do processing that need to be done with the CPU. But may be in the futur ;)).

    PS. concerning the xiaomi. It look like they use 64bit like two 32bit for GPU and CPU driver command. That is may be why 64bit use all the core and 32bit compilation only 4 core. On the hauweil i found out that 64bit or 32 bit give the same result, and that 32bit is a little bit faster than 64bit. I did not anderstoud why 64bit use all the core and that 32bit only 4 core. MAys be someone could explain it.

    regards.