Summary
Is OpenCL support for the Mali-T628 (for example as found in the Exynos 5420 SoC on the Arndale Octa board) available? If so, how to set it up?
More details
According to the vendor, OpenCL should be supported, but the Arndale Octa Wiki does not state how this can be achieved.
I am using the latest Linaro developer build and installed Mali drivers that contain OpenCL libraries for Mali T604. According to this guide, the driver actually contains references to the Mali T628. So I tried to create the udev rule as specified, which is supposed to solve a permission problem with /dev/mali0, but I found that there is no /dev/mali0 on my installation at all. So my conclusion is that the driver indeed does not support T628.
When I execute a clinfo utility, clGetDeviceInfo returns CL_OUT_OF_HOST_MEMORY for some device properties. Why can I query the GPU for some characteristics, but does this fail for some others? When running a normal application, the same error appears when trying to create an OpenCL Context.
I was surprised to find this topic, where yoshi seems to have OpenCL working and can run benchmarks on his Arndale Octa board. How is this possible if there is no driver available? Or am I just missing something? I hope that you can help me to also establish a working OpenCL development environment.
You are right, I occidentally mixed up the two version numbers. So I should use the unmodified kernel in combination with the r4p0 userspace binary. This binary comes in two forms, X11 and fbdev. I tried both but don't see much of a difference. Which one should I use?
I am using the same clpeak benchmark that I used:
Platform: ARM Platform Device: Mali-T628 Driver version : 1.1 Compute units : 4 Clock frequency : 533 MHz Single-precision compute (GFLOPS) float : 1.56654 float2 : 3.92411 float4 : 3.92181 float8 : 4.84877 float16 : 4.82142 Device: Mali-T628 Driver version : 1.1 Compute units : 2 Clock frequency : 533 MHz Single-precision compute (GFLOPS) float : 0.747759 float2 : 1.96387 float4 : 1.98183 float8 : 2.43132 float16 : 2.41775
Platform: ARM Platform
Device: Mali-T628
Driver version : 1.1
Compute units : 4
Clock frequency : 533 MHz
Single-precision compute (GFLOPS)
float : 1.56654
float2 : 3.92411
float4 : 3.92181
float8 : 4.84877
float16 : 4.82142
Compute units : 2
float : 0.747759
float2 : 1.96387
float4 : 1.98183
float8 : 2.43132
float16 : 2.41775
It is nice to see that the two clusters are now properly recognized, but the results are nowhere near the +/- 33 Gflops that Chris reported:
I am seeing 33.27 and 33.17 SP GFLOPS for float2 and float4 respectively, or 45% of max theoretical peak
Hi bramv,
It's probably worth someone at our end testing this out as well, but as a quick sanity check can you ensure that the CPU and GPU DVFS are disabled/otherwise pinned (set CPU DVFS to performance and frequency to something high like 1.7GHz) before running the benchmark? These SoCs have a tendency to take the busfreq down with the CPUfreq when the CPU is idle, and as these GPU benchmarks tend not to stress the CPU too much, this has the effect that the bus clocks down to v/fmin and severely throttles the GPU's memory bandwidth.
Hth,
Chris
Hi Chris,
I guess I would need a utility like cpufreq to set DVFS or change the clock frequency of the SoC, but don't know exactly how to do this. Instead I looked in the kernel configuration and disabled DVFS altogether and only enabled the performance governor. This however does not seem to have any impact on performance.
I agree that it would be useful i someone at our end could run the same benchmark. You can find my code here.
The X11 and fbdev choice gives you the choice to render GLES 3d content either directly into the framebuffer (fbdev), or within the X windowing system (X11). If you are doing nothing graphics related and have no need to install X11, then I would suggest you just use the fbdev version of the driver userspace; in terms of OpenCL they will be exactly the same.
Following on from what Chris said, it could well be that DVFS is clocking down bandwidth and therefore causing the reduced performance you are seeing. I can take a look at this at my end and try and verify if this is the case.
To test at your end quickly, you should be able to disable dvfs as follows:
echo off > /sys/class/misc/mali0/device/dvfs
Hope this Helps,
Rich
Hi Rich,
Good to know the difference between X11 and fbdev. I have one follow-up question on this. So far I have linked my OpenCL programs to libmali.so from the userspace binary package instead of using the libOpenCL.so, since libOpenCL.so gives me lots of undefined reference errors whilst using libmali.so, compilation progresses without any errors. I also noticed that libmali.so is the only substantial file in the package:
# ll ../fbdev/ total 21056 drwxr-xr-x 2 root root 4096 Jan 1 2000 ./ drwx------ 11 root root 4096 Jan 1 2000 ../ -rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libEGL.so* -rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libGLESv1_CM.so* -rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libGLESv2.so* -rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libOpenCL.so* -rwxr-x--- 1 16580 16580 21518354 Jul 23 14:44 libmali.so*
# ll ../fbdev/
total 21056
drwxr-xr-x 2 root root 4096 Jan 1 2000 ./
drwx------ 11 root root 4096 Jan 1 2000 ../
-rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libEGL.so*
-rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libGLESv1_CM.so*
-rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libGLESv2.so*
-rwxr-x--- 1 16580 16580 4806 Jul 23 14:44 libOpenCL.so*
-rwxr-x--- 1 16580 16580 21518354 Jul 23 14:44 libmali.so*
Is using libmali.so to lik against the correct way to go?
Regarding DVFS, the /device/dvfs file is not present on my system:
# tree /sys/class/misc/mali0 /sys/class/misc/mali0 ├── dev ├── device -> ../../../11800000.mali ├── power │ ├── autosuspend_delay_ms │ ├── control │ ├── runtime_active_time │ ├── runtime_status │ └── runtime_suspended_time ├── subsystem -> ../../../../class/misc └── uevent 3 directories, 7 files
# tree /sys/class/misc/mali0
/sys/class/misc/mali0
├── dev
├── device -> ../../../11800000.mali
├── power
│ ├── autosuspend_delay_ms
│ ├── control
│ ├── runtime_active_time
│ ├── runtime_status
│ └── runtime_suspended_time
├── subsystem -> ../../../../class/misc
└── uevent
3 directories, 7 files
And your command results in a permission denied error, even though I am using a root shell:
# echo off > /sys/class/misc/mali0/device/dvfs -bash: /sys/class/misc/mali0/device/dvfs: Permission denied
# echo off > /sys/class/misc/mali0/device/dvfs
-bash: /sys/class/misc/mali0/device/dvfs: Permission denied
I am looking forward to seeing your clpeak results!
"libOpenCL.so" is the spec defined library name that OpenCL should be exposed as on the platform. In our case we implement this (and the GLES libs) as shims which pass through to the libmali.so binary, which is why that's the largest one there. For development purposes you can link against either, but obviously you wouldn't want to do this for a release, you'd want to link against libOpenCL.so for portability.
That said, it SHOULD work, so if you are having issues linking against libOpenCL.so feel free to share them here and we will take a look. I can't think of a good reason why one should work and the other fail.
Thanks,
All the errors are undefined reference errors at link time, the symbols should be there so offhand not sure why that's happening, but in any case linking against libmali.so is working
In your above output, ldd a.out is reporting /usr/lib/libmali.so on the first line, so thats working as expected.
I put all files from the binary userspace driver in the /usr/lib/ directory and compiled just using: "g++ program.cpp -lOpenCL". This gives the same result as compiling like this: "g++ program.cpp /usr/lib/libOpenCL.so". The output is too long to show here.
Compiling using "g++ clpeak-arndale-octa.cpp /usr/lib/libmali.so" works just fine: But now that I am further looking into the compilation, I noticed the following:
# ldd a.out /usr/lib/libmali.so (0xb5e03000) libstdc++.so.6 => /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 (0xb5d4a000) libm.so.6 => /lib/arm-linux-gnueabihf/libm.so.6 (0xb5cde000) libgcc_s.so.1 => /lib/arm-linux-gnueabihf/libgcc_s.so.1 (0xb5cbd000) libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xb5bd6000) /lib/ld-linux-armhf.so.3 (0xb6f6d000) librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0xb5bc8000) libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0xb5bac000) libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0xb5ba1000)
# ldd a.out
/usr/lib/libmali.so (0xb5e03000)
libstdc++.so.6 => /usr/lib/arm-linux-gnueabihf/libstdc++.so.6 (0xb5d4a000)
libm.so.6 => /lib/arm-linux-gnueabihf/libm.so.6 (0xb5cde000)
libgcc_s.so.1 => /lib/arm-linux-gnueabihf/libgcc_s.so.1 (0xb5cbd000)
libc.so.6 => /lib/arm-linux-gnueabihf/libc.so.6 (0xb5bd6000)
/lib/ld-linux-armhf.so.3 (0xb6f6d000)
librt.so.1 => /lib/arm-linux-gnueabihf/librt.so.1 (0xb5bc8000)
libpthread.so.0 => /lib/arm-linux-gnueabihf/libpthread.so.0 (0xb5bac000)
libdl.so.2 => /lib/arm-linux-gnueabihf/libdl.so.2 (0xb5ba1000)
The resulting binary does not contain any reference to a OpenCL or Mali library. Does this mean that the program runs on the CPU instead of GPU? But then, why does it report the two Mali-T628 devices? I am confused.
Whoops, I missed that line. In that case I can conclude the following:
I am running a kernel containing the proper Mali kernel driver, my program is linked to the corresponding userspace binary and T628-MP6 is correctly recognized.
Is it indeed DVFS that is preventing the benchmark to achieve the expected performance level? Why is /sys/class/misc/mali0/device/dvfs missing?
Hi Bramv,
Taken shamelessly from an answer by peterharris in another thread:
"The DVFS code for the GPU is not directly managed by our drivers - it is part of the platform integration provided in the BSP from Insignal. This style of integration occurs because the DVFS analogue parts which control F and V for the power domains are not part of the ARM IP. This question is probably best asked to Samsung or Insignal, as they maintain the BSP for that platform."
It is possible to disable features such as DVFS by recompiling the linux kernel and mali kernel module with the correct configuration. The reason this reduced performance (normally) happens is because DVFS ties the GPU frequencies to the workload of the CPU. As you are running your intensive GPU test, the CPU is left to idle and so DVFS drops the CPU core speed, unfortunately, also dropping the GPU frequency.
A means of stopping this happening would be to add some CPU intensive code to run whilst the GPU code is running to stop DVFS dropping the frequencies.
With regards to the Linker errors, you should be able to fix the issue by linking against both mali and OpenCL. OpenCL will provide the runtime linker target (even in the absence of mali) whilst mali will provide the symbols at compile time, stopping the errors in your shared paste.
Hope this helps,
I just tried to run the GPU benchmark while the CPU was busy compiling a new kernel. There are however no significant differences in the results.
To see whether I could make DVFS work, I enabled "CONFIG_MALI_MIDGARD_DVFS=y", but this results in undefined references while compiling:
LD init/built-in.o drivers/built-in.o: In function `mali_dvfs_update_asv': :(.text+0x517a8): undefined reference to `exynos_lot_id' :(.text+0x517ac): undefined reference to `exynos_lot_id' drivers/built-in.o: In function `mali_sysfs_show_asv': :(.text+0x51ab6): undefined reference to `exynos_asv_group_get' :(.text+0x51b2a): undefined reference to `exynos5420_is_g3d_mp6' :(.text+0x51b50): undefined reference to `exynos_lot_id' :(.text+0x51b58): undefined reference to `exynos_lot_id' drivers/built-in.o: In function `mali_dvfs_event_proc': :(.text+0x51fc8): undefined reference to `exynos_result_of_asv' :(.text+0x51fcc): undefined reference to `exynos_result_of_asv' make: *** [vmlinux] Error 1
LD init/built-in.o
drivers/built-in.o: In function `mali_dvfs_update_asv':
:(.text+0x517a8): undefined reference to `exynos_lot_id'
:(.text+0x517ac): undefined reference to `exynos_lot_id'
drivers/built-in.o: In function `mali_sysfs_show_asv':
:(.text+0x51ab6): undefined reference to `exynos_asv_group_get'
:(.text+0x51b2a): undefined reference to `exynos5420_is_g3d_mp6'
:(.text+0x51b50): undefined reference to `exynos_lot_id'
:(.text+0x51b58): undefined reference to `exynos_lot_id'
drivers/built-in.o: In function `mali_dvfs_event_proc':
:(.text+0x51fc8): undefined reference to `exynos_result_of_asv'
:(.text+0x51fcc): undefined reference to `exynos_result_of_asv'
make: *** [vmlinux] Error 1
Are you using the same kernel configuration that Guillaume mentioned?
./scripts/kconfig/merge_config.sh linaro/configs/linaro-base.conf linaro/configs/distribution.conf linaro/configs/arndale_octa.conf linaro/configs/lt-arndale_octa.conf linaro/configs/mali-arndale-octa.conf
I've not been able to look into this a huge amount for you unfortunately.
I was able to compile your code and run it on an Odroid X-U3 which is running a very similar SoC and I didn't see performance numbers anywhere near what you see, they are somewhat better. I can only assume this is a BSP issue, but I haven't had an Arndale Octa at hand to create my own BSP and test.
I will try and investigate at this end when I get the opportunity ,