Summary
Is OpenCL support for the Mali-T628 (for example as found in the Exynos 5420 SoC on the Arndale Octa board) available? If so, how to set it up?
More details
According to the vendor, OpenCL should be supported, but the Arndale Octa Wiki does not state how this can be achieved.
I am using the latest Linaro developer build and installed Mali drivers that contain OpenCL libraries for Mali T604. According to this guide, the driver actually contains references to the Mali T628. So I tried to create the udev rule as specified, which is supposed to solve a permission problem with /dev/mali0, but I found that there is no /dev/mali0 on my installation at all. So my conclusion is that the driver indeed does not support T628.
When I execute a clinfo utility, clGetDeviceInfo returns CL_OUT_OF_HOST_MEMORY for some device properties. Why can I query the GPU for some characteristics, but does this fail for some others? When running a normal application, the same error appears when trying to create an OpenCL Context.
I was surprised to find this topic, where yoshi seems to have OpenCL working and can run benchmarks on his Arndale Octa board. How is this possible if there is no driver available? Or am I just missing something? I hope that you can help me to also establish a working OpenCL development environment.
Hi Veeranna,
No unfortunately not. I tried to run this image to get it up and running but the board wouldn't boot.
I am now running a linaro image using kernel 3.15.0-1-linaro-arndale-octa. But so far I haven't been able to build a mali kernel driver module myself either: Update: see end of post
root@arndale-octa:~/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/midgard# make -C /lib/modules/`uname -r`/build M=$PWD modules make: Entering directory `/usr/src/linux-headers-3.15.0-1-linaro-arndale-octa' Building modules, stage 2. MODPOST 0 modules make: Leaving directory `/usr/src/linux-headers-3.15.0-1-linaro-arndale-octa'
root@arndale-octa:~/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/midgard# make -C /lib/modules/`uname -r`/build M=$PWD modules
make: Entering directory `/usr/src/linux-headers-3.15.0-1-linaro-arndale-octa'
Building modules, stage 2.
MODPOST 0 modules
make: Leaving directory `/usr/src/linux-headers-3.15.0-1-linaro-arndale-octa'
No ko file is being generated.
The driver includes some scons files, but they do not work for me. First of all, a Sconstruct file is not included. When adding one myself and running scons, the environment is not recognized:
root@arndale-octa:~/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm# scons scons: Reading SConscript files ... scons: *** Import of non-existent variable ''env'' File "/root/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/midgard/sconscript", line 20, in <module>
root@arndale-octa:~/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm# scons
scons: Reading SConscript files ...
scons: *** Import of non-existent variable ''env''
File "/root/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/midgard/sconscript", line 20, in <module>
So I figured, let's fix this by replacing Import('env') by env = Environment(ENV = os.environ) but the dictionary class that scons uses does not handle non-existent keys the way the sconscript expects it:
scons: Reading SConscript files ... KeyError: 'v': File "/root/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/Sconstruct", line 1: SConscript('midgard/sconscript') File "/usr/lib/scons/SCons/Script/SConscript.py", line 609: return method(*args, **kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 546: return _SConscript(self.fs, *files, **subst_kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 260: exec _file_ in call_stack[-1].globals File "/root/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/midgard/sconscript", line 28: if env['v'] != '1': File "/usr/lib/scons/SCons/Environment.py", line 412: return self._dict[key]
KeyError: 'v':
File "/root/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/Sconstruct", line 1:
SConscript('midgard/sconscript')
File "/usr/lib/scons/SCons/Script/SConscript.py", line 609:
return method(*args, **kw)
File "/usr/lib/scons/SCons/Script/SConscript.py", line 546:
return _SConscript(self.fs, *files, **subst_kw)
File "/usr/lib/scons/SCons/Script/SConscript.py", line 260:
exec _file_ in call_stack[-1].globals
File "/root/TX011-SW-99002-r4p1-00rel0/driver/product/kernel/drivers/gpu/arm/midgard/sconscript", line 28:
if env['v'] != '1':
File "/usr/lib/scons/SCons/Environment.py", line 412:
return self._dict[key]
Any help on compiling this kernel module is highly appreciated!
Update:
I managed to compile the kernel module. It turns out that you don't need the scons files at all and can just use the included Kbuild and Makefile. Compiling mali_kbase.ko is a matter of running:
CONFIG_MALI_MIDGARD=m make
But inserting this module into the kernel does not provide a /dev/mali, neither does lshw detects a GPU. This means that we still cannot run OpenCL on the T628. How can we make the /dev/mali device show up?
Hi Bramv,
Are you able use all 6 cores T628?. If you are succeeded can you give complete steps to builds required binaries.
Thanks,
Veeranna
Hi Bram,
Yes thats correct the old kernel driver will not work with the new userspace driver, the kernel needs recompiling with the new kernel driver.
Hth,
Chris
I cannot stress enough that if you are interested in real world performance, you should move away from these synthetic benchmarks and look at actual data for proper use cases. There are some real-world oriented benchmarks out there already, which already have Mali powered devices in their results.
I totally agree that synthetic benchmarks are no proper measure for real world performance, but these benchmarks do help to determine whether the environment is set-up properly, which is clearly not (yet) the case.
Have spoken to someone from the driver team, if you're only seeing one device then it must be an old driver, it's worth asking Insignal what their roadmap is for providing updates. On current drivers you will see 2 devices, one with 4 cores and one with 2.
By now the r4p0-02rel0 drivers are available. I tried running the benchmarks using them instead of the r3p0-02rel0 drivers that I used before, however the runtime fails to create an OpenCL context.
Someone opened a topic on the Insignal forum about the latest drivers, it seems that I have to compile my own kernel using the latest kernel driver to get it working.
Thank you for the information.
Is this API/driver overhead?
Probably, yes. When I try this I'm seeing the same sort of additional time using gettimeofday vs the profiling values output. With the Sobel sample I looked at the difference in this additional time when I increased the size of the image being processed. It appeared to stay roughly the same, which would back up the theory that this is due to additional overhead in the API.
Hope that helps,
Tim
Hi Tim,
Thank you for providing information on performance number. With "cl_event" profiling we are getting numbers as
Queued time: 0.11ms
Wait time: 0.2725ms
Run time: 0.7586ms
But if we measure numbers across clEnqueueNDRangeKernel() and clfinish() using gettimeofday we are seeing number is 1.5msec. Is this API/driver overhead?
I realise I didn't include the platform details. I'm running on a Mali-T604 platform with the GPU clocked @ 533MHz. I'm using a later version of the drivers than you, I believe, and that would likely mean the performance you would expect on your platform is slightly worse. On balance though I would expect the performance on your Mali-T628 system to be roughly equivalent to mine.
HTH, Tim
I ran the Sobel filter as provided by the Mali OpenCL SDK, modifying it to run a 1024x1024 image and I see the following...
Wait time: 0.222292ms
Run time: 0.680683ms
The Mali OpenCL SDK doesn't include a Laplace transform... which one are you using?
Best regards, Tim
We really appreciate your quick reply. Looking forward to see the performance numbers.
Hi Veerannah,
We don't have stock performance numbers for the SDK samples to hand. However what I will do later this week is run Sobel and Laplace on a platform here and get some numbers back to you.
Hi Chris,
Where can we find benchmark numbers for examples kernels provided with SDK?. We ran "sobel" filter example for 1024x1024 image, profile number measured across kernel trigger and wait measures as 5msec. And Laplace transform for 4096x2048 image, number measures as 10msec. Is these numbers expected for for Mali T628 @ 533Mhz GPU? We measured the numbered across clEnqueueNDRange() and immediate clFinish().
I get 6 and 600 respectively. So I would assume that I am running on all six cores at 600 Mhz. Is there anything I can do to verify which core group is actually used?
Just to add to what Chris has said about this, this is indeed what we see with older versions of the driver. As you are seeing, with this version it does return 6 cores, though it really will only run on 4 of them. This particular SoC configures the 6 cores into 2 core-groups, 4 + 2. With the driver version you are using you can only target the 4-core group with CL... it is only later drivers that allow access to both core groups, and then it will be as separate devices so you will need to manually split the work if you want to use all the cores. (Graphics works differently and can automatically spread the workload across the entire GPU).
Hope that helps, Tim
Is there any reference document (by Mali?) that lists all important specs, like the number of FLOPS (including the exact number of vector, scalar and dot units per compute unit?
Here the number of flops is 16 instead of 17
It must be a typo for them, it's definitely 17 max.
Is there anything I can do to verify which core group is actually used?
I normally use http://graphics.stanford.edu/~yoel/notes/clInfo.c to quickly sanity check a platform. Will run this on the chromebook to confirm when I get it running again this afternoon.
What benchmark are you using to measure those 33.27 GFLOPS?
That's with clPeak, it gives me the numbers I would expect for the work that it's doing, i.e. just vector add and multiply, no scalar, no dot product. We have a kernel which exercises all functional units, but it is obviously synthetic and not representive of a real workload. I cannot stress enough that if you are interested in real world performance, you should move away from these synthetic benchmarks and look at actual data for proper use cases. There are some real-world oriented benchmarks out there already, which already have Mali powered devices in their results.
directly compare the efficiency of this presumably energy-efficient GPU...
Do you have a workload in mind? Does clPeak represent the sort of work you will be doing? If not I'd recommend looking into existing benchmarks/applications which represent the sort of work you will be doing and use that as your basis for comparison, rather than synthetic benchmarks like this one. Will your application run on handhend smartphones/tablets, or is it intended to be run in a compute farm somewhere?
On T628 MP[5-8] there will be 2 core groups, so it should not be possible to see one device with 6 cores, they will be exposes as 2 separate devices. It might be an issue with the benchmark? That does explain where your expectation came from however
UPDATE: Have spoken to someone from the driver team, if you're only seeing one device then it must be an old driver, it's worth asking Insignal what their roadmap is for providing updates. On current drivers you will see 2 devices, one with 4 cores and one with 2.
Hi Chris, I am glad that you did reply, you bring up some very interesting topics!
For me only one OpenCL platform containing only one device is found and when I request the number of compute units and clock frequency:
cl::Device device device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); device.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()
cl::Device device
device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
device.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()
Somewhere else mentioned the following formula for peak floating points performance:
16FP x 2 Vec4 x 6 Clusters x 0.600 GHz = 115.2 GFLOPS
Here the number of flops is 16 instead of 17, so I wonder where those numbers come from. Is there any reference document (by Mali?) that lists all important specs, like the number of FLOPS (including the exact number of vector, scalar and dot units per compute unit?
I ran clpeak on another architecture (Nvidia Tesla K40) and the program indeed measured peak performance (of about 4000 GFLOPS), therefore I assumed that I could also use this benchmark to measure peak performance of this Mali GPU. SInce this clearly doesn't work, I wonder how if I we could come up with some kernel that runs at +/- 100 GFLOPS. What benchmark are you using to measure those 33.27 GFLOPS?
You might be wondering why I am so determined to run the GPU at peak performance, but that is to ultimately be able to directly compare the efficiency of this presumably energy-efficient GPU to more common accelerators like the Nvidia Tesla K40, AMD Firepro W9100 and Intel Xeon Phi.
View all questions in Graphics and Gaming forum