Summary
Is OpenCL support for the Mali-T628 (for example as found in the Exynos 5420 SoC on the Arndale Octa board) available? If so, how to set it up?
More details
According to the vendor, OpenCL should be supported, but the Arndale Octa Wiki does not state how this can be achieved.
I am using the latest Linaro developer build and installed Mali drivers that contain OpenCL libraries for Mali T604. According to this guide, the driver actually contains references to the Mali T628. So I tried to create the udev rule as specified, which is supposed to solve a permission problem with /dev/mali0, but I found that there is no /dev/mali0 on my installation at all. So my conclusion is that the driver indeed does not support T628.
When I execute a clinfo utility, clGetDeviceInfo returns CL_OUT_OF_HOST_MEMORY for some device properties. Why can I query the GPU for some characteristics, but does this fail for some others? When running a normal application, the same error appears when trying to create an OpenCL Context.
I was surprised to find this topic, where yoshi seems to have OpenCL working and can run benchmarks on his Arndale Octa board. How is this possible if there is no driver available? Or am I just missing something? I hope that you can help me to also establish a working OpenCL development environment.
UPDATE: I've taken a look at the kernels and sure enough apart from right at the end they only touch the VMUL and VADD units. The workload is purely vector add and multiply, and absolutely does not touch the scalar or dot units, so you can take that 17 FLOPS down to 8, and our peak realistic performance becomes:
8 FLOPS * 2 ALUs * 4 cores (in one core group, which is 1 CL device) * 0.533 = 34.112 GFLOPS peak realistic performance.
We're seeing 33.27 GFLOPS, which is 97.5% of peak realistic performance, so not too shabby.
Hi Bramv,
At the risk of beating a dead horse, by my math the peak throughput of the Mali T628 MP6 in the Arndale Octa is:
17 FLOPS * 2 ALUs per core * 6 cores * 0.533Ghz (i think this is the max frequency for this board?) = 108.732 GFLOPS. But, as those cores are distributed across 2 core groups (4+2) simply using the first device available will mean you're only running on 4 cores, so that comes down to 72.488GFLOPS. That's the peak possible throughput of the A pipes, if you were to just keep pumping instructions through it which perfectly exercise all the functional units. In the real world, where kernels tend not to map perfectly to the hardware, tend to include some LS activity, tends to try and load stuff not in the cache, bandwidth is finite etc, it is not expected that you will actually achieve this in real world use cases. Asking for peak flops is the wrong question if you're really interested in peak real world performance.
I am seeing 33.27 and 33.17 SP GFLOPS for float2 and float4 respectively, or 45% of max theoretical peak (this is very good, considering its probably just using the VMUL and VADD units, havent looked into it yet), without looking at the actual code of the benchmark, which seems to indicate your environment has some issue. I will investigate why this benchmark reports almost identical perf numbers for float2 and float4. I also have to state that I've never seen clpeak before and am in no way vouching for its accuracy or usefulness when comparing GPU products.
Hth,
Chris
I tried again using the rootfs, hwpack and boot files that you provided and can now also compile and run the samples from the Mali SDK!
WIth some more fiddling I also managed to compile and run my own programs. Performance however is way lower than I expected. I modified the clpeak benchmark so that it runs on the Arndale Octa, and get the following resuls:
While I would have expected to measure around 100 GFLOPS for SP compute using float4 vectors. I will have to look into this, but at least I have OpenCL running and that was the goal of this topic after all.
Thanks a lot to all of you for helping me.
Btw,
When creating this OpenCL environment on the Arndale Octa, I kept track of all commands that I executed and used them to write a brief guide on the Insignal Developer forum.
Hi bramv,
I managed to get OpenCL working on the Arndale Octa board. I used the latest Linaro hwpack and file system.
then did:
sudo linaro-media-create --image-file test_image.img --dev arndale-octa --hwpack hwpack_linaro-arndale-octa_20140525-654_armhf_supported.tar.gz --binary linaro-trusty-server-20140522-661.tar.gz
to create an image to make a bootable SD card.
I then copied ahbijeet's compiled kernel files board.dtb, uImage, uInitrd from http://livehopper.com/boot.tar over the existing ones in the boot partition of the SD card.
Using the fbdev version of the userspace drivers. I was able to successfully run the sgemm example from the SDK as root, linking with libmali
Not sure exactly what I did that was different than you..
Thanks
Simon
Hi Veeranna,
There are a number of performance improvements present in the r4p0 driver not present in previous releases. Keep in mind that OpenCL is not performance portable, so an application optimized for another platform, or otherwise written with another architecture in mind, may not be performant when run on another platform/architecture. The below materials contain advice and detail some of the differences and considerations when moving from desktop to Mali, so let us know if they helps or if you have any further queries and we'll be happy to help.
There is the Developer Guide: Mali-T600 Series GPU OpenCL Developer Guide « Mali Developer Center
There is also the OpenCL faq: http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf
And the Laplace case study by timhar01 Technical presentation about ARM Mali-T600 GPU and ARM Mali-T700 GPU Compute - YouTube (although I recommend you watch the whole video)
Hope this helps,
Hi Chris,
Finally we are able to run our application on T628. But performance numbers are not good. Do we get any improvement if we move r4p0 driver?
Any other suggestions to improve the GPU performance will be helpful.
Thanks,
Veeranna
Hi Raoul,
can you give a rough timeline for when octabard Opencl support would be available in a linaro hwpack/filesystem (not necessarily in the main builds) ?
So the Linaro Mali support is available if you look up this thread, but it looks like OpenCL could be broken at the moment. I'd recommend grabbing whatever is available from Linaro or from the post above, test it, and if it doesn't work for you report that to Linaro so it can be fixed. This is downstream of us so I can't give dates or anything, you'd need to talk to them.
As alternative to octaboard in the shorter term, does anyone know if there is a prebuilt download of the Chromebook/mali image that is described in many steps on the mali site at: https://developer.arm.com/graphics/development-platforms/samsung-chromebook
ARM provide the userspace binaries necessary for GLES/CL support on the Chromebook, but we do not have legal approval to release a public BSP image at this time, which is why we wrote the guide for the Chromebook. I don't think the click-through licence on malideveloper.arm.com for the userspace binaries allows redistribution, but as far as I know there is nothing stopping someone from following that guide, and creating an image from that and distributing it, with instructions to just grab the mali blob from the site. So far to my knowledge no-one has done this, but assuming there's nothing in any applicable licences precluding it, feel free to do so!
Hello,
I'm interested in the functional capability - not so much about absolute performance as being able to put a board in front of some people so they can develop with Opencl.
As alternative to octaboard in the shorter term, does anyone know if there is a prebuilt download of the Chromebook/mali image that is described in many steps on the developer site: https://developer.arm.com/graphics/development-platforms/samsung-chromebook
thanks, Raoul
That looks like an integration problem to me, it's worth reporting to Linaro, as I believe that's the kernel you're using?
I tried the SGEMM sample from the SDK:
root@arndale-octa:~/Mali_OpenCL_SDK_v1.1.0/samples/sgemm# ./sgemm [PLUGIN INFO] Plugin initializing [PLUGIN DEBUG] './override.instr_config' not found, trying to open the process config file [PLUGIN DEBUG] './sgemm.instr_config' not found, trying to open the default config file [PLUGIN ERROR] Couldn't open default config file './default.instr_config'. [PLUGIN INFO] No configuration file found, attempting to use environment [PLUGIN INFO] CINSTR GENERAL: Output directory set to: . [PLUGIN INFO] No instrumentation features requested. ^C[ 384.380000] Mali<ERROR, BASE_MMU>: kbase_mmu_report_fault_and_kill Unhandled Page fault in AS0 at VA 0x00000000B6F0E100 [ 384.380000] raw fault status 0x820003C3 [ 384.380000] decoded fault status: SLAVE FAULT [ 384.380000] exception type 0xC3: TRANSLATION_FAULT [ 384.380000] access type 0x3: WRITE [ 384.380000] source id 0x8200 [ 384.400000] Mali<ERROR, BASE_JM>: kbase_job_done_slot t6xx: GPU fault 0x43 from job slot 1
root@arndale-octa:~/Mali_OpenCL_SDK_v1.1.0/samples/sgemm# ./sgemm
[PLUGIN INFO] Plugin initializing
[PLUGIN DEBUG] './override.instr_config' not found, trying to open the process config file
[PLUGIN DEBUG] './sgemm.instr_config' not found, trying to open the default config file
[PLUGIN ERROR] Couldn't open default config file './default.instr_config'.
[PLUGIN INFO] No configuration file found, attempting to use environment
[PLUGIN INFO] CINSTR GENERAL: Output directory set to: .
[PLUGIN INFO] No instrumentation features requested.
^C[ 384.380000] Mali<ERROR, BASE_MMU>: kbase_mmu_report_fault_and_kill Unhandled Page fault in AS0 at VA 0x00000000B6F0E100
[ 384.380000] raw fault status 0x820003C3
[ 384.380000] decoded fault status: SLAVE FAULT
[ 384.380000] exception type 0xC3: TRANSLATION_FAULT
[ 384.380000] access type 0x3: WRITE
[ 384.380000] source id 0x8200
[ 384.400000] Mali<ERROR, BASE_JM>: kbase_job_done_slot t6xx: GPU fault 0x43 from job slot 1
The program hangs after the [PLUGIN INFO] messages are printed. When I press ctrl+c, the kernel messages appear, indicating that the GPU run into trouble?
It's interesting that its not deterministic where it fails. Without a reproducer I'm afraid I'm just guessing at possible causes. Also the driver itself is quite old now, we're currently at r4p0, so it is worth asking them when they plan to provide an up to date version as this could fix it.
We create all our kernels first, then we allocate required buffers and then call clEnqueueNDRange. Sometime it fails to create some kernels(some are got created). And sometimes it fails in buffer allocation. It didnt hit clEnqueueNDRange yet.
Hi veerannah,
I doubt simply having a lot of kernels would in itself be a problem, nor would I expect memory bandwidth to be an issue. Can you confirm whether you are creating all of your kernels up-front before any clEnqueueNDRange commands take place, or do you somewhat interleave kernel compilation and execution? Can you also confirm whether all program objects compile without error prior to the calls to clCreateKernel that consume them?
Interestingly simple median filter example runs fine(openCL on T628), but our application fails. Our application has many kernels and huge memory bandwidth will it be the reason?
Any suggestion for debugging will be helpful.
With above said command, I got version as 1.4 Midgard-"r3p0-01bet0. I will try to give simple app to run.
Thank you for help.
View all questions in Graphics and Gaming forum