We have two Malis on our board Odroid XU4 . We wish to create a large image, with one Mali creating half the image and the other Mali creating the other half. We also want the image to be memory mapped, as it is quite large. Can we map the image in such a way so that both Malis see at least the part they are working on, and, of course, the memory for the whole image is contiguous for the Cpu?
A rephrasing of the question might be: May different devices see the same shared memory with the host?
One possible way (but would like to confirm before moving forward), involves using buffers instead of images:
1. Create entire buffer in the context with CL_MEM_ALLOC_HOST_PTR
2. Create two disjoint sub-buffers. (did not see a way to create sub-images)
3. Map each sub-buffer on its own command queue.
However, this depends on when the memory gets allocated on the host:
Is memory allocated in step 1. or in step 3.? If in step 1, the memory will be contiguous on the host. If in step 3, ... ? I suspect in step 3, since that is when the host ptr becomes available.
Message was edited by: Norman Goldstein
Everything for Mali is memory mapped; there is no dedicated graphics RAM and everything is stored in system memory. The two logical Mali GPUs in your system for OpenCL share the same MMU, so they will both be able to see all GPU related data in your application process.
There are two points to watch out for.
Firstly note that the two OpenCL devices are not memory coherent, so if you have two parallel parts trying to modify the same cache line it won't work and you will get data corruption. Provided you partition the work so that each core gets a unique set of cache lines then it should all be OK.
Secondly note that the two GPU partitions are not the equal in terms of performance (the first is 4 GPU cores, the second is 2 GPU cores), so you will want to factor that in to your problem partitioning.
HTH, Pete
Thanks for the info and pointers. Here is an outline of what I
understand from this:
-- A single context having the two devices: device0 and device1
and two queues: queue0 and queue1
// The float single channel image that we want to generate
-- image = clCreateImage2D( context,
CL_MEM_WRITE_ONLY |
CL_MEM_ALLOC_HOST_PTR,
...,
nullptr, // host ptr
... );
// Map the entire image
-- float* ptr = clEnqueueMapImage( queue0,
image,
After running the kernels, ptr will point to the entire (contiguous)
image, as created by the kernels of the two devices. We could have used
"queue1" instead of "queue0" to do the mapping -- it makes no
difference, due to the Mali memory architecture.
I should have added another sentence to the end of my previous post:
Do I have this correct -- Either queue0 or queue1 can be used to map the entire buffer for the cpu, and in which the cpu will see the results generated by the kernels on both devices?
I'm a graphics guy, not a compute guy, but yes pretty sure that should work. I'll try and find a handy OpenCL driver dev to comment!
Cheers, Pete