hi,
i asked this question on khronos forum but i got no answer. So i decided to ask the question on this forum.
I used to do the following procces with openCL on Android.
Working with:
- Mali-G715-Immortalis MC11 r1p2
- OpenCL 3.0 v1.r38p1-01eac0.c1a71ccca2acf211eb87c5db5322f569
- SVM_COARSE_GRAIN_BUFFER supported
i create the platform,queu,devive. Create all my cl::buffer and compile all the kernel at the start of my application.
i get picture from my camera and send the byte data using JNI jbyteArray =>((uint8_t*)inPtr) to my c++ function.
i get the (uint8_t*)inPtr pointer than i use cl::buffer to feed the buffer with the camera picture data, using : bufferNV21 = cl::Buffer(gContext, CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR , isize*sizeof(cl_uchar), inPtr , NULL); this take less than 1ms.
i process my kernel NV21toRGB than i do some staff with my output buffer.
i use enqueueMapBuffer to point the Buffer,buf, to my local program memory and that wil be used by pthread CPU processing. take less than 2ms
than i copy back the CPU result to the GPU buffer doing: bufferligne = cl::Buffer(gContext, CL_MEM_USE_HOST_PTR, (1024*1024)*sizeof(cl_uchar4), buf, NULL); // remplace enqueueWriteBuffer. this take less than 3ms
do some kernel on bufferligne cl::buffer
then send back the GPU buffer(bufferMMM) to Java out bitmap using gQueue.enqueueReadBuffer(bufferMMM, CL_TRUE, 0, osize*sizeof(cl_uchar4), out, 0, &arraySecondEvent); // pour openCL this last part take between 3 and 5ms, depends. Sometime less.
So it is relevant to use SVM with my cnfiguration and what should i change if i want to use SVM. Change at step 3,5,7 or 8.
And what does SVM that cl::buffer does not. I would like to anderstand Why to use.
i could improved the speed by using on the kernel.cl file
#pragma OPENCL EXTENSION cl_khr_priority_hints : enable // accelere openCL queue driver #pragma OPENCL EXTENSION CL_QUEUE_PRIORITY_HIGH_KHR : enable
and on the .cpp file
// Optional extension support #define CL_HPP_USE_IL_KHR #define CL_HPP_USE_CL_SUB_GROUPS_KHR #define CL_HPP_OPENCL_API_WRAPPER
thanks jhon for the help.
i tri it. But for step 5 it can be removed from CPU. CPU trheat information as a row so we can have sequentialy processing on X and Y in a FOR.
GPU can threat information on the same way that CPU. One work on matrix, work_group, it need to réorganize the data row to process the work_group. CPU is more faster, the médiatex 9200+ is surprising in 64bit. The CPU is more useable computing array,database,STRUC. It is not the same use for me, if not i would used only one.
GPU are faster than CPU for multitasking. CPU got 8/16 core when GPU got thusand, i am joking,but.
It is définitly not the same use because they do not process information the same way.
All is how to structure information for your use.
GPU could and should decome like a CPU by processing the work_group in order not in aléatory mode,"soory for my, I think. VERY BAD ENGLISH WRITTING, it is the same in french.
GPU and CPU do not do the same things.
CPU can som very quickly on big STRUC.
But thanks again for the respond. I will find it, no problem. time and patience.
lest's come back to SVM. you said :
For your example above that would mean allocating "inPtr" via clSVMAlloc. you will most likely see a performance improvement of the map/unmap calls.
But "inPtr" is a java Object, we do not know where the memoriy is allocated and the same for "out". So using JAVA i am not sure that the use of SVM can be done. And i do not know how to do it is it could be done.
Hi Hterolle, If the memory is allocated externally then there is no way to use SVM. You can try the other suggestions from my initial reply:
John Kesapides said:Although you will probably see the same performance benefit on map/unmap if you use CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR. Another way you can improve you application is by importing your data. You can use either of these extensions cl_import_memory_arm or cl_khr_external_memory, to import memory.
Although you will probably see the same performance benefit on map/unmap if you use CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR.
Another way you can improve you application is by importing your data. You can use either of these extensions cl_import_memory_arm or cl_khr_external_memory, to import memory.
hope this helps.
-John
hi John,
Thanks for the tips. The input fom java cost nothing until the map/umap. The only cost i can see, after year of testing now, is on the output to java memory. So no need to use SVM, CL_MEM_USE_HOST_PTR look fast enough. Does not need to use umap after map. just a flush at the end of the traitement before next frame.
For the output to java i use enqueueReadBuffer and that can cost me between 2 to 5 ms. I tried to use map/umap but if it is working fine from java to openCL it is not the same from opencl to java.
the other problem that i found is removing the Event.wait() it sped up all the kernel but i need to set atleast one wait() before the transfert from GPU to CPU (map/umap) if not data on CPU are not up to date, but this cost aroud 10 ms, some time more. and it is not the map/umap how cost.
So the last improvment i can see is from opencl to java. I need to copy data i cannot use pionter. I may be wrong but you should have a look at it. Fro me it is strange that it can work from java to opencl but not i the other way.
PS: i just test it with CL_FALSE flag rather than CL_TRUE on enqueueReadBuffer that look to be the solution with 0ms of work. ;))
Thanks for your answer i found a new solution for improvment ;)) ;))
So it look like enqueueReadBuffer with CL_FALSE work like map/umap. So why map/umap does not work from OpenCL to Java ?
On shared memory normally they should not be any problem and any use of SVM. It is just how to implement the old function with the new materiel. Some do it well some does not. ;))
Médiatek ans hauwei seams to do it well. If qualcom offer me a spartagon equal to the meéditek 9200+ i could check it. ;)) ;)) ;))
Thanks again for posting an answer taht force me to think about performance than about data traitment.
Best Regards
hervé.