What is the difference between SVM and CL::buffer

hi,

i asked this question on khronos forum but i got no answer. So i decided to ask the question on this forum.

I used to do the following procces with openCL on Android.

Working with:

   - Mali-G715-Immortalis MC11 r1p2

  - OpenCL 3.0 v1.r38p1-01eac0.c1a71ccca2acf211eb87c5db5322f569

  - SVM_COARSE_GRAIN_BUFFER supported

  1. i create the platform,queu,devive. Create all my cl::buffer and compile all the kernel at the start of my application.

  2. i get picture from my camera and send the byte data using JNI jbyteArray =>((uint8_t*)inPtr) to my c++ function.

  3. i get the (uint8_t*)inPtr pointer than i use cl::buffer to feed the buffer with the camera picture data, using :
    bufferNV21 = cl::Buffer(gContext, CL_MEM_READ_ONLY|CL_MEM_USE_HOST_PTR , isize*sizeof(cl_uchar), inPtr , NULL); this take less than 1ms.

  4. i process my kernel NV21toRGB than i do some staff with my output buffer.

  5. i use enqueueMapBuffer to point the Buffer,buf, to my local program memory and that wil be used by pthread CPU processing. take less than 2ms

  6. than i copy back the CPU result to the GPU buffer doing:
    bufferligne = cl::Buffer(gContext, CL_MEM_USE_HOST_PTR, (1024*1024)*sizeof(cl_uchar4), buf, NULL); // remplace enqueueWriteBuffer.
    this take less than 3ms

  7. do some kernel on bufferligne cl::buffer

  8. then send back the GPU buffer(bufferMMM) to Java out bitmap using
    gQueue.enqueueReadBuffer(bufferMMM, CL_TRUE, 0, osize*sizeof(cl_uchar4), out, 0, &arraySecondEvent); // pour openCL
    this last part take between 3 and 5ms, depends. Sometime less.

So it is relevant to use SVM with my cnfiguration and what should i change if i want to use SVM. Change at step 3,5,7 or 8.

And what does SVM that cl::buffer does not. I would like to anderstand Why to use.

i could improved the speed by using on the kernel.cl file

#pragma OPENCL EXTENSION cl_khr_priority_hints : enable // accelere openCL queue driver
#pragma OPENCL EXTENSION CL_QUEUE_PRIORITY_HIGH_KHR : enable

and on the .cpp file

// Optional extension support
#define CL_HPP_USE_IL_KHR
#define CL_HPP_USE_CL_SUB_GROUPS_KHR
#define CL_HPP_OPENCL_API_WRAPPER

Parents Reply Children
  • As the blog mentions, these are the benefits of using SVM:
    * It has lower overheads.
    * It is easier to use because it is just a pointer to data.
    * If your platform supports coherency, It allows you to use coherent memory.
    * Because the address of the memory is guaranteed to be the same in the host and the device, it allows
    you to write kernels using dynamic data-structures that rely on pointers (i.e. linked lists).

    With regards to your application, with the information supplied, you would probably benefit from the use of SVM.
    The fastest way to convert an existing application to use SVM is to allocate memory via clSVMAlloc and pass it as a host_ptr to a cl object along with CL_MEM_USE_HOST_PTR.

    For your example above that would mean allocating "inPtr" via clSVMAlloc. you will most likely see a performance improvement of the map/unmap calls.

    Although you will probably see the same performance benefit on map/unmap if you use CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR.

    Another way you can improve you application is by importing your data. You can use either of these extensions cl_import_memory_arm or cl_khr_external_memory, to import memory.

    Also as a general rule try to avoid switching between the CPU and GPU, if possible.
    for instance. Can step 5 be moved to GPU? also on step 8 can't the kernel write directly to "out"(see if you can import "out" into bufferMMM)?

    Hope this helps.

  • thanks jhon for the help.

    i tri it. But for step 5 it can be removed from CPU. CPU trheat information as a row so we can have sequentialy processing on X and Y in a FOR.

    GPU can threat information on the same way that CPU. One work on matrix, work_group, it need to réorganize the data row to process the work_group. CPU is more faster, the médiatex 9200+ is surprising in 64bit. The CPU is more useable computing array,database,STRUC. It is not the same use for me, if not i would used only one.

    GPU are faster than CPU for multitasking. CPU got 8/16 core when GPU got thusand, i am joking,but.

    It is définitly not the same use because they do not process information the same way.

    All is how to structure information for your use.

    GPU could and should decome like a CPU by processing the work_group in order not in aléatory mode,"soory for my, I think. VERY BAD ENGLISH WRITTING, it is the same in french.

    GPU and CPU do not do the same things.

    CPU can som very quickly on big STRUC.

    But thanks again for the respond. I will find it, no problem. time and patience.

  • hi,

    lest's come back to SVM. you said :

    For your example above that would mean allocating "inPtr" via clSVMAlloc. you will most likely see a performance improvement of the map/unmap calls.

    But "inPtr" is a java Object, we do not know where the memoriy is allocated and the same for "out". So using JAVA i am not sure that the use of SVM can be done. And i do not know how to do it is it could be done.

  • Hi Hterolle, If the memory is allocated externally then there is no way to use SVM. You can try the other suggestions from my initial reply:

    Although you will probably see the same performance benefit on map/unmap if you use CL_MEM_COPY_HOST_PTR instead of CL_MEM_USE_HOST_PTR.

    Another way you can improve you application is by importing your data. You can use either of these extensions cl_import_memory_arm or cl_khr_external_memory, to import memory.

    hope this helps. 

    -John

  • hi John,

    Thanks for the tips. The input fom java cost nothing until the map/umap. The only cost i can see, after year of testing now, is on the output to java memory. So no need to use SVM, CL_MEM_USE_HOST_PTR look fast enough. Does not need to use umap after map. just a flush at the end of the traitement before next frame.

    For the output to java i use enqueueReadBuffer and that can cost me between 2 to 5 ms. I tried to use map/umap but if it is working fine from java to openCL it is not the same from opencl to java.

    the other problem that i found is removing the Event.wait() it sped up all the kernel but i need to set atleast one wait() before the transfert from GPU to CPU (map/umap) if not data on CPU are not up to date, but this cost aroud 10 ms, some time more. and it is not the map/umap how cost.

    So the last improvment i can see is from opencl to java. I need to copy data i cannot use pionter. I may be wrong but you should have a look at it. Fro me it is strange that it can work from java to opencl but not i the other way.

    PS: i just test it with CL_FALSE flag rather than CL_TRUE on  enqueueReadBuffer that look to be the solution with 0ms of work. ;))

    Thanks for your answer i found a new solution for improvment  ;)) ;))

    So it look like enqueueReadBuffer with CL_FALSE work like map/umap. So why map/umap does not work from OpenCL to Java ?

    On shared memory normally they should not be any problem and any use of SVM. It is just how to implement the old function with the new materiel. Some do it well some does not. ;))

    Médiatek ans hauwei seams to do it well. If qualcom offer me a spartagon equal to the meéditek 9200+ i could check it. ;)) ;)) ;))

    Thanks again for posting an answer taht force me to think about performance than about data traitment.

    Best Regards

    hervé.