This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Overhead generated by calling clCreateBuffer

Hi everyone,

I'm using OpenCL on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU (Samsung S7 edge). And it is taking a high overhead when creating a buffer from the call clCreateBuffer. I'd like to know more about this issue. Is anything related with the driver that takes all this time? Why it takes a long time to create the buffer?

Below are described the example used and the sizes with their respective time. Observe that I'm creating two buffer each one with size of N*N elements of type float.

    #define DATA_TYPE float

    int N = 8192;  

    t_start = rtclock();

#ifdef OFFLOAD

    a_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY, sizeof(DATA_TYPE) * N * N, NULL, &errcode);

    b_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE, sizeof(DATA_TYPE) * N * N, NULL, &errcode);

#else

    a_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR,  sizeof(DATA_TYPE) * N * N, NULL, &errcode);

    b_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(DATA_TYPE) * N * N, NULL, &errcode);

#endif

    t_end = rtclock();

    printf("Total time of clCreateBuffer %lf \n" , t_end - t_start);

   

N (size) clCreateBuffer (seconds)
2048 0.010235
4096 0.251183
8192 1.385209
9000 1.622948
10000 2.054119
11000 2.501804

PD. Executing the same program on an Intel GPU doesn't take a long time when compared with the time taken by Mali GPU.

Thanks!!!

Parents
  • Hi,

    When you create a buffer the driver needs to map the corresponding pages, then do some cache maintenance and zero these pages (Which is where all this time goes), however on some platforms these operations don't trigger the CPU governor and therefore are all performed with the CPU running at the minimal frequency.

    So make sure your device is running with the CPU in performance mode and it should be much quicker.

    I would expect N=10000 to take about 300ms (I just tried on a Samsung Chromebook).

    Hope this helps,

    Anthony

Reply
  • Hi,

    When you create a buffer the driver needs to map the corresponding pages, then do some cache maintenance and zero these pages (Which is where all this time goes), however on some platforms these operations don't trigger the CPU governor and therefore are all performed with the CPU running at the minimal frequency.

    So make sure your device is running with the CPU in performance mode and it should be much quicker.

    I would expect N=10000 to take about 300ms (I just tried on a Samsung Chromebook).

    Hope this helps,

    Anthony

Children