This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Overhead generated by calling clCreateBuffer

Hi everyone,

I'm using OpenCL on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU (Samsung S7 edge). And it is taking a high overhead when creating a buffer from the call clCreateBuffer. I'd like to know more about this issue. Is anything related with the driver that takes all this time? Why it takes a long time to create the buffer?

Below are described the example used and the sizes with their respective time. Observe that I'm creating two buffer each one with size of N*N elements of type float.

    #define DATA_TYPE float

    int N = 8192;  

    t_start = rtclock();

#ifdef OFFLOAD

    a_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY, sizeof(DATA_TYPE) * N * N, NULL, &errcode);

    b_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE, sizeof(DATA_TYPE) * N * N, NULL, &errcode);

#else

    a_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR,  sizeof(DATA_TYPE) * N * N, NULL, &errcode);

    b_mem_obj = clCreateBuffer(clGPUContext, CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR, sizeof(DATA_TYPE) * N * N, NULL, &errcode);

#endif

    t_end = rtclock();

    printf("Total time of clCreateBuffer %lf \n" , t_end - t_start);

   

N (size) clCreateBuffer (seconds)
2048 0.010235
4096 0.251183
8192 1.385209
9000 1.622948
10000 2.054119
11000 2.501804

PD. Executing the same program on an Intel GPU doesn't take a long time when compared with the time taken by Mali GPU.

Thanks!!!

  • Hi,

    When you create a buffer the driver needs to map the corresponding pages, then do some cache maintenance and zero these pages (Which is where all this time goes), however on some platforms these operations don't trigger the CPU governor and therefore are all performed with the CPU running at the minimal frequency.

    So make sure your device is running with the CPU in performance mode and it should be much quicker.

    I would expect N=10000 to take about 300ms (I just tried on a Samsung Chromebook).

    Hope this helps,

    Anthony

  • "cat" is to read, not to write

    If you do

    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    Does it say the processor is in performance mode ?

    If not you need to do

    echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

  • Hi Anthony Barbier,

    Running it with performance flag didn't improve almost nothing.

    I've used the following commands to set performance:

    echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

    echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor

    ...

    After setting performance, I've checked:

    $cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

    performance

    $cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

    performance

    $cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor

    performance

    ...

    I'd appreciate any addition feedback.

    Thanks.

  • I've updated the previous message with the correct information about what I did to run in performance mode.

    Briefly, I did:

    • To set performance mode:

              $echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

              $echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

              $echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor

              ...

    • After setting performance, I've checked running the following commands:

              $cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

              performance

              $cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor

              performance

              $cat /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor

              performance

              ...

    Is there anything more I can do to check or reduce the time taken by calling clCreateBuffer?

    The source file that I've used to measure the times is attached.