This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM Mali-T628(Samsung Exynos Octa 5420 Board) GPU Cache issue in kernel

Hi,

I am facing Cache issue in Mali GPU do you have any idea how to resolve it. I will explain the problem clearly.

We are working on Samsung Exynos Octa 5420 Board, we have one algorithm to be ported to GPU.

1. First we thought of having  separate GPU buffers(Created by using "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR") where we have to copy the input data from CPU global buffer(Created by using malloc) to GPU buffer since it is separate GPU buffer data we are arranging without any gaps, Example is if 1st thread is operation on 1st block of data 2nd thread or any other thread may work on 2nd block of data which is located just after 1st block of data. Here in this design GPU algorithm numbers are fine with in the range.

2. In above design We observed that copying is taking huge time so we decided to create CPU global buffer with "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR". So that by mapping(using clEnqueueMapBuffer(CPU Buffer)) we can use this buffer on CPU and GPU also. But this buffer data is arranged in such a way that data required by GPU algorithm will be arranged at different position, example if 1st thread is operation on 1st block of data 2nd thread or any other thread may work on 2nd block of data which is located not exactly beside to 1st block. We are observing performance drop of nearly 95% compared to earlier algorithm (1st design is taking 41 m sec  2nd design is taking 79 m sec). Can you suggest any way to avoid the Cache issue, quicker response will be very much helpful.

Thanks & Regards,

Narendra Kumar

Parents
  • Hi Hartley,

                 Your understanding is almost correct in first case I am using malloced buffer where we data is distributed among different buffer locations so to make it unified memory I am copying them to GPU accessible buffer(Created by using "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR"). since the buffer is only one and as I am storing the data block by block means 1st block copied from CPUBuf1 is copied to location GPUBuf[0], 2nd block copied from location CPUBuf2 is copied to location GPUBuf[1] so the offset between GPUBuf[0] and GPUBuf[1]  is 1 which are just next to each other, here I have mentioned the performance is meeting to our expectation its true because profile time is within the range "41 msec" is our expected number for this algorithm. But in first case we need to copy the data from multiple CPU buffers to Single GPU buffers.we are avoiding this memcopy by creating actual CPU buffers with "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR" and we are passing this buffer to GPU after mapping using clEnqueueMapBuffer(CPU Buffer). 

    CPU malloced buffer is of size 8 Mb out of which only 1 Mb of data is used by my GPU algorithm and this 1 Mb of data is not arranged in such a way that my first block input to GPU is located at an offset of 0 from base source pointer and 2nd block of input is located at an offset of 1x1024x1024 which is at 1 Mb boundary so what I am doing in first case is I created a 1 Mb buffer GPU buffer and copied each of the blocks located at different offset on CPU to a single GPU buffer so block to block offset is only 1 not 1 Mb as like in CPU buffer case, here we need to copy 1Mb from CPU buffer to GPU buffer instead we can Directly create this CPU buffer with "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR" so that this memory we can access at GPU side also for fetching that 1 Mb data block by block where eachh block located at different location offsets GPU is taking huge time assume in first case our algorithm is taking "n msec" in second case due to random access it is taking "2n msec".

    Please let me know still if you are not understanding the problem.

    Thanks & Regards,

    Narendra Kumar.

Reply
  • Hi Hartley,

                 Your understanding is almost correct in first case I am using malloced buffer where we data is distributed among different buffer locations so to make it unified memory I am copying them to GPU accessible buffer(Created by using "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR"). since the buffer is only one and as I am storing the data block by block means 1st block copied from CPUBuf1 is copied to location GPUBuf[0], 2nd block copied from location CPUBuf2 is copied to location GPUBuf[1] so the offset between GPUBuf[0] and GPUBuf[1]  is 1 which are just next to each other, here I have mentioned the performance is meeting to our expectation its true because profile time is within the range "41 msec" is our expected number for this algorithm. But in first case we need to copy the data from multiple CPU buffers to Single GPU buffers.we are avoiding this memcopy by creating actual CPU buffers with "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR" and we are passing this buffer to GPU after mapping using clEnqueueMapBuffer(CPU Buffer). 

    CPU malloced buffer is of size 8 Mb out of which only 1 Mb of data is used by my GPU algorithm and this 1 Mb of data is not arranged in such a way that my first block input to GPU is located at an offset of 0 from base source pointer and 2nd block of input is located at an offset of 1x1024x1024 which is at 1 Mb boundary so what I am doing in first case is I created a 1 Mb buffer GPU buffer and copied each of the blocks located at different offset on CPU to a single GPU buffer so block to block offset is only 1 not 1 Mb as like in CPU buffer case, here we need to copy 1Mb from CPU buffer to GPU buffer instead we can Directly create this CPU buffer with "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR" so that this memory we can access at GPU side also for fetching that 1 Mb data block by block where eachh block located at different location offsets GPU is taking huge time assume in first case our algorithm is taking "n msec" in second case due to random access it is taking "2n msec".

    Please let me know still if you are not understanding the problem.

    Thanks & Regards,

    Narendra Kumar.

Children
  • Hi Narendra Kumar,

    I think I understand this more clearly now, thankyou.


    It sounds like it could be a cache issue, but without looking at the application it is very difficult to comment definitively.  My suggestion is to look into using DS5 Streamline as I suggested before.  This will allow you to clearly track performance counters from CPU and GPU and will show where the bottlenecks are.


    Alternatively you could experiment with workgroup size within your kernels.  This can be a useful way to influence memory access patterns and would be a good indicator that cache maintenance is causing this problem.

    I'm sorry I can't be more definitive, but do post back any other observations and we'll see if that helps identify potential solutions.

    HTH, Tim

  • Hi Tim,

    I am not allowed to share the code but I am working to create source code with same kind of functionality which has same behavior as I mentioned in the issue, I will share the code within 2 days.

    Thanks & Regards,

    Narendra Kumar.

  • Hi Narendra,

    Thankyou.  For IP reasons we also would not want to receive actual source code from your application, so that's good.  A reproducer like you describe would be great though and could be very helpful.

    Regards, Tim