Hi,
I am facing Cache issue in Mali GPU do you have any idea how to resolve it. I will explain the problem clearly.
We are working on Samsung Exynos Octa 5420 Board, we have one algorithm to be ported to GPU.
1. First we thought of having separate GPU buffers(Created by using "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR") where we have to copy the input data from CPU global buffer(Created by using malloc) to GPU buffer since it is separate GPU buffer data we are arranging without any gaps, Example is if 1st thread is operation on 1st block of data 2nd thread or any other thread may work on 2nd block of data which is located just after 1st block of data. Here in this design GPU algorithm numbers are fine with in the range.
2. In above design We observed that copying is taking huge time so we decided to create CPU global buffer with "clCreateBuffer" and "CL_MEM_ALLOC_HOST_PTR". So that by mapping(using clEnqueueMapBuffer(CPU Buffer)) we can use this buffer on CPU and GPU also. But this buffer data is arranged in such a way that data required by GPU algorithm will be arranged at different position, example if 1st thread is operation on 1st block of data 2nd thread or any other thread may work on 2nd block of data which is located not exactly beside to 1st block. We are observing performance drop of nearly 95% compared to earlier algorithm (1st design is taking 41 m sec 2nd design is taking 79 m sec). Can you suggest any way to avoid the Cache issue, quicker response will be very much helpful.
Thanks & Regards,
Narendra Kumar
Hi Narendra Kumar,
To be able to help you on this I think you will need to describe in more detail what it is you are doing. From the description so far it's not clear what is going on.
For example, in your first version you have multiple buffers created through clCreateBuffer and a single malloc'd buffer you copy data from/to? Is that right? And in your second version instead of using a malloc'd buffer, you are creating the main buffer with clCreateBuffer? When you then say... "But this buffer data is arranged in such a way that data required by GPU algorithm will be arranged at different position"... I don't quite understand why the layout of the data is different. Perhaps you could explain this in more detail.
Also when you say "Here in this design GPU algorithm numbers are fine with in the range", do you mean the data is correct or do you mean the speed is acceptable? If you mean the data is correct, are you suggesting that the data in the second version is not correct?
There is a tool that can help to profile both CPU and GPU operation together, and it may be useful here to determine why the performance is so different between your versions. You can find out more about the tool here... ARM DS-5 Streamline - Mali Developer Center Mali Developer Center.
Regards, Tim