So, I am trying to perform some operation inside an OpenCL kernel. I have this buffer named filter which is a 3x3 matrix initialized with value 1.
I pass this as an argument to the OpenCL kernel from the host side. The issue is when I try to fetch this buffer on the device side as a float3 vector. For ex -
__kernel void(constant float3* restrict filter)
float3 temp1 = filter;
float3 temp2 = filter;
float3 temp3 = filter;
The first two temp variables behave as expected and have all their value as 1. But, the third temp variable (temp3) has only the x component as 1 and rest of the y and z components are 0. When I fetch the buffer as only a float vector, everything behaves as expected. Am I doing something wrong? I don't want to use vload instructions as they give an overhead.
3-element vectors are aligned like 4-element vectors. See section 6.1.5 in the OpenCL C specification .
To load packed data you have to use vload/vstore which don't come with any overhead on Mali GPUs. What makes you think that there is an overhead?