I am trying to make my image processing program faster.
So, i changed my scalar code into vectorized code.
for understanding, the purpose of program is" read right(4) and left(4) total 8 pixel of target pixel from input buffer
then compare them with target pixel and calculate weight, then write result into other buffer.
so, i coded in this style
Read 16 pixel(Read_in) , then 8 pixel of this will be 8 target pixel(center).
After that, split pixels into 4(letf)&4(right) . and store them into vector variables.
float8 splited1=(float8)(Readin.s0123,Readin.s5678)
float8 splited2=(float8)(Readin.s1234,Readin.s6789) and so on...
then compare splited1~n with center by using vector operators and calculate weight.
Fianally, store result data(float8) into buffer.
In mali optimization guide, vectorized code is faster than scalar code.
but, in my case, vectorized code is slower than scalar code about 3 times.
why this thing is happened? is it caused by too many elements?
My device is samssung galaxy s6 equipped with mali t760-mp8
thank you for comment, i changed my code with 128bit-wide simd operations.
then, it gets faster a little bit.