This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

why vectorized code is slower?

I am trying to make my image processing program faster.

So, i changed my  scalar code into vectorized code.

for understanding, the purpose of program is" read right(4) and left(4) total 8 pixel of target pixel   from input buffer

pixel1pixel2 pixel3pixel4targetpixel 5pixel6pixel 7pixel8

then compare them with target pixel and calculate weight, then write result into other buffer.

so, i coded in this style

Read 16 pixel(Read_in) , then 8 pixel of this will be 8 target pixel(center).

After that,   split pixels into 4(letf)&4(right) . and store them into vector variables.

float8 splited1=(float8)(Readin.s0123,Readin.s5678)

float8 splited2=(float8)(Readin.s1234,Readin.s6789) and so on...

then compare splited1~n with center  by using vector operators and calculate weight.

Fianally, store result data(float8) into buffer.

In mali optimization guide, vectorized code is faster than scalar code.

but, in my case, vectorized code is slower than scalar code about 3 times.

why this thing is happened? is it caused by too many elements?

My device is samssung galaxy s6  equipped with mali t760-mp8

Parents
  • If you data is all fp32 floats the I would guess you are hitting issues around register availability - if you are reading 8 fp32 values for input and needing 8 fp32 values for output then that is a huge amount of data to have hanging about in registers. See Anton's blog here for some basic info on register limits which you probably want to stick to:

    ARM Mali Compute Architecture Fundamentals

    Also remember that GPUs have finite memory bandwidth - 8 cores all making vec8 fp32 accesses are likely to end up memory limited anyway, depending how much maths you are doing per access. Look at whether you really need to use fp32 inputs - fp16 is much faster (half the bandwidth, almost twice the arithmetic performance in Mali's maths units) and normally fine for processing color values (which end up as 8-bit int unorm before being displayed anyway), and many algorithms can operate on int8 luminance values (using int16 or fp16 for temporary precision inside the core if needed, storing an int8 result) which halves the memory bandwidth again.

    ... but without seeing your exact code it's going to be hard to provide specific advice.

    HTH,
    Pete

Reply
  • If you data is all fp32 floats the I would guess you are hitting issues around register availability - if you are reading 8 fp32 values for input and needing 8 fp32 values for output then that is a huge amount of data to have hanging about in registers. See Anton's blog here for some basic info on register limits which you probably want to stick to:

    ARM Mali Compute Architecture Fundamentals

    Also remember that GPUs have finite memory bandwidth - 8 cores all making vec8 fp32 accesses are likely to end up memory limited anyway, depending how much maths you are doing per access. Look at whether you really need to use fp32 inputs - fp16 is much faster (half the bandwidth, almost twice the arithmetic performance in Mali's maths units) and normally fine for processing color values (which end up as 8-bit int unorm before being displayed anyway), and many algorithms can operate on int8 luminance values (using int16 or fp16 for temporary precision inside the core if needed, storing an int8 result) which halves the memory bandwidth again.

    ... but without seeing your exact code it's going to be hard to provide specific advice.

    HTH,
    Pete

Children