This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]

Parents

Shervin Emami over 12 years ago

Note: This was originally posted on 25th September 2010 at http://forums.arm.com

Hi guys,

After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.

To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):

Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec

So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!

Cheers,
Shervin Emami.
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Cancel
Vote up 0 Vote down

Cancel

Reply

Shervin Emami over 12 years ago

Note: This was originally posted on 25th September 2010 at http://forums.arm.com

Hi guys,

After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.

To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):

Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec

So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!

Cheers,
Shervin Emami.
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Cancel
Vote up 0 Vote down

Cancel

Children

No data