Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
How to efficiently sum 4 x 8bit integers with ARM or NEON
Jump...
Cancel
Locked
Locked
Replies
16 replies
Subscribers
119 subscribers
Views
15578 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
How to efficiently sum 4 x 8bit integers with ARM or NEON
Shervin Emami
over 12 years ago
Note: This was originally posted on 17th September 2010 at
http://forums.arm.com
Hi,
I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.
But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).
Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.
Cheers,
Shervin Emami
[url="
http://www.shervinemami.co.cc/
"]
http://www.shervinemami.co.cc/[/url]
Parents
Shervin Emami
over 12 years ago
Note: This was originally posted on 25th September 2010 at
http://forums.arm.com
Hi guys,
After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.
To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):
Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec
So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!
Cheers,
Shervin Emami.
[url="
http://www.shervinemami.co.cc/
"]
http://www.shervinemami.co.cc/[/url]
Cancel
Vote up
0
Vote down
Cancel
Reply
Shervin Emami
over 12 years ago
Note: This was originally posted on 25th September 2010 at
http://forums.arm.com
Hi guys,
After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.
To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):
Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec
So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!
Cheers,
Shervin Emami.
[url="
http://www.shervinemami.co.cc/
"]
http://www.shervinemami.co.cc/[/url]
Cancel
Vote up
0
Vote down
Cancel
Children
No data