Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
How to efficiently sum 4 x 8bit integers with ARM or NEON
Jump...
Cancel
Locked
Locked
Replies
16 replies
Subscribers
119 subscribers
Views
15578 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
How to efficiently sum 4 x 8bit integers with ARM or NEON
Shervin Emami
over 12 years ago
Note: This was originally posted on 17th September 2010 at
http://forums.arm.com
Hi,
I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.
But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).
Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.
Cheers,
Shervin Emami
[url="
http://www.shervinemami.co.cc/
"]
http://www.shervinemami.co.cc/[/url]
Parents
Shervin Emami
over 12 years ago
Note: This was originally posted on 27th September 2010 at
http://forums.arm.com
Actually I just used VPADDL's instead of VPADAL, so your code should run even faster than myne :-) But there is one other important difference: My code doesn't specify the data alignment (such as @64 and @128), because I'm using the default assembler in XCode (GCC4.2 -assembler-as-cpp), and I can't figure out how to specify the NEON data alignment. Maybe I should be using NASM or something instead of GCC to assemble my code...
And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
Cancel
Vote up
0
Vote down
Cancel
Reply
Shervin Emami
over 12 years ago
Note: This was originally posted on 27th September 2010 at
http://forums.arm.com
Actually I just used VPADDL's instead of VPADAL, so your code should run even faster than myne :-) But there is one other important difference: My code doesn't specify the data alignment (such as @64 and @128), because I'm using the default assembler in XCode (GCC4.2 -assembler-as-cpp), and I can't figure out how to specify the NEON data alignment. Maybe I should be using NASM or something instead of GCC to assemble my code...
And one question about your code: You specify @128 alignment for 2 of your instructions and @64 for the other 2 loads & store. The timing diagram says that @64 is the max alignment it can take advantage of in VLD1.8, so is there a reason you wrote @128 for some of your instructions and not others?
Cancel
Vote up
0
Vote down
Cancel
Children
No data