This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

How to efficiently sum 4 x 8bit integers with ARM or NEON

Note: This was originally posted on 17th September 2010 at http://forums.arm.com

Hi,

I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.

But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).

Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]

Parents

Shervin Emami over 12 years ago

Note: This was originally posted on 21st September 2010 at http://forums.arm.com

Wow thanks so much guys, thats exactly what I needed to know! I still haven't learnt enough of NEON to have used it and I totally overlooked USADA8 for this operation, so until now I just came up with something like this:

LDR r0, [r4] // Load 4 pixels A:B:C:D from (x,y) LDR r1, [r5] // Load 4 pixels E:F:G:H from (x,y+1) UHADD8 r2, r0, r1 // Add pixels A:B:C:D with pixels E:F:G:H and divide each pixel by 2. UXTB r3, r2 // Set r3 = (D+H)/2 UXTAB r3, r3, r2, ROR #8 // Set r3 = r3 + (C+G)/2 UXTAB r3, r3, r2, ROR #16 // Set r3 = r3 + (B+F)/2 UXTAB r3, r3, r2, ROR #24 // Set r3 = r3 + (A+E)/2 // r3 is now (A+E)/2 + (B+F)/2 + (C+G)/2 + (D+H)/2 // which is (A+B+C+D + E+F+G+H) / 2 LSR r3, r3, 2 // Set r3 = average of 8 pixels A to H

So obviously your 2 solutions are much better! I used to be an Assembly programmer about 10 years ago for Intel 16bit and 32bit CPUs but I only just started learning ARM last week, and now I finally see why so many people used to say that ARM RISC is better than Intel CISC! NEON seems really powerful, and I'm amazed that Thumb-2 can fit something like USADA8 in just 16-bits!

Glad to finally be part of the ARM community :-)

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Cancel
Vote up 0 Vote down

Cancel

Reply

Shervin Emami over 12 years ago

Note: This was originally posted on 21st September 2010 at http://forums.arm.com

Wow thanks so much guys, thats exactly what I needed to know! I still haven't learnt enough of NEON to have used it and I totally overlooked USADA8 for this operation, so until now I just came up with something like this:

LDR r0, [r4] // Load 4 pixels A:B:C:D from (x,y) LDR r1, [r5] // Load 4 pixels E:F:G:H from (x,y+1) UHADD8 r2, r0, r1 // Add pixels A:B:C:D with pixels E:F:G:H and divide each pixel by 2. UXTB r3, r2 // Set r3 = (D+H)/2 UXTAB r3, r3, r2, ROR #8 // Set r3 = r3 + (C+G)/2 UXTAB r3, r3, r2, ROR #16 // Set r3 = r3 + (B+F)/2 UXTAB r3, r3, r2, ROR #24 // Set r3 = r3 + (A+E)/2 // r3 is now (A+E)/2 + (B+F)/2 + (C+G)/2 + (D+H)/2 // which is (A+B+C+D + E+F+G+H) / 2 LSR r3, r3, 2 // Set r3 = average of 8 pixels A to H

So obviously your 2 solutions are much better! I used to be an Assembly programmer about 10 years ago for Intel 16bit and 32bit CPUs but I only just started learning ARM last week, and now I finally see why so many people used to say that ARM RISC is better than Intel CISC! NEON seems really powerful, and I'm amazed that Thumb-2 can fit something like USADA8 in just 16-bits!

Glad to finally be part of the ARM community :-)

Cheers,
Shervin Emami
[url="http://www.shervinemami.co.cc/"]http://www.shervinemami.co.cc/[/url]
Cancel
Vote up 0 Vote down

Cancel

Children

No data