Arm Community
Site
Search
User
Site
Search
User
Groups
Arm Research
DesignStart
Education Hub
Graphics and Gaming
High Performance Computing
Innovation
Multimedia
Open Source Software and Platforms
Physical
Processors
Security
System
Software Tools
TrustZone for Armv8-M
中文社区
Blog
Announcements
Artificial Intelligence
Automotive
Healthcare
HPC
Infrastructure
Innovation
Internet of Things
Machine Learning
Mobile
Smart Homes
Wearables
Forums
All developer forums
IP Product forums
Tool & Software forums
Support
Open a support case
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Developer Community
Tools and Software
Software Tools
Jump...
Cancel
Software Tools
Arm Development Studio forum
How to efficiently sum 4 x 8bit integers with ARM or NEON
Tools, Software and IDEs blog
Forums
Videos & Files
Help
Jump...
Cancel
New
Replies
16 replies
Subscribers
126 subscribers
Views
10484 views
Users
0 members are here
Related
How to efficiently sum 4 x 8bit integers with ARM or NEON
Offline
Shervin Emami
over 7 years ago
Note: This was originally posted on 17th September 2010 at http://forums.arm.com
Hi,
I am trying to write an ASM function to shrink an 8-bit greyscale image by 4, so I need to get the sum of 4 bytes very quickly. From what I have read, NEON needs atleast 32-bit integers and VFP is for floats, so it looks like I should just stick with ARM (or Thumb-2) instructions.
But I'm just a beginner so I'm wondering if there is a more efficient method of summing 4 consecutive bytes than convert each byte to a 32bit int and then sum them (and then shift right to get the average).
Its for a Cortex-A8 (ARMv7-A), and the data is aligned to 32 bytes or whatever I want.
Cheers,
Shervin Emami
[url="
http://www.shervinemami.co.cc/
"]
http://www.shervinemami.co.cc/[/url]
Parents
Offline
Shervin Emami
over 7 years ago
Note: This was originally posted on 25th September 2010 at
http://forums.arm.com
Hi guys,
After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.
To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):
Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec
So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!
Cheers,
Shervin Emami.
[url="
http://www.shervinemami.co.cc/
"]
http://www.shervinemami.co.cc/[/url]
Cancel
Up
0
Down
Reply
Cancel
Reply
Offline
Shervin Emami
over 7 years ago
Note: This was originally posted on 25th September 2010 at
http://forums.arm.com
Hi guys,
After spending about 1 week to optimise my 2 resizing functions and learn NEON and create NEON versions, I have some timing results that you guys might be interested in knowing. As part of my project I made separate functions to divide the number of pixels by 4 (half width and half height) or to divide the number of pixels by 16 (quarter width and quarter height), where the resizing is done by adding all the 2x2 or 4x4 pixel values and dividing by 4 or 16.
To shrink a 480x360 pixel greyscale image by 16 (quarter width and quarter height) on a Cortex-A8 (iPhone 3GS):
Resize function in OpenCV C library (GCC4.2 with -O3) takes about 38 msec
My optimised C code without intrinsics (GCC4.2 with -O3) takes about 27 msec
My hand-optimised ARMv7 Assembly code takes about 7.8 msec
My hand-optimised NEON Assembly code takes about 0.9 msec
So I'm very happy to see the resizing function that was the bottleneck of my program is now 4200% faster! Especially since my optimised C code is structured to do the exact same thing as what my ARM assembly code does, but obviously the C compiler didn't agree!
Cheers,
Shervin Emami.
[url="
http://www.shervinemami.co.cc/
"]
http://www.shervinemami.co.cc/[/url]
Cancel
Up
0
Down
Reply
Cancel
Children
No data
More questions in this forum
By title
By date
By reply count
By view count
By most asked
By votes
By quality
Descending
Ascending
All recent questions
Unread questions
Questions you've participated in
Questions you've asked
Unanswered questions
Answered questions
Questions with suggested answers
Questions with no replies
Suggested Answer
ARM development studio with ARM Juno r2 board
0
Juno Arm Development Platform
Arm Development Studio
Products
Arm Support
9042
views
2
replies
Latest
5 months ago
by
Ronan Synnott
Answered
"Unable to execute remote query (response code 503) " issue
0
8678
views
1
reply
Latest
5 months ago
by
Ronan Synnott
Answered
Where can I download DS-5 hardware firmware??
+1
8119
views
1
reply
Latest
5 months ago
by
Ronan Synnott
Not Answered
freeRTOS demo DS-5 ERROR(CMD360) when trying to debug
+1
13418
views
12
replies
Latest
6 months ago
by
tolc
Answered
ubuntu - How to uninstall Arm Development studio and all its requirements
0
Arm Development Studio
9069
views
1
reply
Latest
6 months ago
by
Jonathan Simmonds
<
>
View all questions in Arm Development Studio forum