Arm Community
Site
Search
User
Site
Search
User
Groups
Research Collaboration and Enablement
DesignStart
Education Hub
Innovation
Open Source Software and Platforms
Forums
AI and ML forum
Architectures and Processors forum
Arm Development Platforms forum
Arm Development Studio forum
Arm Virtual Hardware forum
Automotive forum
Compilers and Libraries forum
Graphics, Gaming, and VR forum
High Performance Computing (HPC) forum
Infrastructure Solutions forum
Internet of Things (IoT) forum
Keil forum
Morello Forum
Operating Systems forum
SoC Design and Simulation forum
中文社区论区
Blogs
AI and ML blog
Announcements
Architectures and Processors blog
Automotive blog
Graphics, Gaming, and VR blog
High Performance Computing (HPC) blog
Infrastructure Solutions blog
Innovation blog
Internet of Things (IoT) blog
Operating Systems blog
Research Articles
SoC Design and Simulation blog
Tools, Software and IDEs blog
中文社区博客
Support
Arm Support Services
Documentation
Downloads
Training
Arm Approved program
Arm Design Reviews
Community Help
More
Cancel
Support forums
Arm Development Studio forum
NEON Running slower
Jump...
Cancel
Locked
Locked
Replies
2 replies
Subscribers
121 subscribers
Views
4711 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
NEON Running slower
Offline
John Mudumbe
over 9 years ago
Offline
John Mudumbe
over 9 years ago
Note: This was originally posted on 16th April 2013 at
http://forums.arm.com
Hi Shervin,
thanks for the reply a a lot. I removed the auto vectorization flag and used the vst1q_s16()to extract data back to register and it optimized my coded not so much though. I think its because of the reason you mentioned about the memory being slow and also the short period of time my loop is generated even though its called many times.
I wanted to know, when profiling, is it advisable to use gprof or is there other profiling tool i can use?
I have purchased D-stream. but it does not give me the granule profiling like gprof on the JMVC software that am currently working on. and also when I tried to run it Under RTSM, i had the compiler problem where some libraries were not included.
Thanks for the help again.
Cancel
Up
0
Down
Cancel
Offline
Shervin Emami
over 9 years ago
Note: This was originally posted on 10th April 2013 at
http://forums.arm.com
There are several reasons why this code can be slower than plain C code.
First thing to note is that you are manually using NEON Intrinsics but also telling GCC to try to generate NEON code from your plain C code (auto-vectorize) since you use -ftree-vectorize -O3. Maybe they are causing you a strange comparison (eg: your C code might be using GCC's NEON, and I'm not sure but perhaps your NEON code might be interfering somehow with GCC's NEON code).
Also, you are using a for loop but the for loop only runs twice, so it might actually be generating the loop rather than automatically unrolling your loop (never assume GCC for ARM will automatically figure out any optimization, you are better off unrolling it yourself to be sure).
Also, I'm willing to bet money on the fact that your speed is not limited by the CPU (ARM or NEON), it is limited by your memory access. And memory access isn't necessarily faster using NEON than plain ARM, often plain ARM will have better memory speeds than NEON.
Also, your NEON code to extract each NEON byte using vgetq_lane_s16() might not be an efficient solution. Try using vst1q_s16() to store the whole 16 bytes in 1 NEON instruction instead of 8 lines of code (that GCC might turn into 1 NEON instruction if you are lucky but might turn it into 24 NEON instructions if you are unlucky!).
Cancel
Up
0
Down
Cancel