Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Differences between NEON in Cortex-A8 and A9
Jump...
Cancel
Locked
Locked
Replies
32 replies
Subscribers
119 subscribers
Views
20051 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
How was your experience today?
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Differences between NEON in Cortex-A8 and A9
Kun Feng
over 12 years ago
Note: This was originally posted on 25th July 2011 at
http://forums.arm.com
Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(
http://hilbert-space.de/?p=22
), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.
I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens?
Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms
Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.
Thanks and looking forward to any useful suggestions.
Parents
Krish ks
over 12 years ago
Note: This was originally posted on 23rd March 2013 at
http://forums.arm.com
Hi,
I executed NEON operation test on Linux platform board. I am doing for 4*4 matrix multiplication using arm & neon instructions.
(1) Matrix multiplication: Method of calculating one bye one.Here I have used only S registers. (Normal ARM instructions)
Here I am loading the float array content to S registers (32-bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using S registers as operand and to to hold the result.
(2) Matrix multiplication: Since 128 bit calculation is done, the number of instructions will become 1/4 compared to (1). Here I have used Q and D registers. (Neon instructions)
Here I am loading the complete float array content to Q registers(128 bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using Q(128 bit) and D (64 bit) registers which will obviously reduce the number of instructions ( Load, store, multiplication) to 1/4 th of (1) code.
I am using linux 3.0.35 and test code is executed on Linux platform (Cortex-a9 architecture) .
But there is no speed difference between (1) and (2).
In my Linux kernel configuration following options enabled
CONFIG_VFP=y
CONFIG_VFPv3=y
CONFIG_NEON=y
Following gcc command I have used to build the NEON application and gcc compiler version is gcc 4.6.2
gcc -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard -o test.out test.c
Why I dint find any performance difference between normal ARM and NEON codes?
I have tested same code with the Cortex-A8 and I am able to achieve the performance difference.
Thanks in advance
Cancel
Vote up
0
Vote down
Cancel
Reply
Krish ks
over 12 years ago
Note: This was originally posted on 23rd March 2013 at
http://forums.arm.com
Hi,
I executed NEON operation test on Linux platform board. I am doing for 4*4 matrix multiplication using arm & neon instructions.
(1) Matrix multiplication: Method of calculating one bye one.Here I have used only S registers. (Normal ARM instructions)
Here I am loading the float array content to S registers (32-bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using S registers as operand and to to hold the result.
(2) Matrix multiplication: Since 128 bit calculation is done, the number of instructions will become 1/4 compared to (1). Here I have used Q and D registers. (Neon instructions)
Here I am loading the complete float array content to Q registers(128 bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using Q(128 bit) and D (64 bit) registers which will obviously reduce the number of instructions ( Load, store, multiplication) to 1/4 th of (1) code.
I am using linux 3.0.35 and test code is executed on Linux platform (Cortex-a9 architecture) .
But there is no speed difference between (1) and (2).
In my Linux kernel configuration following options enabled
CONFIG_VFP=y
CONFIG_VFPv3=y
CONFIG_NEON=y
Following gcc command I have used to build the NEON application and gcc compiler version is gcc 4.6.2
gcc -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard -o test.out test.c
Why I dint find any performance difference between normal ARM and NEON codes?
I have tested same code with the Cortex-A8 and I am able to achieve the performance difference.
Thanks in advance
Cancel
Vote up
0
Vote down
Cancel
Children
No data