Arm Community
Site
Search
User
Site
Search
User
Support forums
Arm Development Studio forum
Differences between NEON in Cortex-A8 and A9
Jump...
Cancel
Locked
Locked
Replies
32 replies
Subscribers
118 subscribers
Views
20336 views
Users
0 members are here
Options
Share
More actions
Cancel
Related
This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion
Differences between NEON in Cortex-A8 and A9
Kun Feng
over 12 years ago
Note: This was originally posted on 25th July 2011 at
http://forums.arm.com
Currently i am working on a Cortex-A9 single-core chip(AML8726-m if you want to know more), and in the datasheet it's said there is a neon in it. But when i test the code here(
http://hilbert-space.de/?p=22
), i cannot find any acceleration on it, sometimes the neon-assembly- optimized code runs even slower than the arm-c-code. At the same time, the same code can get a pretty good acceleration on my i.MX515 which is a Cortex-A8 chip.
I am using the Android NDK to build a test app running on Android, can it be the reason?
Can anyone tell me why it happens?
Here is some results:
#####On A8#####
arm c code: 116.*** ms
neon c code: 83.*** ms
neon asm code: 51.*** ms
#####On A9#####
arm c code: 107.*** ms
neon c code: 106-107.*** ms
neon asm code: 106-107.*** ms
Android is Linux based OS, so I can call gettimeofday() to get a precise time period in us level. The results on A9 are not identical but almost the same and I didn't run the same binary 3 times, I'm sure.
Thanks and looking forward to any useful suggestions.
Parents
Krish ks
over 12 years ago
Note: This was originally posted on 23rd March 2013 at
http://forums.arm.com
Hi,
I executed NEON operation test on Linux platform board. I am doing for 4*4 matrix multiplication using arm & neon instructions.
(1) Matrix multiplication: Method of calculating one bye one.Here I have used only S registers. (Normal ARM instructions)
Here I am loading the float array content to S registers (32-bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using S registers as operand and to to hold the result.
(2) Matrix multiplication: Since 128 bit calculation is done, the number of instructions will become 1/4 compared to (1). Here I have used Q and D registers. (Neon instructions)
Here I am loading the complete float array content to Q registers(128 bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using Q(128 bit) and D (64 bit) registers which will obviously reduce the number of instructions ( Load, store, multiplication) to 1/4 th of (1) code.
I am using linux 3.0.35 and test code is executed on Linux platform (Cortex-a9 architecture) .
But there is no speed difference between (1) and (2).
In my Linux kernel configuration following options enabled
CONFIG_VFP=y
CONFIG_VFPv3=y
CONFIG_NEON=y
Following gcc command I have used to build the NEON application and gcc compiler version is gcc 4.6.2
gcc -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard -o test.out test.c
Why I dint find any performance difference between normal ARM and NEON codes?
I have tested same code with the Cortex-A8 and I am able to achieve the performance difference.
Thanks in advance
Cancel
Vote up
0
Vote down
Cancel
Reply
Krish ks
over 12 years ago
Note: This was originally posted on 23rd March 2013 at
http://forums.arm.com
Hi,
I executed NEON operation test on Linux platform board. I am doing for 4*4 matrix multiplication using arm & neon instructions.
(1) Matrix multiplication: Method of calculating one bye one.Here I have used only S registers. (Normal ARM instructions)
Here I am loading the float array content to S registers (32-bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using S registers as operand and to to hold the result.
(2) Matrix multiplication: Since 128 bit calculation is done, the number of instructions will become 1/4 compared to (1). Here I have used Q and D registers. (Neon instructions)
Here I am loading the complete float array content to Q registers(128 bit) using "vldmia" and then "vmul.f32" and "vmla.f32" to perform matrix multiplication using Q(128 bit) and D (64 bit) registers which will obviously reduce the number of instructions ( Load, store, multiplication) to 1/4 th of (1) code.
I am using linux 3.0.35 and test code is executed on Linux platform (Cortex-a9 architecture) .
But there is no speed difference between (1) and (2).
In my Linux kernel configuration following options enabled
CONFIG_VFP=y
CONFIG_VFPv3=y
CONFIG_NEON=y
Following gcc command I have used to build the NEON application and gcc compiler version is gcc 4.6.2
gcc -march=armv7-a -mtune=cortex-a9 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=hard -o test.out test.c
Why I dint find any performance difference between normal ARM and NEON codes?
I have tested same code with the Cortex-A8 and I am able to achieve the performance difference.
Thanks in advance
Cancel
Vote up
0
Vote down
Cancel
Children
No data