This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:









Loop Unrolled code:





The following are the results I ran for different frequencies
 [size=2]T                     [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]




Parents
  • Note: This was originally posted on 26th March 2013 at http://forums.arm.com

    Yes, please post your code. If it's large I recommend using something like pastebin.com instead of posting it directly.

    When you say you're using on-chip SRAM are you referring to the 64KB at 0x40200000? This is still on the other side of the L3 bus so you'd definitely be accessing it at bus speed and not something derived from the CPU clock. I can't find any SRAM internal to the CPU unless you're using part of L2 cache in lockdown. If you're going through L3 that's a 200MHz clock, if setup correctly of course.

    Still not sure why you're getting what appears to be variable perf/MHz for your larger data set if L2 cache is enabled. It could be that page attributes aren't setup correctly, or something uninitialized with NEON. There's a lot of stuff to setup.
Reply
  • Note: This was originally posted on 26th March 2013 at http://forums.arm.com

    Yes, please post your code. If it's large I recommend using something like pastebin.com instead of posting it directly.

    When you say you're using on-chip SRAM are you referring to the 64KB at 0x40200000? This is still on the other side of the L3 bus so you'd definitely be accessing it at bus speed and not something derived from the CPU clock. I can't find any SRAM internal to the CPU unless you're using part of L2 cache in lockdown. If you're going through L3 that's a 200MHz clock, if setup correctly of course.

    Still not sure why you're getting what appears to be variable perf/MHz for your larger data set if L2 cache is enabled. It could be that page attributes aren't setup correctly, or something uninitialized with NEON. There's a lot of stuff to setup.
Children
No data