This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:









Loop Unrolled code:





The following are the results I ran for different frequencies
 [size=2]T                     [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]




Parents
  • Note: This was originally posted on 29th March 2013 at http://forums.arm.com

    As far as the whole configuration for the DM3730 goes I don't have any real experience with it and I don't think you'll get a lot of help here.. maybe you should ask on TI's forums? For instance here: http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/537.aspx You could also try the BeagleBoard newsgroup http://beagleboard.org/discuss

    I think from what you've said that it's clear at least that the block labeled local interconnect running on ARM_FCLK isn't connected to L3. That you have to set the two separate PLLs correctly proves that they're not on the same clock domain. You can happen to set it to a value that scales like you want because you're using such low CPU clock speeds, but if you want to run the CPU at 1GHz you won't be able to run L3 at half the clock rate.

    Still not really sure why the performance seems to suggest your data isn't going through L2 cache. Maybe the page tables aren't setup to allow this for the internal SRAM. That makes sense since it's supposed to be shared, but it doesn't make sense that it'd still be cached in L1, which is what appears to be the case.

    When I mentioned L2 cache in lockdown I'm referring to this feature:

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Chdeghcb.html

    If you use L2 in lockdown you can treat it kind of like a scratchpad memory, but it still needs to be backed by some real RAM. Anyway, since you've confirmed you aren't doing this it isn't really important.
Reply
  • Note: This was originally posted on 29th March 2013 at http://forums.arm.com

    As far as the whole configuration for the DM3730 goes I don't have any real experience with it and I don't think you'll get a lot of help here.. maybe you should ask on TI's forums? For instance here: http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/537.aspx You could also try the BeagleBoard newsgroup http://beagleboard.org/discuss

    I think from what you've said that it's clear at least that the block labeled local interconnect running on ARM_FCLK isn't connected to L3. That you have to set the two separate PLLs correctly proves that they're not on the same clock domain. You can happen to set it to a value that scales like you want because you're using such low CPU clock speeds, but if you want to run the CPU at 1GHz you won't be able to run L3 at half the clock rate.

    Still not really sure why the performance seems to suggest your data isn't going through L2 cache. Maybe the page tables aren't setup to allow this for the internal SRAM. That makes sense since it's supposed to be shared, but it doesn't make sense that it'd still be cached in L1, which is what appears to be the case.

    When I mentioned L2 cache in lockdown I'm referring to this feature:

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Chdeghcb.html

    If you use L2 in lockdown you can treat it kind of like a scratchpad memory, but it still needs to be backed by some real RAM. Anyway, since you've confirmed you aren't doing this it isn't really important.
Children
No data