This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:









Loop Unrolled code:





The following are the results I ran for different frequencies
 [size=2]T                     [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]




Parents
  • Note: This was originally posted on 25th March 2013 at http://forums.arm.com

    You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

    I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

    You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:


    mov const64, #64
    add pSrcB, pSrc, #32

    fftCopyLoop
    vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64
    vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64
    vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64
    vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64
    subs count, #1
    bgt fftCopyLoop


    This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

    Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

    (also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
Reply
  • Note: This was originally posted on 25th March 2013 at http://forums.arm.com

    You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

    I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

    You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:


    mov const64, #64
    add pSrcB, pSrc, #32

    fftCopyLoop
    vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64
    vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64
    vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64
    vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64
    subs count, #1
    bgt fftCopyLoop


    This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

    Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

    (also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
Children
No data