This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:









Loop Unrolled code:





The following are the results I ran for different frequencies
 [size=2]T                     [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]




Parents
  • Note: This was originally posted on 26th March 2013 at http://forums.arm.com

    Thanks  "Exophase"  for your time.
    I am running directly on the board(Beagle Board). I am using arm compiler suite.
    L1, L2 caches are enabled in the processor.
    The code below is from arm itself.
     

    [size=2]Final assembly[/size]


      PRESERVE8


      AREA  newarea,CODE,READONLY,ALIGN=4

      ARM

      EXPORT ne10_radix4_butterfly_float_neon_ls

    ne10_radix4_butterfly_float_neon_ls FUNCTION


            PUSH    {r4-r12,lr}    ;push r12 to keep stack 8 bytes aligned
            VPUSH   {d8-d15}


    pSrc        RN  R0
    count    RN  R6
    const256    RN  R7
    pSrcB    RN  R8
            MOV        count,#256
            MOV        const256,#64
            ADD        pSrcB, pSrc, #32

    fftCopyLoop
        vld1.f32 { d0, d1, d2, d3 }, [ pSrc@256 ] ,const256
        vld1.f32 { d4, d5, d6, d7 }, [ pSrcB@256 ] ,const256
        vld1.f32 { d8, d9, d10, d11 }, [ pSrc@256 ] ,const256
        vld1.f32 { d12, d13, d14, d15 },[ pSrcB@256 ] ,const256
            SUBS        count,#1
            BGT   fftCopyLoop

            ;/* Retureq From Function*/
            VPOP    {d8-d15}
            POP  {r4-r12,pc}
            ENDP
            END




    When I used alignment in instructions in my  original FFT code, I was able to reduce the instruction count by around 10000 cycles. Thank you very much!!
    If you do not mind can I post my FFT code here ? To see if there is any scope for further improvement in the performance.






    You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

    I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

    You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:


    mov const64, #64
    add pSrcB, pSrc, #32

    fftCopyLoop
    vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64
    vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64
    vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64
    vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64
    subs count, #1
    bgt fftCopyLoop


    This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

    Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

    (also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
Reply
  • Note: This was originally posted on 26th March 2013 at http://forums.arm.com

    Thanks  "Exophase"  for your time.
    I am running directly on the board(Beagle Board). I am using arm compiler suite.
    L1, L2 caches are enabled in the processor.
    The code below is from arm itself.
     

    [size=2]Final assembly[/size]


      PRESERVE8


      AREA  newarea,CODE,READONLY,ALIGN=4

      ARM

      EXPORT ne10_radix4_butterfly_float_neon_ls

    ne10_radix4_butterfly_float_neon_ls FUNCTION


            PUSH    {r4-r12,lr}    ;push r12 to keep stack 8 bytes aligned
            VPUSH   {d8-d15}


    pSrc        RN  R0
    count    RN  R6
    const256    RN  R7
    pSrcB    RN  R8
            MOV        count,#256
            MOV        const256,#64
            ADD        pSrcB, pSrc, #32

    fftCopyLoop
        vld1.f32 { d0, d1, d2, d3 }, [ pSrc@256 ] ,const256
        vld1.f32 { d4, d5, d6, d7 }, [ pSrcB@256 ] ,const256
        vld1.f32 { d8, d9, d10, d11 }, [ pSrc@256 ] ,const256
        vld1.f32 { d12, d13, d14, d15 },[ pSrcB@256 ] ,const256
            SUBS        count,#1
            BGT   fftCopyLoop

            ;/* Retureq From Function*/
            VPOP    {d8-d15}
            POP  {r4-r12,pc}
            ENDP
            END




    When I used alignment in instructions in my  original FFT code, I was able to reduce the instruction count by around 10000 cycles. Thank you very much!!
    If you do not mind can I post my FFT code here ? To see if there is any scope for further improvement in the performance.






    You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

    I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

    You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:


    mov const64, #64
    add pSrcB, pSrc, #32

    fftCopyLoop
    vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64
    vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64
    vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64
    vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64
    subs count, #1
    bgt fftCopyLoop


    This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

    Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

    (also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
Children
No data