This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Arm NEON not able to understand the cycles?

Note: This was originally posted on 25th March 2013 at http://forums.arm.com

I am working on optimizing the code for FFT algorithm using NEON of ARM. I am running Beagle Board xM as target. I am running my program without any operating system on the board(Running program directly on the board). The board is supposed to be run at 1Ghz, I am not where operating near to that frequency. Currently I am facing difficulties regarding basic understanding of NEON. Anyone please help me with the things.

The following are sample programs I ran. LOOP CODE:

Loop Unrolled code:

The following are the results I ran for different frequencies

[size=2]T [/size]
[font="Arial,"][font="Arial,"]The above does not make any sense, Different cycles per instructions at different frequencies.?[/font][/font]

Parents

wolfrum aurum over 12 years ago

Note: This was originally posted on 26th March 2013 at http://forums.arm.com

Thanks "Exophase" for your time.
I am running directly on the board(Beagle Board). I am using arm compiler suite.
L1, L2 caches are enabled in the processor.
The code below is from arm itself.

[size=2]Final assembly[/size]
PRESERVE8 AREA newarea,CODE,READONLY,ALIGN=4 ARM EXPORT ne10_radix4_butterfly_float_neon_ls ne10_radix4_butterfly_float_neon_ls FUNCTION PUSH {r4-r12,lr} ;push r12 to keep stack 8 bytes aligned VPUSH {d8-d15} pSrc RN R0 count RN R6 const256 RN R7 pSrcB RN R8 MOV count,#256 MOV const256,#64 ADD pSrcB, pSrc, #32 fftCopyLoop vld1.f32 { d0, d1, d2, d3 }, [ pSrc@256 ] ,const256 vld1.f32 { d4, d5, d6, d7 }, [ pSrcB@256 ] ,const256 vld1.f32 { d8, d9, d10, d11 }, [ pSrc@256 ] ,const256 vld1.f32 { d12, d13, d14, d15 },[ pSrcB@256 ] ,const256 SUBS count,#1 BGT fftCopyLoop ;/* Retureq From Function*/ VPOP {d8-d15} POP {r4-r12,pc} ENDP END

When I used alignment in instructions in my original FFT code, I was able to reduce the instruction count by around 10000 cycles. Thank you very much!!
If you do not mind can I post my FFT code here ? To see if there is any scope for further improvement in the performance.

You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:

mov const64, #64 add pSrcB, pSrc, #32 fftCopyLoop vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64 vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64 vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64 vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64 subs count, #1 bgt fftCopyLoop

This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

(also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
Cancel
Vote up 0 Vote down

Cancel

Reply

wolfrum aurum over 12 years ago

Note: This was originally posted on 26th March 2013 at http://forums.arm.com

Thanks "Exophase" for your time.
I am running directly on the board(Beagle Board). I am using arm compiler suite.
L1, L2 caches are enabled in the processor.
The code below is from arm itself.

[size=2]Final assembly[/size]
PRESERVE8 AREA newarea,CODE,READONLY,ALIGN=4 ARM EXPORT ne10_radix4_butterfly_float_neon_ls ne10_radix4_butterfly_float_neon_ls FUNCTION PUSH {r4-r12,lr} ;push r12 to keep stack 8 bytes aligned VPUSH {d8-d15} pSrc RN R0 count RN R6 const256 RN R7 pSrcB RN R8 MOV count,#256 MOV const256,#64 ADD pSrcB, pSrc, #32 fftCopyLoop vld1.f32 { d0, d1, d2, d3 }, [ pSrc@256 ] ,const256 vld1.f32 { d4, d5, d6, d7 }, [ pSrcB@256 ] ,const256 vld1.f32 { d8, d9, d10, d11 }, [ pSrc@256 ] ,const256 vld1.f32 { d12, d13, d14, d15 },[ pSrcB@256 ] ,const256 SUBS count,#1 BGT fftCopyLoop ;/* Retureq From Function*/ VPOP {d8-d15} POP {r4-r12,pc} ENDP END

When I used alignment in instructions in my original FFT code, I was able to reduce the instruction count by around 10000 cycles. Thank you very much!!
If you do not mind can I post my FFT code here ? To see if there is any scope for further improvement in the performance.

You say you're running without an operating system - did you enable L2 cache? If not that would explain why you're getting a different number of cycles/MHz for the 2048 loads case. One of the load instructions accesses 32 bytes, so 1024 accesses 32KB and fits in the CPU's L1 cache. But 2048 accesses 64KB and doesn't. If it has to go to main memory then the number of cycles will vary with clock speed because the memory bus is on a different clock.

I don't know how much work your unrolled version does so I can't really say what it should be taking in an ideal case. It could be possible that you're exhausting L1 instruction cache. The cycles for the non-unrolled version look to be what I would expect. Each load instruction takes four cycles. There's 3 cycles for the load itself because NEON can access the L1 cache 128-bits per cycle at an aligned address, and since your address isn't specified to be aligned it needs needs to do perform 3 128-bit loads to access 256-bits at what could be an arbitrarily unaligned address. The fourth cycle is because you issue auto-increment instructions back to back. Since the memory base register access and update happen in different cycles you get a stall when you do this.

You can save two cycles if you use 128-bit aligned addresses and use two different pointers, like this:

mov const64, #64 add pSrcB, pSrc, #32 fftCopyLoop vld1.f32 { d0, d1, d2, d3 }, [ pSrc, :128 ], const64 vld1.f32 { d4, d5, d6, d7 }, [ pSrcB, :128 ], const64 vld1.f32 { d8, d9, d10, d11 }, [ pSrc, :128 ], const64 vld1.f32 { d12, d13, d14, d15 }, [ pSrcB, :128 ], const64 subs count, #1 bgt fftCopyLoop

This assumes that pSrc is 128-bit aligned in the first place. Unrolling the code more than this is pointless. Even this much is probably unnecessary. And it's only worth using separate load pointers if you can't fit other instructions in between the loads.

Not that any of this really matters since the code doesn't do anything, but maybe it'll help you in the future.

(also, in the future could you please copy and paste code inside code tags instead of pasting a screenshot of your code, so we don't have to manually re-type it)
Cancel
Vote up 0 Vote down

Cancel

Children

No data