This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

instruction cycle timings for LDR1, STR1 on cortex-a8

Note: This was originally posted on 13th March 2012 at http://forums.arm.com

Hi,

Iam new to beagle board and cortex-a8. i have written a small piece of code to understand instruction cycle timings of cortex-a8. code is in a loop of 10,000 count. code behaves differentlty with different combinations. following is my code with cycle timings

when i keep only loads, code is taking 10 cycles instead of 6 cycles. In the following case, there are no cache issues, as same memory is used to load the values to 'q' register

VLD1.S32  {rq0},[r11@128] 
VLD1.S32  {rq1},[r11@128]
      
VLD1.S32  {rq3},[r11@128]
VLD1.S32  {rq5},[r11@128]
 
VLD1.S32  {rq6},[r11@128]
VLD1.S32  {rq7},[r11@128]

Below code is taking 13 cycles instead of 6 cycles. difference is above code has loads and this code has stores

VST1.S32  {rq0},[r12@128] 
VST1.S32  {rq1},[r12@128]
      
VST1.S32  {rq3},[r12@128]
VST1.S32  {rq5},[r12@128]
 
VST1.S32  {rq6},[r12@128]
VST1.S32  {rq7},[r12@128]

Combination of loads and stores are working fine. they are taking 12 cycles which is expected. but when i change the register r12 to r11 in store operation, code is taking 32 cycles. why accessing of r11 in loads and stores is giving more cycles.

VLD1.S32  {rq0},[r11@128] 
VLD1.S32  {rq1},[r11@128]
      
VLD1.S32  {rq3},[r11@128]
VLD1.S32  {rq5},[r11@128]
 
VLD1.S32  {rq6},[r11@128]
VLD1.S32  {rq7},[r11@128]

VST1.S32  {rq0},[r12@128] 
VST1.S32  {rq1},[r12@128]
      
VST1.S32  {rq3},[r12@128]
VST1.S32  {rq5},[r12@128]
 
VST1.S32  {rq6},[r12@128]
VST1.S32  {rq7},[r12@128]

Why this is happening. why different combinations are behaving differently. Can anyone please explain.

Thanks in advance,
Chandrakala
Parents
  • Note: This was originally posted on 15th March 2012 at http://forums.arm.com

    I can't give a very thorough answer, but in my tests I've found that the Cortex-A8 core can't sustain 1 128-bit load or 128-bit store per cycle every cycle in a long run. There could be a bottleneck somewhere like the load queue for NEON or the write buffer for stores. If you mix it with other instructions you can sustain 1 per cycle for a while, which is what you seem to be achieving in the mix of loads + stores. I can't give any exact numbers but my rough heuristic is to try to have at least 1 non-load/store for every 2 loads/stores, although you're probably better off with more than that.

    I've also heard that you can get slower throughput using the same register for loads. You might be able to do better if you try different registers.

    As for your second case, the problem is that NEON (unlike the integer core) doesn't have store to load forwarding. Since you're performing these operations in a loop, the loads to r11 come immediately after the stores to r11. Before the load can happen the write buffer which contains its new value has to be emptied, causing a big stall. You can also see this sort of stall if you perform an unaligned load immediately after an unaligned store, to the address immediately after the store address. This is because unaligned accesses get split into multiple aligned accesses which in this case will be partially overlapping.
Reply
  • Note: This was originally posted on 15th March 2012 at http://forums.arm.com

    I can't give a very thorough answer, but in my tests I've found that the Cortex-A8 core can't sustain 1 128-bit load or 128-bit store per cycle every cycle in a long run. There could be a bottleneck somewhere like the load queue for NEON or the write buffer for stores. If you mix it with other instructions you can sustain 1 per cycle for a while, which is what you seem to be achieving in the mix of loads + stores. I can't give any exact numbers but my rough heuristic is to try to have at least 1 non-load/store for every 2 loads/stores, although you're probably better off with more than that.

    I've also heard that you can get slower throughput using the same register for loads. You might be able to do better if you try different registers.

    As for your second case, the problem is that NEON (unlike the integer core) doesn't have store to load forwarding. Since you're performing these operations in a loop, the loads to r11 come immediately after the stores to r11. Before the load can happen the write buffer which contains its new value has to be emptied, causing a big stall. You can also see this sort of stall if you perform an unaligned load immediately after an unaligned store, to the address immediately after the store address. This is because unaligned accesses get split into multiple aligned accesses which in this case will be partially overlapping.
Children
No data