This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Dual Issue and Pipeline stalls

Note: This was originally posted on 27th August 2010 at http://forums.arm.com

Hi,

I have the following situation in my current work. Please explain in case of any pipeline stall or preventing dual issue chance.

    LDRB     R2,[R0,#2]
    LDRB     R1,[R0,#1]   
VLD1.32  {D4,D5},[R7]
    VLD1.32  {D22,D23},[R8]
VADD.I32 Q2,Q2,Q11
STR      R2,[R5]
    STR      R1,[R5]


In another instant the following situation comes

        VADD.I32 Q2,Q11,Q2
    VADD.I32 Q2,Q2,Q9
    VSHR.S32 Q11,Q2,#1
    VCGT.S32 Q2,Q11,Q0
    VLD1.32  {D22,D23},[R4]
    VBSL     Q2,Q12,Q13
    VADD.I32 Q2,Q11,Q2
    VADD.I32 Q2,Q2,Q10
    VST1.32  {D4,D5},[R1]

In this set of instruction any chance for any types of pipeline stalls in a Cortex A8 processor.

Thanks
Dave.
Parents
  • Note: This was originally posted on 27th August 2010 at http://forums.arm.com

    Not too sure about the second block of code - but in the first case the critical thing to note is that the ARM and NEON units only have a single load store pipe and back to back loads and stores cannot be dual issued in the same cycle. For Cortex-A8, the following should be faster:

        LDRB     R2,[R0,#2]
    VLD1.32  {D4,D5},[R7]
        LDRB     R1,[R0,#1]   
        VLD1.32  {D22,D23},[R8]
    STR      R2,[R5]
    VADD.I32 Q2,Q2,Q11
        STR      R1,[R5]


    NEON instructions are forwarded to the NEON unit using integer pipelines of the ARM unit, so in the cycle the NEON instruction is forwarded (counts as an ARM integer instruction) you can also issue a load to the ARM load/store unit. That said, not tried the code, but pretty sure that is right.
Reply
  • Note: This was originally posted on 27th August 2010 at http://forums.arm.com

    Not too sure about the second block of code - but in the first case the critical thing to note is that the ARM and NEON units only have a single load store pipe and back to back loads and stores cannot be dual issued in the same cycle. For Cortex-A8, the following should be faster:

        LDRB     R2,[R0,#2]
    VLD1.32  {D4,D5},[R7]
        LDRB     R1,[R0,#1]   
        VLD1.32  {D22,D23},[R8]
    STR      R2,[R5]
    VADD.I32 Q2,Q2,Q11
        STR      R1,[R5]


    NEON instructions are forwarded to the NEON unit using integer pipelines of the ARM unit, so in the cycle the NEON instruction is forwarded (counts as an ARM integer instruction) you can also issue a load to the ARM load/store unit. That said, not tried the code, but pretty sure that is right.
Children
No data