This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

AXI transaction when ldm/stm instruction used on  cortex-a9

Note: This was originally posted on 15th September 2011 at http://forums.arm.com

HI, ARM experts

I used ldm/stm instruction to copy(read-write) memory with caches disabled. The code is listed as:
int memcpy_8_regs(uint32_t * dst, uint32_t * src, uint32_t len)
{
    asm volatile ("stmfd   sp!, {r3-r10}\n");
    while(len){
        asm volatile ("ldmia   r1!, {r3-r10}\n");
        asm volatile ("stmia   r0!, {r3-r10}\n");
        len -= 32;
    }
    asm volatile ("ldmfd  sp!, {r3-r10}\n");

    return 0;
}
[font="Arial"]The code was run in a simulation enviroment and the operation of arm core can be observed.
The arm core issued a axi transaction with 16(bits) * 4(length) for write access, while a  32(bits) * 8(length) transaction for read access. It is kind of werid. Why not a 32(bits) * 8(length) transaction can be issued to get a maximum throughput? The limitation is because of the cortex-a9 arch?

BR
Jerry

[/font]


  • Note: This was originally posted on 16th September 2011 at http://forums.arm.com

    The destination and source address are both 32 byte aligend. Actully, for read access,  the arm core issue a 32*8 axi trancation, while for write, a 32*2 trancation.

    BR
    Jerry
  • Note: This was originally posted on 17th September 2011 at http://forums.arm.com

    What type is this memory of? Normal, Device, or Strongly Ordered? Is you MMU enabled at all?
    [Jerry] The MMU and caches are both disabled. 
            BTW, if MMU and caches enabled, how the type of the memory affect the axi transcation?
    Do you mean a sequence four 32*2 transactions? Otherwise only two registers will be stored.
    [Jerry]Yes,  a sequence four 32*2 transactions acturally.
  • Note: This was originally posted on 15th September 2011 at http://forums.arm.com


    [font="Arial"]The code was run in a simulation enviroment and the operation of arm core can be observed.
    The arm core issued a axi transaction with 16(bits) * 4(length) for write access, while a  32(bits) * 8(length) transaction for read access. It is kind of werid. Why not a 32(bits) * 8(length) transaction can be issued to get a maximum throughput? The limitation is because of the cortex-a9 arch?
    [/font]

    Are you sure about the write access? That'd be the content of only two registers. Depending on the memory type of the destination pointer I'd have expected 64(bits) * 4 (length) due to store merging. As far as the reads are concerned, you may also get 64 * 4 bursts, if the addresses are 8 byte aligned.

    Regards
    Marcus
  • Note: This was originally posted on 16th September 2011 at http://forums.arm.com


    The destination and source address are both 32 byte aligend. Actully, for read access,  the arm core issue a 32*8 axi trancation

    What type is this memory of? Normal, Device, or Strongly Ordered? Is you MMU enabled at all?

    , while for write, a 32*2 trancation.

    Do you mean a sequence four 32*2 transactions? Otherwise only two registers will be stored.

    Regards
    Marcus
  • Note: This was originally posted on 19th September 2011 at http://forums.arm.com


    The MMU and caches are both disabled.

    OK, so there can't be any memory access optimization. All memory is treated as strongly ordered (sort of). This means that all accesses to load registers must be 32bit accesses. According to section 8.1.2 in the Cortex-A9 TRM, 32bit write transactions can only come in bursts of lengths 1 or 2. For reading, bursts lengths of up to 16 beats are supported. That explains the observed behavior.

    Kindly
    Marcus