This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

LDM to LTP Reason

Hi all,

The LDM and STM instruction is no more supported in ARMv8 and instead LTP and STP is used.

What is the key difference between the same why the instruction is loaded and stored in pair.

  • I think the differences are:

    (1) LDP and STP can only have 2 destination registers.

    (2) LDP and STP can have an signed offset + base as address. LDM and STM don't have offset.

    (3) LDP and STP have pre-index and post-index. such as "ldp w3, w1, [x0, #4]!" (pre-index) and "ldp w3, w1, [x0], #4" (post-index).

    LDM and STM usually need ADD instructions to get the right address. So I think may be the reason to use LDP and STP is that they are agiler.

    I think LDM with 2 destination registers is slower than LDP (as it usually needs ADD instructions to get the address). And LDM with 4 destination registers may not faster than 2 LDPs (I'm not sure).

  • Hi Haoliu,

    Thanks for the reply. Can you refer to the white paper document or arch manual with the exact information.

  • ARMv8 ISA overview :

    The LDM, STM, PUSH and POP instructions do not exist in A64, however bulk transfers can be constructed using the LDP and STP instructions which load and store a pair of independent registers from consecutive memory locations, and which support unaligned addresses when accessing normal memory. The LDNP and STNP instructions additionally provide a “streaming” or ”non-temporal” hint that the data does not need to be retained in caches. The PRFM (prefetch memory) instructions also include hints for “streaming” or “non-temporal” accesses, and allow targeting of a prefetch to a specific cache level.

  • HI Pravinchanm,

    You will find the definition of LDP/STP behaviour in the ARMv8 Architecture Reference Manual which you can get from infocenter.

    One of the main reasons for moving from LDM/STM to LDP/STP is that the load/store multiple instructions make for a lot of extra work in complex superscalar out-of-order pipelines. When an instruction can touch 15 or 16 registers, that is a lot of dependencies which the pipeline has to manage when trying to schedule that instruction against others which are in the stream at the same time. Restricting the number of registers to two simplifies this considerably and makes for a more efficient pipeline design.

    From a software point of view, LDP/STP have addressing modes which are much the same as other load/store instructions so they are easier for programmers and compilers to use when generating code. And, as chinatiger has pointed out above, they support unaligned accesses.

    Hope this helps.

    Chris

  • Thank you chris. So happy to get the detailed description of the same.