This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Additional memory cycles during LDR with unaligned address

When LDR makes an unaligned memory access on a Cortex-M4 (ARMv7), I would expect there to be two memory read cycles required to retrieve the data. More specifically, I would expect that to be true whether the address is off by 1, 2, or 3 bytes. However, I've measured the execution time (of a long sequence of LDR's) and found that the time for the "off by 2" case is about 25% faster than either the "off by 1" or "off by 3" cases. Does this mean that the retrieval requires 2 read cycles for off by2, but three for off by 1 and off by 3?

Dan

Parents
  • FYI, here are the results from one of my lab assignments in a course I teach on ARM assembly using the STM32F429I-Discovery board. It displays the results of counting the number of clock cycles required to copy 1000 bytes of data using four different functions (OffBy0, OffBy1, OffBy2 and OffBy3). Each is designed to copy as much of the 1000 bytes of data as possible using a repetition of word-aligned LDR/STR pairs and all other bytes using LDRB/STRB. (OffBy0 uses only LDR/STR instructions, while the other functions have some LDRB/STRB instructions both before and after the LDR/STR sequences.)

    There are three test cases (from left to right) corresponding to the starting address of the source and destination data regions. Each case compares the execution time of OffBy0 (yellow) to that of a function (green) that is optimized for data that is not word-aligned. In the left-most case, these regions start at an address that is one greater than a word-aligned address. In the middle case, the source and destination regions start at an address that is two greater than a word-aligned address. And in the right-most case, the source and destination regions start at an address that is three greater. 

    The yellow bars show how the unaligned access time is less when the unaligned regions start at two greater than a word-aligned address.

Reply
  • FYI, here are the results from one of my lab assignments in a course I teach on ARM assembly using the STM32F429I-Discovery board. It displays the results of counting the number of clock cycles required to copy 1000 bytes of data using four different functions (OffBy0, OffBy1, OffBy2 and OffBy3). Each is designed to copy as much of the 1000 bytes of data as possible using a repetition of word-aligned LDR/STR pairs and all other bytes using LDRB/STRB. (OffBy0 uses only LDR/STR instructions, while the other functions have some LDRB/STRB instructions both before and after the LDR/STR sequences.)

    There are three test cases (from left to right) corresponding to the starting address of the source and destination data regions. Each case compares the execution time of OffBy0 (yellow) to that of a function (green) that is optimized for data that is not word-aligned. In the left-most case, these regions start at an address that is one greater than a word-aligned address. In the middle case, the source and destination regions start at an address that is two greater than a word-aligned address. And in the right-most case, the source and destination regions start at an address that is three greater. 

    The yellow bars show how the unaligned access time is less when the unaligned regions start at two greater than a word-aligned address.

Children
No data