Hi
The Cortex-M3 should support unaligned data access to save RAM space without the lost of performances. Is there a possibility to enable this features for an entire project or do I have to use the __packed attribut for each data structure?
This is simply not true. Of course there will be a loss of performance compared to aligned access to the same type. Each unaligned access causes multiple bus accesses which will prevent other components from using the bus.
That said, the experienced performance loss will be by far better than not having the option of hardware supported unaligned access.
> Is there a possibility to enable this features for an > entire project or do I have to use the __packed > attribut for each data structure?
You will have to use __packed. I am not aware of any (documented) compiler option that would implicitly apply the __packed attribute.
Regards Marcus http://www.doulos.com/arm/
to save RAM space without the lost of performances
But this is not possible (to my knowledge - I never worked with your processor). Have a look at the generated assembly for an unaligned access compared to the same code that uses aligned access.
> Have a look at the generated assembly for an unaligned > access compared to the same code that uses aligned > access.
It will be the same. But the number of bus transactions increases.
ARM processors v6 and higher support this behavior.
I did some tests with aligned and unaligned data access by unsing the __packed keyword. As Marcus wrote there is no extra code for unaligend data access for the Cortex-M3 core ( compared to ARM7 ). In other words the Cortex-M3 supports unaligned data access.
I also found the following statment from the Cortex-M3 technical reference manual.
1.2.4 Bus Matrix ... The bus matrix also controls the following: - Unaligned accesses. The bus matrix converts unaligned processor accesses into aligned accesses. ...
The internal RAM is connected to the D-code bus which is 32Bit wide. There has to be more than one bus transaction on the D-code bus to get unaligend data. This soulde be done independend from the CPU by the Bus Matrix. Right? Where do I lost performance? Do I only lost performance if I had to read/write a block of unaligned data?
> The internal RAM is connected to the D-code bus which > is 32Bit wide.
Perhaps. Depending on your device, of course.
> There has to be more than one bus transaction on the > D-code bus to get unaligend data. This soulde be done > independend from the CPU by the Bus Matrix. Right?
Yes.
> Where do I lost performance? Do I only lost > performance if I had to read/write a block of > unaligned data?
Any back-to-back sequence involving at least one unaligned access will not be quite as fast as in the aligned case. A concurrent DMA transfer accessing the same port of the bus matrix might see hick-ups, where there might not have been any with aligned transfers.
As I said: Whether you should worry or not, I don't know.
I was merely arguing the statement that claimed that unaligned access came without performance loss. This is not true.
Please be sure to update your other thread(s) on other forum(s) so that everybody gets to benefit, and people on one forum don't waste time repeating what's already been said on another forum.
You should always do this as a matter of courtesy when posting the same question on multiple forums - known as "cross-posting"
Providing clickable links between the threads is generally sufficient...
(This forum automatically makes URLs clickable; on the STM forum, you have to do it manually)
The x86 has always supported unaligned accesses by having the memory interface hide the dual-access.
This has given a lot of PC programmers bad habits - strange "bus error" messages when they move their code from the PC to other processors and notices that it isn't ok to just typecast a void pointer into a short* or long* pointer and use it for multi-byte memory accesses.
But what you introduce when you use nonaligned data is either stall cycles or loss of memory bandwidth for other devices.
The most common stall cycle is where the processor will have to wait an extra cycle for getting the result of a read operation. The PC processor tries to mask this by huge cache memories that makes sure that 1) the memory interface can do "read ahead" and always read larger chunks than 1, 2 or 4 bytes. You may have 8 or 16 bytes or even wider memory interfaces. 2) after the cache has been loaded with both the two addresses needed for combining a nonaligned access, no memory read (or extra stall cycle) will be needed for the merging of the data.
But you can also have stall cycles on writes, since a processor either performs memory writes synchronously, or supports a limited number of outstanding writes (waiting for the memory interface to be ready to accept one more address + write data). If the processor can't support multiple outstanding writes, then every nonaligned write will stall the processor. And a processor with delayed writes will be stalled if you do several unaligned writes after each other (unless possibly they are to a continuous memory area in which case the memory interface might be able to combine several unaligned writes into several aligned writes instead of performing "smaller" writes).
Another thing is that a aligned write to a memory interface of the same width is just a write. If you have an unaligned write, then the memory controler must do: 1) read of first word. 2) bit-and + bit-or of the part of the word that should be replaced 3) write of first word. 4) read of second word. 5) bit-and + bit-or of the part of the word that should be replaced. 6) write of second word. Some of the above steps can be combined or reordered, but it should be obvious that unless you have a memory interface running at a higher clock speed than your processor (in the real world, it is the reverse unless the processor is "intentionally" slowed down to 1:1) so any extra access do cost time.
The really big advantage (besides reduced code size) with a processor that handles read and write combining in the memory controller is that it saves on required bandwidth to supply new op-codes to process.
Thanks to all for the feedback about this issue
The Link of the cross-post at the STM32 is:
www.st.com/.../forums-cat-7816-23.html
Regards
> Another thing is that a aligned write to a memory > interface of the same width is just a write. If you > have an unaligned write, then the memory controler > must do: [...]
Not with ARM processors that I am aware of. Unaligned access will be broken down into aligned accesses of smaller size.
E.g. a word access to address 0x55 will be accessing a byte at address 0x55, a half word at address 0x56 and another byte at address 0x58.
Since memory systems in ARM are required to support all access sizes, all is taken care of by byte enable. No R-M-W needed.
Yes, the ARM line of processors has this requirement for all memory interfaces. But this only goes for the interface to the core - you will not know if the physical memory supports byte or half-word accesses or if this is done by glue logic that activates the nWAIT signal or by stretching MCLK while performing a read-modify-write.
Your example shows another important thing relevant to the ARM core and unaligned accesses. Your unaligned write resulted in three writes, since the ARM can't signal a three-byte write or an unaligned two-byte write.