Hi all
i have some questions.
Q1
if the master write a burst started in unaligned address.How to know the slave support unaligned transfers or not?
Q2AXI spec mention that the AXI protocol does not require the slave to take special action based on any alignment information from the master.what is the meaning of that?Thanks a lot
Hi Norbert,
Not sure I agree fully with your first statement.
The AXI protocol definitely allows unaligned transactions, and it is then up to the AXI master bus interface logic how it translates what the SW is requesting into HW bus accesses.
From the SW perspective, the code might just request 16 bytes in 4 word transfers (4x32-bits), accessing unaligned address 0x01. So the data being accessed is in the range 0x01 to 0x10. The HW then has two options..
a) use one 5-beat, 32-bit unaligned transaction to 0x01, which accesses 0x01-0x03 (3 bytes), then 0x04-0x07 (4 bytes), 0x08-0x0B (4 bytes), 0x0C-0xF (4 bytes) and finally 0x10 (1 byte).
b) the master's bus interface logic between the core and the bus converts this to a number of separate ALIGNED access sequence, so perhaps a 16-beat 8-bit burst (0x01-0x10), or a 3-beat 8 bit burst to 0x01 (0x01-0x03) followed by a 3-beat 32-bit burst to 0x04 (0x04-0x0F) and finally a 1-beat 8-bit burst to 0x10, or other combinations to complete the transfers requested by SW.
Option a) is the simplest for the AXI master logic, and is the lowest impact on bus bandwidth as there is only 1 transaction, but it means a slight bit more complexity at the slave. Option b) is simplest for the slave, but complicates the master and adds to bus traffic.
However even thinking of the slave complexity for unaligned transfers, for a write transaction the slave will see WSTRB indicating the valid bytes for each of the 5 transfers, so the unaligned start address isn't important (WSTRB=0xE for the first transfer tells it there was no data for address 0x00). And for read transactions the slave could just drive all 32-bits of the RDATA bus for the 5 transfers, so returning data for 0x00-0x13, and the master then takes the data it actually requested. So is unaligned transfer support difficult for the slave ?
(Obviously if the slave contains read sensitive addresses you would need to ensure only the requested read data is returned, but for general memory slaves this isn't an issue)
I agree with comment 1) in your post, an unaligned 4 word 32-bit burst would need 5 bus transfers to implement on the bus, and it would be much more efficient if the SW did consider alignment and so only requiring 4 bus transfers, but that isn't mandatory, so the system has to support possible unalignment using one of the methods I described.
For your comment 2), device registers often need to be updated in one transfer so that all bits of a control register update at the same time, so it is important that the SW does consider alignment of the target for that memory type. I'm less aware of issues with the shared memory structure argument, but I suppose if the semaphores only track accesses to aligned structures, you would either have to do unaligned accesses within that monitored semaphore range (so in my above 16-byte example using an aligned 32-byte semaphore region), or just keep it simple and again the SW considers the target of the transaction before deciding if unaligned transfers make functional sense.
So both HW approaches are valid, with the AXI support for unaligned transfers making the bus interface on the master quite simple in that it just does what the SW requests, or else the HW bus interface can convert what the SW request into "simpler" AXI transfers, but do these simpler accesses simplify anything for the slave at the cost of master complexity and increased bus traffic ?
I've just realised that this question was posted in the Cortex-A profile forum, whereas I have been answering based purely from the AXI protocol perspective, so I can't comment on what the Cortex-A cores will do, but hopefully they go for the simpler option of just using the AXI protocol support for unaligned transactions.
Hi Colin,
Thank you for your response.
Right, AXI protocol allows unaligned transactions, but this does not mean that it worth using this feature, this significantly complicates all system; all memories/interfaces must support unaligned transactions. If memories support such accesses, performance is much lower compared to regular accesses. Maybe you are right, but there is option 3 - unaligned access handled by core with L1 cache. In this case, the system interfaces, and memories (like DDR, higher-level memories) should not support/see unaligned access.In more detail: Unaligned access to Cacheable write-back memory space - store/load miss access in L1 cache in any case issue read aligned access to the system. Means value of the address (AxADDR) is a multiple of the size of the data being transferred (AxSIZE). In your example, the 32-bit unaligned transaction will be fetched by the L1 cache as 32 bytes (cache line size) with address 0.Unaligned access to device memory is not supported.Unaligned access to normal non-cachable memory can be issued as aligned access by core or by L1 caches. For example, the 32-bit unaligned read transaction can be sent to the system as an aligned 64-bit transaction (better option) or 2x32-bit transactions, core/L1 cache decides which data part to use. For write, use aligned write + write strobes. This solution gives better bus/memory performance and utilization versus Unaligned access support in system memories and interfaces and also does not complicates the system.
I do agree with virtually everything you write, an L1 cache can simplify things to some extent, and will usually improve master performance.
But where I still disagree is that unaligned access support complicates designs. As I mentioned previously slave (unless they are read sensitive) can ignore any AxADDR/AxSIZE misalignment. For write transactions the slave will see WSTRB indicate which byte lanes contain valid data (and this always needs checking because transfers can be sparse as well as unaligned), and for read transactions the slave can just return ARSIZE aligned data and let the master use the byte lanes it requested.
So the best option of all is for the SW to think of alignment when accessing data, but where the SW is not so tightly controlled you can use an L1 to improve performance and avoid unaligned accesses, and/or you support unaligned accesses when they are required, knowing that these don't complicate slave designs (unless your slave is also not supporting WSTRB), especially for non-cacheable transactions which should bypass the L1.
I'm not saying you are wrong, it's just that an L1 cache is not the only solution. As your Cortex-A master probably has an L1 it will reduce the occurence of unaligned transactions being needed to the memory system, but they are not a complication for slaves to support when they are needed.
Hi ColinThanks for replythe address start from 0x01 instead of 0x00why optionB is ALIGNED access sequence?Thanks!
Hi Tom,
In option B I was describing either a 16-beat 8-bit burst (so 0x1 is 8-bit aligned), or a series of different width transactions starting with a 3-beat 8-bit burst (so again 0x1 is 8-bit aligned).
These would be aligned alternative to perform the original example transfer of 4x4 bytes to address 0x1, which the HW might simply translate into a 5-beat unaligned 32-bit sequence.