Hi there,
I am the author of the open source Orbuculum tools for SWO data parsing on CORTEX-M targets. I am currently expanding those tools by implementing 1, 2 & 4-bit parallel TRACEDATA capture from CORTEX-M3/M4 CPUs using a small FPGA connected to the TRACE port running in continuous mode. There is plenty of documentation about parsing the data once it's reached the host machine but very little about the precise details of the format of the data bits out of the port.
By trial and error I think I am now collecting these DDR data correctly but I want to verify/validate what I have done. I am not certain of the exact detail of the re-sync (7F FF FF FF) and pacing (7F FF) messages, nor for the potential for bit-shifting of data in relation to the port (e.g. is b0 always on TRACEDATA0...from experimentation it would appear to have the potential to be shifted an arbitrary number of bits?). Is there any documentation or, even better, Verilog examples, of collecting the data from this port? I haven't found anything so far, but there's a _lot_ of documentation in the world of Arm, so I could easily have missed something.
Thanks in advance
DAVE
Hi Joseph,
Thanks for the reply...it was starting to feel a bit tumbleweed-y around here!
I am aware of that doc, and indeed Orbuculum already decodes TPIU frames over both SWO and parallel trace using the info in it, but unfortunately it doesn't tell me how the data are presented at the TRACEDATA pins. From experimentation I have observed that it is only shifted by the number of bits in the port (which, I guess, isn't such a surprise given how a DDR stream is generally created) but I would like to find that information written down somewhere to be certain that artbitary shifts aren't possible.
To provide a little more information, assuming an eight bit byte with bits 76543210 is being presented to a four bit port with pins DCBA then I have observed that bits 3210 can first appear on the high or low edge of TRACECLK, aligned to the widh of the port (i.e. D<--3, C <--2, B<--1, A<--0 on _either_ edge) but I have never observed D<--5, C<--4, B<--3, A<--2, for example. Similarly, for a two bit port, the alignment always seems to be B<--1 and A<--0 but, again, starting on either TRACECLK edge....for completeness, A<--0 is a given for a one bit port of course!
I am just trying to verify that it is indeed the case and that this behaviour is correct and that the trace output can never be shifted relative to the width of the port, 'cos that let's me remove a fair amount of logic related to finding the sync pattern - I don't need to be searching for bit shifts that can never occur!
Regards
I see. I will check internally to see if one of the engineers will able to provide more information.
Hi Dave,
The clock edge variation you mention is not completely obvious - I expect that once the port becomes active, the edge alignment will remain static (as the formatter is running continuously). You're correct that we present the bytes of data aligned with the port width (i.e. the select is a simple mux rather than a fifo). The trace architecture doesn't make this guarantee, but it's a safe assumption for the 4 bit trace port that is typically used with Cortex-M devices.
One detail which you may not have observed yet is that the synchronisation between frames (which will be 7F FF FF FF in the normal case) can be cut short and presented as just 7F FF if the trace source is very active. You should expect to see 7F FF used as padding within the frame, but is unusual cases you might observe new data which is synchronised with the end of the frame (so the expected synchronisation can be missing).
Sean
Hi Sean,
Thanks for the speedy reply. Indeed the 'phase' does seem to remain constant once the communication is established (although I constantly check for syncs, so if I changed I would re-adjust automatically). Given the limitations of the FPGA I'm using I can accept that it's constrained to the 4-bit port for the time being :-) I suspect the phase might change depending on the low power handling of a particular CPU, but for now the only way I've been able to get it to change is by power cycling the target.
I do see 7f ff within the data stream as 'pace' data both inter- and intra- frame. These I filter at the lowest level which makes the data rate up to the fifo much more managable...but I think your reply implies one other thing that I hadn't realised - I had assumed that multiple 16-octet frames could be strung together with no intervening sync (7f ff ff ff) but is that not the case and there will _always_ be a sync before a frame? If that is indeed the case then I can improve the robustness of the flow a little by enforcing that requirement.
It all seems to be working quite well at the moment, and it makes me smile to have these kinds of capabilities on a $22 dev board, but I'm just trying to make sure the assumptions I've made are valid before I start making the stuff too openly available to others.
Your assumption is correct. The 16 byte frames _can_ run back to back, but this depends on the timing of input data to the TPIU. After byte ~12 has been output, one more data byte is sufficient to close the frame (depending on the current ID state). and I think there is a 'rounding up' which prefers a full frame (since we may be forced to use ID 0 for the NULL source). If we're at byte 12 and get more than 3 more bytes of input, the subsequent frame starts with no delay (so best case protocol overhead is 1/16). Typical trace packets are single byte, so I'd expect roughly 50/50 chance of seeing back to back frames, or maybe less if the clock ratio is 1:1. In some cases, only a multi-byte packet in the right place would trigger this.
Sounds like you're about to make trace much more accessible to developers, which is nice.
Sean,
Thanks for this. Very helpful indeed. When the thing is solid enough for primetime I'll post back here so anyone following can pick it up. In the meantime the development status is available at https://github.com/mubes/orbuculum .... that's the link to the generic decode package (which also supports SWO over various physical interfaces), and the trace stuff is in the orbtrace directory under there. It's changing pretty much daily at the moment though....no laughing at the Verilog, I'm an embedded guy, not a PL one.