Programming ARM's System Trace Macrocell

June 7, 2016

12 minute read time.

In this blog, the second in a series, we explore the programmers' model for the ARM System Trace Macrocell. A previous blog covered basic concepts of the STM architecture and implementation. Example code is provided, which is minimally targeted at the Juno ARM Development Platform.

STM Programmers’ Model
- Memory Map
- Configuration
Which Stimulus Port?
Effective use of the STM
Next

STM Programmers’ Model

Memory Map

The STM Architecture defines a memory map that is split into two regions; a configuration interface (4KiB in size) which contains all the registers used to configure the behavior of the STM, as well as access Basic Stimulus Ports, if implemented.

A second region of memory contains the Extended Stimulus Ports and can be up to 16MiB in size. How this is represented in the system memory map is down to the design of the SoC -- all Masters (CPUs and devices) may access the same address, or all Masters may access a dedicated and independent address.

All registers in the STM Architecture are defined as being located at an offset relative to the base address of their constituent region. On the Juno SoC, the base address of the configuration (or "APB") interface is 0x2010_0000 and the based address of the Extended Stimulus (or "AXI") region is 0x2800_0000, with this address being common to all Masters.

Configuration

There are two key steps to configuring the STM via the APB interface. The first is that the STM needs to be configured with a valid Trace ID, since it outputs the instrumentation data over the CoreSight trace subsystem.

This value is exported over the ATB bus interface and is required not only for the transactions to be valid, but to discern between STM trace data and, for example, trace data from another CoreSight component such as an Embedded Trace Macrocell (ETM).

When using an external debugger (such as ARM DS-5) to collect the trace, it is possible to have the debugger set up the Trace ID as part of the connection sequence. The responsibility for this truly depends on your use case; if an external debugger is involved then it may be configuring other CoreSight components and giving them Trace IDs. You do not want the STM Trace ID and the Trace ID for another component to be the same, but you also do not want the debugger to conflict with your application STM configuration.

If you have an external debugger connected you can modify your instrumentation software to compensate; there is no harm whatsoever in having the debugger set the same trace ID as your instrumentation software.

We show an example function stmTRACEID() which performs this operation:

/*
 * stmTRACEID(stm, traceid)
 *
 * Set STM's TRACEID (which goes out over ATB bus ATBID)
 *
 * Note it is illegal per CoreSight to set the trace ID
 * to 0x00 or one of the reserved values (0x70 onwards)
 * (see IHI0029D D4.2.4 Special trace source IDs).
 *
 */
unsigned int stmTRACEID(struct STM *stm, unsigned int traceid)
{
  if ((traceid > 0x00) && (traceid < 0x70)) {
    unsigned int tcsr;


    traceid = traceid & TRACEID_MASK;


    tcsr = (stm->APB->STMTCSR & ~(TRACEID_MASK << TRACEID_SHIFT));
    stm->APB->STMTCSR = (tcsr | (traceid << TRACEID_SHIFT));


    return traceid;
  }


  return 0;
}

The second requirement is to enable the stimulus ports in question. This is actually an optional part of STM Architecture that offers configuration registers to enable and disable the generation of trace packets when a particular stimulus port is accessed. It is possible to enable and disable stimulus ports with a certain granularity, but this will be completely dependent on the design of the instrumented code and the system it runs on. This example code enables all Extended stimulus ports such that any stimulus write to any stimulus port will generate a packet.

/*
 * Set STMPSCR.PORTCTL to 0x0 to ensure port selection is not
 * used. STMPSCR.PORTSEL is ignored and STMSPER and STMSPTER
 * bits apply equally to all groups of ports.
 *
 * Whether the STM has 32 or 65536 ports, they'll all be
 * enabled.
 */
stm->APB->STMSPSCR = 0x00000000;
stm->APB->STMSPER = 0xffffffff;
stm->APB->STMSPTER = 0xffffffff;

Once configured, we can then enable the STM with appropriate register access:

stm->APB->STMTCSR = (stm->APB->STMTCSR | STMTCSR_EN);

This is the bare minimum setup for an STM. There are obviously other configuration options such as Compression, Timestamping, and Synchronization that may or may not be configured dependent on the application.

Which Stimulus Port?

Each of the 65536 possible Extended Stimulus Ports maps to an STPv2 Channel. A trace decoder can then look for trace belonging to this channel to retrieve the instrumentation and differentiate it from other instrumentation sources.

The layout in memory of the stimulus ports means that for each packet, a data item is written to a particular address and offset within the STM stimulus port address space. Recall that each Extended Stimulus Port is a 256-byte region of memory. The address of the start of the stimulus port, and therefore all the registers which will generate trace for that "channel" within the AXI interface, can be calculated.

channel_address  = STM_AXI_BASE + (0x100 * channel_number)

We present code which provides two examples of access methods, the first using logical operations to exploit defined address decode logic within the STM Architecture, and return a pointer which can be used to perform the memory write.

The finer points of the address decode used by the STM is documented in the STM Architecture, section 3.3. The code for stm.c:stmPortAddress() in the example code shows a method of calculating the address and offset using a flag-based API.

The second uses a C struct defining the layout of each stimulus port offset as an array. In this manner, assigning a value to a particular structure member would generate the appropriate store. Additionally, using C macros can simplify and increase readability of the actual stimulus port access.

struct stmPort {
  STM_STIM G_DMTS;
  STM_STIM G_DM;
  STM_STIM G_DTS;
  STM_STIM G_D;
  STM_NA G_reserved[16];


  STM_STIM G_FLAGTS;
  STM_STIM G_FLAG;
  STM_STIM G_TRIGTS;
  STM_STIM G_TRIG;


  STM_STIM I_DMTS;
  STM_STIM I_DM;
  STM_STIM I_DTS;
  STM_STIM I_D;
  STM_NA I_reserved[16];


  STM_STIM I_FLAGTS;
  STM_STIM I_FLAG;
  STM_STIM I_TRIGTS;
  STM_STIM I_TRIG;
};

/*
 * STM AXI Stimulus Interface
 *
 * The STM Architecture defines up to 65536 stimulus ports, all of which are
 * implemented on the STM and STM-500 from ARM, Ltd.
 */
struct stmAXI {
    /*
     * access the port array based on the limit in
     * (stmAPB->STMDEVID & 0x1fff) so nothing we
     * can define at compile time..
     */
    struct stmPort port[0];
};

/*
 * STMn(port, class)
 *
 * Write an n-byte value to a stimulus port of a particular type (e.g. G_DMTS)
 */
#define STM8(a, p, type)  *((volatile unsigned char *) &((a)->port[p].type))
#define STM16(a, p, type) *((volatile unsigned short *) &((a)->port[p].type))
#define STM32(a, p, type) *((volatile unsigned int *) &((a)->port[p].type))
#define STM64(a, p, type) *((volatile unsigned long *) &((a)->port[p].type))

We can re-create "printf debug" functionality by passing formatted strings to a function which outputs them as data over the requested STM channel:

The example function stm.c:stmSendString() outputs a string as instrumentation using macros STMn() (where n is {8,16,32,64}) which resolve to a C struct access as defined above.

/*
 * void stmSendString(stm, channel, string)
 *
 * We specifically write a byte to ensure that we get a D8 packet,
 * although that limits the function to 8-bit encodings.
 *
 * It doesn't matter what we use for the last write (if we see
 * a null character) -- G_FLAGTS has no data except the flag and
 * the timestamp, so a 32-bit access will be just fine..
*/


void stmSendString(struct STM *stm, unsigned int channel, const char *string)
{
    /*
     * Send a string to the STM extended stimulus registers
     * The first character goes out as D8M (Marker) packet
     * The last character is followed by a Timestamp packet
     *
     * This is the Annex C example from the STPv2 spec
     */
    struct stmAXI *axi = stm->AXI;


    int first = 1;


    while(*string != '\0')
    {        /*
         * If the character is a linefeed, then don't output
         * it -- just reset our 'first' state to 1 so that
         * the next character (the start of the next line)
         * is marked
         */
        if (*string == '\n') {
            STM32(axi, channel, G_FLAGTS) = *string++;
            first = 1;
        } else {
            /*
             * Continue to output characters -- if it's the
             * first character in a string, or just after a
             * linefeed (handled above), mark it.
             */
            if (first) {
                STM8(axi, channel, G_DM) = (*string++);
                first = 0;
            } else {
                STM8(axi, channel, G_D) = (*string++);
            }
        }
    }


    /*
     * Flag the end of the string
     *
     * Access size doesn't matter as we have no data for flag
     * packets
     */
    STM32(axi, channel, G_FLAGTS) = 0x0;
}

Effective use of the STM

Annex C of the STPv2 specification gives an example of encoding an ASCII string as a data item, and uses metadata functionality of the extended stimulus ports. Strings are delimited with a Marked packet at the start of the string, and the end each string is appended with a FLAG_TS packet, in place or in lieu of a linefeed or NUL character. For one type of Marked Data packet is 0x08 (G_DM). For a (plain) Data packet, 0x18 (G_D), and for a Flag packet with Timestamp, 0x60 (G_FLAGTS), so we can break down sending the string as individual writes to those addresses. When we look at the trace output for a NUL-terminated string “Cambridge”, we might expect to see the following in the trace stream following this example, as a result of those writes.

D8M	D8	D8	D8	D8	D8	D8	D8	D8	FLAG_TS
C	a	m	b	r	i	d	g	e	...

This allows a trace decoder to adequately identify individual lines within text output, and additionally gives the trace decoder a method of determining when the string was output in time by way of the Timestamp. For binary data, a similar construct may be used with Marked data or Flag metadata surrounding the elements of an instrumentation message.

It might become obvious that outputting ASCII strings over a trace bus with a single packet per character is possibly not the most efficient way to use the STM. Since each data item is encapsulated in the STPv2 protocol, there is some overhead. The example string "Cambridge" sent as D8 packets and surrounded by D8M and FLAG_TS could be, rather than 9 bytes long (1 byte per character), somewhat more than 20 bytes. Packet headers are easily accounted for, but a timestamp may be quite large (up to 7 bytes, not inclusive of the FLAG_TS packet header) and may vary in size. This also does not take into account reporting of Channel and Master information. There are many ways of encoding a string within larger packet types using marker and flag 'framing' to differentiate between strings, but in the end "printf", whether over a USART or an STM interface, is simply not an efficient method of instrumentation.

In fact, in industrial applications, instrumentation is usually binary data formatted to be compact and useful and not a console output. This is especially true of use cases such as the network packet processing instrumentation where the relevant data needn't be prefixed or human readable, and indeed may be far too vast for a human to spend time reading -- the point of said instrumentation would be statistical analysis.

The onus, therefore, is upon the trace decoder to make sense of that packetized binary data. With any instrumentation data, an appropriate format for that data can be designed – ASCII strings or binary structures – and this will very much inform how the Stimulus Ports are used. Simply, you will need to at least define the usage of channels and the metadata packets before you start writing instrumentation code. By modulating the access size and the use of the extended stimulus ports' abilities to add metadata, extremely efficient output of binary instrumentation data can be effected.

Annex C also gives an example of formatting binary data in such a manner that can be constructed using the stimulus port accessor methods (as previously described). Let us imagine an application which calculates prime numbers. When it finds a prime number, it outputs the prime number itself, and the position or index of the prime, as 32-bit stimulus accesses to the STM. For example, 41 is the 13th prime number, so it outputs "41" and "13."

Stimulus Port Register	Data
G_D	prime
G_DMTS	count

A trace decoder can then look for pairs of 32-bit data items, with the second followed by a Marker packet augmented with a Timestamp. From the difference in timestamps between packets, we could work out how long it took to generate that prime number.

D32	D32MTS	D32	D32MTS	D32	D32MTS	...
41	13	43	14	47	15	...

This takes up six 32-bit words (24 data bytes) not including overhead for the 3 shown sets of data. Unless our first prime number very, very large, we would not need to encode the number or the count in a 32-bit data packet. Since each value is packetized independently (the STM will never merge two packets), the accessor could be conditional on the size (counting leading zeros) of the output data or could be automatically emitted as a smaller packet using optional STM Compression features.

The trace decoder would then be able to still look for pairs of data packets (with a Marker+Timestamp) but we would have more efficient usage of bits in the resultant trace. Below we show how an efficient trace output could be achieved counting primes, where increase the packet payload size as we reach the limit of the previous type (again, the first field is a "prime," the second marked field is a "count" of which prime). To collect the data below showing reporting of 5 sets of data, using 15 data bytes (again, not including overhead).

D8	D8MTS	D16	D8MTS	D16	D8MTS	...	D16	D8MTS	D16	D16MTS
251	55	257	56	263	57	...	1613	255	1619	256

We can see that since the first prime can be encoded in 8 bits, we can use a D8 packet. Since it's position can be encoded as 8 bits, we can also use a D8 packet. The next prime is 257, which requires >8 bits to encode, but the position does not, so we see D16+D8MTS. And so on. Eventually we will see D32 and possibly D64 packets if we calculate enough primes, but only if we need that number of bits to encode the value.

We now know fundamentally how to program the STM and generate stimulus which implements out instrumentation. Next we'll discuss how to configure DS-5 to collect the instrumentation as Trace, in Configuring DS-5 for the System Trace Macrocell.

0 comments
0 members are here

Tools, Software and IDEs blog

GCC 15: Continuously Improving

Tamar Christina

GCC 15 brings major Arm optimizations: enhanced vectorization, FP8 support, Neoverse tuning, and 3–5% performance gains on SPEC CPU 2017.
- June 26, 2025
GitHub and Arm are transforming development on Windows for developers

Pareena Verma

Develop, test, and deploy natively on Windows on Arm with GitHub-hosted Arm runners—faster CI/CD, AI tooling, and full dev stack, no emulation needed.
- May 20, 2025
What is new in LLVM 20?

Volodymyr Turanskyy

Discover what's new in LLVM 20, including Armv9.6-A support, SVE2.1 features, and key performance and code generation improvements.
- April 29, 2025

AI blog

Announcements

Architectures and Processors blog

Automotive blog

Embedded and Microcontrollers blog

Internet of Things (IoT) blog

Laptops and Desktops blog

Mobile, Graphics, and Gaming blog

Operating Systems blog

Servers and Cloud Computing blog

SoC Design and Simulation blog

Tools, Software and IDEs blog

Programming ARM's System Trace Macrocell

STM Programmers’ Model

Memory Map

Configuration

Which Stimulus Port?

Effective use of the STM

Next

GCC 15: Continuously Improving

GitHub and Arm are transforming development on Windows for developers

What is new in LLVM 20?