Working with its architecture licensees and ecosystem partners, Arm continues to evolve its architecture, developing new functionality to meet the needs of both new and existing markets.
This blog discusses some of the key additions to the A-profile architecture in 2020.
This blog also introduces two new additions to the Future Architecture Technologies program, which provides advanced information on unreleased versions of the architecture.
Full Instruction Set and System Register information will be available via our technical webpages. The complete Armv8-A Architecture Reference Manual (ArmARM), documenting the 2020 extensions and earlier functionality, is due for release in early 2021. XML releases will be available soon and we will link to those when available.
Details of previous updates to the A-profile architecture are available here: 2014, 2015, 2016, 2017, 2018 and 2019.
As part of the 2020 extensions, Arm is adding the ability to identify devices which can be subject to long delays. TLB invalidate (TLBI) operations and barriers can also be annotated with this attribute.
Technologies such as PCIe allow for devices to be hot-unplugged. This can occur even when there are outstanding requests to the device. When a device is removed, the PCIe root complex will respond with a default response after a timeout period, which is typically in the order of 50ms.
Some impact on the software directly interacting with the removed device is expected. However, we want to minimize the impact on other, unrelated tasks. Consider the following example:
Figure 1 - Hot-unplug causing delayed TLBI response
Core 1 was interacting with the removed device and is now waiting for a response.
Core 2 broadcasts an unrelated TLBI and waits for the acknowledgment from core 1. Ideally core 1 would respond quickly, as it has no outstanding transactions for the location covered by the TLBI. However, some micro-architectures do not track the translation used for issued transactions. To meet the architectural requirements, core 1 would have to wait for all transactions to complete before replying to the TLBI, making core 2 also subject to the PCIe timeout.
The XS attribute gives an efficient mechanism for avoiding this. The mappings for the PCIe devices have XS=1, indicating that long delays are possible. Other regions, such as RAM, are marked as XS=0. A core can track whether outstanding transactions are XS=0 or 1 without needing to record the full original translation. In our example scenario, core 1 knows that only XS=1 accesses are outstanding. Allowing it to quickly respond to core 2’s TLBI if it is marked as applying to XS=0 mappings.
Figure 2 - XS attribute used to avoid TLBI response delay
A growing trend in enterprise systems is the introduction of accelerators that can be accessed using a 64-byte atomic loads or stores. These are used to add items to queues and can, in some cases, signal success or failure of the enqueue operation.
To support this new breed of accelerators 64-byte atomic load (LD64B) instruction and three store (ST64Bx) instructions are added to the architecture.
Figure 3 - Adding a work item to a work queue
The WFE and WFI instructions allow the core to put into standby, for example, while waiting for a resource to become available. There is no limit to how long the core could stay in standby, should no event or interrupt be received. This is one limitation on the use of these instructions.
To address this limitation, new variants of the WFI and WFE instructions are introduced which take a register operand containing a counter value. The core resumes from standby when the CNTVCT_EL0 virtual counter reaches or exceeds the specified value. This allows software to specify a maximum time to remain in standby.
The 2020 extensions also include other small features:
As part of the 2020 enhancements, Arm is introducing two new extensions as part of the Future Architecture Technologies program. Future Architecture Technologies are not released architectures, but those for which we want to share advance information to enable the ecosystem to prepare.
The Call-Stack Recorder Extension (CSRE) and Branch-Record Buffer Extension (BRBE) aim to improve the experience of developing software for Arm. The experience is improved by providing enhanced visibility of how code is executing. This information can be used for debugging, profiling, identifying hot-spots, Feedback Driven Optimization (FDO), and many other uses.
CSRE provides a low impact mechanism to record and unwind the stack. A live view of the current call stack is recorded in memory, where it can be efficiently captured for performance analysis or interpreted for debug.
BRBE captures a recent sequence of branches in an easily consumable format. This information can be used for debugging or fed into profiling tools for hot-spot analysis and AutoFDO.
This blog provides a brief introduction to the latest features included in the Armv8-A architecture as Armv8.7-A, and some information on Future Architecture Technologies. More detailed information will soon be available on our Developer website.
The next step will be working with our ecosystem partners, including Linaro, to ensure that open source software is enabled, to make use of this functionality as soon as the hardware becomes available. Join us at Linaro Connect to learn more about the 2020 extensions and take part in the discussions.
Join Linaro Connect
Will "LD64B" be limited to 64byte aligned (Cache line size) addresses?Also, I am confused about the "PCIe root" in the diagram. Will it not work on DDRAM?
As for "WFE and WFI with timeouts", it would be cool if the instruction would return 0 or 1 in the register to detect the timeout w/o additional overhead.