Hi
I am currently studying the ARM Cortex-A53 platform for high performance real-time control solutions. Among approaches, I am trying to evaluate and understand how to deal with the AMP scheme in a Cortex-A53.
My setup is based on an instance of FreeRTOS running on each of the 4 cores (currently, for simplicity, only 2). Using the linker script, I have assigned a segment of the total RAM to each core and left a trailing space where a shared piece of memory should stay. The shared memory serves as the first core also produces data for the second core to run.
I am facing a series of practical and theoretical issues and I cannot find a solid base, I am trying to resume them here:
1. Is it the right scenario for AMP (Asynchronous Multiprocessing)? Conversely, is the fact that I am sharing a portion of the RAM between cores sufficient to say it is no more an AMP scenario?2. How the TLB/MMU setup should be correctly done? Is there the need to set manually the translation table to account for the private memory of each core and mark the shared piece shared? This question is truly crucial, as the marking of a shared piece of memory disables the caching and has a tremendous performance hit on the platform. From the other side, as the exchanged data is also important, I would like to be assured about the coherence of the cache between cores or RAM (imagine a producer/consumer scheme between cores).3. What is the right cache setup? Is the SMPEN bit disable sufficient to have L2 cache enabled? Can the programming be unaware of the caching issues once the coherence engine is shut-down? Can perfomance hit be avoided, in order for the cache to be able to operate correctly?4. Can a spinlock scheme, based upon a TSL instruction with a variable residing in an highspeed OCM shared memory suffice for an arbitration scheme between cores, when they are set in AMP mode?
The issues I am currently having occur also with very simple for loops to assign data in shared memory pieces. Bandwidth falls below 20/30 MB/s when disabling cache and working with single bytes when disabling cache or going in write-through mode. Spinlock has also a great performance hit.
I know the question may sound confusing and touches too many arguments, but I am in an early stage of understanding and I would like to have ordered ideas in mind, before going to the metal. Moreover, the direct-metal approach is giving me a lot of confusing results which I do not really understand, mainly from the performance point-of-view.
Thanks in advance for all the suggestions and comments here.
Giuseppe
You need to have a TLB setup for each core as it is private to the respective core.
See: https://github.com/42Bastian/arm64-pgtable-tool
// // Memory description for zcu102 // Parse with CPP then feed into arm64-pgtable-tool to generate a setup file. // #ifndef CORE #define CORE 0 #endif #if CORE < 0 || CORE > 3 #error CORE: 0..3 #endif #define ROM_BASE 0x80000000-0x01000000+CORE*0x00400000 #if CORE == 0 // Do not map first 2M for NULL pointer trapping #define RAM_BASE 0x00200000 #else #define RAM_BASE CORE*0x10000000 #endif #define NOCACHE (CORE+1)*0x10000000-0x00200000 #define SRAM_BASE 0xfffc0000+CORE*0x8000 #define C2C_RAM 0xfffe0000 ROM_BASE,, 4M, SHARED:CACHE_WB, SRO_URO:SX:UX, ROM RAM_BASE,, 16M, SHARED:CACHE_WB, SRW_URW , RAM NOCACHE,, 2M, SHARED:NO_CACHE, SRW_URW , No Cache SRAM_BASE,, 32K, SHARED:CACHE_WB, SRW_URW , local SRAM C2C_RAM,, 32K, SHARED:CACHE_WB, SRW_URW , C2C_SRAM 0x80000000,, 1GB, SHARED:DEVICE, SRW_URW , lower PL 0xc0000000,, 512M, SHARED:DEVICE, SRW_URW , QSPI 0xe0000000,, 256M, SHARED:DEVICE, SRW_URW , lower PCIe 0xf8000000,, 16M, SHARED:DEVICE, SRW_URW , CoreSight 0xf9000000,, 1M, SHARED:DEVICE, SRW_URW , RPU 0xfd000000,, 16M, SHARED:DEVICE, SRW_URW , FPS 0xfe000000,, 26M, SHARED:DEVICE, SRW_URW , LPS 0xffc00000,, 2M, SHARED:DEVICE, SRW_URW , PMU #0xffe00000,, 2M, SHARED:CACHE_WB, SRW_URW , OCM/TCM // Mapping needed for Connector: Other core's RAM must be R/O #if CORE == 0 0x10000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE1 0x20000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE2 0x30000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE3 #elif CORE == 1 0x00200000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE0 0x20000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE2 0x30000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE3 #elif CORE == 2 0x00200000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE0 0x10000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE1 0x30000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE3 #else 0x00200000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE0 0x10000000,, 16M, SHARED:CACHE_WB, SRO_URO, RAM_CORE1 0x20000000,, 48M, SHARED:CACHE_WB, SRO_URO, RAM_CORE2 #endif
"Connector" is a SCIOPTA application for core to core message passing.
I prefer to build one ELF per core.