Cortex A53 and AMP Asynchronous Multiprocessing

Hi

I am currently studying the ARM Cortex-A53 platform for high performance real-time control solutions. Among approaches, I am trying to evaluate and understand how to deal with the AMP scheme in a Cortex-A53.

My setup is based on an instance of FreeRTOS running on each of the 4 cores (currently, for simplicity, only 2). Using the linker script, I have assigned a segment of the total RAM to each core and left a trailing space where a shared piece of memory should stay. The shared memory serves as the first core also produces data for the second core to run.

I am facing a series of practical and theoretical issues and I cannot find a solid base, I am trying to resume them here:

1. Is it the right scenario for AMP (Asynchronous Multiprocessing)? Conversely, is the fact that I am sharing a portion of the RAM between cores sufficient to say it is no more an AMP scenario?
2. How the TLB/MMU setup should be correctly done? Is there the need to set manually the translation table to account for the private memory of each core and mark the shared piece shared? This question is truly crucial, as the marking of a shared piece of memory disables the caching and has a tremendous performance hit on the platform. From the other side, as the exchanged data is also important, I would like to be assured about the coherence of the cache between cores or RAM (imagine a producer/consumer scheme between cores).
3. What is the right cache setup? Is the SMPEN bit disable sufficient to have L2 cache enabled? Can the programming be unaware of the caching issues once the coherence engine is shut-down? Can perfomance hit be avoided, in order for the cache to be able to operate correctly?
4. Can a spinlock scheme, based upon a TSL instruction with a variable residing in an highspeed OCM shared memory suffice for an arbitration scheme between cores, when they are set in AMP mode?

The issues I am currently having occur also with very simple for loops to assign data in shared memory pieces. Bandwidth falls below 20/30 MB/s when disabling cache and working with single bytes when disabling cache or going in write-through mode. Spinlock has also a great performance hit.

I know the question may sound confusing and touches too many arguments, but I am in an early stage of understanding and I would like to have ordered ideas in mind, before going to the metal. Moreover, the direct-metal approach is giving me a lot of confusing results which I do not really understand, mainly from the performance point-of-view.

Thanks in advance for all the suggestions and comments here.

Giuseppe

Parents
  • You need to have a TLB setup for each core as it is private to the respective core.

    See: https://github.com/42Bastian/arm64-pgtable-tool

    //
    // Memory description for zcu102
    // Parse with CPP then feed into arm64-pgtable-tool to generate a setup file.
    //
    #ifndef CORE
    #define CORE 0
    #endif
    
    #if CORE < 0 || CORE > 3
    #error CORE: 0..3
    #endif
    
    #define ROM_BASE  0x80000000-0x01000000+CORE*0x00400000
    #if CORE == 0
    // Do not map first 2M for NULL pointer trapping
    #define RAM_BASE  0x00200000
    #else
    #define RAM_BASE  CORE*0x10000000
    #endif
    #define NOCACHE   (CORE+1)*0x10000000-0x00200000
    #define SRAM_BASE 0xfffc0000+CORE*0x8000
    #define C2C_RAM   0xfffe0000
    
    ROM_BASE,,     4M, SHARED:CACHE_WB, SRO_URO:SX:UX, ROM
    RAM_BASE,,    16M, SHARED:CACHE_WB, SRW_URW      , RAM
    NOCACHE,,      2M, SHARED:NO_CACHE, SRW_URW      , No Cache
    SRAM_BASE,,   32K, SHARED:CACHE_WB, SRW_URW      , local SRAM
    C2C_RAM,,     32K, SHARED:CACHE_WB, SRW_URW      , C2C_SRAM
    
    0x80000000,,  1GB, SHARED:DEVICE,   SRW_URW      , lower PL
    0xc0000000,, 512M, SHARED:DEVICE,   SRW_URW      , QSPI
    0xe0000000,, 256M, SHARED:DEVICE,   SRW_URW      , lower PCIe
    0xf8000000,,  16M, SHARED:DEVICE,   SRW_URW      , CoreSight
    0xf9000000,,   1M, SHARED:DEVICE,   SRW_URW      , RPU
    0xfd000000,,  16M, SHARED:DEVICE,   SRW_URW      , FPS
    0xfe000000,,  26M, SHARED:DEVICE,   SRW_URW      , LPS
    0xffc00000,,   2M, SHARED:DEVICE,   SRW_URW      , PMU
    #0xffe00000,,   2M, SHARED:CACHE_WB, SRW_URW      , OCM/TCM
    
    // Mapping needed for Connector: Other core's RAM must be R/O
    #if CORE == 0
    0x10000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE1
    0x20000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE2
    0x30000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE3
    #elif CORE == 1
    0x00200000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE0
    0x20000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE2
    0x30000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE3
    #elif CORE == 2
    0x00200000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE0
    0x10000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE1
    0x30000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE3
    #else
    0x00200000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE0
    0x10000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE1
    0x20000000,,   48M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE2
    #endif
    

    "Connector" is a SCIOPTA application for core to core message passing.

    I prefer to build one ELF per core.

Reply
  • You need to have a TLB setup for each core as it is private to the respective core.

    See: https://github.com/42Bastian/arm64-pgtable-tool

    //
    // Memory description for zcu102
    // Parse with CPP then feed into arm64-pgtable-tool to generate a setup file.
    //
    #ifndef CORE
    #define CORE 0
    #endif
    
    #if CORE < 0 || CORE > 3
    #error CORE: 0..3
    #endif
    
    #define ROM_BASE  0x80000000-0x01000000+CORE*0x00400000
    #if CORE == 0
    // Do not map first 2M for NULL pointer trapping
    #define RAM_BASE  0x00200000
    #else
    #define RAM_BASE  CORE*0x10000000
    #endif
    #define NOCACHE   (CORE+1)*0x10000000-0x00200000
    #define SRAM_BASE 0xfffc0000+CORE*0x8000
    #define C2C_RAM   0xfffe0000
    
    ROM_BASE,,     4M, SHARED:CACHE_WB, SRO_URO:SX:UX, ROM
    RAM_BASE,,    16M, SHARED:CACHE_WB, SRW_URW      , RAM
    NOCACHE,,      2M, SHARED:NO_CACHE, SRW_URW      , No Cache
    SRAM_BASE,,   32K, SHARED:CACHE_WB, SRW_URW      , local SRAM
    C2C_RAM,,     32K, SHARED:CACHE_WB, SRW_URW      , C2C_SRAM
    
    0x80000000,,  1GB, SHARED:DEVICE,   SRW_URW      , lower PL
    0xc0000000,, 512M, SHARED:DEVICE,   SRW_URW      , QSPI
    0xe0000000,, 256M, SHARED:DEVICE,   SRW_URW      , lower PCIe
    0xf8000000,,  16M, SHARED:DEVICE,   SRW_URW      , CoreSight
    0xf9000000,,   1M, SHARED:DEVICE,   SRW_URW      , RPU
    0xfd000000,,  16M, SHARED:DEVICE,   SRW_URW      , FPS
    0xfe000000,,  26M, SHARED:DEVICE,   SRW_URW      , LPS
    0xffc00000,,   2M, SHARED:DEVICE,   SRW_URW      , PMU
    #0xffe00000,,   2M, SHARED:CACHE_WB, SRW_URW      , OCM/TCM
    
    // Mapping needed for Connector: Other core's RAM must be R/O
    #if CORE == 0
    0x10000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE1
    0x20000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE2
    0x30000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE3
    #elif CORE == 1
    0x00200000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE0
    0x20000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE2
    0x30000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE3
    #elif CORE == 2
    0x00200000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE0
    0x10000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE1
    0x30000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE3
    #else
    0x00200000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE0
    0x10000000,,   16M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE1
    0x20000000,,   48M, SHARED:CACHE_WB, SRO_URO,       RAM_CORE2
    #endif
    

    "Connector" is a SCIOPTA application for core to core message passing.

    I prefer to build one ELF per core.

Children
No data