Cortex A53 and AMP Asynchronous Multiprocessing

Hi

I am currently studying the ARM Cortex-A53 platform for high performance real-time control solutions. Among approaches, I am trying to evaluate and understand how to deal with the AMP scheme in a Cortex-A53.

My setup is based on an instance of FreeRTOS running on each of the 4 cores (currently, for simplicity, only 2). Using the linker script, I have assigned a segment of the total RAM to each core and left a trailing space where a shared piece of memory should stay. The shared memory serves as the first core also produces data for the second core to run.

I am facing a series of practical and theoretical issues and I cannot find a solid base, I am trying to resume them here:

1. Is it the right scenario for AMP (Asynchronous Multiprocessing)? Conversely, is the fact that I am sharing a portion of the RAM between cores sufficient to say it is no more an AMP scenario?
2. How the TLB/MMU setup should be correctly done? Is there the need to set manually the translation table to account for the private memory of each core and mark the shared piece shared? This question is truly crucial, as the marking of a shared piece of memory disables the caching and has a tremendous performance hit on the platform. From the other side, as the exchanged data is also important, I would like to be assured about the coherence of the cache between cores or RAM (imagine a producer/consumer scheme between cores).
3. What is the right cache setup? Is the SMPEN bit disable sufficient to have L2 cache enabled? Can the programming be unaware of the caching issues once the coherence engine is shut-down? Can perfomance hit be avoided, in order for the cache to be able to operate correctly?
4. Can a spinlock scheme, based upon a TSL instruction with a variable residing in an highspeed OCM shared memory suffice for an arbitration scheme between cores, when they are set in AMP mode?

The issues I am currently having occur also with very simple for loops to assign data in shared memory pieces. Bandwidth falls below 20/30 MB/s when disabling cache and working with single bytes when disabling cache or going in write-through mode. Spinlock has also a great performance hit.

I know the question may sound confusing and touches too many arguments, but I am in an early stage of understanding and I would like to have ordered ideas in mind, before going to the metal. Moreover, the direct-metal approach is giving me a lot of confusing results which I do not really understand, mainly from the performance point-of-view.

Thanks in advance for all the suggestions and comments here.

Giuseppe