This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Problem with understanding memory barriers and problem of barriers taking too long time to execute in ARM Cortex-A15 corepack

Hello community and experts,

I am having a troubled time to understand the memory barriers. The further I read, the further I am being paranoid about the speculative reads and cpu re-ordering.

I will have some questions and I will really appreciate any help. First of all, allow me to give some background info. This is our first ARM experience, before that we have used mostly single Core DSPs. In this project, we are using a ARM Cortex-A15 corepack, with four ARM cores, L2 cache shared. Our compiler is GCC.

Main Problem and a meta-question:

My core problem, besides well-understanding the memory ordering and memory barriers, is the execution of memory barriers (particularly DMB SY) sometimes take too many cycles to complete (cycles equal to 50 usecs in a 1.4 Ghz core). This is not happening always, but sometimes it happens, and it seems randomly to me. I am doing the measurement by using the Core Cycle register. Note that 50 usec is a way too long time for our system, since we have 1 msec real time constraints on many operations.

Is there a particular reason for this performance spike? (Maybe there are too many operations waiting in the pipeline hence when I force the pipeline to be flushed, it takes too long time...) How can I prevent these "spikes" from happening?

Question 1 about memory barriers:

There is a simple mailbox example in ARM memory barrier with cache. It uses two DMBs to prevent CPU reordering.

Let's say I have changed the code to a software queue style:

unsigned int write_cnt = 0, read_cnt = 0, msg_array[N];

void coreA()

{

msg_array[write_cnt] = 1; //or any value other than 1, 1 stands as an example

asm volatile ("DMB");

++write_cnt;

}

void coreB()

{

unsigned int my_read = 0;

while(read_cnt != write_cnt) {

  my_read = msg_array[read_cnt];     //I know I don't use my_read anywhere in the example

  ++read_cnt;

}

}

I am not sure if I require a DMB in the coreB function. What is your opinion? And why should I require a barrier, after all, as a programmer I have set my mind to write code with reordering invisible or unknown to me for years. My mind is set to that, the loop in the coreB function will not reorder the accesses because it seems to me that the accesses are all dependent on each other.

Question 2 about memory barriers:

We use a hardware queue system to communicate between ARM cores  and co-processors (ARM Core <-> Arm Core, ARM Core <-> co-processor).

We have some descriptors, or buffers to write data. Some of these buffers are in a cacheable region, some are not. Only the beginning address of the buffer is pushed into the hardware queue.

Buffer Pool A, in a cacheable region

Core1

pop a buffer from Pool A

write the data into the buffer

//Do I need a memory barrier (DMB) here in this line?

push this buffer into the hw queue no 1

Core2

read loop on the hw queue no 1

if there is a entry in the hw queue,

     pop the buffer

     read the contents of the buffer

Do I need a memory barrier in the Core2? If yes, why?

Question 3 about memory barriers:

In our system, we are using the scenario in the question 2 without any memory barriers. And there are no visible problems. How is this possible, at least Core1 should require a DMB>

Again, in the same scenario, but this time the Buffer Pool is in a non-cacheable region. Without memory barrier in the Core1, we see some problems, like Core2 does not observe the data change in the buffer. But we don't use a barrier in the Core2.

As you can see, I am pretty confused. Partly because of our system's behaviour, partly the memory barriers are complex and new to me. I will be appreciated for any help,

Thank you,

Erman

Parents
  • I am not sure if I require a DMB in the coreB function. What is your opinion? And why should I require a barrier, after all, as a programmer I have set my mind to write code with reordering invisible or unknown to me for years. My mind is set to that, the loop in the coreB function will not reorder the accesses because itseems to me that the accesses are all dependent on each other.

    As I understand it, you absolutely MUST have a DMB in the coreB function.  If you do not, your code will only work by chance.

    The barrier in the coreA function ensures that the write you have made will published to main memory prior to moving the write pointer.  Remember that barriers do not cause any memory writes or flushes to actually occur; all they do is ensure when writes to memory finally DO happen, all the writes before the barrier will happen before all the writes after the barrier.  All the barrier is doing is ensuring *ordering* of events - nothing else.  In theory, in fact, with this data structure, coreB might NEVER see ANY writes made by coreA.  In practise, of course, it does, because coreA is by running code performing memory writes all the time.

    In coreB, the barrier ther ensures the core will SEE the writes made by coreA - and, moreover, *see them in the correct order*.  If there is no barrier here, it may be that the write pointer move in coreA is seen by coreB *BEFORE* the data written to the queue element is seen; coreB will think there is a new element, but it will read the OLD data in that element.

    I have implemented this queue (it's the bounded, single consumer, single producer queue - not the other queue, which is multiple consumer/producer) in my own lock-free data structure library.  You can have a look at the code, if you want; http://www.liblfds.org

    To be honest, memory barriers and so on, it's complex stuff which is easy to get wrong.  If you're new to this, you're likely to get it wrong.  I would not advise you to be learning memory barriers in the process of creating an actual product.

    In our system, we are using the scenario in the question 2 without any memory barriers. And there are no visible problems. How is this possible, at least Core1 should require a DMB>

    Probably the other load in the system is causing enough memory reads and writes to be keeping itself honest.  However, it's pure chance.

    Again, in the same scenario, but this time the Buffer Pool is in a non-cacheable region. Without memory barrier in the Core1, we see some problems, like Core2 does not observe the data change in the buffer.

    Careful.  The way you write makes me think you are thinking the memory barrier is causing data to be flushed to memory.  It's not doing that at all.  Once you realise that, things will be fabulously confusing (at least, they were for me!) but then you're on the right track.

Reply
  • I am not sure if I require a DMB in the coreB function. What is your opinion? And why should I require a barrier, after all, as a programmer I have set my mind to write code with reordering invisible or unknown to me for years. My mind is set to that, the loop in the coreB function will not reorder the accesses because itseems to me that the accesses are all dependent on each other.

    As I understand it, you absolutely MUST have a DMB in the coreB function.  If you do not, your code will only work by chance.

    The barrier in the coreA function ensures that the write you have made will published to main memory prior to moving the write pointer.  Remember that barriers do not cause any memory writes or flushes to actually occur; all they do is ensure when writes to memory finally DO happen, all the writes before the barrier will happen before all the writes after the barrier.  All the barrier is doing is ensuring *ordering* of events - nothing else.  In theory, in fact, with this data structure, coreB might NEVER see ANY writes made by coreA.  In practise, of course, it does, because coreA is by running code performing memory writes all the time.

    In coreB, the barrier ther ensures the core will SEE the writes made by coreA - and, moreover, *see them in the correct order*.  If there is no barrier here, it may be that the write pointer move in coreA is seen by coreB *BEFORE* the data written to the queue element is seen; coreB will think there is a new element, but it will read the OLD data in that element.

    I have implemented this queue (it's the bounded, single consumer, single producer queue - not the other queue, which is multiple consumer/producer) in my own lock-free data structure library.  You can have a look at the code, if you want; http://www.liblfds.org

    To be honest, memory barriers and so on, it's complex stuff which is easy to get wrong.  If you're new to this, you're likely to get it wrong.  I would not advise you to be learning memory barriers in the process of creating an actual product.

    In our system, we are using the scenario in the question 2 without any memory barriers. And there are no visible problems. How is this possible, at least Core1 should require a DMB>

    Probably the other load in the system is causing enough memory reads and writes to be keeping itself honest.  However, it's pure chance.

    Again, in the same scenario, but this time the Buffer Pool is in a non-cacheable region. Without memory barrier in the Core1, we see some problems, like Core2 does not observe the data change in the buffer.

    Careful.  The way you write makes me think you are thinking the memory barrier is causing data to be flushed to memory.  It's not doing that at all.  Once you realise that, things will be fabulously confusing (at least, they were for me!) but then you're on the right track.

Children
No data