This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Problem with understanding memory barriers and problem of barriers taking too long time to execute in ARM Cortex-A15 corepack

Hello community and experts,

I am having a troubled time to understand the memory barriers. The further I read, the further I am being paranoid about the speculative reads and cpu re-ordering.

I will have some questions and I will really appreciate any help. First of all, allow me to give some background info. This is our first ARM experience, before that we have used mostly single Core DSPs. In this project, we are using a ARM Cortex-A15 corepack, with four ARM cores, L2 cache shared. Our compiler is GCC.

Main Problem and a meta-question:

My core problem, besides well-understanding the memory ordering and memory barriers, is the execution of memory barriers (particularly DMB SY) sometimes take too many cycles to complete (cycles equal to 50 usecs in a 1.4 Ghz core). This is not happening always, but sometimes it happens, and it seems randomly to me. I am doing the measurement by using the Core Cycle register. Note that 50 usec is a way too long time for our system, since we have 1 msec real time constraints on many operations.

Is there a particular reason for this performance spike? (Maybe there are too many operations waiting in the pipeline hence when I force the pipeline to be flushed, it takes too long time...) How can I prevent these "spikes" from happening?

Question 1 about memory barriers:

There is a simple mailbox example in ARM memory barrier with cache. It uses two DMBs to prevent CPU reordering.

Let's say I have changed the code to a software queue style:

unsigned int write_cnt = 0, read_cnt = 0, msg_array[N];

void coreA()

{

msg_array[write_cnt] = 1; //or any value other than 1, 1 stands as an example

asm volatile ("DMB");

++write_cnt;

}

void coreB()

{

unsigned int my_read = 0;

while(read_cnt != write_cnt) {

  my_read = msg_array[read_cnt];     //I know I don't use my_read anywhere in the example

  ++read_cnt;

}

}

I am not sure if I require a DMB in the coreB function. What is your opinion? And why should I require a barrier, after all, as a programmer I have set my mind to write code with reordering invisible or unknown to me for years. My mind is set to that, the loop in the coreB function will not reorder the accesses because it seems to me that the accesses are all dependent on each other.

Question 2 about memory barriers:

We use a hardware queue system to communicate between ARM cores  and co-processors (ARM Core <-> Arm Core, ARM Core <-> co-processor).

We have some descriptors, or buffers to write data. Some of these buffers are in a cacheable region, some are not. Only the beginning address of the buffer is pushed into the hardware queue.

Buffer Pool A, in a cacheable region

Core1

pop a buffer from Pool A

write the data into the buffer

//Do I need a memory barrier (DMB) here in this line?

push this buffer into the hw queue no 1

Core2

read loop on the hw queue no 1

if there is a entry in the hw queue,

     pop the buffer

     read the contents of the buffer

Do I need a memory barrier in the Core2? If yes, why?

Question 3 about memory barriers:

In our system, we are using the scenario in the question 2 without any memory barriers. And there are no visible problems. How is this possible, at least Core1 should require a DMB>

Again, in the same scenario, but this time the Buffer Pool is in a non-cacheable region. Without memory barrier in the Core1, we see some problems, like Core2 does not observe the data change in the buffer. But we don't use a barrier in the Core2.

As you can see, I am pretty confused. Partly because of our system's behaviour, partly the memory barriers are complex and new to me. I will be appreciated for any help,

Thank you,

Erman

0