Hello community and experts,
I am having a troubled time to understand the memory barriers. The further I read, the further I am being paranoid about the speculative reads and cpu re-ordering.
I will have some questions and I will really appreciate any help. First of all, allow me to give some background info. This is our first ARM experience, before that we have used mostly single Core DSPs. In this project, we are using a ARM Cortex-A15 corepack, with four ARM cores, L2 cache shared. Our compiler is GCC.
Main Problem and a meta-question:
My core problem, besides well-understanding the memory ordering and memory barriers, is the execution of memory barriers (particularly DMB SY) sometimes take too many cycles to complete (cycles equal to 50 usecs in a 1.4 Ghz core). This is not happening always, but sometimes it happens, and it seems randomly to me. I am doing the measurement by using the Core Cycle register. Note that 50 usec is a way too long time for our system, since we have 1 msec real time constraints on many operations.
Is there a particular reason for this performance spike? (Maybe there are too many operations waiting in the pipeline hence when I force the pipeline to be flushed, it takes too long time...) How can I prevent these "spikes" from happening?
Question 1 about memory barriers:
There is a simple mailbox example in ARM memory barrier with cache. It uses two DMBs to prevent CPU reordering.
Let's say I have changed the code to a software queue style:
unsigned int write_cnt = 0, read_cnt = 0, msg_array[N];
void coreA()
{
msg_array[write_cnt] = 1; //or any value other than 1, 1 stands as an example
asm volatile ("DMB");
++write_cnt;
}
void coreB()
unsigned int my_read = 0;
while(read_cnt != write_cnt) {
my_read = msg_array[read_cnt]; //I know I don't use my_read anywhere in the example
++read_cnt;
I am not sure if I require a DMB in the coreB function. What is your opinion? And why should I require a barrier, after all, as a programmer I have set my mind to write code with reordering invisible or unknown to me for years. My mind is set to that, the loop in the coreB function will not reorder the accesses because it seems to me that the accesses are all dependent on each other.
Question 2 about memory barriers:
We use a hardware queue system to communicate between ARM cores and co-processors (ARM Core <-> Arm Core, ARM Core <-> co-processor).
We have some descriptors, or buffers to write data. Some of these buffers are in a cacheable region, some are not. Only the beginning address of the buffer is pushed into the hardware queue.
Buffer Pool A, in a cacheable region
Core1
pop a buffer from Pool A
write the data into the buffer
//Do I need a memory barrier (DMB) here in this line?
push this buffer into the hw queue no 1
Core2
read loop on the hw queue no 1
if there is a entry in the hw queue,
pop the buffer
read the contents of the buffer
Do I need a memory barrier in the Core2? If yes, why?
Question 3 about memory barriers:
In our system, we are using the scenario in the question 2 without any memory barriers. And there are no visible problems. How is this possible, at least Core1 should require a DMB>
Again, in the same scenario, but this time the Buffer Pool is in a non-cacheable region. Without memory barrier in the Core1, we see some problems, like Core2 does not observe the data change in the buffer. But we don't use a barrier in the Core2.
As you can see, I am pretty confused. Partly because of our system's behaviour, partly the memory barriers are complex and new to me. I will be appreciated for any help,
Thank you,
Erman
In terms of general background reading around barriers and usage for different algorithms, this is a good starting point:
http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf
Hard to say. Cortex-A cores are not real-time cores, with multiple levels of caching, out-of-ordering, virtual memory, etc. One data load instruction can easily take 700 cycles if you manage to miss in TLB and in the data cache (it's three round trips to DDR, which may be a couple of hundred CPU cycles each on a slow memory system), possibly more if the other cores in the system are busy doing the same thing at the same time.
The first thing to check is that your cache and page tables are set up correctly. For inner shared data the DMB should only have to commit to the L2, not to main memory.
am not sure if I require a DMB in the coreB function. What is your opinion?
"It depends". If there is no address dependency between the flag and the data which is touched, then yes you probably need a barrier. Otherwise the core may out-of-order and load the data before the flag is set. If there is an address dependency, for example a data lookup where the "flag" is a pointer which is deferenced if not NULL then you don't need a barrier. See the message passing example in the PDF linked above.
Quite possibly the answer is "luck". Barriers, by definition are there to prevent what the core "might" do and avoiding timing races in accesses which "might" happen at the same time on two cores. If you don't have those situations aligning on your platform, then it may well just work. I've seen programs fail after a couple of days in stress testing - testing and then finding missing barrier cases is a pain.
HTH, Pete
Hello Peter,
Thanks for the answer. I will read and try to understand the document. Also I will check the cache and TLB configuration as soon as possible. As far as I know, L2 cache module is active as data cache and instruction cache, however I do not know the current state of the TLB. If there is a quick reference for the correct configuration of cache and TLB, I would gladly use it...
Regarding my last question, you say that it is "luck" to work properly. Then I need a memory barrier in the Core1. But I am not sure about the Core2.
asm volatile ("DMB")