We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello community and experts,
I am having a troubled time to understand the memory barriers. The further I read, the further I am being paranoid about the speculative reads and cpu re-ordering.
I will have some questions and I will really appreciate any help. First of all, allow me to give some background info. This is our first ARM experience, before that we have used mostly single Core DSPs. In this project, we are using a ARM Cortex-A15 corepack, with four ARM cores, L2 cache shared. Our compiler is GCC.
Main Problem and a meta-question:
My core problem, besides well-understanding the memory ordering and memory barriers, is the execution of memory barriers (particularly DMB SY) sometimes take too many cycles to complete (cycles equal to 50 usecs in a 1.4 Ghz core). This is not happening always, but sometimes it happens, and it seems randomly to me. I am doing the measurement by using the Core Cycle register. Note that 50 usec is a way too long time for our system, since we have 1 msec real time constraints on many operations.
Is there a particular reason for this performance spike? (Maybe there are too many operations waiting in the pipeline hence when I force the pipeline to be flushed, it takes too long time...) How can I prevent these "spikes" from happening?
Question 1 about memory barriers:
There is a simple mailbox example in ARM memory barrier with cache. It uses two DMBs to prevent CPU reordering.
Let's say I have changed the code to a software queue style:
unsigned int write_cnt = 0, read_cnt = 0, msg_array[N];
void coreA()
{
msg_array[write_cnt] = 1; //or any value other than 1, 1 stands as an example
asm volatile ("DMB");
++write_cnt;
}
void coreB()
unsigned int my_read = 0;
while(read_cnt != write_cnt) {
my_read = msg_array[read_cnt]; //I know I don't use my_read anywhere in the example
++read_cnt;
I am not sure if I require a DMB in the coreB function. What is your opinion? And why should I require a barrier, after all, as a programmer I have set my mind to write code with reordering invisible or unknown to me for years. My mind is set to that, the loop in the coreB function will not reorder the accesses because it seems to me that the accesses are all dependent on each other.
Question 2 about memory barriers:
We use a hardware queue system to communicate between ARM cores and co-processors (ARM Core <-> Arm Core, ARM Core <-> co-processor).
We have some descriptors, or buffers to write data. Some of these buffers are in a cacheable region, some are not. Only the beginning address of the buffer is pushed into the hardware queue.
Buffer Pool A, in a cacheable region
Core1
pop a buffer from Pool A
write the data into the buffer
//Do I need a memory barrier (DMB) here in this line?
push this buffer into the hw queue no 1
Core2
read loop on the hw queue no 1
if there is a entry in the hw queue,
pop the buffer
read the contents of the buffer
Do I need a memory barrier in the Core2? If yes, why?
Question 3 about memory barriers:
In our system, we are using the scenario in the question 2 without any memory barriers. And there are no visible problems. How is this possible, at least Core1 should require a DMB>
Again, in the same scenario, but this time the Buffer Pool is in a non-cacheable region. Without memory barrier in the Core1, we see some problems, like Core2 does not observe the data change in the buffer. But we don't use a barrier in the Core2.
As you can see, I am pretty confused. Partly because of our system's behaviour, partly the memory barriers are complex and new to me. I will be appreciated for any help,
Thank you,
Erman
In terms of general background reading around barriers and usage for different algorithms, this is a good starting point:
http://infocenter.arm.com/help/topic/com.arm.doc.genc007826/Barrier_Litmus_Tests_and_Cookbook_A08.pdf
Hard to say. Cortex-A cores are not real-time cores, with multiple levels of caching, out-of-ordering, virtual memory, etc. One data load instruction can easily take 700 cycles if you manage to miss in TLB and in the data cache (it's three round trips to DDR, which may be a couple of hundred CPU cycles each on a slow memory system), possibly more if the other cores in the system are busy doing the same thing at the same time.
The first thing to check is that your cache and page tables are set up correctly. For inner shared data the DMB should only have to commit to the L2, not to main memory.
am not sure if I require a DMB in the coreB function. What is your opinion?
"It depends". If there is no address dependency between the flag and the data which is touched, then yes you probably need a barrier. Otherwise the core may out-of-order and load the data before the flag is set. If there is an address dependency, for example a data lookup where the "flag" is a pointer which is deferenced if not NULL then you don't need a barrier. See the message passing example in the PDF linked above.
Quite possibly the answer is "luck". Barriers, by definition are there to prevent what the core "might" do and avoiding timing races in accesses which "might" happen at the same time on two cores. If you don't have those situations aligning on your platform, then it may well just work. I've seen programs fail after a couple of days in stress testing - testing and then finding missing barrier cases is a pain.
HTH, Pete
Hello Peter,
Thanks for the answer. I will read and try to understand the document. Also I will check the cache and TLB configuration as soon as possible. As far as I know, L2 cache module is active as data cache and instruction cache, however I do not know the current state of the TLB. If there is a quick reference for the correct configuration of cache and TLB, I would gladly use it...
Regarding my last question, you say that it is "luck" to work properly. Then I need a memory barrier in the Core1. But I am not sure about the Core2.
asm volatile ("DMB")
I am not sure if I require a DMB in the coreB function. What is your opinion? And why should I require a barrier, after all, as a programmer I have set my mind to write code with reordering invisible or unknown to me for years. My mind is set to that, the loop in the coreB function will not reorder the accesses because itseems to me that the accesses are all dependent on each other.
As I understand it, you absolutely MUST have a DMB in the coreB function. If you do not, your code will only work by chance.
The barrier in the coreA function ensures that the write you have made will published to main memory prior to moving the write pointer. Remember that barriers do not cause any memory writes or flushes to actually occur; all they do is ensure when writes to memory finally DO happen, all the writes before the barrier will happen before all the writes after the barrier. All the barrier is doing is ensuring *ordering* of events - nothing else. In theory, in fact, with this data structure, coreB might NEVER see ANY writes made by coreA. In practise, of course, it does, because coreA is by running code performing memory writes all the time.
In coreB, the barrier ther ensures the core will SEE the writes made by coreA - and, moreover, *see them in the correct order*. If there is no barrier here, it may be that the write pointer move in coreA is seen by coreB *BEFORE* the data written to the queue element is seen; coreB will think there is a new element, but it will read the OLD data in that element.
I have implemented this queue (it's the bounded, single consumer, single producer queue - not the other queue, which is multiple consumer/producer) in my own lock-free data structure library. You can have a look at the code, if you want; http://www.liblfds.org
To be honest, memory barriers and so on, it's complex stuff which is easy to get wrong. If you're new to this, you're likely to get it wrong. I would not advise you to be learning memory barriers in the process of creating an actual product.
Probably the other load in the system is causing enough memory reads and writes to be keeping itself honest. However, it's pure chance.
Again, in the same scenario, but this time the Buffer Pool is in a non-cacheable region. Without memory barrier in the Core1, we see some problems, like Core2 does not observe the data change in the buffer.
Careful. The way you write makes me think you are thinking the memory barrier is causing data to be flushed to memory. It's not doing that at all. Once you realise that, things will be fabulously confusing (at least, they were for me!) but then you're on the right track.