This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Memory barriers and ldrexd - race condition?

Note: This was originally posted on 2nd October 2012 at http://forums.arm.com

Hi -

Can someone please please PLEASE help me understand something which is still after a LOT of effort trying to understand confusing and puzzling me.

I'm trying to get my head around memory barriers and load-link/conditional-store.

In the ARM Linux kernel, I see this function;

353 static inline u64 atomic64_cmpxchg(atomic64_t *ptr, u64 old, u64 new)
354 {
355   u64 oldval;
356   unsigned long res;
357
358   smp_mb();
359
360   do {
361     __asm__ __volatile__("@ atomic64_cmpxchg\n"
362     "ldrexd   %1, %H1, [%3]\n"
363     "mov         %0, #0\n"
364     "teq         %1, %4\n"
365     "teqeq       %H1, %H4\n"
366     "strexdeq    %0, %5, %H5, [%3]"
367     : "=&r" (res), "=&r" (oldval), "+Qo" (ptr->counter)
368     : "r" (&ptr->counter), "r" (old), "r" (new)
369     : "cc");
370   } while (res);
371
372   smp_mb();
373
374   return oldval;
375 }


Now, I recently read Paul McKenney's paper explaining the hardware issues involved in memory barriers.

So what I understand is that the initial memory barrier flushes the invalidate queue and the store buffer.

That's all good and well.

We then come to the ldrexd.  Now, this will perform the initial load and activate the ERG covering the address where the load occurs.

(And then we compare and if the compare passes, we store.  We then issue another memory barrier to ensure our store will be seen before any other stores that we then perform (assuming the other CPUs issued read barriers, of course)).

Now, that which I am so confused about; after the first memory barrier, well - there's a period between the invalidate queue/store buffer flush and the execution of ldrexd.

During this time another CPU (especially if we're running a lock-free data structure, where we're using the same pointer a lot on different CPUs) could issue an invalidate on that pointer - and we in our CPU could fail to process it before executing ldrexd.

So we haven't yet activated the ERG - we're blind to the invalidate AND we're about to load an out of date value from our cache!

So we can imagine doing so, doing the compare, the compare passes, and then we store - which confuses me even more, because we could then store and THEN find the invalidate...

So, I'm missing something here.  Question is, what?

PLEASE someone enlighten me!

It's that or insanity and I think I'm pretty close by now :-)
Parents
  • Note: This was originally posted on 3rd October 2012 at http://forums.arm.com

    I may very well be wrong


    Yes you may well be =) Please don't confuse cache coherency and a weak memory model - the two are not related, and you'll just give yourself a headache if you try and make them part of the same problem.

    The caches of the ARM SMP cores behave as I describe, and transient store buffers do not bypass MESI. If you think about it if you allow things to bypass MESI the system very rapidly becomes non-coherent and it is impossible to write software for it. The snoop control hardware does ensure that every core gets visibility of the data it needs at the right time - as far as this goes "it just works" and it is transparent to software, no barriers needed.

    Where barriers are needed it where software starts caring about ordering of memory accesses relative to each other. On strongly ordered architectures you get some implied ordering constraints, on ARM you don't get that. Unless there is an explicit address dependency the hardware is free to reorder memory accesses relative to each other to achieve higher performance. This is fine for most memory accesses but some are sensitive (i.e. a lock) and you have to ensure that the lock state is committed to memory before you touch the memory. If you consider the typical spinlock sequence (ignore the test and retry for sake of brevity):


        Acquire lock protecting structure A
        Load or store some value from structure A
        Release lock


    On a strongly ordered machine this "works", but on a weakly ordered machine there is nothing stopping the hardware re-ordering the sequence to:


        Load or store some value from structure A
        Acquire lock protecting structure A
        Release lock


    Obviously that's not what the programmer intended ... this is what barriers are for. Barriers are used to tell the compiler/out-of-order-hardware that they cannot reorder across that instruction, and that they must ensure visibility of that memory access to other cores in the coherent system before proceeding. What you would typically see is:


        Load or store some value from structure A
        dmb
        Acquire lock protecting structure A
        dmb
        Release lock


    The first DMB ensures that the lock has been acquired (and is visible to other masters) before the thread messes about with the content of the structure, and the last DMB ensures that the thread has finished messing about the with structure before actually releasing the lock.

    In your specific case the atomic compare exchange sequence is usually implemented to allow cross-thread spin-lock type implementations, so the defensive kernel approach is to wrap them in barriers to ensure that the function as expected when developers expect the same "strongly ordered" behavior from existing x86 code. But it's definitively nothing to do with cache coherency - promise ;)
Reply
  • Note: This was originally posted on 3rd October 2012 at http://forums.arm.com

    I may very well be wrong


    Yes you may well be =) Please don't confuse cache coherency and a weak memory model - the two are not related, and you'll just give yourself a headache if you try and make them part of the same problem.

    The caches of the ARM SMP cores behave as I describe, and transient store buffers do not bypass MESI. If you think about it if you allow things to bypass MESI the system very rapidly becomes non-coherent and it is impossible to write software for it. The snoop control hardware does ensure that every core gets visibility of the data it needs at the right time - as far as this goes "it just works" and it is transparent to software, no barriers needed.

    Where barriers are needed it where software starts caring about ordering of memory accesses relative to each other. On strongly ordered architectures you get some implied ordering constraints, on ARM you don't get that. Unless there is an explicit address dependency the hardware is free to reorder memory accesses relative to each other to achieve higher performance. This is fine for most memory accesses but some are sensitive (i.e. a lock) and you have to ensure that the lock state is committed to memory before you touch the memory. If you consider the typical spinlock sequence (ignore the test and retry for sake of brevity):


        Acquire lock protecting structure A
        Load or store some value from structure A
        Release lock


    On a strongly ordered machine this "works", but on a weakly ordered machine there is nothing stopping the hardware re-ordering the sequence to:


        Load or store some value from structure A
        Acquire lock protecting structure A
        Release lock


    Obviously that's not what the programmer intended ... this is what barriers are for. Barriers are used to tell the compiler/out-of-order-hardware that they cannot reorder across that instruction, and that they must ensure visibility of that memory access to other cores in the coherent system before proceeding. What you would typically see is:


        Load or store some value from structure A
        dmb
        Acquire lock protecting structure A
        dmb
        Release lock


    The first DMB ensures that the lock has been acquired (and is visible to other masters) before the thread messes about with the content of the structure, and the last DMB ensures that the thread has finished messing about the with structure before actually releasing the lock.

    In your specific case the atomic compare exchange sequence is usually implemented to allow cross-thread spin-lock type implementations, so the defensive kernel approach is to wrap them in barriers to ensure that the function as expected when developers expect the same "strongly ordered" behavior from existing x86 code. But it's definitively nothing to do with cache coherency - promise ;)
Children
No data