This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ldrex/sdrex failures on cortex?

Note: This was originally posted on 20th November 2009 at http://forums.arm.com

Hi.

I'm trying to implementing an atomic compare-and-swap.  I've been working on this for the last month or so and basically non-stop for the last half week or so in an effort to get the **** thing working.

I'm working on a quad-core Cortex-A9, on Linux 2.6.28 with GCC 4.3.4.

Initially, I tried using the GCC built-in __sync_val_compare_and_swap().

Turns out this calls the kernel compare-and-swap which until about 2.6.31, was implemented *without* memory barriers.

So I rolled my own __asm__ - I've never done ARM before (and hardly any assembly) but I've a little experience with GCC inline assembly and figuring out ldrex/strex with the help of some Googling wasn't too hard...except of course it's still not working.

In fact, I started out with my own version, then gradually simplified more and more, as it wasn't working, until now, when my code is in fact *identical* to the code in the current Linux kernel.

Still doesn't work.

Couple of days later, I then started to wonder if __sync_synchronize() was working...

Turns out it's not (cue much gnashing of teeth - is no one even doing SMP on ARM with GCC?).  On ARM, as of GCC 4.3.4, it produces no output.

So I replaced that with an __asm__ "dmb".  (I've got march set to "armv7-a").

Thing is...STILL doesn't work!  argghhh!

The code I have is a freelist, a populated Trieber's stack.  It's actually written on top of an abstraction layer.  There are four tests.  The freelist works just fine (or appears to and passes the tests) on x86 and x64, on Windows and Linux, with GCC and MSVC.

Of the tests, THREE pass.  It's the fourth which fails - but the fourth is the most intensive, which does the most rapid sets of pushes/pops to the freelist, which of course mean the greatest number of CAS operations - and also, of the way the test is contructed, also the greatest number of CAS collisions.  (The fourth test is basically a single freelist, with one thread per CPU, where each thread simply pops and then immediately pushes; the test is set to run for ten seconds - at the end, there is a loop in the freelist; the first element points to itself).

So, this is my code;

(atom_t is unsigned long int)

  INLINE atom_t abstraction_cas( volatile atom_t *destination, atom_t exchange, atom_t compare )
  {
    atom_t
      original_destination,
      stored_flag;

    __sync_synchronize();

    __asm__ __volatile__
    (
      "dmb;"
      :
      :
      : "memory"
    );

    do
    {
      __asm__ __volatile__
      (
        "ldrex %1, [%2];"
        "mov %0, #0;"
        "teq %1, %3;"
        "strexeq %0, %4, [%2];"

        // output
        : "=&r" (stored_flag), "=&r" (original_destination)

        // input
        : "r" (destination), "Ir" (compare), "r" (exchange)

        // clobbered
        : "memory", "cc"
     );
    }
    while( stored_flag == 1 );

    __asm__ __volatile__
    (
      "dmb;"
      :
      :
      : "memory"
    );

    __sync_synchronize();

    return( original_destination );
  }


Any thoughts, ideas or suggestions?

I left the __sync_synchronize() in for now - doesn't do any harm.  I think it may also act as a compiler re-ordering barrier, but the __asm__ I think does the same.

I have a specific question; what's the "Ir" for with (compare) in the input list?  "I" in the GCC docs for ARM assembly indicate something apparently completely unrelated to what I'm doing here; using it or not using it appears to make no difference and the Linux kernel code uses it (which is why I'm still using it).
Parents
  • Note: This was originally posted on 21st November 2009 at http://forums.arm.com

    Bit more info;

    With two, three, four, eight, twelve or sixteen threads, the test detects a loop between about 70,000 and 1,200,000 pops.

    However, with two threads, the test passes fairly often.  I've never seen a test pass with three or more threads (almost all of my testing has been done with four threads, which is one per core).

    With one thread of course the test always passes.

    I've removed the second dmb (after the ldrex/sdrex); I think it's not necessary.  I've also removed the __sync_sychronized() calls.
Reply
  • Note: This was originally posted on 21st November 2009 at http://forums.arm.com

    Bit more info;

    With two, three, four, eight, twelve or sixteen threads, the test detects a loop between about 70,000 and 1,200,000 pops.

    However, with two threads, the test passes fairly often.  I've never seen a test pass with three or more threads (almost all of my testing has been done with four threads, which is one per core).

    With one thread of course the test always passes.

    I've removed the second dmb (after the ldrex/sdrex); I think it's not necessary.  I've also removed the __sync_sychronized() calls.
Children
No data