This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

i got really weird test result of memory barrier on Cortex A9

  • Note: This was originally posted on 27th November 2011 at http://forums.arm.com

    today i did more test, no matter how i insert DMB or DSB as pair, the result is trigger that kernel panic, anyone who can explain it?
  • Note: This was originally posted on 27th November 2011 at http://forums.arm.com

    What loop values does this fail with?

    I can't see a terminating condition, so my best guess is that it runs correctly until you overflow the 32-bit integer.

    0x0 is indeed less than 0xFFFFFFFE, so it panics ...
  • Note: This was originally posted on 28th November 2011 at http://forums.arm.com

    and obviously the DMB is enough here, with lower payload?


    Yes, DMB is what you want here. DSB also synchronizes the instruction stream, which is obviously more expensive, as it stops the core prefetching instructions.

    how can cpu0 get s_byt_test_int1 > s_byt_test_int2 ?


    After the  CPU out-of-orders you actualyl end up with:

    thread1:

    int1++;
    thread1_barrier;
    int2++;


    thread2:

    read int1;
    read int2;


    If the first thread runs a couple of times between the two reads in the second thread -> explosion. Remember you have no locks here, so there is no guarantee the two threads run in lock-step.

    make two variables into the same  cache is much  better than make them separately in these tests, see  result of step 1  and step 2, how to explain it ?                   

    Sharing a cache line is more likely to make the two threads run in lockstep - the core has to acquire the cache line before it can process the load or store, which will stop the other thread doing a load or a store.

    Secondly you assigned a larger gap between int1 and int2 as a starting condition, so in the case above the first thread running once between two reads isn't enough to trigger the error, it has to run three times to make int2 larger than int1, so you are less likely to hit this condition.

    Iso
  • Note: This was originally posted on 28th November 2011 at http://forums.arm.com

    @isogen74, thank you very much for your reply. i can understand most of your points, but for the last one, i can not make it clearly enough, can you please correct me?

    1. int1 and int2 use the different cache line(suppose they are in index 0 and index 1 cache lines):
    (1). int1 and int2 are both equal 10.
    (2). int1 is in cache line index 0 of cpu1, and int2 is in cache line index 1 of cpu1.
    (3). int1 is in cache line index 0 of cpu0, and int2 is in cache line index 1 of cpu0.
    (4). suppose at this point, cpu1 write add 1 to int1, this will update cache line index 0 of cpu1, and ask cpu0 to invalidate the cache line index 0 of cpu0, after this step the int1 is 11, i think this step should done by hardware automatically, right?
    (5). suppose cpu0 start to read int1, because cache line index 0 is invalidated in step 4, so cpu0 should get value from memory.at alst it gets the value 11.
    (6). suppose cpu0 is busy doing other bus traffic, and delayed to get the next int2 value.
    (7). cpu1 start to update int2, the same as step (4), after this step, int2 is 11, and the cache line index 1 of cpu0 is invalidated.
    (8). cpu1 update int1 to 12, and then update int2 to 12, after that , both cache line index 0 and index 1 are invalidated in cpu0.
    (9). now cpu0 start to get value of int2, it try to get from memory, and fill cache line index 1. at last it get the value 12 , and trigger the kenel panic.

    2. int1 and int2 use the same cache line(suppose both of them are in index 0)
    (1). int1 and int2 are both equal 10.
    (2). int1 and int2 are in cache line index 0 of cpu1.
    (3). int1 and int2 are in cache line index 0 of cpu0.
    (4). suppose at this point, cpu1 write add 1 to int1, this will update cache line index 0 of cpu1, and ask cpu0 to invalidate the cache line index 0 of cpu0, after this step the int1 is 11, i think this step should done by hardware automatically, right?
    (5). suppose cpu0 start to read int1, because cache line index 0 is invalidated in step 4, so cpu0 should get value from memory.at alst it gets the value 11.
    (6). suppose cpu0 is busy doing other bus traffic, and delayed to get the next int2 value.
    (7). cpu1 start to update int2, the same as step (4), after this step, int2 is 11, and the cache line index 0 of cpu0 is invalidated.
    (8). cpu1 update int1 to 12, and then update int2 to 12, these two actions will update the same cache line of cpu1, and will ask the cpu0 to invalidate the same cache line, but what is the difference between this step and the step 1.(8) ? 
    (9). now cpu0 start to get value of int2, it try to get from memory, and fill cache line index 0. at last it get the value 12 , and trigger the kenel panic.

    in both scenarios, the timing of  the step 6 is very important for test result. my guess is that cpu1 is do writing, and because it is always operating with cache hit, so it is much fater than the reading acting, so maybe couples of writing actions can happen between the two reading actions.

    and in scenario 1, when writing int1, cpu1 ask cpu0 to validate cache line 0, when writing int2, cpu1 ask cpu0 to validate cache line 1.

    in scenario 2, when writing int1, cpu1 ask cpu0 to validate cache line 0, when writing int2, cpu1 ask cpu0 to validate cache line 0. will these two validating been merged to one action?

    how to get the result that in scenario 2 the cpu0("read thread") are blocked less time than in scenario 1 ?

    and does this blocking action caused by cache line invalidating instruction issued by cpu1 ?
  • Note: This was originally posted on 28th November 2011 at http://forums.arm.com

    :rolleyes:

  • Note: This was originally posted on 28th November 2011 at http://forums.arm.com

    [color=#222222][size=2]> how to get the result that in scenario 2 the cpu0("read thread") are blocked less time than in scenario 1 ?[/size][/color]
    [color=#222222][size=2]
    [/size][/color]
    [color=#222222][size=2]I'm not sure it necessarily is "less time" - it's just "different" and different enough that you hit the race condition more often. It's really hard to say what is actually happening inside the core as the internal micro-architecture is relatively complex (and I'm a software guy).[/size][/color]
    [color=#222222][size=2]
    [/size][/color]
    [size=2]The software without barriers is broken, so not worth spending too long explaining why it breaks a particular way, right ... =)[/size]
    [color=#222222][size=2]
    [/size][/color]
    [color=#222222][size=2]Cheers, [/size][/color]
    [color=#222222][size=2]Iso [/size][/color]
  • Note: This was originally posted on 29th November 2011 at http://forums.arm.com

    thank you @isogen74.you are right, i only need to remember use memory barrier instructions as pair.