This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM11 MPcore and stale chacheline

Note: This was originally posted on 18th February 2013 at http://forums.arm.com

Hello all,

we are in middle of porting FreeBSD to ARM11 MPcore CPU.

Unfortunately, we stuck on strange stale cacheline (probably) issue.

In short, after specific write pattern performed on first core and single write on second, we got stale cacheline on first one. The write (and yes, it's followed by DSB) from second core is not visible on first CPU. But after executing s DMB on first core, we got actual data.

We have verified that both cores are in SMP mode, accessed memory is mapped using 1MB section with shared bit set, without any aliasing. The hardware is Cavium Networks CNS3420 dual core ARM11 MPcore CPU, revision r2p0.

Unfortunately, we don't have access to any ARM11 MPcore errata. It's here any errata that can cause this problem? It's possible to get errata sheet even we are not ARM customer?

We can post pseudocode that's trigger the issue here, if it's necessary/required.

Many thanks

Michal Meloun

Parents

Peter Harris over 12 years ago

Note: This was originally posted on 19th February 2013 at http://forums.arm.com

First question (not that it matters) - why does your cache array have 5 ways? The ARM11 MP only has 4 ...

Looking at the point at hand, why is this result surprising? The IPI handler on cpu1 first writes ...

test_arr[2][0][3] += 0x1000000; test_arr[4][0][3] += 0x1000000;

Then writes ...

WAIT_ITEM = 1;

Your code sees test_arr being incremented before it sees WAIT_ITEM being incremented, but again this is totally valid behavior (indeed, in this case it must happen this way because of the barriers in the IPI handler between the two).

In the run that fails ...

896 (0xc125ec00:cpu0): 0x00000000 0x00010003 0xD7020003 0x00030003 0xD6040003 897 (0xc125ec00:cpu0): 0x00000000 0x00010003 0xD7020003 0x00030003 0xD7040003 898 (0xc125ec00:cpu0): tmp1:0, tmp2:0, tmp3: 1

... the memory dumps looks perfectly fine to me.

For the first row of data (sampled after tmp1) the WAIT_ITEM is zero, so there is no guarantee that the increment has happened yet. However this is not the same as a guarantee that it hasn't. It seems that the first increment has happened by this point, but the second hasn't. In the absence of additional synchronization across the threads this is valid.

For the second row of data (sampled after tmp3) the WAIT_ITEM is now one, so there is a guarantee (because of the barriers in the IPI thread) that the increments have already happened. This is correctly reflected in the data.

The instruction ordering in each thread seems to be doing what you expect, but it looks like you are trying to infer cross-core data visibility constraints from one thread's local memory barriers. No amount of barriers or ordering limits are going to help here; this is not a single thread's instruction ordering problem, but a cross-thread synchronization problem; i.e. you need a semaphore or condition variable. If the test thread busy waits on WAIT_ITEM becoming 1 (using it as an inefficient spinlock-style condition variable), and then captures the memory it should always get the incremented value.

Can you explain what you were expecting the data to look like, and why, perhaps that will help narrow down on where your faulty assumption is.

HTH,
Iso
Cancel
Vote up 0 Vote down

Cancel

Reply

Peter Harris over 12 years ago

Note: This was originally posted on 19th February 2013 at http://forums.arm.com

First question (not that it matters) - why does your cache array have 5 ways? The ARM11 MP only has 4 ...

Looking at the point at hand, why is this result surprising? The IPI handler on cpu1 first writes ...

test_arr[2][0][3] += 0x1000000; test_arr[4][0][3] += 0x1000000;

Then writes ...

WAIT_ITEM = 1;

Your code sees test_arr being incremented before it sees WAIT_ITEM being incremented, but again this is totally valid behavior (indeed, in this case it must happen this way because of the barriers in the IPI handler between the two).

In the run that fails ...

896 (0xc125ec00:cpu0): 0x00000000 0x00010003 0xD7020003 0x00030003 0xD6040003 897 (0xc125ec00:cpu0): 0x00000000 0x00010003 0xD7020003 0x00030003 0xD7040003 898 (0xc125ec00:cpu0): tmp1:0, tmp2:0, tmp3: 1

... the memory dumps looks perfectly fine to me.

For the first row of data (sampled after tmp1) the WAIT_ITEM is zero, so there is no guarantee that the increment has happened yet. However this is not the same as a guarantee that it hasn't. It seems that the first increment has happened by this point, but the second hasn't. In the absence of additional synchronization across the threads this is valid.

For the second row of data (sampled after tmp3) the WAIT_ITEM is now one, so there is a guarantee (because of the barriers in the IPI thread) that the increments have already happened. This is correctly reflected in the data.

The instruction ordering in each thread seems to be doing what you expect, but it looks like you are trying to infer cross-core data visibility constraints from one thread's local memory barriers. No amount of barriers or ordering limits are going to help here; this is not a single thread's instruction ordering problem, but a cross-thread synchronization problem; i.e. you need a semaphore or condition variable. If the test thread busy waits on WAIT_ITEM becoming 1 (using it as an inefficient spinlock-style condition variable), and then captures the memory it should always get the incremented value.

Can you explain what you were expecting the data to look like, and why, perhaps that will help narrow down on where your faulty assumption is.

HTH,
Iso
Cancel
Vote up 0 Vote down

Cancel

Children

No data