This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM11 MPcore and stale chacheline

Note: This was originally posted on 18th February 2013 at http://forums.arm.com

Hello all,

we are in middle of porting FreeBSD to ARM11 MPcore CPU.

Unfortunately, we stuck on strange stale cacheline (probably) issue.

In short, after specific write pattern performed on first core and single write on second, we got stale cacheline on first one.  The write (and yes, it's followed by DSB) from second core is not visible on first CPU.  But after executing s DMB on first core, we got actual data. 

We have verified that both cores are in SMP mode, accessed memory is mapped using 1MB section with shared bit set, without any aliasing.  The hardware is Cavium Networks CNS3420 dual core ARM11 MPcore CPU, revision r2p0.

Unfortunately, we don't have access to any ARM11 MPcore errata. It's here any errata that can cause this problem? It's possible to get errata sheet even we are not ARM customer?

We can post pseudocode that's trigger the issue here, if it's necessary/required.

Many thanks

Michal Meloun
  • Note: This was originally posted on 22nd February 2013 at http://forums.arm.com


    I'm not 100% sure it applies to ARM11 MPCore, but on ARMv7A it is not architecturally valid to use clean and invalidate of the whole cache once the CPU is running; you have to do it by set-way or the SCU doesn't necessarily pick up the snoop correctly.

    Good catch, thanks. The reference manual is bit cryptic in this point (at least for me) so I totally miss this fact. Unfortunately, replacing full cache maintenance operations by full set-way cycle has no effect.


    Additionally why do you have the clean and invalidates everywhere? They shouldn't be needed. The SCU hardware should ensure everything syncs.

    It was added as "be really sure" and "put core to more defined state" when we makes testcase. The whole synchronization sequence can be replaced by DSB without any effect.


    ... and to check the obvious - you have marked these pages as shared in the MMU, and enabled the SCU?

    Checked using CP15 PA to VA translation ("mrc  p15, 0, %0, c7, c4, 0").  Both cores have same value "" normal memory, shared, inner and outer WB WA (beware, ARMv6k uses different format that ARMv7A).

    389 (0xc114c000:cpu0):  cache_test: WAIT_ITEM va: 0xC0484000 -> pa: 0x20484194
    390 (0xc114a900:cpu1): intr_event_handle: exec 0xc005b4dc(0xf) for ipi_test
    391 (0xc114a900:cpu1):  ipi_test_handler: WAIT_ITEM va: 0xC0484000 -> pa: 0x20484194
    392 (0xc114c000:cpu0): 0x00000000 0x00010003 0x17020003 0x00030003 0x16040003
    393 (0xc114c000:cpu0): 0x00000000 0x00010003 0x17020003 0x00030003 0x17040003


    When we got stale data then any of following action helps (all on cpu0):
    - DMB
    - Read at least 4 words on same cache index as wait variable in different ways. 
    - Any write to other word in same cacheline
    - Cacheline flush and invalidate by MVA of wait variable.
    Repeated read from any word in same cacheline or longer timeout not helps.
  • Note: This was originally posted on 18th February 2013 at http://forums.arm.com

    How do you know that is is a stale line and not a synchronisation problem?
    The DMB may just preventing some form of re-ordering changing the timing, thus you get the new(er) data.
  • Note: This was originally posted on 19th February 2013 at http://forums.arm.com

    First question (not that it matters) - why does your cache array have 5 ways? The ARM11 MP only has 4 ...

    Looking at the point at hand, why is this result surprising? The IPI handler on cpu1 first writes ...

    test_arr[2][0][3] += 0x1000000;
    test_arr[4][0][3] += 0x1000000;


    Then writes ...

    WAIT_ITEM = 1;

    Your code sees test_arr being incremented before it sees WAIT_ITEM being incremented, but again this is totally valid behavior (indeed, in this case it must happen this way because of the barriers in the IPI handler between the two).

    In the run that fails ...

    896 (0xc125ec00:cpu0): 0x00000000 0x00010003 0xD7020003 0x00030003 0xD6040003
    897 (0xc125ec00:cpu0): 0x00000000 0x00010003 0xD7020003 0x00030003 0xD7040003
    898 (0xc125ec00:cpu0): tmp1:0, tmp2:0, tmp3: 1



    ... the memory dumps looks perfectly fine to me.

    For the first row of data (sampled after tmp1) the WAIT_ITEM is zero, so there is no guarantee that the increment has happened yet. However this is not the same as a guarantee that it hasn't. It seems that the first increment has happened by this point, but the second hasn't. In the absence of additional synchronization across the threads this is valid.


    For the second row of data (sampled after tmp3) the WAIT_ITEM is now one, so there is a guarantee (because of the barriers in the IPI thread) that the increments have already happened. This is correctly reflected in the data.

    The instruction ordering in each thread seems to be doing what you expect, but it looks like you are trying to infer cross-core data visibility constraints from one thread's local memory barriers. No amount of barriers or ordering limits are going to help here; this is not a single thread's instruction ordering problem, but a cross-thread synchronization problem; i.e. you need a semaphore or condition variable. If the test thread busy waits on WAIT_ITEM becoming 1 (using it as an inefficient spinlock-style condition variable), and then captures the memory it should always get the incremented value.

    Can you explain what you were expecting the data to look like, and why, perhaps that will help narrow down on where your faulty assumption is.

    HTH,
    Iso
  • Note: This was originally posted on 21st February 2013 at http://forums.arm.com

    One thing I'd like to check.

    I'm not 100% sure it applies to ARM11 MPCore, but on ARMv7A it is not architecturally valid to use clean and invalidate of the whole cache once the CPU is running; you have to do it by set-way or the SCU doesn't necessarily pick up the snoop correctly.

    Additionally why do you have the clean and invalidates everywhere? They shouldn't be needed. The SCU hardware should ensure everything syncs.

    ... and to check the obvious - you have marked these pages as shared in the MMU, and enabled the SCU?
  • Note: This was originally posted on 24th February 2013 at http://forums.arm.com

    On the errata front, the SoC vendor should be able to provide the relevant errata version - that is the "official" channel for getting the errata for your device.

    If you get really stuck you might be able to get access via the ARM support team (support@arm.com).

    Iso
  • Note: This was originally posted on 19th February 2013 at http://forums.arm.com

    Thanks for sharing.