This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Cortex A8 preload engine (PLE) error

Note: This was originally posted on 24th November 2011 at http://forums.arm.com

I have a user-mode Linux application running on a Cortex-A8 (a TI 8148 Davinci chip). I have a shared memory region that I'm using to communicate data back and forth between the ARM core and the TI c674x DSP. The shared memory region is a ring buffer made of 32k segments (the size of the 8148's L2 cache ways). I've locked down 3 of the L2 cache ways and I'm trying to use the L2 PLE (preload engine) - the L2 feature accessed through coprocessor 15 c11 - to asynchronously preload and writeback the ring buffer segments. The ring buffer itself is located in physically and virtually contiguous memory - we're using TI's cmem module to allocate out of a memory hole. Moreover, I've checked the linux struct page flags for the ring buffer pages and they seem to all be uniform and fairly kosher. Plain-vanilla loads and stores from the ring buffer work just fine, as do coprocessor 15 based cache writeback operations (performed in privileged mode, of course).

Anyways, everything goes quite nicely for a while (anywhere from 3 to 10 PLE transfers complete successfully), until a PLE transfer errors-out at a page boundary. It's a different page boundary (both virtual and physical address) each time, and it's a different number of ring buffer segments and a different number of pages into the ring buffer segment each time this happens. The error itself, from table 3-132 in the ARM Cortex-A8 Technical Reference Manual, is "b1000101", or "translation fault, section".

Does anyone know what this error means? At first I thought that maybe it was because the page was marked as uncached, but looking at the page properties (with /proc/kpageflags), that doesn't seem to be the case.

Edit: One more detail - this failure only happens with preload operations - not writebacks. Or at least I haven't seen it happen with a writeback yet.
  • Note: This was originally posted on 25th November 2011 at http://forums.arm.com

    My guess is that you set the PLE running on one set of virtual addresses, and then your OS content switches, and the CPU page tables change. The VA the PLE is trying to use doesn't exist in the new processes address map. It will always fail at the start of a page or PLE range, as this is the first time it will see a translation fault from the MMU.

    HTH,
    Iso
  • Note: This was originally posted on 25th November 2011 at http://forums.arm.com

    [font=Arial]
    I'm wondering if maybe Linux is switching to another process which coincidentally DOES have that same VA mapped, and my L2 cache data is getting written out to the wrong place (ie a page in that other process)?

    Yes, corruption would certainly result if the VA->PA translation changed to something else and the PLE was still running.

    What's supposed to prevent this from happening?


    Is suspect the answer is "software" =)

    I'm not a PLE expert, but AFAICT the PLE uses the same page tables as currently mapped on the core, so if the OS context switches from one process to another you either have to (1) stall the context switch waiting for the pending PLE reqeusts to complete, or (2) cancel pending PLE requests, [/font](3) "pause" the transfer, switch the process out, and "resume" when it gets switched back in again.

    Cheers,
    Iso
  • Note: This was originally posted on 26th November 2011 at http://forums.arm.com

    Or you can  allocate the shared memory in the kernel space, since for every Linux process, the kernel space shared the same MMU table entries.
  • Note: This was originally posted on 28th November 2011 at http://forums.arm.com



    Yes, corruption would certainly result if the VA->PA translation changed to something else and the PLE was still running.



    Is suspect the answer is "software" =)

    I'm not a PLE expert, but AFAICT the PLE uses the same page tables as currently mapped on the core, so if the OS context switches from one process to another you either have to (1) stall the context switch waiting for the pending PLE reqeusts to complete, or (2) cancel pending PLE requests, (3) "pause" the transfer, switch the process out, and "resume" when it gets switched back in again.

    Cheers,
    Iso


    I wonder - I've played around with this a bit and it seems that the PLE ContextID register might be the key here. I suspect that the ASID field in that register needs to match the ASID field in any TLB entries used by the PLE to do it's address translation. With ARMv7 apparently the ASID is part of the TLB lookup - if the current contents of the global ContextID register (c13, c0) don't match the TLB ASID, then the TLB entry won't be a match. It seems like maybe the PLE ContextID register (c11, c15) might serve a similar purpose for these asynchronous PLE transfers.

    Unfortunately, Linux seems to change the ASID whenever it rolls over to 0 (it's an 8-bit counter) - so I'm not sure that I could guarantee that my process's ASID is always going to be the same? If not, I'd have to set the PLE ContextID ASID often enough to do reliable transfers - and the PLE ContextID register is only accessible in kernel-mode. One of the big reasons I'm trying to use the PLE in the first place is to avoid an expensive syscall when writing back memory - it's fairly expensive on this platform (about 8000 cycles for a binary sysfs attribute access, and more for an ioctl or a character sysfs attribute access).


    The real problem that I'm having now seems to be writing the L1 cache back - I've figured out that most (all?) of the corruption I'm seeing now is due to writing the L2 cache back with the PLE but not the L1 cache.
  • Note: This was originally posted on 29th November 2011 at http://forums.arm.com


    Or you can  allocate the shared memory in the kernel space, since for every Linux process, the kernel space shared the same MMU table entries.


    Yes - I considered this too, but I'm not sure my Linux kernel-fu is quite advanced enough yet to accomplish this. We've been using a /dev/mem -like tool that TI provides called cmem to map contiguous memory chunks, and it doesn't provide the ability to use kernel logical mappings - just user mappings.
  • Note: This was originally posted on 29th November 2011 at http://forums.arm.com

    [size="2"]
    With ARMv7 apparently the ASID is part of the TLB lookup - if the current contents of the global ContextID register (c13, c0) don't match the TLB ASID, then the TLB entry won't be a match.
    [/size]

    Yes the aim of the ASID is so that you don't have to flush the TLB on context switch. What I am unclear on is what happens when you get a TLB miss when the PLE is running. I assume it would perform a table walk using the current page tables, but populated with the ASID value out of the ContextID register. Which probably isn't what you wanted it to do (I guess you would want it to stop on an ASID mismatch for your usecase).


    Unfortunately, Linux seems to change the ASID whenever it rolls over to 0


    Yes, that's the other issue. If you have more than 255 processes active at the same time you will get ASID rollover, so it is time variant.

    I think Jerry is on the right lines here; the usual approach to exposing this type of hardware is to provide a device driver, so user-space allocates the memory via a kernel call to the driver, and performs special operations (start PLE transfer, for example) via a kernel call to the driver. This allows the kernel to have the memory mapped in it's address space, which solves the changing page-table problem, and you will need the kernel calls at the start and end of each PLE operation as you will need to issue appropriate L1 cache operations to ensure visibility of the data you've just shovelled into / want to shovel out of the L2.

    Cheers,
    Iso
  • Note: This was originally posted on 30th November 2011 at http://forums.arm.com

    Yes the aim of the ASID is so that you don't have to flush the TLB on context switch. What I am unclear on is what happens when you get a TLB miss when the PLE is running. I assume it would perform a table walk using the current page tables, but populated with the ASID value out of the ContextID register. Which probably isn't what you wanted it to do (I guess you would want it to stop on an ASID mismatch for your usecase).


    According to section 8.4.1 of the TRM the PLE doesn't use the TLB and always walks the page table directly at the start of a transfer and between 4KB boundaries. This should mean that the PLE Context ID register is compared against the global Context ID register, which grants it an entire 32-bits and should avoid potential aliasing. I would guess that you should be setting the PLE Context ID register to the current value of the Context ID register. It could be that the PLE doesn't bother checking equivalence for the first page, hence why you're succeeding until the second page is hit. This would make sense since the first table walk is done before any data is transferred and is then valid for the entire page regardless of whether or not a context switch occurs, and this first table walk may have to succeed before the start operation can finish.

    Section 8.4.5 does claim that if the Context ID register changes during a PLE operation the result is unpredictable.. you would think they're really referring to the PLE Context ID register, since otherwise I don't understand the point of having it in the first place (if the current process one can't change)