This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

SMP ARM cores hang when using DMA and two cores enabled

Hi,

I am experiencing A complete arm core hang when both of the cores are employed in SMP mode and using DMA.

I was tested with Linux kernels 3.10, 4.1 and 4.6 in SMP mode.

SOC used is Altera Cyclone V SOC-FPGA with dual Cortex A9.

The DMA transfer goes from the DDR to the FPGA logic.

SignalTap running shows the FPGA is still running when the cores hang.

Running the kernel with maxcpus=1 in the command line makes the problem go away.

The two cores are connected via the L2 cache controller and the SCU to the switch fabric (NIC-301).

If I make a cyclic history tracer which logs the operations and the pointer fed to the DMA

and enable the watchdog, after the reset, I stop the booting process at u-boot and view the log in memory

then the addresses fed to the DMA are valid ones, so the software looks OK.

So the above items move the suspicion to the ARM cores themselves.

There are erratas specifying hang or memory corruption due to race conditions between the cache management

of the two Cortex A9 cores.

I have tried to apply ARM Cortex A9 erratas 761320, 845369, 764369, 794072 but I am still experiencing the hang.

I can try to turn on, one by one, the bits in the diagnostic debug register mentioned in the above erratas but before

I do that I would be glad for any help in case someone experienced something similiar with other Cortex A9 SOCs

and is aware of additional erratas relating to SMP cache coherency / cache management race conditions

which might help solve the issue.

Thanks,

Elad.

Parents
  • Do you have any means to see if the memory system is still alive? One common cause for CPUs to hang is when they are stuck waiting for external AXI memory requests which never get a response.

    Secondly - is the DMA bypassing the CPU completely, or wired in through the ACP port? If the DMA bypasses the CPU completely it's hard to see where interaction with the CPU comes from except through side-effects at the memory system level.

    Are the CPUs and DMA accessing the same data at the same time when the crash happens or doing something unrelated?

    HTH,
    Pete

Reply
  • Do you have any means to see if the memory system is still alive? One common cause for CPUs to hang is when they are stuck waiting for external AXI memory requests which never get a response.

    Secondly - is the DMA bypassing the CPU completely, or wired in through the ACP port? If the DMA bypasses the CPU completely it's hard to see where interaction with the CPU comes from except through side-effects at the memory system level.

    Are the CPUs and DMA accessing the same data at the same time when the crash happens or doing something unrelated?

    HTH,
    Pete

Children
  • Hi Peter,

    If I connect a DS-5 Dstream hardware debugger it cannot halt once the hang occurs, so I cannot view the memory system or get the trace once everything hangs (I am not aware of a way to stream trace from the DS-5 without stopping the target).

    If we connect a signaltap it can still see the FPGA being alive after the hang.

    Other than that, I have no way to validate the memory system.

    The DMA bypasses the CPU completely, we do not use ACP for this specific DMA. But on the other hand that requires us to flush the cache before we start the DMA operation.

    The CPUs and the FPGA DMA are not supposed to be accessing the same data on the DDR at the same time when the hang occurs, but they might be working in a pipeline (The FPGA is DMAing data and the CPU is preparing the next batch). The only interaction is done via the FPGA memory mapped I/O which contains the buffer descriptors which has ownership bits which indicate who owns the buffer descriptor - the CPU or the FPGA.

    Thanks,

    Elad.