This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

AM3352 core hang-up

Hello,

We are encountering the core hang-up of unknown origin in our mass-produced board using TI's AM3352 and Linux Kernel 3.13.4.
Regarding the reproducibility of the test, some units had the hang-up to take about 2000 hours after a system start,
and others had about 24 hours at the earliest from a system start.
And also the core hang-ups have occurred by 21 units out of 232 units.

Here is a trace log of ETB (Embedded Trace Buffer) acquired via JTAG (CoreSight).

0284.trace_log_20180104.zip

Trace log result summary
It stops by just before the core hang-up with the following processing sequence.

1. Undefined instruction exception (VFP)
2. Processing of userland Process
3. Data abortion exception

The trace log is acquired by total of 5 times of core hang-up. It stops by the same processing in all trace log.

And from checking the last processing of all trace log; log_file, acquired at the hang-up,
it is set the value in a system control register of CP15, and it seems to make the MPU hung state.
Is there a possibility which will be in the MPU hung state by this processing?

ldr r0,0xC05E3420
ldr r0,[r0]
mcr p15,0x0,r0,c1,c0,0x0; p15,0,r0,c1,c0,0 (system control)

And so, please advise us the effective way to investigate this hang-up.

Best reards,
Takashi

  • Hi Takashi,

    Can you maybe rule out any electrical issue by e.g. changing voltages, frequencies and disabling DVFS and power management?

    Best regards,

    Vincent.

  • Hi Vincent,

    Thank you for quick reply.
    Here are the additional information.

    1. DVFS is disabled on our board.
    2. CPU frequency: 1GHz (fixed)
    3. VDD_MPU: 1.325V (fixed)
    4. Temperature at the core hang-up: Approximately 20-30 degrees
    5. We have run memtester at +60℃ and -20℃ for 96 hours.
    However the results are no problems.
    6. We can access the DRAM via JTAG (CoreSight) after the hang-up,
    and read/write is possible to it.

    Best Regards,
    Takashi

  • Dear Takashi,

    1. DVFS is disabled on our board.

     

    This is good as it should make debug easier.

    2. CPU frequency: 1GHz (fixed)

    3. VDD_MPU: 1.325V (fixed)

    Do you have the possibility to lower the frequency below 1 GHz, while keeping the same fixed voltage of 1.325V?

    This is probably one path to explore, in the hope to find a frequency where failing chips no longer fail.

    4. Temperature at the core hang-up: Approximately 20-30 degrees

    5. We have run memtester at +60℃ and -20℃ for 96 hours. However the results are no problems.

    This is good news as it suggests we can rule out thermal issues for now.

    6. We can access the DRAM via JTAG (CoreSight) after the hang-up, and read/write is possible to it.

    This is good news as well, as it allows easier "post mortem" analysis.

    Best regards,

    Vincent.

  • Dear Vincent,

    Thank you for your support.

    Do you have the possibility to lower the frequency below 1 GHz, while keeping the same fixed voltage of 1.325V?

    No. When we use the lower frequency below 1GHz, we change the
    VDD_MPU to suitable voltage as a described in the datasheet.
    We control the MPU voltage by companion PMIC (TPS65217C) for AM3352 processor.
    In addition, we tried to change the CPU frequency with suitable
    voltage to rule out the CPU frequency. The core hang-up occured
    regardless of the frequency, 300MHz, 600MHz and 1GHz.

    This is good news as well, as it allows easier "post mortem" analysis.

    We conducted a survey of the dump data from DRAM acquired via CoreSight,
    but have yet to find garbled data.

    Best Regards,
    Takashi

  • Dear Takashi,

    The core hang-up occured regardless of the frequency, 300MHz, 600MHz and 1GHz.

    Sorry to hear that.

    Also, in your first post, you wrote:

    And also the core hang-ups have occurred by 21 units out of 232 units.

    If I understand correctly, some units will fail, and this regardless of the frequency. And some other units will never fail.

    Also I think a "unit" here is really board+chip and it is not really possible to separate chip issues from board issues.

    Did you have maybe the possibility to swap chips between units?

    I mean: de-soldering the chip of a known-bad unit and soldering it to a board from a known-good unit? Or the contrary?

    Most probably this is not something easy, but that could help separate board issues from chip issues...

    EDIT: Also, I forgot to ask you: did you try to test the memory extensively for the failing units?

    Best regards,

    Vincent.

  • Dear Vincent,

    Thank you for your reply.

    If I understand correctly, some units will fail, and this regardless of the frequency. And some other units will never fail.

    We can't always necessarily say that. It does not have a constant time
    until the core hang-up occurred. Regarding the reproducibility of the
    test, some units had the hang-up to take about 2000 hours after a
    system start, and others had about 24 hours at the earliest from a
    system start. There is a possibility of hung-up when we run the unit
    over 2000 hours. So, it is not clear that all units will never fail.

    Did you have maybe the possibility to swap chips between units?
    I mean: de-soldering the chip of a known-bad unit and soldering it to a board from a known-good unit? Or the contrary?

    It isn't realistic to swap a device and test once again,
    because it has a high possibility that a device is damaged by rework and influence a test result.

    EDIT: Also, I forgot to ask you: did you try to test the memory extensively for the failing units?

    Yes, in failing units, we have already run memtester.

    We take an interest in the instruction "mcr p15",
    when a core hang has occurred,
    it stops at the same instruction "mcr p15" in all 5 boards.
    If it causes a core hang by a board issue,
    the stop point would be at random.

    Best Regards,
    Takashi

  • Hi Takashi,

    I see the code sequence of your trace in the Linux kernel code. As far as I can tell this is the  `alignment_trap' macro in arch/arm/kernel/entry-header.S:


    40 .macro alignment_trap, rtemp
    41 #ifdef CONFIG_ALIGNMENT_TRAP
    42 ldr \rtemp, .LCcralign
    43 ldr \rtemp, [\rtemp]
    44 mcr p15, 0, \rtemp, c1, c0
    45 #endif
    46 .endm

    I don't see why this would be a problem. There are indeed 7 occurrences of this sequence in your trace and only the last one had an issue.

    Also, I thought about other debugging experiments:

    - Did you try to enable all Cortex-A8 errata workarounds in your kernel? For example: ARM_ERRATA_430973, ARM_ERRATA_458693 and ARM_ERRATA_460075?

    - Did you try to reproduce your issue with a more recent kernel?

    - Did you try to reproduce your issue on a different board with the same processor? I think the beaglebone black has a TI AM3358 with the same Cortex-A8 as AM3352.

    Best regards,

    Vincent.

  • Dear Vincent,

    Thank you for your reply.

    I see the code sequence of your trace in the Linux kernel code. As far as I can tell this is the `alignment_trap' macro in arch/arm/kernel/entry-header.S:

    It is exactly as you say.

    I don't see why this would be a problem. There are indeed 7 occurrences of this sequence in your trace and only the last one had an issue.

    We also care about the execution of the VFP before "mcr p15".
    So it looks like a problem with coprocessors.

    Did you try to enable all Cortex-A8 errata workarounds in your kernel? For example: ARM_ERRATA_430973, ARM_ERRATA_458693 and ARM_ERRATA_460075?

    We are using the following chip revision.

    CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7)

    In other words, it will be r3p2.
    So, we think that these ERRATA do not apply.
    Even if we apply it, the version check will work as follows.


    #if defined(CONFIG_ARM_ERRATA_430973) && !defined(CONFIG_ARCH_MULTIPLATFORM)

    teq r5, #0x00100000 @ only present in r1p*
    mrceq p15, 0, r10, c1, c0, 1 @ read aux control register
    orreq r10, r10, #(1 << 6) @ set IBE to 1
    mcreq p15, 0, r10, c1, c0, 1 @ write aux control register
    #endif
    #ifdef CONFIG_ARM_ERRATA_458693
    teq r6, #0x20 @ only present in r2p0
    mrceq p15, 0, r10, c1, c0, 1 @ read aux control register
    orreq r10, r10, #(1 << 5) @ set L1NEON to 1
    orreq r10, r10, #(1 << 9) @ set PLDNOP to 1
    mcreq p15, 0, r10, c1, c0, 1 @ write aux control register
    #endif
    #ifdef CONFIG_ARM_ERRATA_460075
    teq r6, #0x20 @ only present in r2p0
    mrceq p15, 1, r10, c9, c0, 2 @ read L2 cache aux ctrl register
    tsteq r10, #1 << 22
    orreq r10, r10, #(1 << 22) @ set the Write Allocate disable bit
    mcreq p15, 1, r10, c9, c0, 2 @ write the L2 cache aux ctrl register
    #endif


    Did you try to reproduce your issue with a more recent kernel?

    We are testing with "ti-linux-4.9.y" brunch too.


    Did you try to reproduce your issue on a different board with the same processor? I think the beaglebone black has a TI AM3358 with the same Cortex-A8 as AM3352.

    This is good idea.
    However, there is no board as same as hardware configuration as our board.
    In the current situation, it is difficult.

    Add information about system control register of CP15:
    In addition to the following commit, we tried a patch that does not read CP15 at 'alignment_trap'.


    commit 195b58add463f697fb802ed55e26759094d40a54
    Author: Russell King <rmk+kernel@arm.linux.org.uk>
    Date: Thu Aug 28 13:08:14 2014 +0100

    ARM: Avoid writing to control register on every exception

    If we are not changing the control register value, avoid writing to it.
    Writes to the control register can be very expensive, taking around a
    hundred cycles or so.


    Here is a trace log when the core hang-up occurs.

    0763.arm_corelock_00014_b35_20180125.zip

    Trace the summary of log results just before core hang-up.

    1. Undefined instruction exception (VFP)
    2. Processing of userland Process
    3. Data abortion exception(mrc p15 ...)

    Best regards,
    Takashi

  • Hello,

    About the problem of "AM3352 core hang-up", we find that CPU hang will occur
    if HIGHMEM of Linux kernel option is enabled from the verification result.

    [HIGHMEM verification result]

     (1) DRAM 1 GB HIGHMEM valid     ---> CPU hang occurs
     (2) DRAM 1 GB HIGHMEM invalid  ---> No occurrence
     (3) DRAM 512 MB                              ---> No occurrence

      Linux kernel version: 3.13.4

    When "DRAM 1GB" is implemented, the area exceeding 740MB(LOWMEM) becomes the HIGHMEM area,
    and the Linux memory management method differs from the LOWMEM area's method.
    In order to use this HIGHMEM area, if it enables the Linux kernel option HIGHMEM,
    the CPU hang occurs. However, if it disables HIGHMEM, the CPU hang does not occur.
    Also, in the case of DRAM 512 MB, it does not occur because the HIGHMEM area is not used.

    From this result, it seems that Linux memory management function including HIGHMEM
    is affecting the CPU hang issue.

    The result of the JTAG trace log at CPU hang is always found read or write instruction
    of the co-processor(CP15) register.
    It seems that there is a problem with the MMU and L1 / L2 cache of AM3352 Coretx-A8,
    and it affects the co-processor(CP15).

    -Cortex-A8 processor revision: r3p2 (0x413fc082)

    [Question]
    If it assumes that the Linux kernel memory management function including HIGHMEM function
    is causing CPU hang issue, could you tell us the possible causes for that?

    Best regards.

    Kimura