We are running a survey to help us improve the experience for all of our members. If you see the survey appear, please take the time to tell us about your experience if you can.
Hello,
We are encountering the core hang-up of unknown origin in our mass-produced board using TI's AM3352 and Linux Kernel 3.13.4.Regarding the reproducibility of the test, some units had the hang-up to take about 2000 hours after a system start, and others had about 24 hours at the earliest from a system start.And also the core hang-ups have occurred by 21 units out of 232 units.
Here is a trace log of ETB (Embedded Trace Buffer) acquired via JTAG (CoreSight).
0284.trace_log_20180104.zip
Trace log result summaryIt stops by just before the core hang-up with the following processing sequence.
1. Undefined instruction exception (VFP) 2. Processing of userland Process 3. Data abortion exception
The trace log is acquired by total of 5 times of core hang-up. It stops by the same processing in all trace log.
And from checking the last processing of all trace log; log_file, acquired at the hang-up, it is set the value in a system control register of CP15, and it seems to make the MPU hung state.Is there a possibility which will be in the MPU hung state by this processing?
ldr r0,0xC05E3420 ldr r0,[r0] mcr p15,0x0,r0,c1,c0,0x0; p15,0,r0,c1,c0,0 (system control)
And so, please advise us the effective way to investigate this hang-up.
Best reards,Takashi
Hi Takashi,
Can you maybe rule out any electrical issue by e.g. changing voltages, frequencies and disabling DVFS and power management?
Best regards,
Vincent.
Hi Vincent,
Thank you for quick reply.Here are the additional information.
1. DVFS is disabled on our board. 2. CPU frequency: 1GHz (fixed)3. VDD_MPU: 1.325V (fixed)4. Temperature at the core hang-up: Approximately 20-30 degrees5. We have run memtester at +60℃ and -20℃ for 96 hours.However the results are no problems.6. We can access the DRAM via JTAG (CoreSight) after the hang-up,and read/write is possible to it.
Best Regards,Takashi
Dear Takashi,
1. DVFS is disabled on our board.
This is good as it should make debug easier.
2. CPU frequency: 1GHz (fixed)
3. VDD_MPU: 1.325V (fixed)
Do you have the possibility to lower the frequency below 1 GHz, while keeping the same fixed voltage of 1.325V?
This is probably one path to explore, in the hope to find a frequency where failing chips no longer fail.
4. Temperature at the core hang-up: Approximately 20-30 degrees
5. We have run memtester at +60℃ and -20℃ for 96 hours. However the results are no problems.
This is good news as it suggests we can rule out thermal issues for now.
6. We can access the DRAM via JTAG (CoreSight) after the hang-up, and read/write is possible to it.
This is good news as well, as it allows easier "post mortem" analysis.
Dear Vincent,
Thank you for your support.
No. When we use the lower frequency below 1GHz, we change theVDD_MPU to suitable voltage as a described in the datasheet.We control the MPU voltage by companion PMIC (TPS65217C) for AM3352 processor.In addition, we tried to change the CPU frequency with suitablevoltage to rule out the CPU frequency. The core hang-up occuredregardless of the frequency, 300MHz, 600MHz and 1GHz.
We conducted a survey of the dump data from DRAM acquired via CoreSight,but have yet to find garbled data.
The core hang-up occured regardless of the frequency, 300MHz, 600MHz and 1GHz.
Sorry to hear that.
Also, in your first post, you wrote:
And also the core hang-ups have occurred by 21 units out of 232 units.
If I understand correctly, some units will fail, and this regardless of the frequency. And some other units will never fail.
Also I think a "unit" here is really board+chip and it is not really possible to separate chip issues from board issues.
Did you have maybe the possibility to swap chips between units?
I mean: de-soldering the chip of a known-bad unit and soldering it to a board from a known-good unit? Or the contrary?
Most probably this is not something easy, but that could help separate board issues from chip issues...
EDIT: Also, I forgot to ask you: did you try to test the memory extensively for the failing units?
Thank you for your reply.
We can't always necessarily say that. It does not have a constant timeuntil the core hang-up occurred. Regarding the reproducibility of thetest, some units had the hang-up to take about 2000 hours after asystem start, and others had about 24 hours at the earliest from asystem start. There is a possibility of hung-up when we run the unitover 2000 hours. So, it is not clear that all units will never fail.
Did you have maybe the possibility to swap chips between units?I mean: de-soldering the chip of a known-bad unit and soldering it to a board from a known-good unit? Or the contrary?
It isn't realistic to swap a device and test once again, because it has a high possibility that a device is damaged by rework and influence a test result.
Yes, in failing units, we have already run memtester.
We take an interest in the instruction "mcr p15",when a core hang has occurred, it stops at the same instruction "mcr p15" in all 5 boards. If it causes a core hang by a board issue, the stop point would be at random.
I see the code sequence of your trace in the Linux kernel code. As far as I can tell this is the `alignment_trap' macro in arch/arm/kernel/entry-header.S:
40 .macro alignment_trap, rtemp 41 #ifdef CONFIG_ALIGNMENT_TRAP 42 ldr \rtemp, .LCcralign 43 ldr \rtemp, [\rtemp] 44 mcr p15, 0, \rtemp, c1, c0 45 #endif 46 .endm
I don't see why this would be a problem. There are indeed 7 occurrences of this sequence in your trace and only the last one had an issue.
Also, I thought about other debugging experiments:
- Did you try to enable all Cortex-A8 errata workarounds in your kernel? For example: ARM_ERRATA_430973, ARM_ERRATA_458693 and ARM_ERRATA_460075?
- Did you try to reproduce your issue with a more recent kernel?
- Did you try to reproduce your issue on a different board with the same processor? I think the beaglebone black has a TI AM3358 with the same Cortex-A8 as AM3352.
It is exactly as you say.
We also care about the execution of the VFP before "mcr p15".So it looks like a problem with coprocessors.
Did you try to enable all Cortex-A8 errata workarounds in your kernel? For example: ARM_ERRATA_430973, ARM_ERRATA_458693 and ARM_ERRATA_460075?
We are using the following chip revision.
CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7)
In other words, it will be r3p2.So, we think that these ERRATA do not apply.Even if we apply it, the version check will work as follows.
#if defined(CONFIG_ARM_ERRATA_430973) && !defined(CONFIG_ARCH_MULTIPLATFORM)
teq r5, #0x00100000 @ only present in r1p*mrceq p15, 0, r10, c1, c0, 1 @ read aux control registerorreq r10, r10, #(1 << 6) @ set IBE to 1mcreq p15, 0, r10, c1, c0, 1 @ write aux control register#endif#ifdef CONFIG_ARM_ERRATA_458693teq r6, #0x20 @ only present in r2p0mrceq p15, 0, r10, c1, c0, 1 @ read aux control registerorreq r10, r10, #(1 << 5) @ set L1NEON to 1orreq r10, r10, #(1 << 9) @ set PLDNOP to 1mcreq p15, 0, r10, c1, c0, 1 @ write aux control register#endif#ifdef CONFIG_ARM_ERRATA_460075teq r6, #0x20 @ only present in r2p0mrceq p15, 1, r10, c9, c0, 2 @ read L2 cache aux ctrl registertsteq r10, #1 << 22orreq r10, r10, #(1 << 22) @ set the Write Allocate disable bitmcreq p15, 1, r10, c9, c0, 2 @ write the L2 cache aux ctrl register#endif
Did you try to reproduce your issue with a more recent kernel?
We are testing with "ti-linux-4.9.y" brunch too.
Did you try to reproduce your issue on a different board with the same processor? I think the beaglebone black has a TI AM3358 with the same Cortex-A8 as AM3352.
This is good idea.However, there is no board as same as hardware configuration as our board.In the current situation, it is difficult.
Add information about system control register of CP15:In addition to the following commit, we tried a patch that does not read CP15 at 'alignment_trap'.
commit 195b58add463f697fb802ed55e26759094d40a54Author: Russell King <rmk+kernel@arm.linux.org.uk>Date: Thu Aug 28 13:08:14 2014 +0100
ARM: Avoid writing to control register on every exception If we are not changing the control register value, avoid writing to it.Writes to the control register can be very expensive, taking around ahundred cycles or so.
Here is a trace log when the core hang-up occurs.
0763.arm_corelock_00014_b35_20180125.zip
Trace the summary of log results just before core hang-up.
1. Undefined instruction exception (VFP)2. Processing of userland Process3. Data abortion exception(mrc p15 ...)
Best regards,Takashi
Hello,About the problem of "AM3352 core hang-up", we find that CPU hang will occurif HIGHMEM of Linux kernel option is enabled from the verification result.[HIGHMEM verification result] (1) DRAM 1 GB HIGHMEM valid ---> CPU hang occurs (2) DRAM 1 GB HIGHMEM invalid ---> No occurrence (3) DRAM 512 MB ---> No occurrence Linux kernel version: 3.13.4When "DRAM 1GB" is implemented, the area exceeding 740MB(LOWMEM) becomes the HIGHMEM area,and the Linux memory management method differs from the LOWMEM area's method.In order to use this HIGHMEM area, if it enables the Linux kernel option HIGHMEM,the CPU hang occurs. However, if it disables HIGHMEM, the CPU hang does not occur.Also, in the case of DRAM 512 MB, it does not occur because the HIGHMEM area is not used.From this result, it seems that Linux memory management function including HIGHMEMis affecting the CPU hang issue.The result of the JTAG trace log at CPU hang is always found read or write instructionof the co-processor(CP15) register.It seems that there is a problem with the MMU and L1 / L2 cache of AM3352 Coretx-A8,and it affects the co-processor(CP15).-Cortex-A8 processor revision: r3p2 (0x413fc082)
[Question]If it assumes that the Linux kernel memory management function including HIGHMEM functionis causing CPU hang issue, could you tell us the possible causes for that?Best regards.Kimura