This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

AM3352 core hang-up

takashi over 7 years ago

Hello,

We are encountering the core hang-up of unknown origin in our mass-produced board using TI's AM3352 and Linux Kernel 3.13.4.
Regarding the reproducibility of the test, some units had the hang-up to take about 2000 hours after a system start,
and others had about 24 hours at the earliest from a system start.
And also the core hang-ups have occurred by 21 units out of 232 units.

Here is a trace log of ETB (Embedded Trace Buffer) acquired via JTAG (CoreSight).

0284.trace_log_20180104.zip

Trace log result summary
It stops by just before the core hang-up with the following processing sequence.

1. Undefined instruction exception (VFP)
2. Processing of userland Process
3. Data abortion exception

The trace log is acquired by total of 5 times of core hang-up. It stops by the same processing in all trace log.

And from checking the last processing of all trace log; log_file, acquired at the hang-up,
it is set the value in a system control register of CP15, and it seems to make the MPU hung state.
Is there a possibility which will be in the MPU hung state by this processing?

ldr r0,0xC05E3420
ldr r0,[r0]
mcr p15,0x0,r0,c1,c0,0x0; p15,0,r0,c1,c0,0 (system control)

And so, please advise us the effective way to investigate this hang-up.

Best reards,
Takashi

Top replies

vstehle over 7 years ago in reply to takashi +1 verified

Hi Takashi, I see the code sequence of your trace in the Linux kernel code. As far as I can tell this is the `alignment_trap' macro in arch/arm/kernel/entry-header.S: 40 .macro alignment_trap, rtemp...

Parents

0 takashi over 7 years ago in reply to vstehle

Dear Vincent,

Thank you for your reply.

I see the code sequence of your trace in the Linux kernel code. As far as I can tell this is the `alignment_trap' macro in arch/arm/kernel/entry-header.S:

It is exactly as you say.

I don't see why this would be a problem. There are indeed 7 occurrences of this sequence in your trace and only the last one had an issue.

We also care about the execution of the VFP before "mcr p15".
So it looks like a problem with coprocessors.

Did you try to enable all Cortex-A8 errata workarounds in your kernel? For example: ARM_ERRATA_430973, ARM_ERRATA_458693 and ARM_ERRATA_460075?

We are using the following chip revision.

CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7)

In other words, it will be r3p2.
So, we think that these ERRATA do not apply.
Even if we apply it, the version check will work as follows.

#if defined(CONFIG_ARM_ERRATA_430973) && !defined(CONFIG_ARCH_MULTIPLATFORM)

teq r5, #0x00100000 @ only present in r1p*
mrceq p15, 0, r10, c1, c0, 1 @ read aux control register
orreq r10, r10, #(1 << 6) @ set IBE to 1
mcreq p15, 0, r10, c1, c0, 1 @ write aux control register
#endif
#ifdef CONFIG_ARM_ERRATA_458693
teq r6, #0x20 @ only present in r2p0
mrceq p15, 0, r10, c1, c0, 1 @ read aux control register
orreq r10, r10, #(1 << 5) @ set L1NEON to 1
orreq r10, r10, #(1 << 9) @ set PLDNOP to 1
mcreq p15, 0, r10, c1, c0, 1 @ write aux control register
#endif
#ifdef CONFIG_ARM_ERRATA_460075
teq r6, #0x20 @ only present in r2p0
mrceq p15, 1, r10, c9, c0, 2 @ read L2 cache aux ctrl register
tsteq r10, #1 << 22
orreq r10, r10, #(1 << 22) @ set the Write Allocate disable bit
mcreq p15, 1, r10, c9, c0, 2 @ write the L2 cache aux ctrl register
#endif

Did you try to reproduce your issue with a more recent kernel?

We are testing with "ti-linux-4.9.y" brunch too.

Did you try to reproduce your issue on a different board with the same processor? I think the beaglebone black has a TI AM3358 with the same Cortex-A8 as AM3352.

This is good idea.
However, there is no board as same as hardware configuration as our board.
In the current situation, it is difficult.

Add information about system control register of CP15:
In addition to the following commit, we tried a patch that does not read CP15 at 'alignment_trap'.

commit 195b58add463f697fb802ed55e26759094d40a54
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date: Thu Aug 28 13:08:14 2014 +0100

ARM: Avoid writing to control register on every exception

If we are not changing the control register value, avoid writing to it.
Writes to the control register can be very expensive, taking around a
hundred cycles or so.

Here is a trace log when the core hang-up occurs.

0763.arm_corelock_00014_b35_20180125.zip

Trace the summary of log results just before core hang-up.

1. Undefined instruction exception (VFP)
2. Processing of userland Process
3. Data abortion exception(mrc p15 ...)

Best regards,
Takashi
Cancel
Vote up 0 Vote down

Cancel

Reply

0 takashi over 7 years ago in reply to vstehle

Dear Vincent,

Thank you for your reply.

I see the code sequence of your trace in the Linux kernel code. As far as I can tell this is the `alignment_trap' macro in arch/arm/kernel/entry-header.S:

It is exactly as you say.

I don't see why this would be a problem. There are indeed 7 occurrences of this sequence in your trace and only the last one had an issue.

We also care about the execution of the VFP before "mcr p15".
So it looks like a problem with coprocessors.

Did you try to enable all Cortex-A8 errata workarounds in your kernel? For example: ARM_ERRATA_430973, ARM_ERRATA_458693 and ARM_ERRATA_460075?

We are using the following chip revision.

CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7)

In other words, it will be r3p2.
So, we think that these ERRATA do not apply.
Even if we apply it, the version check will work as follows.

#if defined(CONFIG_ARM_ERRATA_430973) && !defined(CONFIG_ARCH_MULTIPLATFORM)

teq r5, #0x00100000 @ only present in r1p*
mrceq p15, 0, r10, c1, c0, 1 @ read aux control register
orreq r10, r10, #(1 << 6) @ set IBE to 1
mcreq p15, 0, r10, c1, c0, 1 @ write aux control register
#endif
#ifdef CONFIG_ARM_ERRATA_458693
teq r6, #0x20 @ only present in r2p0
mrceq p15, 0, r10, c1, c0, 1 @ read aux control register
orreq r10, r10, #(1 << 5) @ set L1NEON to 1
orreq r10, r10, #(1 << 9) @ set PLDNOP to 1
mcreq p15, 0, r10, c1, c0, 1 @ write aux control register
#endif
#ifdef CONFIG_ARM_ERRATA_460075
teq r6, #0x20 @ only present in r2p0
mrceq p15, 1, r10, c9, c0, 2 @ read L2 cache aux ctrl register
tsteq r10, #1 << 22
orreq r10, r10, #(1 << 22) @ set the Write Allocate disable bit
mcreq p15, 1, r10, c9, c0, 2 @ write the L2 cache aux ctrl register
#endif

Did you try to reproduce your issue with a more recent kernel?

We are testing with "ti-linux-4.9.y" brunch too.

Did you try to reproduce your issue on a different board with the same processor? I think the beaglebone black has a TI AM3358 with the same Cortex-A8 as AM3352.

This is good idea.
However, there is no board as same as hardware configuration as our board.
In the current situation, it is difficult.

Add information about system control register of CP15:
In addition to the following commit, we tried a patch that does not read CP15 at 'alignment_trap'.

commit 195b58add463f697fb802ed55e26759094d40a54
Author: Russell King <rmk+kernel@arm.linux.org.uk>
Date: Thu Aug 28 13:08:14 2014 +0100

ARM: Avoid writing to control register on every exception

If we are not changing the control register value, avoid writing to it.
Writes to the control register can be very expensive, taking around a
hundred cycles or so.

Here is a trace log when the core hang-up occurs.

0763.arm_corelock_00014_b35_20180125.zip

Trace the summary of log results just before core hang-up.

1. Undefined instruction exception (VFP)
2. Processing of userland Process
3. Data abortion exception(mrc p15 ...)

Best regards,
Takashi
Cancel
Vote up 0 Vote down

Cancel

Children

No data