I'm (again) facing a very strange problem in my project for ARM Cortex-M4 (STM32F301K8). The project requires some of the functions to be executed from RAM (it's actually a bootloader with encryption and option to self-update, but that doesn't matter here). In my startup code I have a loop that "initializes" blocks of data by copying them from flash to given address in RAM. The most common use of this code is to copy .data section and it works flawlessly, because it brain-dead simple.
In my linker script I have something like that:
/* sub-section: data_array */ . = ALIGN(4); __data_array_start = .; PROVIDE(__data_array_start = __data_array_start); LONG(LOADADDR(.data)); LONG(ADDR(.data)); LONG(ADDR(.data) + SIZEOF(.data)); LONG(LOADADDR(.ram_text)); LONG(ADDR(.ram_text)); LONG(ADDR(.ram_text) + SIZEOF(.ram_text)); . = ALIGN(4); __data_array_end = .; PROVIDE(__data_array_end = __data_array_end); /* end of sub-section: data_array */
Then in my startup code I have this code:
// Initialize sections from data_array (including .data) ldr r4, =__data_array_start ldr r5, =__data_array_end 1: cmp r4, r5 // outer loop - addresses from data_array ittte lo ldrlo r1, [r4], #4 // start of source address ldrlo r2, [r4], #4 // start of destination address ldrlo r3, [r4], #4 // end of destination address bhs 3f 2: cmp r2, r3 // inner loop - section initialization ittt lo ldrlo r0, [r1], #4 strlo r0, [r2], #4 blo 2b b 1b // go back to start
Now the problem I'm facing right now is that _ONE_ single word in RAM is not stored correctly... The problem is very strange, because when I have 0x00000000 in RAM and 0x12345678 is loaded in the register (r0 in my case) after the write I have 0x00005678 in RAM... Somehow only "half" of the data is written and the other half in RAM is not modified. This problem happens in the middle of the block - so it's not a problem of wrong range, all the data before and after that problematic spot are copied correctly. This problem happens in the same address (for example now that is 0x20000148), but from time to time the particular address changes. If I just move the block to some different address, the problem just moves to some different spot within this block. If I take another chip, the problem persists but on a different address.
As I wrote above, this is the second time I'm having this issue. Previously I've seen it on STM32F103 and nothing helped on the first day - copying with words, bytes, half-words, double-words, memcpy(). After I went to sleep without solving the issue, the next morning everything worked flawlessly ever since with absolutely no fix - identical code that didn't work on one day worked perfectly fine on the other day...
One guy suggested me that this may have something to do with the Flash Patch and Breakpoint unit in the core, but when I check it with the debugger I see that it is indeed enabled (0x261 in FP_CTRL register), but all the comparators are disabled (0 in FP_COMPx).
Anyone faced this issue and found a reliable solution? Thanks in advance for any hints!
I had a similar problem with my LPC1342, so I think it might not be unheard of. Similar but not identical; I think it was a problem with flashing the chip. Very few data (at random) went haywire.
I remember that if I ran the microcontroller at a low speed, the problem went away, but as soon as I ran it at full speed, it went erratic (eg. Even if I flash-programmed it at a low speed).
Looking at the code and comparing with the symptoms, it suggest that it's the *read* that goes wrong, not the write.
Eg. somehow, it sounds like the source-pointer might 'jump' back or forward by 2.
This could be caused by running the microcontroller at some high (overclocked) speed by accident.
-So first thing: Try using OpenOCD and issue a few mdw commands to dump the RCC registers (something like 'mdw 0x40022000 40' will probably do fine), and then use the Reference Manual to find out what speed the MCU is actually running at. This is a much better approach than reading code, because you can look at the code over and over and never see the error.
I think the first thing you might need to do is to check that the chip gets the power it needs.
If it's a Discovery-board, then it probably does already, but if it's your own design, it's important to remember that something could have gone wrong (also from the PCB manufacturer's side).
What I'm going to suggest is of course trivial (and probably a little annoying).
Check that each of your 100nF VDD capacitors are soldered correctly.
Check that there's a stable voltage on those pins.
Now a bit worse: Make sure your external clock crystal's capacitors are correct.
This may require some advanced equipment; if you have the equipment, then it's cool.
If you don't, then the best bet will be to verify that there's no open connections between the XTAL pins and the crystal's terminals, plus that the capacitors are soldered correctly.
Also the value of the capacitors would most likely be in the range 6pF to 10pF.
If they're for instance 22pF, I'm pretty sure you'll need to re-calculate the values.
-But instead of checking the crystal and capacitors, it might be a lot quicker to switch to using the internal oscillator, run at a low frequency and see if the problem persists.
Please let me know about your findings.
First of all - this is not a problem of flashing, because the data I see in flash is correct. Second thing - this is definitely a problem with writing, because correct value is read from flash to register r0, and both index registers have correct values. I can write the data "manually" to the RAM address using OpenOCD and it works perfectly fine, so the RAM is working correctly. It's also not related to OpenOCD's loader, because the problem exists after a fresh power-up of the chip without debugger. Clock, crystal, PLL or any other peripheral cannot be related, because the code is executed right after reset and absolutely NOTHING is enabled.
As previously the problem suddenly disappeared and now the same code works perfectly every time and I cannot reproduce the issue anymore... I'd still be glad to find the root cause of the problem, because it seems there's some pattern here...
What I meant earlier, in details:
r1 and r2 starts off being correct. r0 is read fine from register r1 the first N words.
Then suddenly something happens, which by accident sets bit 1 in r1, thus r0 is now read from an address that spans two words; you'll see a 'skewed' value, but the block being written is "perfectly contigouus". Bit 1 in r1 stays set for a while, then it might get cleared (perhaps when the code is done executing, due to the chip is cooling down).
This could happen if the chip had too much heat during soldering, but it would never happen during debugging, because you'd give the CPU time to cool off.
Thus after the copy is done, both r1 and r2 would look fine.
You could try running the chip in raised room temperature (for instance placing the board under a hot lamp) and see if it starts acting funny.
Making a small "heater" before the copying might help triggering the error:
movs.n r1,#(1 << 24)
loop: subs.n r1,r1,#1
bpl.n loop
nop
(note: the nop instruction is to keep the alignment the same as you had previously, as different alignment can cause different execution timing).
If it the error shows up now, you could change the registers to be r4, r5 and r6; just to see if it still happens.
(If the problem goes away, try changing back to r0, r1 and r2).
Note: The chip could also have been damaged by ESD if it at some point had not been handled correctly; but it's not likely that two chips have the exact same symptoms due to ESD; it would be more likely that it had too much heat during soldering.