This discussion has been locked.

You can no longer post new replies to this discussion. If you have a question you can start a new discussion

hard fault caused by scatter loader decompress routine

I am working with a microsemi cortex M3 ARM with uVision and ulink2 mdk 5.12 and I am getting a PRECISERR bus fault from a memory read during initialization by the scatterloader decompress routine. It is definitely trying to read a bad address. This code is created by the linker to initialize my memory/variables so its not something I have direct control over.

Here is the goofy thing, I found that I could get around getting a hard fault 1 time if after hitting the hard fault I get out of the debugger and right back in. Then I can run as long as I don't try to restart by clicking the reset button on uVision.

I don't understand why this would make any difference but it does and allowed me to work for the past few weeks although its slow and awkward.

This weekend my luck ran out and now I get that hard fault every single time I trying to run. My trick no longer works. All I did just before this started was to add 1 line of code and recompile and reload.

Here is a interesting clue. Out of desperation I recompiled my entire project and reflashed and I was able to get to Main! All I did was recompile. I did some testing and had to make one minor change to a index and recompiled and now I am back to getting that hard fault every time. None of my tricks work either.

Its still coming from the same loop in the decompress routine where it reads a address that is just below my DDR memory. 0x9FFFFEFF

It really sounds like some kind of memory alignment issue or something like that. But the clues do not seem to be related.

Why would adding or removing a few bytes from my code cause the decompress routine to hard fault?
Why did recompiling the project allow me to get past the decompress problem once?
Why does clicking reset and trying to run the same code cause the decompress to suddenly not work?
Why would getting out the debugger and back in allow it to do the decompression correctly one time?

Arrrgg.

I am totally stopped from doing any more development because the decompression code always causes a hard fault from the same address and memory access.

Has anyone experienced anything like this before?

Parents

0 Westonsupermare Pier over 9 years ago in reply to ImPer Westermark

Does it depend how long it's been running? If you add an idle loop in ResetHandler does it change the behaviour? How about after the DDR is initialized?

Decompression most often fails or goes off in the weeds, when the data stream it is decoding is corrupted. It's designed for efficiency, not integrity.

The linker chooses to use compression, or not, depending on the static data it is trying to copy into RAM, and if compression actually makes the image smaller, or not.
Cancel
Up 0 Down

Cancel

Reply

0 Westonsupermare Pier over 9 years ago in reply to ImPer Westermark

Does it depend how long it's been running? If you add an idle loop in ResetHandler does it change the behaviour? How about after the DDR is initialized?

Decompression most often fails or goes off in the weeds, when the data stream it is decoding is corrupted. It's designed for efficiency, not integrity.

The linker chooses to use compression, or not, depending on the static data it is trying to copy into RAM, and if compression actually makes the image smaller, or not.
Cancel
Up 0 Down

Cancel

Children

0 steve kerich over 9 years ago in reply to Westonsupermare Pier

Hi
Thanks for getting back to me so quickly. The decompress routine is called as part of the C library initialization long before my code starts. So it is not directly a issue with anything that my codes does because the system hard faults before it can run.

The scatter load process which decompress is part of is trying to initialize static and initialized variables before my code runs.

The DDR is successfully initialized by a script that the debugger runs before loading symbols and running any code. I have verified that is this does work correctly and memory is accessible.

Over the weekend my trick of getting out/in of the debugger was workign but then I made one small change (added a +1) to a routine in my code and recompiled and reflashed it into nvm and now that hard fault happens every single time I run the code. It always faults in the same loop in the decompress routine. It tries to load a value from undefined memory. It looks like it is a computed address from looking at the assembly code but it is pretty complicated.

This is why I was thinking there was some type of memory alignment problem or something because the only thing I did was make the code a few bytes bigger.

After this started happening and having absolutely no other ideas as to what to do, I recompiled all of my code in the project, reflashed, and was able to get to main. No hard fault. I tested and wanted to restarts so I did my trick with getting out and back into the debugger which had cleared this up before (I have no idea) but it did not work. This was exactly the same code in flash that started without a hard fault just a few moments before. What would change? power cycling does not help either.

"Decompression most often fails or goes off in the weeds, when the data stream it is decoding is corrupted. It's designed for efficiency, not integrity."

I had thought about this but the it looks like the addresses and even some of the data to be written are "calculated" using the current PC address or some odd math which I assume is done to save code space. And this code and the calculation are flashed into nvm so there should not be any corruption. Although if it uses the stack it would be vulnerable to having values corrupted. But since it is the only code running, it would be stepping on its own variables/data.

Since this is generated code, I do not have any direct control over how its logic works. That is the scary thing.

Is there any way to tell the linker not to use compression? Maybe there is a problem with the decompression logic that I can avoid by asking it not to do that.

I am at a total loss as to why the C library decompression routine would make a mistake and cause a hard fault every time now. And before it was only after clicking reset but could be cleared if I got out of the debugger and back in.
Cancel
Up 0 Down

Cancel
0 Westonsupermare Pier over 9 years ago in reply to steve kerich

You seem to be having a lot of fun with this platform.

Code that you provide must initialize external buses and memories BEFORE they are used. Keil's C runtime code which initializes the statics is called by __main prior to it calling your main() function. The linker's understanding of memory is controlled by the target GUI, or the scatter file you supply. The processor Hard Faults in the same way an 68K would DTACK fault if the underlying memory (address) you are attempting access is NOT present, or doesn't permit writing, etc.

You can add code to the startup_arch.s file, and you are expected to initialize SDRAM/DDR via routines in, and called by, SystemInit(). But be conscious that such routines CAN'T use memory that will be subsequently initialized. As such you should explicitly initialize any variables, and not assume they will be zero, or have statically assigned values. The C runtime code has not been run at this point.

I'd start by assuming that the linker/compiler are doing their job, the code has been tested and used in millions of devices.

The linker creates tables of load regions which need zeroing, or copying from flash. There may be options to prevent compression being used, but in general you're just seeing that it's more sensitive to bus/data anomalies, and this is really a big red flag for underlying issues you need to address.

Focus on what's broken in your M3 architecture, flash and prefetching strapped to the core. Use a debugger and trace hardware to understand exactly is supposed to be read and what is actually being read by the executing code.

Review if the .MAP files suggests data is being placed outside the scope of the hardware design.
Cancel
Up 0 Down

Cancel
0 Westonsupermare Pier over 9 years ago in reply to Westonsupermare Pier

Might also suggest you add a small assembler checksum or CRC routine as the first instructions in the ResetHandler, and have it validate the executable image against data you've computed on the PC from the generated .HEX/.AXF files.

This will save you a lot of time chasing ghosts in the machine.
Cancel
Up 0 Down

Cancel

0 steve kerich over 9 years ago in reply to Westonsupermare Pier

Thanks for your response. Yes I have been turning very grey over working with the M3 ARM. I have worked with many different microprocessors over the past 34 years and this is my first experience with a ARM and its been very frustrating.

I have verified that the correct device is selected so the CMSIS initialization should be correct, it initializes the DDR memory so doing it in the debugger script is redundant when using nvm based code. I will not use the script until I start using remapping again and see if that helps. Other than DDR I do not see anything else being setup or initialized by the CMSIS code. The reset is the scatter load code.

Outside of having the scatter file locate my variables and heap in DDR memory I do not force any memory to be in any particular place. I leave that up to the linker

Below is the part of the code that is causes the fault at address x0210, R4 contains 0x9FFFFFEE which is totally illegal and the LDRB read causes the fault. I do not know what it trying to read. I would have thought it would only be writing to memory and using the stack area if anything for reads. It looks like it computed this value at 0x1FA by doing a subtraction with R1 which contains the start address of DDR 0xA0000000. By going backwards from there instead of forwards into the DDR it uses a bad address.

                 __scatterload_copy:
0x000001C4 F8103B01  LDRB     r3,[r0],#0x01
0x000001C8 440A      ADD      r2,r2,r1
0x000001CA F0130403  ANDS     r4,r3,#0x03
0x000001CE BF08      IT       EQ
0x000001D0 F8104B01  LDRB     r4,[r0],#0x01
0x000001D4 111D      ASRS     r5,r3,#4
0x000001D6 BF08      IT       EQ
0x000001D8 F8105B01  LDRB     r5,[r0],#0x01
0x000001DC 1E64      SUBS     r4,r4,#1
0x000001DE D005      BEQ      0x000001EC
0x000001E0 F8106B01  LDRB     r6,[r0],#0x01
0x000001E4 1E64      SUBS     r4,r4,#1
0x000001E6 F8016B01  STRB     r6,[r1],#0x01
0x000001EA D1F9      BNE      0x000001E0
0x000001EC 2D00      CMP      r5,#0x00
0x000001EE D015      BEQ      0x0000021C
0x000001F0 F8104B01  LDRB     r4,[r0],#0x01
0x000001F4 F003030C  AND      r3,r3,#0x0C
0x000001F8 2B0C      CMP      r3,#0x0C
0x000001FA EBA10404  SUB      r4,r1,r4
0x000001FE BF0A      ITET     EQ
0x00000200 F8103B01  LDRB     r3,[r0],#0x01
0x00000204 EBA41483  SUB      r4,r4,r3,LSL #6
0x00000208 EBA42403  SUB      r4,r4,r3,LSL #8
0x0000020C F1050301  ADD      r3,r5,#0x01
0x00000210 F8146B01  LDRB     r6,[r4],#0x01
0x00000214 1E5B      SUBS     r3,r3,#1
0x00000216 F8016B01  STRB     r6,[r1],#0x01
0x0000021A D5F9      BPL      0x00000210
0x0000021C 4291      CMP      r1,r2
0x0000021E BF38      IT       CC
0x00000220 F8103B01  LDRB     r3,[r0],#0x01
0x00000224 D3D1      BCC      0x000001CA
0x00000226 4770      BX       lr

0 Westonsupermare Pier over 9 years ago in reply to steve kerich

Pretty sure the decompression use an LZ type implementation, the compressed bit stream describes raw bytes, and run lengths to copy out of the previously decompressed output stream. ie stuff it's already unpacked into RAM, and is accessed by back referencing into that data, this may go 2KB to 32KB back depending on the implementation.

In my limited validation of Keil decompression code, there seem to be a number of algorithms, or stream encoding schemes, depending on the nature and compressibility of the code.

All the code presumes the underlying hardware isn't bollixed up. This means that none of the critical paths in the core are clocked too fast, and the memory used for stack (auto/local) and memory used for data and code reliably return the data that was previously written to them.

The M3 doesn't have any cache making that less of an issue. Dynamic memory is notoriously complex, and synchronous ones even more so with respect to starting a transfer from a new address, and bursting streams of data.

en.wikipedia.org/.../LZ77_and_LZ78
en.wikipedia.org/.../Lempel–Ziv–Storer–Szymanski
Cancel
Up 0 Down

Cancel
0 steve kerich over 9 years ago in reply to Westonsupermare Pier

I have spend this evening looking at what decompress is doing before it fails and I found something interesting. When the uVision debugger is started it writes to my DDR memory a lot of values starting at the start of DDR. I found that the first thing the decompress assembly does is to read these values beginning at the start of DDR and uses them to compute addresses. I see as it loops the start of DDR being overwritten by what I assume are the initial values that to be there. It overwrites those initial value that it used to compute the addresses.

If I click on reset and restart, of course those values are not there but decompress uses them anyway and that is that causes it to come up with a bad address. the debugger did not write the value into the DDR for it when restarting.

I do not know why the debugger would write these very critical values to the DDR or why the decompress would look to what should be uninitialized RAM for any values to use. when I deploy this software the debugger will not be thee to write these values to DDR for decompress to use.

The code needs to write whatever values it needs to use for these addresses to DDR itself when starting up. That is not happening. I did a test by starting the debugger, it did its write to memory and I clear the first 5 long words hoping to see that the code would put the correct values back when it started and that the debugger was not needed. Well, it did not write anything to DDR and hard faulted.

So that is what is going on. I just need to know why this dependency on the debugger to write these critical values for decompress to work correctly exists and how to stop it.

Does this make sense?

Is there a compiler switch for debug verses deployment that may be preventing some additional initialization code from being written?

If I can get that code to put those values out there in DDR, then it will not compute a bad address and hard fault.
Cancel
Up 0 Down

Cancel
0 Westonsupermare Pier over 9 years ago in reply to steve kerich

How does the application get into DDR on the deployed system?

How does the data in DDR when it first starts related to the data in the .HEX or .AXF representations of the memory space? Does a checksum mechanism confirm the integrity?

Does the image in DDR corrupt itself? Does it behave like a FLASH/ROM image? Or do things like the vector table get copied, modified, or remapped?

Do the ROM and RAM regions defined for the linker overlap? Or stack or heap create conflict?

Not sure what the debugger or debug script are contributing here, or if the debugger is attempting to break-point the initial run so as to "run to main"

Does the code function normally if you deliver the image to DDR and run it absent the debugger?
Cancel
Up 0 Down

Cancel
0 steve kerich over 9 years ago in reply to Westonsupermare Pier

Hi

I have flash it into memory and let it run and it still gets a hard fault.

I have been doing some testing/investigating and I think I was off on thinking it was the compiler setting. But there is something going on interesting. I wish I could post a screen shot because that would be clear as to what I have found out.

When the debugger starts it does write to DDR. ESRAM also contains the exact same data in the same order. I believe decompress is copying the data in esram to the DDR. It writes what will be my initialized data. It writes to 0xA0000000 (mixed hex and ascii)

0x0 0xD 0x1A SD_NO_ERROR . .. .

My memory map says that at this address should be my error array which has a first value of SD_NO_ERROR (ascii). THE 0 is the offset to the current address to write, the 0xD is the length of the string and the 0x1A is ???. there are 3 hex bytes in front of every string.

I can run to main and this is what the code initialized the DDR memory to:

SD_NO_ERROR .....

This is exact correct. What the debugger wrote was ignored and over written as it should be.

Now, if I click reset and go to decompress and I see that the 3 bytes before the string are missing from ESRAM so when it loads the offset to write it gets 0x53 ("S") instead of 0 and a count of 3 bytes instead of 0xD:

D_NO_ERROR

If I let it continue with these bad values it will eventually cause hard fault because it tried to write to address 0x9FFFFB5 which would be below my DDR.

If I reload with the debugger and run to the start of decompress the esram looks good with with those 3 bytes before each string. I can reset over and over and every time it looks the same and correct.

But if I allow it to write just the first string SD_NO_ERROR to DDR and then do a reset and run to the start of decompress, those 3 bytes are missing from ESRAM and SD_NO_ERROR is at 0x2000000 instead of those 3 bytes and then the string.

Why writing the string to DDR would cause whatever initializes the esram with the string values to be off by 3 bytes is a mystery? The code is in nvm memory and so should these initial values that are being put into esram and writing them out to DDR should have not affect.

Getting out of the debugger and back in fixes this data in esram some how because after doing that and running to decompress the esram is correct again.

The debugger is doing something with getting that data into the esram correctly although that all should be done by the code itself.

This is why i thought it might be a compiler switch issue but now it looks like the act of writing the data to DDR changes what is loaded upon the next reset by 3 bytes.

I'll look back farther into the scatter loader to see how that esram data is being put into esram memory.
Cancel
Up 0 Down

Cancel
0 steve kerich over 9 years ago in reply to steve kerich

Ok I found how the data that was being put into esram is coming from. The scatter loader is called first and it calles scatter loader_null which believe it or not copies the values in the DDR into the esram!!!!! Holy cow. The DDR has been either just written by the debugger with the correct init values or was over written with the initialization values.

On the second (after reset) and all other runs of the code it will continue to copy the final initialized code into esram DDR and then copies the esram data BACK INTO DDR. The 3 bytes at the beginning of each string is long gone. It is only there on the first run when the debugger puts it into DDR. All runs of the code after that are picking up whatever the DDR was initialized to.

This is not offset, string length or any information to tell the decompress routine how much to copy and to where. Its just chaos at this point.

So the code is definitely depending upon the debugger to put this proper data into DDR so it can copy that to esram so that it can be decompressed back into DDR memory minus the 3 bytes of string information.

This cannot not possibility be how this ARM processor and code will run operationally. It cannot depend upon a debugger to put that information in DDR memory.

There is some code that should be created by the linker to do this and not the debugger. Obviously this code is not being generated by the linker so that the proper initialization information is made available to the decompress routine.

How does one turn that on? I need some help here because I have not idea why this missing code is not being created.
Cancel
Up 0 Down

Cancel
0 ImPer Westermark over 9 years ago in reply to steve kerich

So exactly what does your scatter file look like?

Sounds a bit like some data that in a normal build should be stored in flash is in your build stored in a RAM region (downloaded by the debugger) that will be overwritten when the program starts to run - so a second run will basically have "parts of the 'nonvolatile flash' destroyed".
Cancel
Up 0 Down

Cancel
0 Tamiryan Michael over 9 years ago in reply to ImPer Westermark

Exactly - show us your scatter file. The problem is there, or in your hardware.
Cancel
Up 0 Down

Cancel

0 steve kerich over 9 years ago in reply to Tamiryan Michael

OK that would make total sense. I am sure I have accidentally put that initial data into the DDR memory space through the use of a wildcard * .

I didn't even think about that. I assumed that the data would be a integrated part of the code and not a separate data item. I just put every RW section into DDR. I assumed that this static data would RO only. It should be RO data and not RW. I think that would be a misclassification of that type of data

Do you know what it would be called? Its the .data section right? The .bss is my read/write variables?

0xA0000000 is where I found the scatterload loading the initialization data and that is where the .data sections are a grouped together.

FLASH_LOAD 0x00000000 0x00080000      ; load region size_region
{
    ER_RO 0x00000000 0x80000    ; load address = execution address
    {
        *.o (RESET, +First)
        *(InRoot$$Sections)
;        startup_m2sxxx.o (.text)
       system_m2sxxx.o (.text)
;       sys_config.o (.text)
       low_level_init.o (.text)
        retarget.o  (.text)
         * (+RO)
    }

    ER_RW 0x20000000 UNINIT 0x10000
    {
        startup_m2sxxx.o (STACK)
    }

}

MDDR_RAM 0xA0000000 0x10000000
{
    ER_DDR 0xA0000000 UNINIT 0x10000000  ; RW data  DDR
    {
       * (+RW +ZI)
       * (HEAP)
    }

}

Base Addr    Size         Type   Attr      Idx    E Section Name        Object




    0xa0000000   0x00001c8e   Data   RW            7    .data               accumulationnode.o
    0xa0001c8e   0x00000002   Data   RW          535    .data               statemachine.o
    0xa0001c90   0x00000004   Data   RW          777    .data               processidle.o
    0xa0001c94   0x00000001   Data   RW         1008    .data               timerirq.o
    0xa0001c95   0x00000001   PAD
    0xa0001c96   0x00000002   Data   RW         1069    .data               dummy.o
    0xa0001c98   0x000001e4   Data   RW         1204    .data               diskio.o
    0xa0001e7c   0x00000006   Data   RW         1365    .data               ff.o
    0xa0001e82   0x00000002   PAD
    0xa0001e84   0x00000055   Data   RW         1435    .data               uart.o
    0xa0001ed9   0x00000003   PAD
    0xa0001edc   0x0000001c   Data   RW         1603    .data               system_m2sxxx.o
    0xa0001ef8   0x00000004   Data   RW         1672    .data               retarget.o
    0xa0001efc   0x00000008   Data   RW         1888    .data               mss_can.o
    0xa0001f04   0x00000018   Data   RW         2047    .data               mss_pdma.o
    0xa0001f1c   0x0000002c   Data   RW         2121    .data               mss_comblk.o
    0xa0001f48   0x00004d34   Zero   RW            5    .bss                accumulationnode.o
    0xa0006c7c   0x0000000c   Zero   RW         1203    .bss                diskio.o
    0xa0006c88   0x000002d0   Zero   RW         1504    .bss                fault_handler.o
    0xa0006f58   0x00000018   Zero   RW         1950    .bss                mss_hpdma.o

0 steve kerich over 9 years ago in reply to steve kerich

Moving the .data sections to nvm solved the problem. I can run to main every time now. Decompress does not even get called now. I could not see the forest because of the trees.

Thank you very much for you help in leading me to the answer.

Steve
Cancel
Up 0 Down

Cancel
0 Westonsupermare Pier over 9 years ago in reply to steve kerich

Isn't 0xA0000000 the base of your DDR? Why the heck are you copying stuff there, or allowing the areas to conflict?

Stuff that gets unpacked by the linker/loader needs to reside INSIDE the ROM/FLASH LOAD REGION braces.

If you are remapping 0 <-> 0xA0000000 you need to carve that space out of the linkers view so it doesn't put data over the top of it.
Cancel
Up 0 Down

Cancel
0 steve kerich over 9 years ago in reply to Westonsupermare Pier

for this build I am not remapping. We needed to get this out by our deadline this friday so I created a no remapped nvm smaller version of my code. So I am only using it for my variables and heap right now. Just a huge hunk of ram.

Once that delivery is done. I will go back and work on the remapped version and your right my code and that variables will both be in DDR memory. I have a different scatter file for that mapping.

In fixing this initialization problem by moving the .data sections to nvm my mallocs will no longer allocate memory. They were working ok before making this change.

Do you know if there is anything in the .data section that would affect malloc?
Cancel
Up 0 Down

Cancel