This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

ARM RL-FlashFS native SD Card corruption ffree()

Hi,
Wonder if anyone could help me. I’ve got an SD card related problem.
My project builds under uVisionV4.73.0.0, C Compiler Armcc.Exe V5.03.0.76, device = STM32F427II.
My target platform has a native SD card device (2GB capacity and FAT file structure) for which we’re using Keil’s RL-FLashFS i.e. SDIO_STM32F4xx.c , File_Config,c (Rev V4.70). My target platform is a custom hardware design.

The problem
The SD card becomes corrupt however I can still read and write files to it from my target platform perfectly well and finit() returns 0 indicating the card is in good working order however:

- ffree() takes over 1 second to complete whereas normally it completes in approximately 70mSec on a card that hasn’t been corrupted (this very slow response prevents my target booting which is how I discovered the card corruption).

- My Windows PC reports 700+ MB of used space on the card when the files on the card add up to less than 10 MB.

- Windows CHKDSK reports the SD card has errors and can repair it; it finds around 24,000 x 32KB bad clusters. Once CHKDSK has repaired the card the used space equates roughly to the size of the files on the card and ffree() calls on my target platform complete in the usual 70mSec time period.

BTW when I make a copy of the corrupted card using HDDGuru’s HDDRawCopy1.10 the copy has the same 700+MB of wasted (corrupted) space just like the original but when I insert the card into my target platform calls to ffree() complete in the normal 70mSec time frame?

Specifically I would help with
1. Detecting the SD card corruption in my target platform, everything appears to work fine apart from the very slow ffree(); unfortunately fanalyse() and fcheck() aren’t available to me because it is a FAT file system.
2. Understanding why a low level copy of the card doesn’t suffer from the very slow ffree() response.
3. Ultimately stopping the corruption from occurring in the first place.

Many thanks in advance for any assistance/advice you can give me.

Paul

  • One note here - a binary copy of an "intelligent" flash media will not result in an identical copy. The binary copy will only duplicate the data on the file system level. It will not manage to duplicate the underlying storage structure on the actual flash memory.

    Remember that the flash controller in the card contains a translation layer as it finds suitable raw flash blocks for storing changes to logical file system sectors.

    So a "low-level-formatted" card is much faster because the memory controller will have lots of already erased flash blocks that can be immediately used. While a card that has already been filled and then had the files erased will normally not have erased flash blocks since the memory controller will normally not be informed about file system sectors that are no longer in use.

    So the memory controller may have to suffer a huge write amplification when asked to perform a small write - it has to move data between different flash blocks so it can get a flash block empty and possible to erase.

    en.wikipedia.org/.../Write_amplification

    This is a reason why for example SSD have got a TRIM command - so the OS can tell "this region in the logical address space is now unused". And the SSD will then know it doesn't have to move that data when trying to get a flash block empty so it can be erased. So TRIM reduces the amount of wear, while greatly speeding up write speeds.

    It's important to think twice about the usage patterns when using flash media in embedded devices - especially since a SD isn't as advanced as a SSD.

  • Instrument the SDIO sector level access routines to understand if it's stuck in there and getting errors from the media, or what exactly is the pattern of reads leading to the long delays. ie are the reads taking a long time, or is it doing a lot of reading.

    The FAT file system isn't unduly complicated, and is well documented, walk the structures to understand what's happened inside them.

  • Is your SD formatted with FAT32 or FAT16? ffree() is most likely inspecting FAT table for allocated clusters and probably it is missing one or more end of cluster marker. There are various reasons for this to happen, like worn out SD card, sudden power loss, etc...

    I would inspect FAT (table) and compare with the one repaired by chkdsk. From the comparison the reason could be seen.

  • Guys thanks for taking the time to respond I appreciate your thoughts.

    I’ve had a read up on the FAT16 file system, write amplification and have had a look at the card's block i/o timings.

    The block i/o timings look consistent across both good and bad SD cards.

    Comparing a good FAT table with a bad FAT table wouldn't help me discover the cause also it would be a huge undertaking as Windows CHKDSK output contains thousands of entries for the corrupt card, sample output below:

    Lost chain cross-linked at cluster 5728. Orphan truncated.
    Bad links in lost chain at cluster 5730 corrected.
    Bad links in lost chain at cluster 5731 corrected.

    This indicates to me that there is definitely corruption in the FAT table. I’m guessing the ffree() call iterates through the entire FAT table and totals up the free clusters; maybe the FAT table corruption is the reason why the call to ffree() is taking so long. I don’t have access to the Keil's FAT source code so I’m guessing here.

    I’m inclined to think that a power outage is the most likely cause of the corruption but I am unable to prevent it.

    Theoretical Solution

    What I’d like to do programmatically is detect and hence repair any card corruption in my target platform.

    In theory I could write code to recreate the same steps on my target platform that I undertook on my PC to detect and repair the bad card i.e. look for an error in the number of used bytes on the card then do a CHKDSK type repair.

    Problem

    However I think I’ve hit a dead end. I could probably add up the file size of every file on the card but if I do detect an anomaly between that figure and (total card size – ffree()) how do I effect a repair?

    As I stated in my first post RL_FLASHFS doesn’t appear to have any FAT repair routines. I would have to go for the “nuclear” option of a complete fformat() and lose all of the files.

    Any ideas anyone?

    Thanks

    Paul

  • A raw file system copy should give you a second card with the same cross-linked clusters in the FAT chains. And you mentioned that you had done a perfect disc copy and did not get the slowdown on the copied card.

    So while broken FAT tables can result in big issues when later trying to allocate space or release files, it doesn't sound like this is the problem you are having.

    The FAT file system really isn't very robust for embedded use.

  • Determining the number of unused FAT entries is a fairly trivial task, so I'm going to assume ffree() is instead chasing it's tail through the links, and cross links, in the table rather than just looking for the occupied ones.

    The first test is to see if BOTH tables are the same, and then to go through the cluster links and make sure you don't have more than one reference to the same next cluster. For lost and trucated things you need to enumerate through all files on the media, comparing the size of the file in bytes reported in the directory entry, against the length and integrity of the cluster chain it points too.

    Repair is more of a challenge, you could truncate cluster chains where the length doesn't match, or code in the new length. You could delete the files. In the cross-link case you'd have to delete all the files that share to compromised chain(s).

    A half decent fsck will take a lot of time and resources, things often not workable in an embedded system.

  • Paul:

    I recently did a CAN bus to SD card logger on a str731 for internal use.

    I had many corruption / lost file problems with the mdk 4.14 FS. Removing power would corrupt the card if files were not closed prior.

    I seem to think 4.74's FS being better but all known problems went away with the FS in 5.15. I ended migrating the project / FS to MDK 5.15 via legacy pack. It may be worth a try if you have the newer FS available. 5.15 may miss the tail of the log if power is removed but that is the worst of it.

    ffree() does take progressively longer to complete as the card fills up even without corruption.
    I am using a card with less than 20 files but all but 4 are quite large. It takes 5 or more seconds on a 32 gig card with 4 gig used. Opening a new file also takes increasingly long as space is used. I could watch the SPI traffic increase as time went on. I was testing it by writing data at over 400kbyte / second until the card was full.

    Note, I am using a 5-6Mhz (max) SPI bus and don't have the RAM to cache much from the SD card. Delays would likely be much less with a faster clock.

    Chad

  • The FAT32 table could run to 10K or 100K sectors, several of orders of magnitude bigger than the memory resources of the average micro-controller being used with it.

  • Removing power would corrupt the card if files were not closed prior.

    That's totally expectable behaviour of a FAT file system, particularly if it's running on flash storage.

    FAT is not, and never has been, robust against surprising power loss or hard reset events. In the old days, when PCs ran DOS and had real power switches and reset buttons, nobody would be surprised by CHKDSK finding problems after either of those were used without preparation. "Close all programs properly before power-down!" was a routine experienced MS-DOS users had already got used to, before Windows 3.1 and '95 drilled into everyone with force. On development machines, where programs would crash and require a hard reboot more frequently, it was customary to run CHKDSK on every boot.

    On top of that, using FAT on flash media would be essentially impossible unless there's a sector remapping mechanism for wear levelling sitting between the FAT file system driver and the raw medium. Which is why "raw" flash file systems should never use FAT.

    As it is, the necessarily huge frequency of updates to the FATs triggered by open, continuously growing files will stress the wear-levelling mechanism quite a bit. Surprising power-loss will cause failure to update either the FATs themselves, or the re-mapping tables, to non-volatile storage. If you utilize the full write speed, the on-medium data will practically never be in a state fit for a clean shutdown, so every power loss will leave them corrupted.

  • I had a similar problem

    I use a stm32F2 driving the sdio to the SD card (all Keil software from early versions of 4 to 4.74)

    I was collecting can data and a few sensors data and logging to the SD card.

    I stopped the SD card corrupting by making sure I only wrote to the SD card with whole sectors eg when 512+ bytes have been written to ram buffer then commit 512 to the SD card

    why ? it reduced the number of sector reads and writes e.g. root directory entry, FAT, Cluster chain, data clusters

    when the system powers down I write everything in the ram buffer.

    Another thing to be aware of I had an issue where the number of files in the folder slowed down the file access time e.g. no problems with less than 30 files but it started taking much longer with about 60 files and if you had more than say 150 the system would crash as the file operations would take so long the system would stop task switching meaning the watchdog was kicked in time. ( there was some good "while" loops in the sdio driver that the system would seem to crash in.

    I worked out by store the data in MxxY20xx ( month year named folder so only having a max of 31 files per folder) it stopped the FS slowing down

    its in the forum threads so just search Danny Curran and it will list the file issues I had and how I got round them

  • Terrible software ! People pay for this :-( !

  • But in the good old days, when a PC had max 640 kB memory, people had to remember to split their log data into subdirectories to avoid slowdowns. And most embedded devices has much less RAM available.

    Next thing is that flash media is bad at emulating a hard disk - traditional file systems for HDD/FDD are optimized for media with completely different behavior.

    EEPROM is often a better choice for small to medium data collection. And for large-scale data collection, it really matters to try to align data and match block writes with flash block sizes - so maybe performing 128 kB writes with 128 kB align or even 1 MB writes with 1 MB align.

    The FAT file system works badly on embedded equipment running a full Linux and writing to SD cards. It isn't likely to work better on much smaller embedded devices running Keil's small-RAM adaptation.

    Optimum is to not require PC compatibility for the SD card data, and instead optimize for stability/performance. Then either let the embedded device perform some translation and exporting the data over USB, or maybe let the PC see a single huge file on the memory card and use a custom PC application that extracts the data inside that container.

    In the end, Keil can't do miracles with FAT.

  • Just some basics, most of SD Cards basically have NAND flash chip as memory and a ASIC controller which handles accessing the NAND as well as ECC and wear-leveling and so on.

    What happens when you want to write to a sector in SD Card is it reads a block that is being accessed to internal RAM it updates RAM with new data, deletes the block and writes whole block back. Worst case if power gets lost is if it happens during block erase or write is that whole block content is lost, so it does not affect only that file data but if it happens on FAT block it corrupts much more.

    If you write a data to a file it then writes all data to blocks, then it updates FAT accordingly depending on size of FAT sectors allocated by written file or when file is closed.

    So, to make it fail safe you would need to detect power failure by a processor and have enough of capacity to hold processor running for as long as it needs to properly finish pending card writes, after that processor has to stop accessing the card and it can power of without leaving card in an odd state.