Hi, Wonder if anyone could help me. I’ve got an SD card related problem. My project builds under uVisionV4.73.0.0, C Compiler Armcc.Exe V5.03.0.76, device = STM32F427II. My target platform has a native SD card device (2GB capacity and FAT file structure) for which we’re using Keil’s RL-FLashFS i.e. SDIO_STM32F4xx.c , File_Config,c (Rev V4.70). My target platform is a custom hardware design.
The problem The SD card becomes corrupt however I can still read and write files to it from my target platform perfectly well and finit() returns 0 indicating the card is in good working order however:
- ffree() takes over 1 second to complete whereas normally it completes in approximately 70mSec on a card that hasn’t been corrupted (this very slow response prevents my target booting which is how I discovered the card corruption).
- My Windows PC reports 700+ MB of used space on the card when the files on the card add up to less than 10 MB.
- Windows CHKDSK reports the SD card has errors and can repair it; it finds around 24,000 x 32KB bad clusters. Once CHKDSK has repaired the card the used space equates roughly to the size of the files on the card and ffree() calls on my target platform complete in the usual 70mSec time period.
BTW when I make a copy of the corrupted card using HDDGuru’s HDDRawCopy1.10 the copy has the same 700+MB of wasted (corrupted) space just like the original but when I insert the card into my target platform calls to ffree() complete in the normal 70mSec time frame?
Specifically I would help with 1. Detecting the SD card corruption in my target platform, everything appears to work fine apart from the very slow ffree(); unfortunately fanalyse() and fcheck() aren’t available to me because it is a FAT file system. 2. Understanding why a low level copy of the card doesn’t suffer from the very slow ffree() response. 3. Ultimately stopping the corruption from occurring in the first place.
Many thanks in advance for any assistance/advice you can give me.
Paul
One note here - a binary copy of an "intelligent" flash media will not result in an identical copy. The binary copy will only duplicate the data on the file system level. It will not manage to duplicate the underlying storage structure on the actual flash memory.
Remember that the flash controller in the card contains a translation layer as it finds suitable raw flash blocks for storing changes to logical file system sectors.
So a "low-level-formatted" card is much faster because the memory controller will have lots of already erased flash blocks that can be immediately used. While a card that has already been filled and then had the files erased will normally not have erased flash blocks since the memory controller will normally not be informed about file system sectors that are no longer in use.
So the memory controller may have to suffer a huge write amplification when asked to perform a small write - it has to move data between different flash blocks so it can get a flash block empty and possible to erase.
en.wikipedia.org/.../Write_amplification
This is a reason why for example SSD have got a TRIM command - so the OS can tell "this region in the logical address space is now unused". And the SSD will then know it doesn't have to move that data when trying to get a flash block empty so it can be erased. So TRIM reduces the amount of wear, while greatly speeding up write speeds.
It's important to think twice about the usage patterns when using flash media in embedded devices - especially since a SD isn't as advanced as a SSD.
Instrument the SDIO sector level access routines to understand if it's stuck in there and getting errors from the media, or what exactly is the pattern of reads leading to the long delays. ie are the reads taking a long time, or is it doing a lot of reading.
The FAT file system isn't unduly complicated, and is well documented, walk the structures to understand what's happened inside them.
Is your SD formatted with FAT32 or FAT16? ffree() is most likely inspecting FAT table for allocated clusters and probably it is missing one or more end of cluster marker. There are various reasons for this to happen, like worn out SD card, sudden power loss, etc...
I would inspect FAT (table) and compare with the one repaired by chkdsk. From the comparison the reason could be seen.
Guys thanks for taking the time to respond I appreciate your thoughts.
I’ve had a read up on the FAT16 file system, write amplification and have had a look at the card's block i/o timings.
The block i/o timings look consistent across both good and bad SD cards.
Comparing a good FAT table with a bad FAT table wouldn't help me discover the cause also it would be a huge undertaking as Windows CHKDSK output contains thousands of entries for the corrupt card, sample output below:
Lost chain cross-linked at cluster 5728. Orphan truncated. Bad links in lost chain at cluster 5730 corrected. Bad links in lost chain at cluster 5731 corrected.
This indicates to me that there is definitely corruption in the FAT table. I’m guessing the ffree() call iterates through the entire FAT table and totals up the free clusters; maybe the FAT table corruption is the reason why the call to ffree() is taking so long. I don’t have access to the Keil's FAT source code so I’m guessing here.
I’m inclined to think that a power outage is the most likely cause of the corruption but I am unable to prevent it.
Theoretical Solution
What I’d like to do programmatically is detect and hence repair any card corruption in my target platform.
In theory I could write code to recreate the same steps on my target platform that I undertook on my PC to detect and repair the bad card i.e. look for an error in the number of used bytes on the card then do a CHKDSK type repair.
Problem
However I think I’ve hit a dead end. I could probably add up the file size of every file on the card but if I do detect an anomaly between that figure and (total card size – ffree()) how do I effect a repair?
As I stated in my first post RL_FLASHFS doesn’t appear to have any FAT repair routines. I would have to go for the “nuclear” option of a complete fformat() and lose all of the files.
Any ideas anyone?
Thanks
A raw file system copy should give you a second card with the same cross-linked clusters in the FAT chains. And you mentioned that you had done a perfect disc copy and did not get the slowdown on the copied card.
So while broken FAT tables can result in big issues when later trying to allocate space or release files, it doesn't sound like this is the problem you are having.
The FAT file system really isn't very robust for embedded use.
Determining the number of unused FAT entries is a fairly trivial task, so I'm going to assume ffree() is instead chasing it's tail through the links, and cross links, in the table rather than just looking for the occupied ones.
The first test is to see if BOTH tables are the same, and then to go through the cluster links and make sure you don't have more than one reference to the same next cluster. For lost and trucated things you need to enumerate through all files on the media, comparing the size of the file in bytes reported in the directory entry, against the length and integrity of the cluster chain it points too.
Repair is more of a challenge, you could truncate cluster chains where the length doesn't match, or code in the new length. You could delete the files. In the cross-link case you'd have to delete all the files that share to compromised chain(s).
A half decent fsck will take a lot of time and resources, things often not workable in an embedded system.
View all questions in Keil forum