All,
I am trying to track down a problem of some code that was written by an 'overseas' 3rd party (I will be nice and not state the country of origin).
This code uses a single timer set at a 1mS interrupt rate in order to determine if the SPI is still communicating externally with it's master. If no communications are detected the timer resets the SPI port, clears the interrupt, and jumps to the reset vector. The problem is: the SPI port remains dead until a power cycle is accomplished.
The obvious fix is to use the watchdog (which is what I will eventually do), but I would like to understand the why of why this does not work (yes, bad coding practice is the real reason)...
Since the code jumps to the reset vector this is what I have been able to analyze:
(1) Since this is not a true reset (ie: via watchdog) all hardware registers are not reset - problem potential here. (2) The jump to the reset vector was accomplished while in supervisor mode, so the privleged registers (ie: SP,etc) can be written. (3) The timer interrupt was cleared prior to making the jump to the reset vector, so all interrupts are still enabled. (4) Since this is not a true reset, all resident code can still execute (ie: interrupt handlers). (5) The startup code will reset all initialized data, registers, etc prior to jumping to program main(), effectively returning data to a power up state.
One reason I can currently come up with as to why the SPI is never functional after this occurs is that maybe an interrupt occurs while in the startup code (clearing a tracking variable or resetting the processor registers). But the interrupt would also inhibit the startup code until it was serviced. This potential cause is (probably) not the only reason for this issue, and why I am asking for your input(s).
Unfortunately, this board has no JTAG to connect so stepping through the code is not an option. I could write to the serial port - if it was connected, but it isnt. Right now I am trying to analyze my way through this code before using a 'hammer' approach to solving this problem.
What else am I missing in this analysis? Thanks.
(1) Since this is not a true reset (ie: via watchdog) all hardware registers are not reset - problem potential here.
it is a common code monkey belief that jumping to the reset vector actually does a reset.
Try timing out the watchdog instead.
Erik
Erik,
Yes, I agree. However (from initial post):
"The obvious fix is to use the watchdog (which is what I will eventually do), but I would like to understand the why of why this does not work (yes, bad coding practice is the real reason)..."
Thanks
Let's say that your SPI data may never contain 10 bytes in a row with value 0x00.
So if the master doesn't get an answer - send 12 or more zeroes. Then make a pause of say 50 ms. Then start sending real data again.
The slave should be able to notice the long row of zeroes and know that it is a request to synchronize. When it then sees the pause, it can reinitialize the SPI controller and start waiting for more data, having cleared the internal SPI bit counter.
This method wastes a bit of time for synchronizing, but have the advantage that when the sender and receiver are synchronized, you will be able to keep a quite high speed without wasting time performing a lot of bit manipulations in the slave. Counting # of consecutive 0x00 is quite cheap. And only after having seen at least 10 bytes of zero do you need to start measuring if you have a pause in the transfer (which is needed since you may get 10, 11 or 12 zero bytes).
The next thing to do if the interface suffers from noise is of course to make sure that all messages have strong integrity checking. At least crc-32 but possibly even better. Maybe you should even consider a twodimensional scheme.
Per,
An even better, less time-consuming recovery method!
Excellent suggestion. Thanks.
You don't even need to time out the watchdog on many ARM parts. You can simply tell it "I want you to reset the chip _right now_."
Christoph,
According to NXP's LPC210x manual on reset, the following is stated:
"Reset has two sources on the LPC2101/02/03: the RESET pin and watchdog reset."
I am also aware that you can invoke reset(s) through the VICSoftInt register on the LPC2103 and through the STIR and ISPR registers on the Cortex LPC1765.
The criteria for using the software interrupt register seems to differ between the two chips in that on the Cortex chip it is stated by NXP that:
"The STIR register provides an alternate way for software to generate an interrupt, in addition to using the ISPR registers. This mechanism can only be used to generate peripheral interrupts, not system exceptions. By default, only privileged software can write to the STIR register. Unprivileged software can be given this ability if privileged software sets the USERSETMPEND bit in the CCR register"
However, in running tests on the Cortex chip with the STIR it was found that after I wrote to the STIR (within an SVC call to get into privleged mode from user mode code) and then the ISPR registers the W/D interrupt was entered BUT neither the WDTOF or WDINT bit was found to be set using either. This implies to me that you can only use these registers or features for test-type purposes only.
If I am missing something else that is available on the LPC2103 please elaborate.
Thanks.
At times, I can understand why some code looks like patchwork or is not written with a clear end-purpose in mind...
Example: Given the two choices for correcting the potential SPI port issue:(1) either set independent clocks in an interative manner until recovery is achieved (this might not even work) or the much more efficient suggestion from Per the former was chosen. WHY??? Because the 'decision-makers' want the fix to be 'backwards compatible' with existing field units without making changes off the master (unless a bug is found in the slave code - thus my request for additional ideas I may be (am) missing).
Arrrrrrrrrrrrrrrrrrrrrgh. Why do engineers (seem to) die young? Or feel old?
Additionally, after running simulation tests ad-nausem there have been no noticible differences observed between start-up and 'soft' reset system or peripheral registers. Now, I cannot simulate for random external influences to a great degree, or I can, but would never end these simulations.
Finally, every time I try and give this 'person' who wrote this code the benefit of the doubt with this code (I DO NOT call this 'person' a 'developer' for reasons soon to become obvious) I quickly loose any pity I ever had.
Why?
Dealing with code such as this (no comments anywhere):
memcpy(StartRam, StartFlash, 0x3A7);
OR
if (Bitflag & 16) { ... }
What the h... is 0x3A7 or 16??? Gotta track it down...
Finally, this 'person' decided that ALL, repeat ALL, variables (whether file global or local) are to be declared as : volatile
Only 'bout 1000 or so, according to the linker .map file. Feelin old....
I don't know about captain Vince, but I consider this to be a form of terrorism. I'm serious.
The funny thing here is that the "volatile" keyword is known to be one of the most problematic for compilers. Not just in efficiency, but in producing broken code. www.cs.utah.edu/.../emsoft08-preprint.pdf
A good link about volatile: blog.regehr.org/.../28
Found these gems associated to the name of this thread (credit to the authors). Applicability is in the readers eyes:
"The infinite monkey theorem states that a monkey hitting keys at random on a typewriter keyboard for an infinite amount of time will almost surely type a given text, such as the complete works of William Shakespeare."
Better yet:
"Would a million rednecks shooting at road signs ultimately produce the entire works of Shakespeare in braille?"
Didnt want to 'cross this bridge' in the original post as most of the responses to this 'structuring' (used very loosely here - actually probably the wrong description altogether) would have probably masked the original question(s) I was trying to get answered above.
After reading the abstract to the first link I must take the time now to stop and read this information to conclusion prior to resuming my other functions (I guess it was also for me so as not to get waylayed).
Good, informational links though.
Tamir,
I call this criminal (especially since money was accepted for this trash).
The paper submitted by Utah SC should be a MUST READ for anyone that is about to use the volatile keyword.
What is unnerving here is that the lack of clarity from the C specification allows individual interpretation as to how the compiler should/will handle these situations. As a consequence, it was found that compiler bugs are commonplace!
I was well aware that a severe processing hit was occuring because of the declarations, but was unaware of the bug potential generated from the compiler
Thankfully, I do not see (at this point) any real need with having to use the volatile keyword within any portion of this code. The downside to removing these keywords is that, the way this code has been written (read: hacked), that some undocumented timing dependency(ies) will rear its(their) ugly head(s) and smite me. What deadline?
A note of caution: the description of volatile in the 'C' standard is sufficiently vague that some of the interpretations used in that paper are open to debate. A particular example that springs to mind is the explanation and example given in section 2.1 - this one has been fairly convincingly debunked out there on the internet.
But the important thing is that the description of the volatile keyword in the 'C' standard is sufficiently vague that it doesn't matter what you debunk in one debate - it still matters what specific compiler developers thought the description meant.
The bottom line is that we are extremely vulnerable, since we are relying on what the compiler developer thought, and not what the standard writers thought. And a large number of people are assuming things that either should be wrong, or are wrong for their specific compiler.
we are relying on what the compiler developer thought, and not what the standard writers thought
"if you heard what I thought I said we would understand each other"