This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

AR166 TCPnet AX88796 Ethernet driver: prone to crash

Hello,

I work with a PCB built on XC2287M at 80 MHz, Ethernet chip is AX88796B at 25 MHz crystal. Adding Ethernet functionality has revealed an unstable behavior of firmware exhibiting itself in sporadic crash: simply, given 'long enough' time under stress-tests, the target was found to be locked inside Ethernet interrupt in forever-loop. This intermittent defect was found to be sensitive to such factors as changing memory model or optimization level and adding or 'slightly' editing code in non-Ethernet-related modules. I wonder if someone has met any issues similar to the listed herein and how problems were solved.

This topic has some prehistory published in http://www.keil.com/forum/20300/ and here is the findings showing the solution plus some remaining questions. Think it will be useful for next Keil's middleware release. Some stack vendors do not like to supply examples with extended error-handling and re-initialization routines that would make a driver really robust - so, will appreciate any specific suggestions on further driver improvement and if some other overlooked factors exist beside the scene.

To add Ethernet functionality, the AX88796.c module from AR166/TCPnet/Drivers of commercial ARTX-166 package version 3.2 reused as a template for our particular chip. Other details on IDE are in the referred post. The start point was HTTP_DEMO & MCB-XC167 board with AX88796L chip.

Recommendations from ASIX were implemented to adjust init_ethernet() to handle the B chip used on our PCB; MCB-XC167 uses L version, B & L have discrepancies. The interrupt_ethernet() and send_frame() were left intact. The write_PHY() and read_PHY() for some reason turned out to be unusable for our target, and no time was devoted to answering 'why'; instead, the PHY access routines were successfully replaced with a code ported from an older Linux driver. After these preliminary steps the firmware did its work on our target unless 'sometimes' it hung-up; a ping from a Windows XP machine was used for the tests.

An investigation has proven with rather high level of confidence that the following piece of Ethernet code is broken. The snippet initiates internal DMA transfer and reads data port later on; there are 2 similar blocks in the interrupt handler:

void interrupt_ethernet (void) : {
        ...
        /* Start remote DMA read transfer */
        HVAR(U8, CMDR)  = MSK_PG0 | CR_RRE | CR_START;
        /* Read the length and status of this packet. */
        State =  HVAR(U16, DATAPORT);
        RxLen = (HVAR(U16, DATAPORT) - 3) & 0xFFFE;
        ...
        /* Start remote DMA read transfer */
        HVAR(U8, CMDR)  = MSK_PG0 | CR_RRE | CR_START;
        ...
}

The fix shown below has made code stable and no crash or 'bogus' (see below) values/packets were seen anymore. I used some counters & spy variables to monitor the interrupt via an UART.

        HVAR(U8, CMDR)  = MSK_PG0 | CR_RRE | CR_START;   //
        while( ((HVAR(U8, CSR_DSR)) & 0x20 ) != 0x20 );  // the fix: polling 0010 0000b mask RD_RDY

The CSR_DSR pointers to Device Status Register (DSR) at MAC's Control & Status Registers, offset 17h. According to ASIX datasheet: the DSR Bit 5, RD_RDY, is Read Data Port Ready bit. When set, indicates data was ready from SRAM to data port for host reading. In other words, the root cause is that the original driver does not take into account the real speed ratio of Ethernet chip and host: the code that seemingly worked on slower MCB-XC167, failed on faster board. The 80 Mhz was the hint.

During the troubleshooting it was found that State and RxLen can take what programmer's
community names 'bogus values'; Google returns quite a few discussions on bogus values, 'strange' packet size (runt/large), etc. I would assume that some of these 'bogus' beings are due to the collision reason as mentioned e.g. by Rich Blum here: www.velocityreviews.com/.../t29148-what-are-runts-packets.html . But other weird cases I would assume are due to the similar timing negligence.

Now became clear how the crash happened, at least one possible scenario: MCU starts internal chip's DMA but does not actually wait when data are really ready to be read; so it reads the garbage giving - 'sometimes', depending on concurrent interrupts/tasks - bogus State & RxLen and trying to work with these invalid values further on. As the result, NextRecBuf = (U8)(State>>8) becomes invalid and HVAR(U8, BNRY) = NextRecBuf - 1 results in senseless assignment; the condition HVAR (U8, CPR) != NextRecBuf then never fulfilled and firmware dies in the loop inside the interrupt. This was confirmed also in debugging mode via ULink2.

This faulty scenario shows another weakness of driver implementation: lack of proper error handling. Personally, and not to blame, I expected more robust & crisp design from the commercial middleware. On the good side, the driver & stack are really neat, simple and not messy. - Feels the final implementation is up to the end user.

Nevertheless, there is a simple reinitialization when the error condition is met. Either if RxLen > ETH_MTU or < 64 (condition skipped in the original code), or State indicates an error via ERR_MASK, it is enough to break the while ( HVAR (U8, CPR) != NextRecBuf) loop with the following:

  <Command Register>=  0x21;
  <Page Start Register>= SM_RSTART_PG;
  <Page Stop Register>= SM_RSTOP_PG;
  <Boundary Pointer Register>= SM_RSTART_PG;
  <Interrupt Status Register>= 0xFF;
  <Command Register>=  0x61;
  <Current Page Register>= SM_RSTART_PG + 1;
  NextRecBuf= SM_RSTART_PG + 1;
  <Command Register>= MSK_PG0 | CR_RD2| CR_START;

It also works when (NextRecBuf - 1) < SM_RSTART_PG is met - faulty condition not handled by the original driver.

I wrote this in hope that someone who found other critical issues or a 'strange' sporadic behavior with Ethernet driver will share a workaround/fix to make a really strong Ethernet code.

- E.g., what should someone do if the interrupt processing takes too long time - I saw it about 0.3 ms - several ms (!). - But need other critical tasks to run. Lowest priority + nested IRQs do help but maybe there is a better solution - e.g. avoid 2 while() in interrupt - is this possible?

- Hints on implementation of firmware watchdog and its usage to bypass physical reconnection problem

Regards,
Nikolay.