This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

TCPNET HTTP server: TCP checksum errors

Hello,

I have implemented a web server using RL-ARM. The problem I am trying to resolve is that, occasionally, the web server will 'hang' for about two seconds while in the middle of serving a HTTP response to the browser. This does not happen very frequently; 95% of the time, complete pages are served almost instantly.

Using Wireshark, I see that what's happening is that TCPNET is sometimes sending out a TCP packet (containing HTTP data) that has an incorrect checksum. Wireshark actually marks the packet as "Continuation or non-HTTP traffic", and the 'bad checksum' flag is 'true'.

About two seconds after the bad packet is issued, I can see that TCPNET is retransmits the packet. Wireshark marks it as "[TCP Retransmission]". Inspection of this retransmitted packet shows that it contains exactly the same HTTP data as the bad packet, except this time the packet is usually a little longer (perhaps just several bytes, or sometimes tens of bytes longer) and has a good checksum value.

So what's happening is that the browser ignores the packet with the bad checksum, and the 'hang' is when it awaits the retransmitted packet.

The retransmitted packet is almost always longer. It's as if the bad packet, with the wrong checksum, has somehow become slightly truncated.

This is the only issue I am experiencing; everything else seems to be running absolutely fine with our web server. It has been going through very extensive testing and I've never seen anything else that would point to data corruption.

The platform is the ST ARM9. The software in use is:

uVision V4.00
MDK-ARM V4.00
RL-ARM V4.00

The problem has been present ever since we started developing using earlier V3.x versions of uVision, MDK and RLARM. This isn't something that has been introduced with any particular release of Keil software.

In our application there are three tasks running: Main application (middle priority), a serial communications task (highest priority), and web server task. The web server is set to lowest priority. As a test, I have tried making it the highest priority task but this didn't eliminate the checksum errors. At the moment, I am in the process of disabling as much of the main application as I can, along with interrupts, etc. to see if I can determine what, if anything, in our code could be upsetting TCPNET.

In the meantime I am just curious as to whether anyone has experienced anything similar to this. It's something that I'm finding very tricky to debug.

Thanks,

Trevor.

Parents
  • There are a few custom bits in our ethernet driver file, so I copied in your send_frame() and delay function and tried it.

    And it works a total charm. Wireshark isn't showing any corrupted packets, and I'm able to zip around the web server really fast without any sluggishness. It's like a total transformation.

    I can't say thanks enough!

    I have not studied your changes in any great depth yet, but at a glance, I think I understand the solution to be as follows:

    Originally, to send a new frame, you copy the new frame into the next available DMA transmit buffer. If a frame send is already in progress, then I *think* that setting of the DMA_CTRL_NEXT_EN bit causes it to automatically send the new frame afterwards. I'm not sure - I need to read up on the documentation.

    So looking at your changes, I think that you're using your own loop to wait until the previous frame transmit has finished, and then you copy in and send the new frame, and you never set DMA_CTRL_NEXT_EN.

    Again many thanks. That's quite an unexpected result to have just 1.5 hours after posting.

    Trev

Reply
  • There are a few custom bits in our ethernet driver file, so I copied in your send_frame() and delay function and tried it.

    And it works a total charm. Wireshark isn't showing any corrupted packets, and I'm able to zip around the web server really fast without any sluggishness. It's like a total transformation.

    I can't say thanks enough!

    I have not studied your changes in any great depth yet, but at a glance, I think I understand the solution to be as follows:

    Originally, to send a new frame, you copy the new frame into the next available DMA transmit buffer. If a frame send is already in progress, then I *think* that setting of the DMA_CTRL_NEXT_EN bit causes it to automatically send the new frame afterwards. I'm not sure - I need to read up on the documentation.

    So looking at your changes, I think that you're using your own loop to wait until the previous frame transmit has finished, and then you copy in and send the new frame, and you never set DMA_CTRL_NEXT_EN.

    Again many thanks. That's quite an unexpected result to have just 1.5 hours after posting.

    Trev

Children
  • Interesting. Have you checked the errata sheet of the processor? Maybe this is a known hardware issue?

  • Just to throw in some part numbers here:

    I originally started work on the web server using the MCB-STR9 evaluation board which has the STR912FW44X6. On this board I ran the web server standalone using exactly the same project configuration as Keil's 'HTTP demo', without any other application code. If memory serves me right, I never encountered the TCP packet problem back then.

    When integrating the web server with the rest of the application code on our own hardware, the device in use is the STR912FAW46. I'm sure this is the point where the packet corruption started.

  • "And it works a total charm."

    That's such a relief for me. To be blunt, I've been getting VERY frustrated with Keil's support on this one. It's like they 1) don't see it as problem 2) don't seem to believe me and 3) don't want to look at [or understand] the evidence I've given them. There again, to be fair, I still have not found a way of 'forcing' the error.

    "So looking at your changes, I think that you're using your own loop to wait until the previous frame transmit has finished, and then you copy in and send the new frame, and you never set DMA_CTRL_NEXT_EN."

    I don't now have the code to hand, but that sounds like the way I remember it. Theoretically, it is not as efficient as the 'unmodified' version - But when you take into account the stalls, it works out faster.

    "Interesting. Have you checked the errata sheet of the processor? Maybe this is a known hardware issue?"

    Yes, I've looked numerous times - And each occasion come up blank :(

    I'll try again with the numbers Trevor's given.

    -----------------------------------------------

    Trevor,

    What I'd like to do is give Keil support a reference to this thread and see if I can get them to now look into it further.

    As I mentioned before, I've had terrible trouble getting my application reduced down enough for them to see the problem. Would you be able to give them your application so they can more easily see it?

    Cheers.

  • I could in theory make a new project with the web server as standalone and just minimise it to a few pages, rip out most of the CGI stuff, etc. to the point where it is very minimal, yet still exhibits the problem.

    But the thing is, if we were to speculate that the problem is down to an incompatibility between the standard STR9_ENET.c driver and certain variants of the ST ARM9 device, perhaps we should first see if Keil's original HTTP Demo application exhibits the same fault - when made to run on our own hardware?

    I'm pretty sure that my web server ran fine on the MCB-STR9 board and I only had the problem when I started running it on our own PCB.

    So I'm wondering whether to have a go at making HTTP Demo run on our hardware, with the hope that it still exhibits the fault. Then, hopefully, Keil could replicate the problem for themselves by running their demo on another application board that uses the device we're using.

    Which particular ST ARM9 device are / were you using for your project?

  • "Which particular ST ARM9 device are / were you using for your project?"

    The initial development started with the Keil MCBSTR9 board, fitted with an STR912FW44X6 (rev G).

    Our development board uses an STR912FW44X6 (rev H).

    The project uses raw TCP sessions.

    The problem seemed to start when I went beyond the Keil examples and started putting in the 'real-life' code. I've tried going back to the Keil examples (as have Keil support) and I see no problem.

    I believe that the problem is due to some sort of interaction between the basic TCP communication and 'something else'. Since the 'something else' is missing from the Keil examples, the problem is not seen there.

    What I also found was that a slight minor change in a part of the project seemingly disassociated with TCP communication would have an effect on whether the problem would be visible and the frequency of it.

    In one particular example that I gave Keil support, I created two binary files of the project. One repeatedly failed and the other worked for hours without seeing a problem. The only difference between the two binary images was a series of five instructions that were in a different order. To me, the functionality of that sequence of instructions was the same. I could not see what was causing the apparent difference - Could there be some instruction queuing/caching difference? I just don't know.

    If you could spend the time to create some code that repeatedly and easily fails, then maybe we can convince Keil support to look at it again.

  • Please send an email to: support.intl@keil.com and ask for an updated driver.

  • Franc,

    Does this mean there is now an updated driver that corrects this problem?

  • Just before seeing Franc's post, I had sent another project to Keil Support that seems to reliably show the problem.

    I eagerly await the updated driver to try.

  • Hmmm....

    Just received the updated driver, and still get the problem :(

    I remember trying the very same sort of fix on it myself three or four months ago and then going on to try something else.

    Keil support have my updated project so I'll pass the details on to them.

    Also noticed that the update is based upon an older version of the code, so there is one part that has reverted.

    Was this up to about 3.22 (and still is in the update):

    void int_enable_eth (void) {
       /* Ethernet Interrupt Enable function. */
       VIC0->INTER |= 1 << 11;
    }
    

    From about 3.22, this changed to:

    void int_enable_eth (void) {
       /* Ethernet Interrupt Enable function. */
       VIC0->INTER = 1 << 11;
    }
    

  • I confirm also that using Keil's modified send_frame() just immediately brings the problem back for me too.

    I shall continue using your modifications for now if that's alright...!

  • "I shall continue using your modifications for now if that's alright...!"

    Sure you can.

    I've notified Keil support and suggested that they try my app to see if they can re-create it.

    I'll keep you informed.

  • Trevor,

    At the moment, Keil support are unable to replicate the error with my code. With that very same code, I'm seeing the problem on the MCBSTR9 board and our own board.

    If you can do something that you think would show the problem more consistently, I think they would appreciate a copy.

  • Please try the last driver that you have received from support and change the number of TX buffers in the header file:

    #define NUM_TX_BUF          3
    

    It seems that this solves the problem.

  • I've been trying my 'test' project with this fix and so far have seen no errors (after more than 1.5 million packets).

    It looks promising.

    I'm now going to put it into my 'live' project and set up a test to run over the weekend.

  • Unfortunately, the tests on my 'live' project were not successfull; i.e., I still see the error.