This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

efficient memcpy in Cypress FX2

We are considering the Cypress FX2. Ive seen several postings about using the dual data pointers on some 8051 devices, but they don't seem to offer much advantage when you have to waste valuable time executing instructions to toggle between the two. Curiously, there's not much I can see on exploiting the auto-increment feature of the DPTRs in some 8051 variants, and in the FX2 at least, they can both be used without having one of them 'hidden'.

I need an efficient memcpy between the FX2's endpoint memory space, (treated as a FIFO), and an external location. Is this the best way? Only two instructions in the loop... Or have I missed something?

   MOV AUTOPTRSET,#03;; enable, inc 1, not 2
   MOV r7,#64        ;;count
   MOV AUTOPTR1L,#c0
   MOV AUTOPTR1H,#e7 ;; e7c0=EP1OUT buffer
   MOV AUTOPTR2L,#00
   MOV AUTOPTR2H,#40 ;; 4000=external
loop:
   MOV XAUTODAT2,XAUTODAT1
   DJNZ R7,loop


If this is correct, the memcpy figures could be MUCH better than the figures given in the benchmarks. Any input welcome.

Parents
  • I don't know much about Cypress extensions, so I figured the question was a good reason to take a look and learn something. Here's what it looks like to me, but feel free to correct me.

    The chip has what I think of as the "usual" dual DPTR scheme; a second DPTR for use with the MOVX instructions, and a DPS register that selects which one is used by those instructions.

    So, the dual-DPTR loop looks something like this (instruction cycle counts in comments):

    loop:
           MOVX A, @DPTR  ; 2..9
           INC DPTR       ; 3
           INC DPS        ; 1
           MOVX @DPTR, A  ; 2..9
           INC DPTR       ; 3
           INC DPS        ; 1
           DJNZ R7, loop  ; 3
    

    Note that it's the same MOVX instruction, and use of "the" DPTR is implied. Toggling the DPS bit does cut into block transfer time compared to some theoretical architecture where you had two real DPTRs. But the point is that the 8051 instruction set only allows for one DPTR. There are no bits in the instruction or another opcode to specific the other DPTR. The low-order bit of DPS controls which bits are used.

    In addition to these dual DPTR registers, the FX2 also has some "autoptr" registers. These registers have the virtue of automatically incrementing. However, I didn't see any extension to the instruction set that lets you use a MOV instruction between the two autoptrs. Instead, each AUTOPTR register has an associated AUTODAT register. When you read or write this special address, then a MOVX instruction is executed using the address in the associated AUTOPTR register, which auto-increments if so configured.

    The autodat registers are mapped into the external data space of the 8051 (addresses 0E67BH and 0E67CH.) The manual says that holes appear in code space, and not data space, oddly enough. But you must be able to access the autodat registers in data space, with MOVX rather than MOVC, or you wouldn't be able to write them. (MOVC access would also be broken because it always involves using A as an offset to DPTR; loops written this way would have to juggle the accumulator and wind up being even slower than the dual DPTR loop.) To move data, you read one AUTODAT register and write the other AUTODAT register.

    So, the loop would look something like:
         MOV DPTR, #AUTODAT1
         INC DPS
         MOV DPTR, #AUTODAT2
         INC DPS
    loop:
         MOVX A, @DPTR   ; from AUTODAT1   2..9
         INC DPS         ; switch AUTODATs 1
         MOVX @DPTR, A   ; to AUTODAT2     2..9
         INC DPS         ; switch AUTODATs 1
         DJNZ R7, loop   ; 3
    

    In short, you've saved the increment DPTR instructions, but still have to flip back and forth between DPTRs to flip back and forth between AUTODAT registers. (Reloading a single DPTR will be slower than INC DPS.) Best case advantage is 9 cycles / byte rather than 15, or about 60% of the time of the first loop. Worst case, with slow xdata, is 23 / 29 or about 80%.

Reply
  • I don't know much about Cypress extensions, so I figured the question was a good reason to take a look and learn something. Here's what it looks like to me, but feel free to correct me.

    The chip has what I think of as the "usual" dual DPTR scheme; a second DPTR for use with the MOVX instructions, and a DPS register that selects which one is used by those instructions.

    So, the dual-DPTR loop looks something like this (instruction cycle counts in comments):

    loop:
           MOVX A, @DPTR  ; 2..9
           INC DPTR       ; 3
           INC DPS        ; 1
           MOVX @DPTR, A  ; 2..9
           INC DPTR       ; 3
           INC DPS        ; 1
           DJNZ R7, loop  ; 3
    

    Note that it's the same MOVX instruction, and use of "the" DPTR is implied. Toggling the DPS bit does cut into block transfer time compared to some theoretical architecture where you had two real DPTRs. But the point is that the 8051 instruction set only allows for one DPTR. There are no bits in the instruction or another opcode to specific the other DPTR. The low-order bit of DPS controls which bits are used.

    In addition to these dual DPTR registers, the FX2 also has some "autoptr" registers. These registers have the virtue of automatically incrementing. However, I didn't see any extension to the instruction set that lets you use a MOV instruction between the two autoptrs. Instead, each AUTOPTR register has an associated AUTODAT register. When you read or write this special address, then a MOVX instruction is executed using the address in the associated AUTOPTR register, which auto-increments if so configured.

    The autodat registers are mapped into the external data space of the 8051 (addresses 0E67BH and 0E67CH.) The manual says that holes appear in code space, and not data space, oddly enough. But you must be able to access the autodat registers in data space, with MOVX rather than MOVC, or you wouldn't be able to write them. (MOVC access would also be broken because it always involves using A as an offset to DPTR; loops written this way would have to juggle the accumulator and wind up being even slower than the dual DPTR loop.) To move data, you read one AUTODAT register and write the other AUTODAT register.

    So, the loop would look something like:
         MOV DPTR, #AUTODAT1
         INC DPS
         MOV DPTR, #AUTODAT2
         INC DPS
    loop:
         MOVX A, @DPTR   ; from AUTODAT1   2..9
         INC DPS         ; switch AUTODATs 1
         MOVX @DPTR, A   ; to AUTODAT2     2..9
         INC DPS         ; switch AUTODATs 1
         DJNZ R7, loop   ; 3
    

    In short, you've saved the increment DPTR instructions, but still have to flip back and forth between DPTRs to flip back and forth between AUTODAT registers. (Reloading a single DPTR will be slower than INC DPS.) Best case advantage is 9 cycles / byte rather than 15, or about 60% of the time of the first loop. Worst case, with slow xdata, is 23 / 29 or about 80%.

Children
  • Thanks guys.

    My mistake was in thinking that the autodat registers were in SFR space like the address registers. But they're in external, so can only be accessed by MOVX, via the 'one at a time' DPTRs. So, 9 cycles/byte when clocking the cpu at maximum rate gives me 1.33MBytes/sec.

    I can improve on that a BIT because I know I will always be transferring an even number of bytes, so:

    loop:
         MOVX A, @DPTR   ; from AUTODAT1   2..9
         INC DPS         ; switch AUTODATs 1
         MOVX @DPTR, A   ; to AUTODAT2     2..9
         INC DPS         ; switch AUTODATs 1
         MOVX A, @DPTR   ; from AUTODAT1   2..9
         INC DPS         ; switch AUTODATs 1
         MOVX @DPTR, A   ; to AUTODAT2     2..9
         INC DPS         ; switch AUTODATs 1
         DJNZ R7, loop   ; 3
    

    gives me 15 cycles for 2 bytes = 1.6MBytes/sec. I think thats the best we can do.

    Thanks again,

    Mark.