This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

efficient memcpy in Cypress FX2

We are considering the Cypress FX2. Ive seen several postings about using the dual data pointers on some 8051 devices, but they don't seem to offer much advantage when you have to waste valuable time executing instructions to toggle between the two. Curiously, there's not much I can see on exploiting the auto-increment feature of the DPTRs in some 8051 variants, and in the FX2 at least, they can both be used without having one of them 'hidden'.

I need an efficient memcpy between the FX2's endpoint memory space, (treated as a FIFO), and an external location. Is this the best way? Only two instructions in the loop... Or have I missed something?

   MOV AUTOPTRSET,#03;; enable, inc 1, not 2
   MOV r7,#64        ;;count
   MOV AUTOPTR1L,#c0
   MOV AUTOPTR1H,#e7 ;; e7c0=EP1OUT buffer
   MOV AUTOPTR2L,#00
   MOV AUTOPTR2H,#40 ;; 4000=external
loop:
   MOV XAUTODAT2,XAUTODAT1
   DJNZ R7,loop


If this is correct, the memcpy figures could be MUCH better than the figures given in the benchmarks. Any input welcome.

  • I don't know much about Cypress extensions, so I figured the question was a good reason to take a look and learn something. Here's what it looks like to me, but feel free to correct me.

    The chip has what I think of as the "usual" dual DPTR scheme; a second DPTR for use with the MOVX instructions, and a DPS register that selects which one is used by those instructions.

    So, the dual-DPTR loop looks something like this (instruction cycle counts in comments):

    loop:
           MOVX A, @DPTR  ; 2..9
           INC DPTR       ; 3
           INC DPS        ; 1
           MOVX @DPTR, A  ; 2..9
           INC DPTR       ; 3
           INC DPS        ; 1
           DJNZ R7, loop  ; 3
    

    Note that it's the same MOVX instruction, and use of "the" DPTR is implied. Toggling the DPS bit does cut into block transfer time compared to some theoretical architecture where you had two real DPTRs. But the point is that the 8051 instruction set only allows for one DPTR. There are no bits in the instruction or another opcode to specific the other DPTR. The low-order bit of DPS controls which bits are used.

    In addition to these dual DPTR registers, the FX2 also has some "autoptr" registers. These registers have the virtue of automatically incrementing. However, I didn't see any extension to the instruction set that lets you use a MOV instruction between the two autoptrs. Instead, each AUTOPTR register has an associated AUTODAT register. When you read or write this special address, then a MOVX instruction is executed using the address in the associated AUTOPTR register, which auto-increments if so configured.

    The autodat registers are mapped into the external data space of the 8051 (addresses 0E67BH and 0E67CH.) The manual says that holes appear in code space, and not data space, oddly enough. But you must be able to access the autodat registers in data space, with MOVX rather than MOVC, or you wouldn't be able to write them. (MOVC access would also be broken because it always involves using A as an offset to DPTR; loops written this way would have to juggle the accumulator and wind up being even slower than the dual DPTR loop.) To move data, you read one AUTODAT register and write the other AUTODAT register.

    So, the loop would look something like:
         MOV DPTR, #AUTODAT1
         INC DPS
         MOV DPTR, #AUTODAT2
         INC DPS
    loop:
         MOVX A, @DPTR   ; from AUTODAT1   2..9
         INC DPS         ; switch AUTODATs 1
         MOVX @DPTR, A   ; to AUTODAT2     2..9
         INC DPS         ; switch AUTODATs 1
         DJNZ R7, loop   ; 3
    

    In short, you've saved the increment DPTR instructions, but still have to flip back and forth between DPTRs to flip back and forth between AUTODAT registers. (Reloading a single DPTR will be slower than INC DPS.) Best case advantage is 9 cycles / byte rather than 15, or about 60% of the time of the first loop. Worst case, with slow xdata, is 23 / 29 or about 80%.

  • single dptr
    movx a,@dptr
    inc dptr
    mov slot3,dph
    mov slot4,dpl
    mov dph,slot1
    mov dpl,slot 2
    movx @dptr,a
    inc dptr
    mov slot1,dph
    mov slot2,dpl
    mov dph,slot3
    mov dpl,slot4
    loop

    dual dptr
    movx a,@dptr
    inc dptr
    inc dps
    movx @dptr,a
    inc dptr
    inc dps
    loop

    no savings ?????


    Erik

  • I think you need to access the XAUTODATx registers using data pointers. For example,
    for single data pointer, this would be...

    loop:
       MOV DPTR,#XAUTODAT1
       MOV A,@DPTR
       MOV DPTR,#XAUTODAT2
       MOV @DPTR,A
       DJNZ R7,loop
    

    Using dual data pointers would be..

       MOV DPTR,#XAUTODAT1
       INC DPS
       MOV DPTR,#XAUTODAT2
       INC DPS
    loop:
       MOV A,@DPTR
       INC DPS
       MOV @DPTR,A
       INC DPS
    

    The INC DPS takes the same time as MOV DPTR, #. So, in this case, using dual data pointers is NOT more efficient.

    Jon

  • The INC DPS takes the same time as MOV DPTR, #. So, in this case, using dual data pointers is NOT more efficient.

    Jon, you are mistaken. Yes, the time for the 2 operations above are the same, but with one dptr, you need to save the fetch value and load the store value before the store as well as doing the reverse before the load.

    Have a peek at my example regardless of whether it reflect this particular implementation of dual dptr or not.

    Erik

  • Thanks guys.

    My mistake was in thinking that the autodat registers were in SFR space like the address registers. But they're in external, so can only be accessed by MOVX, via the 'one at a time' DPTRs. So, 9 cycles/byte when clocking the cpu at maximum rate gives me 1.33MBytes/sec.

    I can improve on that a BIT because I know I will always be transferring an even number of bytes, so:

    loop:
         MOVX A, @DPTR   ; from AUTODAT1   2..9
         INC DPS         ; switch AUTODATs 1
         MOVX @DPTR, A   ; to AUTODAT2     2..9
         INC DPS         ; switch AUTODATs 1
         MOVX A, @DPTR   ; from AUTODAT1   2..9
         INC DPS         ; switch AUTODATs 1
         MOVX @DPTR, A   ; to AUTODAT2     2..9
         INC DPS         ; switch AUTODATs 1
         DJNZ R7, loop   ; 3
    

    gives me 15 cycles for 2 bytes = 1.6MBytes/sec. I think thats the best we can do.

    Thanks again,

    Mark.

  • There are two different points here.

    Erik points out that when using a single data pointer, you need to store the (changing) value of the DPTR when you switch, and thus it's more expensive than the traditional dual DPTR implementation. I agree, but the original poster was talking about yet another pointer mechanism in this particular variant (which are in addition to the traditional dual DTPRs).

    Jon is talking about using the DPTRs to point to the FX2 AUTODAT registers. These values are constant, and do not need to be stored. The loop can use MOV immediate to load the DPTRs.

    The INC DPS takes the same time as MOV DPTR, #.

    According to the Cypress manual, MOV DPTR, #value takes 3 bytes and 3 instruction cycles. INC DPS takes two bytes and two instruction cycles, so pre-loading the DPTRs and using INC DPS would slightly faster for any but tiny blocks to be copied. The MOV-based code would be smaller, though, as the pre-load instructions are not needed.