This discussion has been locked.
You can no longer post new replies to this discussion. If you have a question you can start a new discussion

Convincing the optimizer (another fun bit of code) to optmize?

I'm a bit curious as to why this bit of code wasn't AS optimized as it would normally be. I have written compilers, so I am not clueless it's just strange the optimizer didn't optimize out some of this code.
This code becomes

  ADCON1 = chan->constants.location_info.con1 & 0x70;
  ADMUX = chan->constants.location_info.mux;
  ADCON0 = chan->constants.type_info.con0;
  OCH = chan->a2d.och;
  OCM = chan->a2d.ocm;
  OCL = chan->a2d.ocl;

  GCH = chan->a2d.gch;
  GCM = chan->a2d.gcm;
  GCL = chan->a2d.gcl;

this code

0010         L?0038:
0010         L?0039:
0010 F582              MOV     DPL,A
0012 E4                CLR     A
0013 3E                ADDC    A,R6
0014 F583              MOV     DPH,A
0016 E0                MOVX    A,@DPTR
0017 22                RET

0000 8F82              MOV     DPL,R7
0002 8E83              MOV     DPH,R6
0004 A3                INC     DPTR
0005 A3                INC     DPTR
0006 E0                MOVX    A,@DPTR
0007 5470              ANL     A,#070H
0009 F5DD              MOV     ADCON1,A
                                           ; SOURCE LINE # 279
000B 8F82              MOV     DPL,R7
000D 8E83              MOV     DPH,R6
000F E0                MOVX    A,@DPTR
0010 F5D7              MOV     ADMUX,A
                                           ; SOURCE LINE # 282
0012 EF                MOV     A,R7
0013 2404              ADD     A,#04H
0015 120000      R     LCALL   L?0038
0018 F5DC              MOV     ADCON0,A
                                           ; SOURCE LINE # 286
001A EF                MOV     A,R7
001B 2417              ADD     A,#017H
001D 120000      R     LCALL   L?0038
0020 F5D3              MOV     OCH,A
                                           ; SOURCE LINE # 287
0022 EF                MOV     A,R7
0023 2418              ADD     A,#018H
0025 120000      R     LCALL   L?0039
0028 F5D2              MOV     OCM,A
                                           ; SOURCE LINE # 288
002A EF                MOV     A,R7
002B 2419              ADD     A,#019H
002D 120000      R     LCALL   L?0039
0030 F5D1              MOV     OCL,A
                                           ; SOURCE LINE # 290
0032 EF                MOV     A,R7
0033 241A              ADD     A,#01AH
0035 120000      R     LCALL   L?0039
0038 F5D6              MOV     GCH,A
                                           ; SOURCE LINE # 291
003A EF                MOV     A,R7
003B 241B              ADD     A,#01BH
003D 120000      R     LCALL   L?0039
0040 F5D5              MOV     GCM,A
                                           ; SOURCE LINE # 292
0042 EF                MOV     A,R7
0043 241C              ADD     A,#01CH
0045 120000      R     LCALL   L?0039
0048 F5D4              MOV     GCL,A


First all the information is references from a pointer.
All variable data access is sequential from said pointer. Why isn't it optimizing out the ADD A, #XXX into INC DPTR?
It has done this a number of other places in the code. Why not here?

Do I have the settings wrong or something?
Do I need to arrange the code differently?

This is embedded into an ISR could that be the reason (that wouldn't make sense however ... )

If anyone can let me know if I need to do something different.

Stephen

  • First all the information is references from a pointer.

    What kind of pointer ? Did you read the chapters about how the C51 compiler handles pointers in the manual ?

    All variable data access is sequential from said pointer. Why isn't it optimizing out the ADD A, #XXX into INC DPTR?

    Possibly because you are using a generic pointer instead of a memory-specific one (this cannot be determined from your code snippet). Incrementing a memory-specific pointer isn't as simple as adding something to it - it required calls to library functions (hence all the LCALLs in your code).

  • a memory-specific pointer isn't as simple as adding something to it - it required calls to library functions (hence all the LCALLs in your code).

    Hm, scratch that. The library function calls in your code are for accessing the target of the pointer, not for the increment.

    Still, it's not clear from your code what kind of variable the pointer is - is it local, static or global ? What happens to it after the code snippet ?

  • The pointer is a reference to a structure in XDATA space (typed xdata thus it is not a generic point). Something like CHAN xdata * chan

    The variable itself is in xdata RAM and is global.

    As I said I'm baffled why it doesn't just increment the DPTR and load the data sequentially. I'm not sure what is being generated underneath. My guess is it's something like

    LOAD CNANPTR
    REFERENCE RECORD N
    MOV A, (N)
    MOV SFR, A
    
    LOAD CHANPTR
    REFERENCE RECORD M
    MOV A, (M)
    MOV SFR, A
    

    Since the CHANPTR is redundant that gets optimized out whereas REFERENCE RECORD N gets rolled into a function. The problem I suspect being is the optimizer is not recognizing these as sequential address offsets to CHANPTR (IE CHANPTR is incremented by a byte offset) which would optimize to INC DPTR instead of the verbose code it's emitting now.

    Stephen

  • The problem I suspect being is the optimizer is not recognizing these as sequential address offsets to CHANPTR (IE CHANPTR is incremented by a byte offset) which would optimize to INC DPTR instead of the verbose code it's emitting now.

    That could be. It should be possible to generate "optimized" code by playing with pointers and casting in C.

    Is chan used in any way after the code snippet ? And which optimization levels have you tried ?

  • There are an almost infinite number of instruction combinations possible, and they are more or less affected by the actual parameters to the instructions.

    Each optimization must be manually identified. Then it's time for a feasibility study, to check how hard the sequence is to detect, the probability that the generated code will have matching sequences and a guestimate how much faster/smaller the code will be.

    Given the large number of combinations possible, it should be expected that you now and then sees code sequences that could have been optimized (better).

  • No 'optimizer' can optimize as well as a competent programmer.

    If you need optimized, then optimize, do not rely on some mechanical function to do your work for you.

    Yes, the optimizer will optimize, but if what you write is not optimum, what is the point?

    someone post about you using generic pointers in this thread and, if something is NOT optimum, that takes the prize.

    I have, in many cases, two 'identical' functions, the only difference being that one takes a pointer to 'code' and one takes a pointer to 'xdata'. Two

    Now, the smoked sardine is going to barge in and say something stupid, just ignore it.

    Erik

  • Yes knowingly having looked at the back end optimization of GCC was an eyefull. Generally the optimization starts with twiddling redundant stuff out. I suppose if it's done in a single pass that could lead to reductions (this is more toward algebraic simplification type analysis). I was just curious if it would recognize that it was adding sequential offsets.
    IE that
    P (instance result of REFERENCE MEMBER (N)) was indeed the prior computations P' + 1 that seems to be the challenge. The optimization doesn't make sense to human eyes because a human tends to use different logic than a programmatic approach would. I suppose changing the symbolic representation by twiddling code would work (heh). Unfortunately the stumbling block I suspect for the optimizer is the fact I am using members of a structure instead of a pointer to a series of equally sized variables (IE byte sized entities).
    The optimizer has to pass over several layers of symbol sets and optimizations before it can emit code. Unfortunately what optimization to apply each time is not as easy.

    I thought I would see if there was some way to get the optimizer to recognize a sequence of incremental addresses like I have here. Would make the ISR time shorter and less code. However I don't want to end up with code that doesn't perform the same function.

    Thanks for your comments gives me a little more confidence that I recognize the limitation. Unfortunately to keep code clarity recasting a pointer is a bad idea. (that would be the easiest way to make the code optimizer recognize sequential addresses)

    I am surprised that it doesn't reduce the offset reference set more but at the same time I am not. If it were a global variable and not a pointer it would be optimized I suspect. It's the pointer and reference from a pointer that makes it difficult for the optimizer.

    Stephen

  • I thought I would see if there was some way to get the optimizer to recognize a sequence of incremental addresses like I have here. Would make the ISR time shorter and less code. However I don't want to end up with code that doesn't perform the same function.
    I have a place where I have to do a MAJOR sequential to sequential move. My derivative does not have 2 DPTRS which Keil AFAIK do use when available, so I wrote it in assembler (ultimate optimizer [if you know what you are doing]) and gained more than ten times performance improvement and the code "does perform the same function"

    Unfortunately to keep code clarity recasting a pointer is a bad idea.
    please elaborate I do not see why

    Erik

  • A good optimizer can optimize better than a good developer can manage, unless the developer spends a huge amount of time, and writes down a large number of permutations with paper and pen.

    It's just a question of how hard the processor is to optimize for. The C51 processor is quite easy to optimize for (for a human). It has few registers, and very limited instructions so it is easy to keep all required state information in the head.

    Let's just look back at the Intel Pentium processor. It got an exchange instruction that could swap two fp registers without consuming any clock cycles. Suddenly, it became very hard to match the Watcom compiler on fp code, because a normal human does not like to constantly swap the locations of the fp registers. You can't even document the register contents in a good way, since the fp register stack doesn't behave like a stack anymore.

    A number of processors have extra bits in the instruction to inform the processor about future instructions, for example if a jump is likely to happen within a short while or if the processor should be lazy or aggresive with writing back changed memory values. Some processors always proceses the first instruction after a jump, just to make sure that they have something to do while the pipeline and memory subsystem is busy to retrieve the first instruction from the jump destination.

    High-end processors are superscalar, so they process multiple instructions for each clock cycle. Even if all execution blocks are maximum fast and performs their result in a single clock cycle, the code must be rewritten so that the pipeline will always be able to issue the full set of integer or fp instructions and remember that more advanced memory addressing modes may possibly lock up one ALU for evaluation of scaled multiple-offset addresses.

    Even if you are very good at writing in assembler, these non-regular hw tricks makes it very, very hard to keep track of everything, and to manage to always produce a correct program.

    In theory, a human should always win over the optimizer in a compiler, but the optimizer has the advantage that it will _always_ remember all the tricks it knows about. It will not, now and then, forget to check if dirty trick "x" is applicable. It will not decide that it is too much work to perform a huge reorder/rewrite just to gain a single clock cycle - a normal developer will reach a stage where he feels: enough is enough.

    This is no different from chess. A chess master does not have it easy against the best chess computers, because the associative memory has to fight against the deep and brutally fast memory of the computer. A chess master (or a human developer) has to prune the alternatives much earlier than a computer. It is only experience combined with luck that helps the chess master/developer from incorrectly pruning a winning tree.

  • The pattern of code generated indicates that the compiler is keeping the pointer in DTPR, and passing the offset in R7 to L?0039. The complaint is the repeated add to generate the offset for a field. The optimizer can't really peek inside the L?0039 function and inc DPTR; that would be using the DPTR as a global variable secretly passed to this linker-generated library function.

    What happens if you write the code this way:

      StructA2D* data a2d = &(chan->a2d);
    
      OCH = a2d->och;
      OCM = a2d->ocm;
      OCL = a2d->ocl;
    
      GCH = a2d->gch;
      GCM = a2d->gcm;
      GCL = a2d->gcl;
    


    This might cache the pointer to the a2d struct in the DPTR. I haven't tried it, but I'd expect to see a series of calls without the add, just passing in the offsets of the fields inside the a2d struct, rather than going back to the base pointer of chan-> each time. It might even do away with L?0039 entirely.

    L?0039 seems like a space optimization. It's eliminating repeated code and turning it into a tiny function. Another way to sidestep the problem is perhaps to favor speed instead of space, at least for this region of code. (You can use #pragma to change the optimizer level.)

  • A good optimizer can optimize better than a good developer can manage, unless the developer spends a huge amount of time, and writes down a large number of permutations with paper and pen.
    divide and conquer. My estimate is that less than 5% of most programs will gain anything noticeabke (to the user) by optimization. So instead of "spends a huge amount of time" spend a bit of time superoptimizing the critical function and leave the rest alone.

    The C51 processor is quite easy to optimize for (for a human). It has few registers, and very limited instructions so it is easy to keep all required state information in the head.
    and that is the processor we discuss (see above MCU: Cx51. 8051 or MCS51)

    Let's just look back at the Intel Pentium
    and that is not the processor we discuss (see above MCU: Cx51. 8051 or MCS51)

    your insights in other processors, while valuable, do not apply to the origin of this thread.

    Erik

  • More interesting code checking. This really shows more
    what is going on. The code really is very simple and
    straight forward and yet ...

    Compiler settings:

    OPTIMIZE (9,SIZE) BROWSE NOINTPROMOTE INCDIR(.\headers)
     DEBUG OBJECTEXTEND CODE LISTINCLUDE SYMBOLS
     PRINT(.\object\list_map\*.lst) TABS (3)
    


    C Code

    void    print_hex4(uint8_t xdata *      dat)
    {
            print_hex(*dat++);
            //dat++;
            print_hex(*dat++);
            //dat++;
            print_hex(*dat++);
            //dat++;
            print_hex(*dat);
    }
    

    Output Listing

                 ; FUNCTION _print_hex4 (BEGIN)
                                               ; SOURCE LINE # 175
    0000 8E00        R     MOV     dat,R6
    0002 8F00        R     MOV     dat+01H,R7
                                               ; SOURCE LINE # 176
                                               ; SOURCE LINE # 177
    0004 0500        R     INC     dat+01H
    0006 E500        R     MOV     A,dat+01H
    0008 7002              JNZ     ?C0051
    000A 0500        R     INC     dat
    000C         ?C0051:
    000C 120000      R     LCALL   L?0058
                                               ; SOURCE LINE # 179
    000F 0500        R     INC     dat+01H
    0011 E500        R     MOV     A,dat+01H
    0013 AE00        R     MOV     R6,dat
    0015 7002              JNZ     ?C0052
    0017 0500        R     INC     dat
    0019         ?C0052:
    0019 120000      R     LCALL   L?0058
                                               ; SOURCE LINE # 181
    001C 0500        R     INC     dat+01H
    001E E500        R     MOV     A,dat+01H
    0020 AE00        R     MOV     R6,dat
    0022 7002              JNZ     ?C0053
    0024 0500        R     INC     dat
    0026         ?C0053:
    0026 14                DEC     A
    0027 F582              MOV     DPL,A
    0029 8E83              MOV     DPH,R6
    002B E0                MOVX    A,@DPTR
    002C FF                MOV     R7,A
    002D 120000      R     LCALL   _print_hex
                                               ; SOURCE LINE # 183
    0030 850082      R     MOV     DPL,dat+01H
    0033 850083      R     MOV     DPH,dat
    0036 E0                MOVX    A,@DPTR
    0037 FF                MOV     R7,A
    0038 020000      R     LJMP    _print_hex
                                               ; SOURCE LINE # 184
    003B         L?0058:
    003B 14                DEC     A
    003C F582              MOV     DPL,A
    003E 8E83              MOV     DPH,R6
    0040 E0                MOVX    A,@DPTR
    0041 FF                MOV     R7,A
    0042 120000      R     LCALL   _print_hex
    0045 22                RET
                 ; FUNCTION _print_hex4 (END)
    


    This is rather .. interesting I suspect it may be the
    difference of how optimizers behavior between STATIC
    and pointer analysis. The pointer analysis appears to
    be the place where things aren't as optimized (although
    it might not be that easy to transform the code in
    simplification passes to 'optimal').

    Stephen

  • Now, the smoked sardine is going to barge in and say something stupid, just ignore it.

    I'm curious, what sort of a stupid thing do you think I might say?