Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog If Conversion Within .NET - Part 2
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Compilers
  • Runtime
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

If Conversion Within .NET - Part 2

Alan Hayward
Alan Hayward
October 2, 2023
24 minute read time.

This is the second part of the blog series on If Conversion in .NET. In part 1 we walked through how .NET optimizes Boolean operations. This part walks through the details of various optimizations happening within the lowering phase. The examples continue to use the same example code from the first part.

If Conversion Within .NET - Part 1

Skipping the Lowering Phase

Previously, I showed the result of the lowering phase. Now we will force RyuJIT to skip as much of this phase as possible. Some parts of lowering are required for later phases, so it cannot be turned off entirely without causing compilation errors. For anyone trying this, we can do this via turning off some additional optimizations via some our special config options:

export DOTNET_JitDump=TestAndOrWithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
export DOTNET_JitDoCompareChainCondLowering=0
export DOTNET_JitDoLowerConstCompare=0
export DOTNET_JitDoLowerJTrue=0
 

Running this, lowering still needs to switch some of the nodes to target specific ones. Then, BB01 to BB05 will be generated as follows:

=============== Generating BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05} flags=0x00000000.00020011: i label LIR 
BB01 IN (2)={V00 V01} + ByrefExposed + GcHeap
     OUT(1)={    V01} + ByrefExposed + GcHeap

Recording Var Locations at start of BB01
  V00(x0)  V01(x1)
Change life 0000000000000000 {} -> 0000000000000003 {V00 V01}
                            V00 in reg x0 is becoming live  [------]
                            Live regs: 0000 {} => 0001 {x0}
New debug range: first
                            V01 in reg x1 is becoming live  [------]
                            Live regs: 0001 {x0} => 0003 {x0 x1}
New debug range: first
                            Live regs: (unchanged) 0003 {x0 x1}
                            GC regs: (unchanged) 0000 {}
                            Byref regs: (unchanged) 0000 {}

      L_M6318_BB01:
Label: G_M6318_IG02, GCvars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}

Scope info: begin block BB01, IL range [000..00D)
Added IP mapping: 0x0009 STACK_EMPTY (G_M6318_IG02,ins#0,ofs#0) label
Generating: N003 (???,???) [000030] -----------                            IL_OFFSET void   INLRT @ 0x009[E-] REG NA
Generating: N005 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 x0 (last use) REG x0 $80
Generating: N007 (  1,  2) [000001] -c---------                    t1 =    CNS_INT   int    3 REG NA $43
                                                                        /--*  t0     int    
                                                                        +--*  t1     int    
Generating: N009 (  6,  4) [000002] N------N-U-                    t2 = *  GT        int    REG x0
                            V00 in reg x0 is becoming dead  [000000]
                            Live regs: 0003 {x0 x1} => 0002 {x1}
Closing debug range.
                Live vars after [000000]: {V00 V01} => {V01}
Mapped BB01 to G_M6318_IG02
IN0001:             cmp     w0, #3
IN0002:             cset    x0, hi
Generating: N011 (  1,  1) [000016] -----------                   t16 =    LCL_VAR   int    V01 arg1         u:1 x1 REG x1 $81
Generating: N013 (  1,  2) [000017] -c---------                   t17 =    CNS_INT   int    10 REG NA $44
                                                                        /--*  t16    int    
                                                                        +--*  t17    int    
Generating: N015 (  6,  4) [000018] -------N---                   t18 = *  EQ        int    REG x2 $101
IN0003:             cmp     w1, #10
IN0004:             cset    x2, eq
                                                                        /--*  t2     int    
                                                                        +--*  t18    int    
Generating: N017 ( 13,  9) [000024] -------N---                   t24 = *  AND       int    REG x0
IN0005:             and     w0, w0, w2
Generating: N019 (  1,  2) [000025] -c---------                   t25 =    CNS_INT   int    0 REG NA
                                                                        /--*  t24    int    
                                                                        +--*  t25    int    
Generating: N021 ( 18, 12) [000026] -----------                   t26 = *  EQ        int    REG x0
IN0006:             cmp     w0, #0
IN0007:             cset    x0, eq
Generating: N023 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V01 arg1         u:1 x1 REG x1 $81
Generating: N025 (  1,  2) [000005] -c---------                    t5 =    CNS_INT   int    4 REG NA $45
                                                                        /--*  t4     int    
                                                                        +--*  t5     int    
Generating: N027 (  6,  4) [000006] N------N-U-                    t6 = *  GT        int    REG x2 $102
IN0008:             cmp     w1, #4
IN0009:             cset    x2, hi
                                                                        /--*  t26    int    
                                                                        +--*  t6     int    
Generating: N029 ( 25, 17) [000027] -------N---                   t27 = *  AND       int    REG x0
IN000a:             and     w0, w0, w2
Generating: N031 (  1,  2) [000028] -c---------                   t28 =    CNS_INT   int    0 REG NA
                                                                        /--*  t27    int    
                                                                        +--*  t28    int    
Generating: N033 ( 30, 20) [000029] -----------                   t29 = *  NE        int    REG x0
IN000b:             cmp     w0, #0
IN000c:             cset    x0, ne
                                                                        /--*  t29    int    
Generating: N035 ( 32, 22) [000007] -----------                         *  JTRUE     void   REG NA $VN.Void
IN000d:             cbnz    (LARGEJMP)L_M6318_BB05

With the full function as:

G_M6318_IG01:
IN0016: 000000      stp     fp, lr, [sp, #-0x10]!
IN0017: 000004      mov     fp, sp
G_M6318_IG02:
IN0001: 000008      cmp     w0, #3
IN0002: 00000C      cset    x0, hi
IN0003: 000010      cmp     w1, #10
IN0004: 000014      cset    x2, eq
IN0005: 000018      and     w0, w0, w2
IN0006: 00001C      cmp     w0, #0
IN0007: 000020      cset    x0, eq
IN0008: 000024      cmp     w1, #4
IN0009: 000028      cset    x2, hi
IN000a: 00002C      and     w0, w0, w2
IN000b: 000030      cmp     w0, #0
IN000c: 000034      cset    x0, ne
IN000d: 000038      cbnz    w0, G_M6318_IG04
G_M6318_IG03:
IN000e: 00003C      mov     w0, #9
IN000f: 000040      b       G_M6318_IG05
G_M6318_IG04:
IN0010: 000044      mov     w0, #12
G_M6318_IG05:
IN0011: 000048      movz    x2, #0x6EE0      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
IN0012: 00004C      movk    x2, #0x45A8 LSL #16
IN0013: 000050      movk    x2, #0xFFFF LSL #32
IN0014: 000054      ldr     x2, [x2]
IN0015: 000058      blr     x2
G_M6318_IG06:
IN0018: 00005C      ldp     fp, lr, [sp], #0x10
IN0019: 000060      ret     lr
 

Look at BB02 - that is a lot of code. On the plus side we do have only a single branch, with all the conditions compared together in one lump. However, That is a lot of code to do those compares. Each compare node gets generated into two instructions. Firstly a CMP to do the comparison into flags, then a CSET to check a particular flag (for example, HI) and put a Boolean result into a register. For example, the first EQ becomes:

IN0003:             cmp     w1, #10
IN0004:             cset    x2, eq

Following through all the nodes, the first two compares in the if are generated into x0 and x2. These are ANDed together, compared against zero with the result going into x0. The final compare from the if is generated, then ANDed with the previous result (in x0) and compared against zero. This result also goes into x0. Note that the previous value in x0 is no longer required, which is why it can be reused. By default, RyuJIT uses the lowest available register. Finally, CNBZ compares this against zero and conditionally jumps.

Lowering JTRUE

There are multiple ways RyuJIT could improve on this code. Ideally this should be turned into a sequence of CMP and CCMP nodes. RyuJIT supports many different lowering optimizations and sometimes an optimization will change the code in such a way that it will prevent a later optimization. RyuJIT tries to ensure that that optimizations are applied in such an order that the best code is generated. Before looking at the compare chain optimization, let's look at some of the other optimizations that could have applied. We run the code as before, but now allow lowering of JTRUE.

export DOTNET_JitDump=TestAndOrWithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
export DOTNET_JitDoCompareChainCondLowering=0
export DOTNET_JitDoLowerConstCompare=0
unset DOTNET_JitDoLowerJTrue

Now when we look at the lowering dumps, the following block:

N013 ( 25, 17) [000027] -------N---                   t27 = *  AND       int   
N014 (  1,  2) [000028] -c---------                   t28 =    CNS_INT   int    0
                                                            /--*  t27    int    
                                                            +--*  t28    int    
N015 ( 30, 20) [000029] -----------                   t29 = *  NE        int   
                                                            /--*  t29    int    
N016 ( 32, 22) [000007] -----------                         *  JTRUE     void   $VN.Void

becomes:

N013 ( 25, 17) [000027] -------N---                   t27 = *  AND       int   
N014 (  1,  2) [000028] -c---------                   t28 =    CNS_INT   int    0
                                                            /--*  t27    int    
                                                            +--*  t28    int    
N016 ( 32, 22) [000007] -----------                         *  JCMP      void  
 

The NE has been removed, and the JTRUE (which takes a single condition as input) is replaced by a JCMP (which takes in two values as input). Also note that the CNS_INT 0 has been marked with a flag "c". This means the node has been contained by its parent. There are many other flags which can be set on a node, but we will not be going into details on them.

A Side Note on Containing

At generation time, contained nodes are not generated. It is the job a parent to ensure it generates code for all it is contained children.

Consider: 

LCL_VAR   int    V01 arg1

CNS_INT   int    2

ADD

As it stands, this would generate code similar to:

MOV x3, 2

ADD x0, x0, x3

However, this can be improved by using a single

ADD x0, x0, #2

This can be done by containing the CNS_INT within the ADD. At generation time the CNS_INT is not generated. The generation of the ADD will see that one of its children is contained and instead of issuing a standard ADD, will issue the immediate version of ADD.

Only certain instructions support having contained children. The code must check it is safe to do so before containing.

Ok, back to the main code. Generation of the contained node:

Generating: N029 ( 25, 17) [000027] -------N---                   t27 = *  AND       int    REG x0
IN000a:             and     w0, w0, w2
Generating: N031 (  1,  2) [000028] -c---------                   t28 =    CNS_INT   int    0 REG NA
                                                                        /--*  t27    int    
                                                                        +--*  t28    int    
Generating: N033 ( 32, 22) [000007] -----------                         *  JCMP      void   REG NA
IN000b:             cbnz    (LARGEJMP)L_M6318_BB05

With the full function:

G_M6318_IG01:
IN0014: 000000      stp     fp, lr, [sp, #-0x10]!
IN0015: 000004      mov     fp, sp
G_M6318_IG02:
IN0001: 000008      cmp     w0, #3
IN0002: 00000C      cset    x0, hi
IN0003: 000010      cmp     w1, #10
IN0004: 000014      cset    x2, eq
IN0005: 000018      and     w0, w0, w2
IN0006: 00001C      cmp     w0, #0
IN0007: 000020      cset    x0, eq
IN0008: 000024      cmp     w1, #4
IN0009: 000028      cset    x2, hi
IN000a: 00002C      and     w0, w0, w2
IN000b: 000030      cbnz    w0, G_M6318_IG04
G_M6318_IG03:
IN000c: 000034      mov     w0, #9
IN000d: 000038      b       G_M6318_IG05
G_M6318_IG04:
IN000e: 00003C      mov     w0, #12
G_M6318_IG05:
IN000f: 000040      movz    x2, #0x6EE0      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
IN0010: 000044      movk    x2, #0x6DA8 LSL #16
IN0011: 000048      movk    x2, #0xFFFF LSL #32
IN0012: 00004C      ldr     x2, [x2]
IN0013: 000050      blr     x2
G_M6318_IG06:
IN0016: 000054      ldp     fp, lr, [sp], #0x10
IN0017: 000058      ret     lr

This has removed two instructions - the CMP and CSET following the final AND.

Lowering: optimizing Const Compare

Next let's re-enable constant compare optimizations within lowering:

export DOTNET_JitDump=TestAndOrWithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
export DOTNET_JitDoCompareChainCondLowering=0
unset DOTNET_JitDoLowerConstCompare
 

Now the lowering phase will check the children of an EQ or NE node for AND and 0:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -----------                    t1 =    CNS_INT   int    3 $43
                                                            /--*  t0     int    
                                                            +--*  t1     int    
N003 (  6,  4) [000002] N------N-U-                    t2 = *  GT        int   
N004 (  1,  1) [000016] -----------                   t16 =    LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000017] -----------                   t17 =    CNS_INT   int    10 $44
                                                            /--*  t16    int    
                                                            +--*  t17    int    
N006 (  6,  4) [000018] -------N---                   t18 = *  EQ        int    $101
                                                            /--*  t2     int    
                                                            +--*  t18    int    
N007 ( 13,  9) [000024] -------N---                   t24 = *  AND       int   
N008 (  1,  2) [000025] -----------                   t25 =    CNS_INT   int    0
                                                            /--*  t24    int    
                                                            +--*  t25    int    
N009 ( 18, 12) [000026] -----------                   t26 = *  EQ        int   

This is transformed into:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -c---------                    t1 =    CNS_INT   int    3 $43
                                                            /--*  t0     int    
                                                            +--*  t1     int    
N003 (  6,  4) [000002] N------N-U-                    t2 = *  GT        int   
N004 (  1,  1) [000016] -----------                   t16 =    LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000017] -c---------                   t17 =    CNS_INT   int    10 $44
                                                            /--*  t16    int    
                                                            +--*  t17    int    
N006 (  6,  4) [000018] -------N---                   t18 = *  EQ        int    $101
                                                            /--*  t2     int    
                                                            +--*  t18    int    
N009 ( 18, 12) [000026] -----------                   t26 = *  TEST_EQ   int   

The EQ(AND(X,Y),0) is converted into TEST_EQ(X,Y), removing the AND and 0, replacing the EQ with a TEST_EQ. This will then map directly to the Arm64 instruction TST.

A similar change is made to optimize the NE to TEST_NE. By doing this JTRUE optimization can no longer happen. Instead, when the JTRUE is lowered to the following:

N015 ( 30, 20) [000029] -----------                   t29 = *  TEST_NE   int   
                                                            /--*  t29    int    
N016 ( 32, 22) [000007] -----------                         *  JTRUE     void   $VN.Void

is transformed into a generic TEST instruction, with the NE having moved down into the JCC node as an additional flag.

N015 ( 30, 20) [000029] -----------                         *  TEST      void  
N016 ( 32, 22) [000007] -----------                            JCC       void   cond=UNE

This is not for optimization purposes and more for rationalization of the IR. When generating a comparison node (for example, EQ, NE) or test node (for example, TEST_EQ), it will be generated as a single Arm64 CMP/TST instruction. It is the following instruction that does the specific flag checking (the EQ part). This transformation simply ensures the nodes are better scoped to the resulting instructions. Ideally an instruction should not depend on its children for how it should operate. The intention is that there should be no JTRUE nodes remaining in the IR at generation.

Finally, the IR is generated into:

*************** After end code gen, before unwindEmit()
G_M6318_IG01:        ; func=00, offs=0x000000, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0013: 000000      stp     fp, lr, [sp, #-0x10]!
IN0014: 000004      mov     fp, sp

G_M6318_IG02:        ; offs=0x000008, size=0x0028, bbWeight=1, PerfScore 5.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB01 [0000], byref, isz

IN0001: 000008      cmp     w0, #3
IN0002: 00000C      cset    x0, hi
IN0003: 000010      cmp     w1, #10
IN0004: 000014      cset    x2, eq
IN0005: 000018      tst     w0, w2
IN0006: 00001C      cset    x0, eq
IN0007: 000020      cmp     w1, #4
IN0008: 000024      cset    x2, hi
IN0009: 000028      tst     w0, w2
IN000a: 00002C      bne     G_M6318_IG04

G_M6318_IG03:        ; offs=0x000030, size=0x0008, bbWeight=0.50, PerfScore 0.75, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB04 [0003], byref

IN000b: 000030      mov     w0, #9
IN000c: 000034      b       G_M6318_IG05

G_M6318_IG04:        ; offs=0x000038, size=0x0004, bbWeight=0.50, PerfScore 0.25, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB05 [0004], byref

IN000d: 000038      mov     w0, #12

G_M6318_IG05:        ; offs=0x00003C, size=0x0014, bbWeight=1, PerfScore 5.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB06 [0005], byref

IN000e: 00003C      movz    x2, #0x6EE0      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
IN000f: 000040      movk    x2, #0x45A6 LSL #16
IN0010: 000044      movk    x2, #0xFFFF LSL #32
IN0011: 000048      ldr     x2, [x2]
IN0012: 00004C      blr     x2

G_M6318_IG06:        ; offs=0x000050, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend

IN0015: 000050      ldp     fp, lr, [sp], #0x10
IN0016: 000054      ret     lr

Which is one instruction shorter than previously: both ANDs have been removed, whereas previously the JTRUE lowering could only remove the second AND.

Lowering: optimizing Compare chains

Finally, let's look at what happens when re-enabling the specific compare chain logic. This optimization was written alongside the optimize bool code and is a good example of how different optimizations work together. From the examples above, it is clear it would not be a good idea to have the optimize bool code without something to break them down.

export DOTNET_JitDump=TestAndOrWithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
unset DOTNET_JitDoCompareChainCondLowering

 

The lowering for compare chains starts at any AND (or OR) node. It checks that the input of the AND (the EQ and GT) are compare nodes and of integer type:

[000027] is a potential candidate for CCMP:
N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -c---------                    t1 =    CNS_INT   int    3 $43
                                                            /--*  t0     int    
                                                            +--*  t1     int    
N003 (  6,  4) [000002] N------N-U-                    t2 = *  GT        int   
N004 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000019] -c---------                   t19 =    CNS_INT   int    10 $44
                                                            /--*  t18    int    
                                                            +--*  t19    int    
N006 (  6,  4) [000020] -------N---                   t20 = *  EQ        int    $101
                                                            /--*  t2     int    
                                                            +--*  t20    int    
N007 ( 13,  9) [000027] -------N---                   t27 = *  AND       int   

 

First, the nodes need shuffling to first set up all the arguments to the if and then all the comparisons in a sequence. In the example, the GT, EQ, and AND are placed after the LCL_VAR and CNS_INT nodes. This is subject to several checks to ensure that there are no dependency issues between the reordered nodes. This could happen, say, if one of the nodes was writing to memory which was then used by another node.

Then the nodes can be transformed. As mentioned earlier with the TEST/JCC transformation, the compare nodes are a little odd as they do not match directly to a single instruction. They generate a CMP node and then it is the job of the following instruction to check the condition. The CMP node is used to model this behavior. A compare node can be transformed into a CMP node providing the parent node is able to hold the condition as an additional flag. A CCMP node is similar, but maps to a CCMP instruction. Only certain nodes can hold condition flags - namely CCMP, SETCC and JCC.

Working through the example, the AND is turned into a SETCC allowing it to hold a condition. The EQ is turned into a CCMP with the condition pushed down to the SETCC. Finally, the GT is turned into a CMP, with the condition pushed down to the EQ:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -c---------                    t1 =    CNS_INT   int    3 $43
N004 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000019] -c---------                   t19 =    CNS_INT   int    10 $44
                                                            /--*  t0     int    
                                                            +--*  t1     int    
N003 (  6,  4) [000002] -------N-U-                         *  CMP       void  
                                                            /--*  t18    int    
                                                            +--*  t19    int    
N006 (  6,  4) [000020] -------N---                         *  CCMP      void   cond=UGT flags=0
N007 ( 13,  9) [000027] -------N---                   t27 =    SETCC     int    cond=UEQ

The same optimization is then performed on the next AND. At the start of the optimization:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -c---------                    t1 =    CNS_INT   int    3 $43
N004 (  1,  1) [000016] -----------                   t16 =    LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000017] -c---------                   t17 =    CNS_INT   int    10 $44
                                                            /--*  t0     int    
                                                            +--*  t1     int    
N003 (  6,  4) [000002] -------N-U-                         *  CMP       void  
                                                            /--*  t16    int    
                                                            +--*  t17    int    
N006 (  6,  4) [000018] -------N---                         *  CCMP      void   cond=UGT flags=0
N007 ( 13,  9) [000024] -------N---                   t24 =    SETCC     int    cond=UNE
N010 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V01 arg1         u:1 $81
N011 (  1,  2) [000005] -c---------                    t5 =    CNS_INT   int    4 $45
                                                            /--*  t4     int    
                                                            +--*  t5     int    
N012 (  6,  4) [000006] N------N-U-                    t6 = *  GT        int    $102
                                                            /--*  t24    int    
                                                            +--*  t6     int    
N013 ( 25, 17) [000027] -------N---                   t27 = *  AND       int   

The code checks the inputs to the AND. The GT is ok because it is a compare, the SETCC is also allowed as the result of the SETCC goes into flags. It then applies the same transformation. The AND becomes a SETCC, the GT becomes a CMP. The first input to the AND is the previous SETCC and it is removed. These are all moved to after the inputs to the conditions. This now gives:

N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -c---------                    t1 =    CNS_INT   int    3 $43
N004 (  1,  1) [000016] -----------                   t16 =    LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000017] -c---------                   t17 =    CNS_INT   int    10 $44
N010 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V01 arg1         u:1 $81
N011 (  1,  2) [000005] -c---------                    t5 =    CNS_INT   int    4 $45
                                                            /--*  t0     int    
                                                            +--*  t1     int    
N003 (  6,  4) [000002] -------N-U-                         *  CMP       void  
                                                            /--*  t16    int    
                                                            +--*  t17    int    
N006 (  6,  4) [000018] -------N---                         *  CCMP      void   cond=UGT flags=0

(Note: The earlier SETCC is removed from here)
                                                            /--*  t4     int    
                                                            +--*  t5     int    
N012 (  6,  4) [000006] -------N-U-                         *  CCMP      void   cond=UNE flags=z
N013 ( 25, 17) [000027] -------N---                   t27 =    SETCC     int    cond=UGT

As mentioned above, the optimization will trigger if the first input of the AND is a SETCC. Potentially this allows any tree of nodes ending in a SETCC, not just a chain previously converted. In practice, only compare chains will be matched, but this is an example of good practice in RyuJIT- optimizations should apply to the widest range of cases possible. If in the future an earlier optimization creates SETCC nodes connected to an AND, then the compare chain code will automatically convert it to the best possible code.

Post the compare chain optimization, the JTRUE will be lowered, which will remove the SETCC child:

N013 ( 25, 17) [000027] -------N---                   t27 =    SETCC     int    cond=UGT
                                                            /--*  t27    int    
N016 ( 32, 22) [000007] -----------                         *  JTRUE     void   $VN.Void

This becomes:

N016 ( 32, 22) [000007] -----------                            JCC       void   cond=UGT

Eventually the function is generated, with BB01 to BB05 as follows:

=============== Generating BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05} flags=0x00000000.00020011: i label LIR 
BB01 IN (2)={V00 V01} + ByrefExposed + GcHeap
     OUT(1)={    V01} + ByrefExposed + GcHeap

Recording Var Locations at start of BB01
  V00(x0)  V01(x1)
Change life 0000000000000000 {} -> 0000000000000003 {V00 V01}
                            V00 in reg x0 is becoming live  [------]
                            Live regs: 0000 {} => 0001 {x0}
New debug range: first
                            V01 in reg x1 is becoming live  [------]
                            Live regs: 0001 {x0} => 0003 {x0 x1}
New debug range: first
                            Live regs: (unchanged) 0003 {x0 x1}
                            GC regs: (unchanged) 0000 {}
                            Byref regs: (unchanged) 0000 {}

      L_M6318_BB01:
Label: G_M6318_IG02, GCvars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}

Scope info: begin block BB01, IL range [000..00D)
Added IP mapping: 0x0009 STACK_EMPTY (G_M6318_IG02,ins#0,ofs#0) label
Generating: N003 (???,???) [000030] -----------                            IL_OFFSET void   INLRT @ 0x009[E-] REG NA
Generating: N005 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 x0 (last use) REG x0 $80
Generating: N007 (  1,  2) [000001] -c---------                    t1 =    CNS_INT   int    3 REG NA $43
Generating: N009 (  1,  1) [000016] -----------                   t16 =    LCL_VAR   int    V01 arg1         u:1 x1 REG x1 $81
Generating: N011 (  1,  2) [000017] -c---------                   t17 =    CNS_INT   int    10 REG NA $44
Generating: N013 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V01 arg1         u:1 x1 REG x1 $81
Generating: N015 (  1,  2) [000005] -c---------                    t5 =    CNS_INT   int    4 REG NA $45
                                                                        /--*  t0     int    
                                                                        +--*  t1     int    
Generating: N017 (  6,  4) [000002] -------N-U-                         *  CMP       void   REG NA
                            V00 in reg x0 is becoming dead  [000000]
                            Live regs: 0003 {x0 x1} => 0002 {x1}
Closing debug range.
                Live vars after [000000]: {V00 V01} => {V01}
Mapped BB01 to G_M6318_IG02
IN0001:             cmp     w0, #3
                                                                        /--*  t16    int    
                                                                        +--*  t17    int    
Generating: N019 (  6,  4) [000018] -------N---                         *  CCMP      void   cond=UGT flags=0 REG NA
IN0002:             ccmp    w1, #10, 0, hi
                                                                        /--*  t4     int    
                                                                        +--*  t5     int    
Generating: N021 (  6,  4) [000006] -------N-U-                         *  CCMP      void   cond=UNE flags=z REG NA
IN0003:             ccmp    w1, #4, z, ne
Generating: N023 ( 32, 22) [000007] -----------                            JCC       void   cond=UGT REG NA
IN0004:             bhi     (LARGEJMP)L_M6318_BB05

Variable Live Range History Dump for BB01
V00 arg0: x0 [(G_M6318_IG02,ins#0,ofs#0), (G_M6318_IG02,ins#0,ofs#0)]
V01 arg1: x1 [(G_M6318_IG02,ins#0,ofs#0), ...]

Giving the final function:

G_M6318_IG01:
IN000d: 000000      stp     fp, lr, [sp, #-0x10]!
IN000e: 000004      mov     fp, sp
G_M6318_IG02:
IN0001: 000008      cmp     w0, #3
IN0002: 00000C      ccmp    w1, #10, 0, hi
IN0003: 000010      ccmp    w1, #4, z, ne
IN0004: 000014      bhi     G_M6318_IG04
G_M6318_IG03:
IN0005: 000018      mov     w0, #9
IN0006: 00001C      b       G_M6318_IG05
G_M6318_IG04:
IN0007: 000020      mov     w0, #12
G_M6318_IG05:
IN0008: 000024      movz    x2, #0x6EE0      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
IN0009: 000028      movk    x2, #0x49A6 LSL #16
IN000a: 00002C      movk    x2, #0xFFFF LSL #32
IN000b: 000030      ldr     x2, [x2]
IN000c: 000034      blr     x2
G_M6318_IG06:
IN000f: 000038      ldp     fp, lr, [sp], #0x10
IN0010: 00003C      ret     lr

 

This is much better and exactly what we were looking for. IG02 is 4 instructions, instead of 13 previously.

Admittedly this is still tricky for a human to parse. It's easy to spot that we have a op1>3, op2!=10 and op2>4, but hard to figure out how this chain connects with && and ||. For that we would have to parse the failure flags of the CCMPs, and invert conditions accordingly.

Downsides of CCMP

Let us look back at the original if statement:

if (op1 > 3 && op2 == 10 || op2 <= 4) 

In the original code (with all the branches), if the first two expressions pass (op1 > 3 && op2 == 10) then the final expression (op2 <= 4) will never be calculated. If the first expression (op1 > 3) fails, then the second expression (op2 == 10) will never be calculated, but the third (op2 <= 4) will. Writing all the combinations shows us that five out of eight times only two operations will be calculated:

Expression results

Expression is calculated?

Total calculations when using branches

Total calculations when using CCMP

1

2

3

1

2

3

false

false

false

yes

no

yes

2

3

true

false

false

yes

yes

yes

3

3

false

true

false

yes

no

yes

2

3

true

true

false

yes

yes

no

2

3

false

false

true

yes

no

yes

2

3

true

false

true

yes

yes

yes

3

3

false

true

true

yes

no

yes

2

3

true

true

true

yes

yes

no

3

3

In the new code, the chain of CMP/CCMP will ensure all three expressions are always calculated. In our case, these three expressions are always faster than the different combinations of branches.

Now consider the following statement:

if (op1 > 1 || op2 > 1 || op3 > 1 || op4 > 1 || op5 > 1 || op6 > 1 || o7 > 1 || op8 > 1) 

If only op8 is greater than 1, then both branch and CCMP versions of the compiled code would require 8 compares. If op1 is greater than 1, then in the branch version only 1 operation would be calculated, whereas in the CCMP version, all 8 compares would still be calculated. In that scenario, then the branch code would be quicker.

At compile time, RyuJIT will know the contents of op1 and op2 but does not know what their contents will be on future calls. Every call could use the same set of values, or sequentially increasing each call, or purely random. Therefore, a normal compiler cannot make any assumptions. This is where profile-guided optimizations (PGO) come into play. Usually, .NET runtime will operate with tiered mode compilation. The first time encountering a function, RyuJIT will quickly compile into reasonable code. Once the function has been called a certain number of times, RyuJIT will jump in and recompile the function into more optimal code but will take longer doing so. This ensures infrequently executed functions are accessible quickly, but hot functions run quickly. With PGO, RyuJIT also keeps track how often basic blocks are run. When recompiling it can optimize based on this if it assumes all future calls will follow the same patterns. This is used by various optimizations (for example, inlining, basic block layout, LSRA, guarded devirtualization, physical promotion), but isn’t currently used by If Conversion.

Without any runtime knowledge, RyuJIT cannot make any assumptions here and must optimize for all inputs equally. Let us see what RyuJIT does with this example:

[MethodImpl(MethodImplOptions.NoInlining)]
        private static void TestOr8WithElse(uint op1, uint op2, uint op3, uint op4, uint op5, uint op6, uint op7, uint op8) {
        if (op1 > 1 || op2 > 1 || op3 > 1 || op4 > 1 || op5 > 1 || op6 > 1 || op7 > 1 || op8 > 1 ) {
                op1 = 9;
            } else {
                op1 = 12;
            }
            Consume(op1, op2);
        }

If following along, switch the function to dump:

export DOTNET_JitDump=TestOr8WithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
unset DOTNET_JitDoCompareChainCondLowering

 

This gives:

G_M47932_IG01:
IN0014: 000000      stp     fp, lr, [sp, #-0x10]!
IN0015: 000004      mov     fp, sp
G_M47932_IG02:
IN0001: 000008      cmp     w0, #1
IN0002: 00000C      ccmp    w1, #1, c, ls
IN0003: 000010      ccmp    w2, #1, c, ls
IN0004: 000014      bhi     G_M47932_IG04
G_M47932_IG03:
IN0005: 000018      cmp     w3, #1
IN0006: 00001C      ccmp    w4, #1, c, ls
IN0007: 000020      ccmp    w5, #1, c, ls
IN0008: 000024      bhi     G_M47932_IG04
IN0009: 000028      cmp     w6, #1
IN000a: 00002C      ccmp    w7, #1, c, ls
IN000b: 000030      bls     G_M47932_IG05
G_M47932_IG04:
IN000c: 000034      mov     w0, #9
IN000d: 000038      b       G_M47932_IG06
G_M47932_IG05:
IN000e: 00003C      mov     w0, #12
G_M47932_IG06:
IN000f: 000040      movz    x2, #0x6EE0      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
IN0010: 000044      movk    x2, #0x69A8 LSL #16
IN0011: 000048      movk    x2, #0xFFFF LSL #32
IN0012: 00004C      ldr     x2, [x2]
IN0013: 000050      blr     x2
G_M47932_IG07:
IN0016: 000054      ldp     fp, lr, [sp], #0x10
IN0017: 000058      ret     lr

RyuJIT has broken this into two chains - two chains of three comparisons (IG02, IG03) and one chain of two comparisons (IG03).

This is checked in the optimize bools pass. In RyuJIT dump are instances of it trying to combine two JTRUE blocks but then exiting with:

Skipping CompareChainCond that will evaluate conditions unconditionally at costs 32,5 

This works by calculating static costs within the two blocks. With static costing, each node type is giving a fixed cost. For example, an ADD would be cost 1, but a floating point add or load from memory would be much higher. This cost represents the relative time a node, once turned into code, will take to execute. These costings are approximate and are (currently) the same across all target systems. The total cost of a tree node is the total cost of all the nodes in it, plus potential additional side costs.

If the cost is larger than a set constant, then the optimization fails. This costing has been set to allow chains of three, but no bigger. There may still be a better way of distributing this, but in practice it is good enough without making RyuJIT overly complicated.

Stress Modes

A single compare chain can be forced by enabling stress mode.

For testing purposes, .NET provides various stress modes. This is applied per optimization or functional phase and usually forces the code to apply to everything it can. For example, the register allocator will not reuse registers, or the scheduler will put every instruction into a new IG, or a pass will ignore all costings. The runtime CI will run instances of the test suites with all the stress modes enabled. The intention here to catch bugs that not show under normal circumstances, allowing a wider range of testing without having to write targeted esoteric C# test applications.

For optimize bools, enabling stress mode will skip all the costing checks. This can be done via:

export DOTNET_JitDump=TestOr8WithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
export DOTNET_JitStressModeNames=STRESS_OPT_BOOLS_COMPARE_CHAIN_COST

Giving:

G_M47932_IG01:
IN0012: 000000      stp     fp, lr, [sp, #-0x10]!
IN0013: 000004      mov     fp, sp
G_M47932_IG02:
IN0001: 000008      cmp     w0, #1
IN0002: 00000C      ccmp    w1, #1, c, ls
IN0003: 000010      ccmp    w2, #1, c, ls
IN0004: 000014      ccmp    w3, #1, c, ls
IN0005: 000018      ccmp    w4, #1, c, ls
IN0006: 00001C      ccmp    w5, #1, c, ls
IN0007: 000020      ccmp    w6, #1, c, ls
IN0008: 000024      ccmp    w7, #1, c, ls
IN0009: 000028      bls     G_M47932_IG04
G_M47932_IG03:
IN000a: 00002C      mov     w0, #9
IN000b: 000030      b       G_M47932_IG05
G_M47932_IG04:
IN000c: 000034      mov     w0, #12
G_M47932_IG05:
IN000d: 000038      movz    x2, #0x6EE0      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
IN000e: 00003C      movk    x2, #0x45A6 LSL #16
IN000f: 000040      movk    x2, #0xFFFF LSL #32
IN0010: 000044      ldr     x2, [x2]
IN0011: 000048      blr     x2
G_M47932_IG06:
IN0014: 00004C      ldp     fp, lr, [sp], #0x10
IN0015: 000050      ret     lr

Closing comments

That is everything I wanted to cover for the optimize bool phase. In Part 3, I will be finally moving onto the If Conversion phase to remove the remaining branches.

If Conversion Within .NET - Part 3

Anonymous
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025