This is the second part of the blog series on If Conversion in .NET. In part 1 we walked through how .NET optimizes Boolean operations. This part walks through the details of various optimizations happening within the lowering phase. The examples continue to use the same example code from the first part.
If Conversion Within .NET - Part 1
Previously, I showed the result of the lowering phase. Now we will force RyuJIT to skip as much of this phase as possible. Some parts of lowering are required for later phases, so it cannot be turned off entirely without causing compilation errors. For anyone trying this, we can do this via turning off some additional optimizations via some our special config options:
export DOTNET_JitDump=TestAndOrWithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 export DOTNET_JitDoCompareChainCondLowering=0 export DOTNET_JitDoLowerConstCompare=0 export DOTNET_JitDoLowerJTrue=0
Running this, lowering still needs to switch some of the nodes to target specific ones. Then, BB01 to BB05 will be generated as follows:
BB01
BB05
=============== Generating BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05} flags=0x00000000.00020011: i label LIR BB01 IN (2)={V00 V01} + ByrefExposed + GcHeap OUT(1)={ V01} + ByrefExposed + GcHeap Recording Var Locations at start of BB01 V00(x0) V01(x1) Change life 0000000000000000 {} -> 0000000000000003 {V00 V01} V00 in reg x0 is becoming live [------] Live regs: 0000 {} => 0001 {x0} New debug range: first V01 in reg x1 is becoming live [------] Live regs: 0001 {x0} => 0003 {x0 x1} New debug range: first Live regs: (unchanged) 0003 {x0 x1} GC regs: (unchanged) 0000 {} Byref regs: (unchanged) 0000 {} L_M6318_BB01: Label: G_M6318_IG02, GCvars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {} Scope info: begin block BB01, IL range [000..00D) Added IP mapping: 0x0009 STACK_EMPTY (G_M6318_IG02,ins#0,ofs#0) label Generating: N003 (???,???) [000030] ----------- IL_OFFSET void INLRT @ 0x009[E-] REG NA Generating: N005 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 x0 (last use) REG x0 $80 Generating: N007 ( 1, 2) [000001] -c--------- t1 = CNS_INT int 3 REG NA $43 /--* t0 int +--* t1 int Generating: N009 ( 6, 4) [000002] N------N-U- t2 = * GT int REG x0 V00 in reg x0 is becoming dead [000000] Live regs: 0003 {x0 x1} => 0002 {x1} Closing debug range. Live vars after [000000]: {V00 V01} => {V01} Mapped BB01 to G_M6318_IG02 IN0001: cmp w0, #3 IN0002: cset x0, hi Generating: N011 ( 1, 1) [000016] ----------- t16 = LCL_VAR int V01 arg1 u:1 x1 REG x1 $81 Generating: N013 ( 1, 2) [000017] -c--------- t17 = CNS_INT int 10 REG NA $44 /--* t16 int +--* t17 int Generating: N015 ( 6, 4) [000018] -------N--- t18 = * EQ int REG x2 $101 IN0003: cmp w1, #10 IN0004: cset x2, eq /--* t2 int +--* t18 int Generating: N017 ( 13, 9) [000024] -------N--- t24 = * AND int REG x0 IN0005: and w0, w0, w2 Generating: N019 ( 1, 2) [000025] -c--------- t25 = CNS_INT int 0 REG NA /--* t24 int +--* t25 int Generating: N021 ( 18, 12) [000026] ----------- t26 = * EQ int REG x0 IN0006: cmp w0, #0 IN0007: cset x0, eq Generating: N023 ( 1, 1) [000004] ----------- t4 = LCL_VAR int V01 arg1 u:1 x1 REG x1 $81 Generating: N025 ( 1, 2) [000005] -c--------- t5 = CNS_INT int 4 REG NA $45 /--* t4 int +--* t5 int Generating: N027 ( 6, 4) [000006] N------N-U- t6 = * GT int REG x2 $102 IN0008: cmp w1, #4 IN0009: cset x2, hi /--* t26 int +--* t6 int Generating: N029 ( 25, 17) [000027] -------N--- t27 = * AND int REG x0 IN000a: and w0, w0, w2 Generating: N031 ( 1, 2) [000028] -c--------- t28 = CNS_INT int 0 REG NA /--* t27 int +--* t28 int Generating: N033 ( 30, 20) [000029] ----------- t29 = * NE int REG x0 IN000b: cmp w0, #0 IN000c: cset x0, ne /--* t29 int Generating: N035 ( 32, 22) [000007] ----------- * JTRUE void REG NA $VN.Void IN000d: cbnz (LARGEJMP)L_M6318_BB05
With the full function as:
G_M6318_IG01: IN0016: 000000 stp fp, lr, [sp, #-0x10]! IN0017: 000004 mov fp, sp G_M6318_IG02: IN0001: 000008 cmp w0, #3 IN0002: 00000C cset x0, hi IN0003: 000010 cmp w1, #10 IN0004: 000014 cset x2, eq IN0005: 000018 and w0, w0, w2 IN0006: 00001C cmp w0, #0 IN0007: 000020 cset x0, eq IN0008: 000024 cmp w1, #4 IN0009: 000028 cset x2, hi IN000a: 00002C and w0, w0, w2 IN000b: 000030 cmp w0, #0 IN000c: 000034 cset x0, ne IN000d: 000038 cbnz w0, G_M6318_IG04 G_M6318_IG03: IN000e: 00003C mov w0, #9 IN000f: 000040 b G_M6318_IG05 G_M6318_IG04: IN0010: 000044 mov w0, #12 G_M6318_IG05: IN0011: 000048 movz x2, #0x6EE0 // code for CSharpTutorials.Program:Consume[uint](uint,uint) IN0012: 00004C movk x2, #0x45A8 LSL #16 IN0013: 000050 movk x2, #0xFFFF LSL #32 IN0014: 000054 ldr x2, [x2] IN0015: 000058 blr x2 G_M6318_IG06: IN0018: 00005C ldp fp, lr, [sp], #0x10 IN0019: 000060 ret lr
Look at BB02 - that is a lot of code. On the plus side we do have only a single branch, with all the conditions compared together in one lump. However, That is a lot of code to do those compares. Each compare node gets generated into two instructions. Firstly a CMP to do the comparison into flags, then a CSET to check a particular flag (for example, HI) and put a Boolean result into a register. For example, the first EQ becomes:
BB02
CMP
CSET
HI
EQ
IN0003: cmp w1, #10 IN0004: cset x2, eq
Following through all the nodes, the first two compares in the if are generated into x0 and x2. These are ANDed together, compared against zero with the result going into x0. The final compare from the if is generated, then ANDed with the previous result (in x0) and compared against zero. This result also goes into x0. Note that the previous value in x0 is no longer required, which is why it can be reused. By default, RyuJIT uses the lowest available register. Finally, CNBZ compares this against zero and conditionally jumps.
x0
x2
AND
if
CNBZ
There are multiple ways RyuJIT could improve on this code. Ideally this should be turned into a sequence of CMP and CCMP nodes. RyuJIT supports many different lowering optimizations and sometimes an optimization will change the code in such a way that it will prevent a later optimization. RyuJIT tries to ensure that that optimizations are applied in such an order that the best code is generated. Before looking at the compare chain optimization, let's look at some of the other optimizations that could have applied. We run the code as before, but now allow lowering of JTRUE.
CCMP
JTRUE
export DOTNET_JitDump=TestAndOrWithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 export DOTNET_JitDoCompareChainCondLowering=0 export DOTNET_JitDoLowerConstCompare=0 unset DOTNET_JitDoLowerJTrue
Now when we look at the lowering dumps, the following block:
N013 ( 25, 17) [000027] -------N--- t27 = * AND int N014 ( 1, 2) [000028] -c--------- t28 = CNS_INT int 0 /--* t27 int +--* t28 int N015 ( 30, 20) [000029] ----------- t29 = * NE int /--* t29 int N016 ( 32, 22) [000007] ----------- * JTRUE void $VN.Void becomes: N013 ( 25, 17) [000027] -------N--- t27 = * AND int N014 ( 1, 2) [000028] -c--------- t28 = CNS_INT int 0 /--* t27 int +--* t28 int N016 ( 32, 22) [000007] ----------- * JCMP void
The NE has been removed, and the JTRUE (which takes a single condition as input) is replaced by a JCMP (which takes in two values as input). Also note that the CNS_INT 0 has been marked with a flag "c". This means the node has been contained by its parent. There are many other flags which can be set on a node, but we will not be going into details on them.
NE
JCMP
CNS_INT 0
c
At generation time, contained nodes are not generated. It is the job a parent to ensure it generates code for all it is contained children.
Consider:
LCL_VAR int V01 arg1
CNS_INT int 2
ADD
As it stands, this would generate code similar to:
MOV x3, 2
ADD x0, x0, x3
However, this can be improved by using a single
ADD x0, x0, #2
This can be done by containing the CNS_INT within the ADD. At generation time the CNS_INT is not generated. The generation of the ADD will see that one of its children is contained and instead of issuing a standard ADD, will issue the immediate version of ADD.
CNS_INT
Only certain instructions support having contained children. The code must check it is safe to do so before containing.
Ok, back to the main code. Generation of the contained node:
Generating: N029 ( 25, 17) [000027] -------N--- t27 = * AND int REG x0 IN000a: and w0, w0, w2 Generating: N031 ( 1, 2) [000028] -c--------- t28 = CNS_INT int 0 REG NA /--* t27 int +--* t28 int Generating: N033 ( 32, 22) [000007] ----------- * JCMP void REG NA IN000b: cbnz (LARGEJMP)L_M6318_BB05
With the full function:
G_M6318_IG01: IN0014: 000000 stp fp, lr, [sp, #-0x10]! IN0015: 000004 mov fp, sp G_M6318_IG02: IN0001: 000008 cmp w0, #3 IN0002: 00000C cset x0, hi IN0003: 000010 cmp w1, #10 IN0004: 000014 cset x2, eq IN0005: 000018 and w0, w0, w2 IN0006: 00001C cmp w0, #0 IN0007: 000020 cset x0, eq IN0008: 000024 cmp w1, #4 IN0009: 000028 cset x2, hi IN000a: 00002C and w0, w0, w2 IN000b: 000030 cbnz w0, G_M6318_IG04 G_M6318_IG03: IN000c: 000034 mov w0, #9 IN000d: 000038 b G_M6318_IG05 G_M6318_IG04: IN000e: 00003C mov w0, #12 G_M6318_IG05: IN000f: 000040 movz x2, #0x6EE0 // code for CSharpTutorials.Program:Consume[uint](uint,uint) IN0010: 000044 movk x2, #0x6DA8 LSL #16 IN0011: 000048 movk x2, #0xFFFF LSL #32 IN0012: 00004C ldr x2, [x2] IN0013: 000050 blr x2 G_M6318_IG06: IN0016: 000054 ldp fp, lr, [sp], #0x10 IN0017: 000058 ret lr
This has removed two instructions - the CMP and CSET following the final AND.
Next let's re-enable constant compare optimizations within lowering:
export DOTNET_JitDump=TestAndOrWithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 export DOTNET_JitDoCompareChainCondLowering=0 unset DOTNET_JitDoLowerConstCompare
Now the lowering phase will check the children of an EQ or NE node for AND and 0:
0
N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] ----------- t1 = CNS_INT int 3 $43 /--* t0 int +--* t1 int N003 ( 6, 4) [000002] N------N-U- t2 = * GT int N004 ( 1, 1) [000016] ----------- t16 = LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000017] ----------- t17 = CNS_INT int 10 $44 /--* t16 int +--* t17 int N006 ( 6, 4) [000018] -------N--- t18 = * EQ int $101 /--* t2 int +--* t18 int N007 ( 13, 9) [000024] -------N--- t24 = * AND int N008 ( 1, 2) [000025] ----------- t25 = CNS_INT int 0 /--* t24 int +--* t25 int N009 ( 18, 12) [000026] ----------- t26 = * EQ int
This is transformed into:
N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] -c--------- t1 = CNS_INT int 3 $43 /--* t0 int +--* t1 int N003 ( 6, 4) [000002] N------N-U- t2 = * GT int N004 ( 1, 1) [000016] ----------- t16 = LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000017] -c--------- t17 = CNS_INT int 10 $44 /--* t16 int +--* t17 int N006 ( 6, 4) [000018] -------N--- t18 = * EQ int $101 /--* t2 int +--* t18 int N009 ( 18, 12) [000026] ----------- t26 = * TEST_EQ int
The EQ(AND(X,Y),0) is converted into TEST_EQ(X,Y), removing the AND and 0, replacing the EQ with a TEST_EQ. This will then map directly to the Arm64 instruction TST.
EQ(AND(X,Y),0)
TEST_EQ(X,Y)
TEST_EQ
TST
A similar change is made to optimize the NE to TEST_NE. By doing this JTRUE optimization can no longer happen. Instead, when the JTRUE is lowered to the following:
TEST_NE
N015 ( 30, 20) [000029] ----------- t29 = * TEST_NE int /--* t29 int N016 ( 32, 22) [000007] ----------- * JTRUE void $VN.Void
is transformed into a generic TEST instruction, with the NE having moved down into the JCC node as an additional flag.
TEST
JCC
N015 ( 30, 20) [000029] ----------- * TEST void N016 ( 32, 22) [000007] ----------- JCC void cond=UNE
This is not for optimization purposes and more for rationalization of the IR. When generating a comparison node (for example, EQ, NE) or test node (for example, TEST_EQ), it will be generated as a single Arm64 CMP/TST instruction. It is the following instruction that does the specific flag checking (the EQ part). This transformation simply ensures the nodes are better scoped to the resulting instructions. Ideally an instruction should not depend on its children for how it should operate. The intention is that there should be no JTRUE nodes remaining in the IR at generation.
CMP/TST
Finally, the IR is generated into:
*************** After end code gen, before unwindEmit() G_M6318_IG01: ; func=00, offs=0x000000, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG IN0013: 000000 stp fp, lr, [sp, #-0x10]! IN0014: 000004 mov fp, sp G_M6318_IG02: ; offs=0x000008, size=0x0028, bbWeight=1, PerfScore 5.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB01 [0000], byref, isz IN0001: 000008 cmp w0, #3 IN0002: 00000C cset x0, hi IN0003: 000010 cmp w1, #10 IN0004: 000014 cset x2, eq IN0005: 000018 tst w0, w2 IN0006: 00001C cset x0, eq IN0007: 000020 cmp w1, #4 IN0008: 000024 cset x2, hi IN0009: 000028 tst w0, w2 IN000a: 00002C bne G_M6318_IG04 G_M6318_IG03: ; offs=0x000030, size=0x0008, bbWeight=0.50, PerfScore 0.75, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB04 [0003], byref IN000b: 000030 mov w0, #9 IN000c: 000034 b G_M6318_IG05 G_M6318_IG04: ; offs=0x000038, size=0x0004, bbWeight=0.50, PerfScore 0.25, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB05 [0004], byref IN000d: 000038 mov w0, #12 G_M6318_IG05: ; offs=0x00003C, size=0x0014, bbWeight=1, PerfScore 5.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB06 [0005], byref IN000e: 00003C movz x2, #0x6EE0 // code for CSharpTutorials.Program:Consume[uint](uint,uint) IN000f: 000040 movk x2, #0x45A6 LSL #16 IN0010: 000044 movk x2, #0xFFFF LSL #32 IN0011: 000048 ldr x2, [x2] IN0012: 00004C blr x2 G_M6318_IG06: ; offs=0x000050, size=0x0008, bbWeight=1, PerfScore 2.00, epilog, nogc, extend IN0015: 000050 ldp fp, lr, [sp], #0x10 IN0016: 000054 ret lr
Which is one instruction shorter than previously: both ANDs have been removed, whereas previously the JTRUE lowering could only remove the second AND.
Finally, let's look at what happens when re-enabling the specific compare chain logic. This optimization was written alongside the optimize bool code and is a good example of how different optimizations work together. From the examples above, it is clear it would not be a good idea to have the optimize bool code without something to break them down.
export DOTNET_JitDump=TestAndOrWithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 unset DOTNET_JitDoCompareChainCondLowering
The lowering for compare chains starts at any AND (or OR) node. It checks that the input of the AND (the EQ and GT) are compare nodes and of integer type:
OR
GT
[000027] is a potential candidate for CCMP: N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] -c--------- t1 = CNS_INT int 3 $43 /--* t0 int +--* t1 int N003 ( 6, 4) [000002] N------N-U- t2 = * GT int N004 ( 1, 1) [000018] ----------- t18 = LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000019] -c--------- t19 = CNS_INT int 10 $44 /--* t18 int +--* t19 int N006 ( 6, 4) [000020] -------N--- t20 = * EQ int $101 /--* t2 int +--* t20 int N007 ( 13, 9) [000027] -------N--- t27 = * AND int
First, the nodes need shuffling to first set up all the arguments to the if and then all the comparisons in a sequence. In the example, the GT, EQ, and AND are placed after the LCL_VAR and CNS_INT nodes. This is subject to several checks to ensure that there are no dependency issues between the reordered nodes. This could happen, say, if one of the nodes was writing to memory which was then used by another node.
LCL_VAR
Then the nodes can be transformed. As mentioned earlier with the TEST/JCC transformation, the compare nodes are a little odd as they do not match directly to a single instruction. They generate a CMP node and then it is the job of the following instruction to check the condition. The CMP node is used to model this behavior. A compare node can be transformed into a CMP node providing the parent node is able to hold the condition as an additional flag. A CCMP node is similar, but maps to a CCMP instruction. Only certain nodes can hold condition flags - namely CCMP, SETCC and JCC.
TEST/JCC
SETCC
Working through the example, the AND is turned into a SETCC allowing it to hold a condition. The EQ is turned into a CCMP with the condition pushed down to the SETCC. Finally, the GT is turned into a CMP, with the condition pushed down to the EQ:
N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] -c--------- t1 = CNS_INT int 3 $43 N004 ( 1, 1) [000018] ----------- t18 = LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000019] -c--------- t19 = CNS_INT int 10 $44 /--* t0 int +--* t1 int N003 ( 6, 4) [000002] -------N-U- * CMP void /--* t18 int +--* t19 int N006 ( 6, 4) [000020] -------N--- * CCMP void cond=UGT flags=0 N007 ( 13, 9) [000027] -------N--- t27 = SETCC int cond=UEQ
The same optimization is then performed on the next AND. At the start of the optimization:
N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] -c--------- t1 = CNS_INT int 3 $43 N004 ( 1, 1) [000016] ----------- t16 = LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000017] -c--------- t17 = CNS_INT int 10 $44 /--* t0 int +--* t1 int N003 ( 6, 4) [000002] -------N-U- * CMP void /--* t16 int +--* t17 int N006 ( 6, 4) [000018] -------N--- * CCMP void cond=UGT flags=0 N007 ( 13, 9) [000024] -------N--- t24 = SETCC int cond=UNE N010 ( 1, 1) [000004] ----------- t4 = LCL_VAR int V01 arg1 u:1 $81 N011 ( 1, 2) [000005] -c--------- t5 = CNS_INT int 4 $45 /--* t4 int +--* t5 int N012 ( 6, 4) [000006] N------N-U- t6 = * GT int $102 /--* t24 int +--* t6 int N013 ( 25, 17) [000027] -------N--- t27 = * AND int
The code checks the inputs to the AND. The GT is ok because it is a compare, the SETCC is also allowed as the result of the SETCC goes into flags. It then applies the same transformation. The AND becomes a SETCC, the GT becomes a CMP. The first input to the AND is the previous SETCC and it is removed. These are all moved to after the inputs to the conditions. This now gives:
N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] -c--------- t1 = CNS_INT int 3 $43 N004 ( 1, 1) [000016] ----------- t16 = LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000017] -c--------- t17 = CNS_INT int 10 $44 N010 ( 1, 1) [000004] ----------- t4 = LCL_VAR int V01 arg1 u:1 $81 N011 ( 1, 2) [000005] -c--------- t5 = CNS_INT int 4 $45 /--* t0 int +--* t1 int N003 ( 6, 4) [000002] -------N-U- * CMP void /--* t16 int +--* t17 int N006 ( 6, 4) [000018] -------N--- * CCMP void cond=UGT flags=0 (Note: The earlier SETCC is removed from here) /--* t4 int +--* t5 int N012 ( 6, 4) [000006] -------N-U- * CCMP void cond=UNE flags=z N013 ( 25, 17) [000027] -------N--- t27 = SETCC int cond=UGT
As mentioned above, the optimization will trigger if the first input of the AND is a SETCC. Potentially this allows any tree of nodes ending in a SETCC, not just a chain previously converted. In practice, only compare chains will be matched, but this is an example of good practice in RyuJIT- optimizations should apply to the widest range of cases possible. If in the future an earlier optimization creates SETCC nodes connected to an AND, then the compare chain code will automatically convert it to the best possible code.
Post the compare chain optimization, the JTRUE will be lowered, which will remove the SETCC child:
N013 ( 25, 17) [000027] -------N--- t27 = SETCC int cond=UGT /--* t27 int N016 ( 32, 22) [000007] ----------- * JTRUE void $VN.Void
This becomes:
N016 ( 32, 22) [000007] ----------- JCC void cond=UGT
Eventually the function is generated, with BB01 to BB05 as follows:
=============== Generating BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05} flags=0x00000000.00020011: i label LIR BB01 IN (2)={V00 V01} + ByrefExposed + GcHeap OUT(1)={ V01} + ByrefExposed + GcHeap Recording Var Locations at start of BB01 V00(x0) V01(x1) Change life 0000000000000000 {} -> 0000000000000003 {V00 V01} V00 in reg x0 is becoming live [------] Live regs: 0000 {} => 0001 {x0} New debug range: first V01 in reg x1 is becoming live [------] Live regs: 0001 {x0} => 0003 {x0 x1} New debug range: first Live regs: (unchanged) 0003 {x0 x1} GC regs: (unchanged) 0000 {} Byref regs: (unchanged) 0000 {} L_M6318_BB01: Label: G_M6318_IG02, GCvars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {} Scope info: begin block BB01, IL range [000..00D) Added IP mapping: 0x0009 STACK_EMPTY (G_M6318_IG02,ins#0,ofs#0) label Generating: N003 (???,???) [000030] ----------- IL_OFFSET void INLRT @ 0x009[E-] REG NA Generating: N005 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 x0 (last use) REG x0 $80 Generating: N007 ( 1, 2) [000001] -c--------- t1 = CNS_INT int 3 REG NA $43 Generating: N009 ( 1, 1) [000016] ----------- t16 = LCL_VAR int V01 arg1 u:1 x1 REG x1 $81 Generating: N011 ( 1, 2) [000017] -c--------- t17 = CNS_INT int 10 REG NA $44 Generating: N013 ( 1, 1) [000004] ----------- t4 = LCL_VAR int V01 arg1 u:1 x1 REG x1 $81 Generating: N015 ( 1, 2) [000005] -c--------- t5 = CNS_INT int 4 REG NA $45 /--* t0 int +--* t1 int Generating: N017 ( 6, 4) [000002] -------N-U- * CMP void REG NA V00 in reg x0 is becoming dead [000000] Live regs: 0003 {x0 x1} => 0002 {x1} Closing debug range. Live vars after [000000]: {V00 V01} => {V01} Mapped BB01 to G_M6318_IG02 IN0001: cmp w0, #3 /--* t16 int +--* t17 int Generating: N019 ( 6, 4) [000018] -------N--- * CCMP void cond=UGT flags=0 REG NA IN0002: ccmp w1, #10, 0, hi /--* t4 int +--* t5 int Generating: N021 ( 6, 4) [000006] -------N-U- * CCMP void cond=UNE flags=z REG NA IN0003: ccmp w1, #4, z, ne Generating: N023 ( 32, 22) [000007] ----------- JCC void cond=UGT REG NA IN0004: bhi (LARGEJMP)L_M6318_BB05 Variable Live Range History Dump for BB01 V00 arg0: x0 [(G_M6318_IG02,ins#0,ofs#0), (G_M6318_IG02,ins#0,ofs#0)] V01 arg1: x1 [(G_M6318_IG02,ins#0,ofs#0), ...]
Giving the final function:
G_M6318_IG01: IN000d: 000000 stp fp, lr, [sp, #-0x10]! IN000e: 000004 mov fp, sp G_M6318_IG02: IN0001: 000008 cmp w0, #3 IN0002: 00000C ccmp w1, #10, 0, hi IN0003: 000010 ccmp w1, #4, z, ne IN0004: 000014 bhi G_M6318_IG04 G_M6318_IG03: IN0005: 000018 mov w0, #9 IN0006: 00001C b G_M6318_IG05 G_M6318_IG04: IN0007: 000020 mov w0, #12 G_M6318_IG05: IN0008: 000024 movz x2, #0x6EE0 // code for CSharpTutorials.Program:Consume[uint](uint,uint) IN0009: 000028 movk x2, #0x49A6 LSL #16 IN000a: 00002C movk x2, #0xFFFF LSL #32 IN000b: 000030 ldr x2, [x2] IN000c: 000034 blr x2 G_M6318_IG06: IN000f: 000038 ldp fp, lr, [sp], #0x10 IN0010: 00003C ret lr
This is much better and exactly what we were looking for. IG02 is 4 instructions, instead of 13 previously.
IG02
Admittedly this is still tricky for a human to parse. It's easy to spot that we have a op1>3, op2!=10 and op2>4, but hard to figure out how this chain connects with && and ||. For that we would have to parse the failure flags of the CCMPs, and invert conditions accordingly.
op1>3
op2!=10
op2>4
&&
||
Let us look back at the original if statement:
if (op1 > 3 && op2 == 10 || op2 <= 4)
In the original code (with all the branches), if the first two expressions pass (op1 > 3 && op2 == 10) then the final expression (op2 <= 4) will never be calculated. If the first expression (op1 > 3) fails, then the second expression (op2 == 10) will never be calculated, but the third (op2 <= 4) will. Writing all the combinations shows us that five out of eight times only two operations will be calculated:
op1 > 3 && op2 == 10
op2 <= 4
op1 > 3
op2 == 10
Expression results
Expression is calculated?
Total calculations when using branches
Total calculations when using CCMP
1
2
3
false
yes
no
true
In the new code, the chain of CMP/CCMP will ensure all three expressions are always calculated. In our case, these three expressions are always faster than the different combinations of branches.
CMP/CCMP
Now consider the following statement:
if (op1 > 1 || op2 > 1 || op3 > 1 || op4 > 1 || op5 > 1 || op6 > 1 || o7 > 1 || op8 > 1)
If only op8 is greater than 1, then both branch and CCMP versions of the compiled code would require 8 compares. If op1 is greater than 1, then in the branch version only 1 operation would be calculated, whereas in the CCMP version, all 8 compares would still be calculated. In that scenario, then the branch code would be quicker.
op8
op1
At compile time, RyuJIT will know the contents of op1 and op2 but does not know what their contents will be on future calls. Every call could use the same set of values, or sequentially increasing each call, or purely random. Therefore, a normal compiler cannot make any assumptions. This is where profile-guided optimizations (PGO) come into play. Usually, .NET runtime will operate with tiered mode compilation. The first time encountering a function, RyuJIT will quickly compile into reasonable code. Once the function has been called a certain number of times, RyuJIT will jump in and recompile the function into more optimal code but will take longer doing so. This ensures infrequently executed functions are accessible quickly, but hot functions run quickly. With PGO, RyuJIT also keeps track how often basic blocks are run. When recompiling it can optimize based on this if it assumes all future calls will follow the same patterns. This is used by various optimizations (for example, inlining, basic block layout, LSRA, guarded devirtualization, physical promotion), but isn’t currently used by If Conversion.
op2
Without any runtime knowledge, RyuJIT cannot make any assumptions here and must optimize for all inputs equally. Let us see what RyuJIT does with this example:
[MethodImpl(MethodImplOptions.NoInlining)] private static void TestOr8WithElse(uint op1, uint op2, uint op3, uint op4, uint op5, uint op6, uint op7, uint op8) { if (op1 > 1 || op2 > 1 || op3 > 1 || op4 > 1 || op5 > 1 || op6 > 1 || op7 > 1 || op8 > 1 ) { op1 = 9; } else { op1 = 12; } Consume(op1, op2); }
If following along, switch the function to dump:
export DOTNET_JitDump=TestOr8WithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 unset DOTNET_JitDoCompareChainCondLowering
This gives:
G_M47932_IG01: IN0014: 000000 stp fp, lr, [sp, #-0x10]! IN0015: 000004 mov fp, sp G_M47932_IG02: IN0001: 000008 cmp w0, #1 IN0002: 00000C ccmp w1, #1, c, ls IN0003: 000010 ccmp w2, #1, c, ls IN0004: 000014 bhi G_M47932_IG04 G_M47932_IG03: IN0005: 000018 cmp w3, #1 IN0006: 00001C ccmp w4, #1, c, ls IN0007: 000020 ccmp w5, #1, c, ls IN0008: 000024 bhi G_M47932_IG04 IN0009: 000028 cmp w6, #1 IN000a: 00002C ccmp w7, #1, c, ls IN000b: 000030 bls G_M47932_IG05 G_M47932_IG04: IN000c: 000034 mov w0, #9 IN000d: 000038 b G_M47932_IG06 G_M47932_IG05: IN000e: 00003C mov w0, #12 G_M47932_IG06: IN000f: 000040 movz x2, #0x6EE0 // code for CSharpTutorials.Program:Consume[uint](uint,uint) IN0010: 000044 movk x2, #0x69A8 LSL #16 IN0011: 000048 movk x2, #0xFFFF LSL #32 IN0012: 00004C ldr x2, [x2] IN0013: 000050 blr x2 G_M47932_IG07: IN0016: 000054 ldp fp, lr, [sp], #0x10 IN0017: 000058 ret lr
RyuJIT has broken this into two chains - two chains of three comparisons (IG02, IG03) and one chain of two comparisons (IG03).
IG03
This is checked in the optimize bools pass. In RyuJIT dump are instances of it trying to combine two JTRUE blocks but then exiting with:
Skipping CompareChainCond that will evaluate conditions unconditionally at costs 32,5
This works by calculating static costs within the two blocks. With static costing, each node type is giving a fixed cost. For example, an ADD would be cost 1, but a floating point add or load from memory would be much higher. This cost represents the relative time a node, once turned into code, will take to execute. These costings are approximate and are (currently) the same across all target systems. The total cost of a tree node is the total cost of all the nodes in it, plus potential additional side costs.
If the cost is larger than a set constant, then the optimization fails. This costing has been set to allow chains of three, but no bigger. There may still be a better way of distributing this, but in practice it is good enough without making RyuJIT overly complicated.
A single compare chain can be forced by enabling stress mode.
For testing purposes, .NET provides various stress modes. This is applied per optimization or functional phase and usually forces the code to apply to everything it can. For example, the register allocator will not reuse registers, or the scheduler will put every instruction into a new IG, or a pass will ignore all costings. The runtime CI will run instances of the test suites with all the stress modes enabled. The intention here to catch bugs that not show under normal circumstances, allowing a wider range of testing without having to write targeted esoteric C# test applications.
For optimize bools, enabling stress mode will skip all the costing checks. This can be done via:
export DOTNET_JitDump=TestOr8WithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 export DOTNET_JitStressModeNames=STRESS_OPT_BOOLS_COMPARE_CHAIN_COST
Giving:
G_M47932_IG01: IN0012: 000000 stp fp, lr, [sp, #-0x10]! IN0013: 000004 mov fp, sp G_M47932_IG02: IN0001: 000008 cmp w0, #1 IN0002: 00000C ccmp w1, #1, c, ls IN0003: 000010 ccmp w2, #1, c, ls IN0004: 000014 ccmp w3, #1, c, ls IN0005: 000018 ccmp w4, #1, c, ls IN0006: 00001C ccmp w5, #1, c, ls IN0007: 000020 ccmp w6, #1, c, ls IN0008: 000024 ccmp w7, #1, c, ls IN0009: 000028 bls G_M47932_IG04 G_M47932_IG03: IN000a: 00002C mov w0, #9 IN000b: 000030 b G_M47932_IG05 G_M47932_IG04: IN000c: 000034 mov w0, #12 G_M47932_IG05: IN000d: 000038 movz x2, #0x6EE0 // code for CSharpTutorials.Program:Consume[uint](uint,uint) IN000e: 00003C movk x2, #0x45A6 LSL #16 IN000f: 000040 movk x2, #0xFFFF LSL #32 IN0010: 000044 ldr x2, [x2] IN0011: 000048 blr x2 G_M47932_IG06: IN0014: 00004C ldp fp, lr, [sp], #0x10 IN0015: 000050 ret lr
That is everything I wanted to cover for the optimize bool phase. In Part 3, I will be finally moving onto the If Conversion phase to remove the remaining branches.
If Conversion Within .NET - Part 3