Over the past few releases, the Arm64 port of .NET has seen a lot of additions and improvements. In many cases making it on par with the X64 implementation. For more details on the exact changes, the official blog posts give a good overview – see here for .NET 5 and here for .NET 7.
Recently, some of those patches have come from Arm. We have a small team working on improving .NET’s code generation component known as RyuJIT. This blog series provides more details into one optimization we have been working on - If Conversion and the associated optimization, optimize bools. In theory this is a simple optimization, but adding it requires many considerations into the design of the RyuJIT. For example, how it will interact with other optimizations. Also, sometimes better-looking code is not always better in practice. Part one will focus on following through the optimize bools optimization.
Note, Arm refers to the architecture as AArch64, but here we follow the .NET naming, which is Arm64.
All the code and dumps generated below were produced using a modified version of coreclr from the .NET runtime repo, which had been slightly modified to allow for certain optimizations to be turned off via new config options. The code is available on GitHub. The SHA of the HEAD is:
182bee3189ebb38511e3963595bdb2234a310b09
If you want to follow along, then you will need a Arm64 (v8.0 or above) box running Linux. Install the standard build dependencies, then do a standard checked build and generate a CORE_ROOT folder:
./build.sh -rc Checked -lc Release -s clr+libs ./src/tests/build.sh generatelayoutonly Checked
The example test program will need building:
cp examples/ifconversion ~ cd ~/ifconversion ../dotnet/.dotnet/dotnet publish -c Release
And can be run with:
~/dotnet/runtime_bools/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun ./bin/Release/net7.0/ifconversion.dll
For each examples run below, it is assumed the test is run from a clean environment without any DOTNET_flags set. However, for convenience, any environment variables set in a previous stage and no longer required are explicitly unset. All the examples can also be reproduced on a Windows Arm64 machine; however, the commands will need adjusting.
DOTNET_
If you only have an X64 machine, then you will get different results. Most of the phases of RyuJIT are common across the two architectures. However due to the differences, some of the optimizations will not apply, and others will not give the exact same results. In addition, the config variables for this optimization have only been added to Arm64 code.
Consider the following snippet of C# from the example:
[MethodImpl(MethodImplOptions.NoInlining)] static void TestAndOrWithElse(uint op1, uint op2) { if (op1 > 3 && op2 == 10 || op2 <= 4) { op1 = 9; } else { op1 = 12; } Consume(op1, op2); }
op1
op2
Run this with:
export DOTNET_JitDisasm=TestAndOrWithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 export DOTNET_JitDoCompareChainCond=0 ~/dotnet/runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun ./bin/Release/net7.0/linux-arm64/conditional.dll
This gives the following annotated Arm64 assembly:
G_M64092_IG01: stp fp, lr, [sp, #-0x10]! mov fp, sp G_M64092_IG02: cmp w0, #3 bls G_M64092_IG04 G_M64092_IG03: cmp w1, #10 beq G_M64092_IG05 G_M64092_IG04: cmp w1, #4 bhi G_M64092_IG06 G_M64092_IG05: mov w0, #9 b G_M64092_IG07 G_M64092_IG06: mov w1, #12 G_M64092_IG07: movz x2, #0x6EE0 // code for CSharpTutorials.Program:Consume[uint](uint,uint) movk x2, #0x55A7 LSL #16 movk x2, #0xFFFF LSL #32 ldr x2, [x2] blr x2 G_M64092_IG08: ldp fp, lr, [sp], #0x10 ret lr
There is some additional output shown by RyuJIT, for example, various instruction weighting throughout the assembly. This has been removed from the above for brevity.
The code has been broken down into several Instruction Groups (IG):
G_M64092_IG01: Set up the function stack. This is common boilerplate code.
G_M64092_IG01
G_M64092_IG02: Contains if (op1 > 3) . Compare op1 and 3 and set condition flags. If the result is less than, then (op1 > 3 && op2 == 10) has failed, so jump to IG04. Otherwise fall through to IG03.
G_M64092_IG02
if (op1 > 3)
3
(op1 > 3 && op2 == 10)
IG04
IG03
G_M64092_IG03: Contains if (op2 == 10) . Compare op2 and 10 and set condition flags. If the result is equals, then (op1 > 3 && op2 == 10) has passed, so jump to IG05.
G_M64092_IG03
if (op2 == 10)
10
IG05
G_M64092_IG04: Contains if (op2 <= 4). Compare op2 and 4 and set condition flags. If op2 is greater than 4, then (op2 <= 4) has failed, the code jumps to IG06.
G_M64092_IG04
if (op2 <= 4)
4
(op2 <= 4)
IG06
G_M64092_IG05: Contains the then clause op1 = 9. Set op1 then jump unconditionally to IG07.
G_M64092_IG05
then
op1 = 9.
IG07
G_M64092_IG06: Contains the else clause op1 = 12. Set op1 .
G_M64092_IG06
else
op1 = 12
G_M64092_IG07: Call the consume function.
G_M64092_IG07
G_M64092_IG08: Restore the function stack and return. Common boilerplate code.
G_M64092_IG08
Note how IG02, IG03 and IG04 handle the conditions of the if statement. Each one is a pair of compare + conditional branch instructions. The IGs work through the conditions sequentially with no reordering and each branch always moves forwards, never backwards. The && and || are implicitly encoded in the jump destinations of IG02 to IG04. For this to work, the conditions in IG02 and IG04 are inverted to test the failure case, whereas IG03 has no inverting and tests the pass case.
IG02
&&
||
There are two key issues with this code. First, for me at least, looking at the assembly it is tricky to determine the exact set of ANDed and ORed statements. For a CPU, this set of conditional jumps is not ideal as it cannot read ahead until the destination has been calculated. Over the years CPUs have become a lot better at this. Branch prediction is used to guess the likely outcome and continue decoding. If the prediction is wrong then the CPU must throw away the prediction and recalculate.
AND
OR
There are two main ways we can improve this code. The first is to roll all the compares from IG02 to IG04 into a single stream of instructions. Then calculate every step of the If, and make a single conditional jump to either IG05 or IG06. In the second way IG05 and IG06 are simply assigning a value to op1, so these two blocks can be merged into a conditional instruction, which removes the remaining jumps.
If
Let us look at the first one.
The AArch64 instruction CCMP is used to chain together successive comparisons. The CCMP is a conditional comparison - it will only compare based on the results of the previous compare. Consider:
CCMP
cmp w0, #3 ccmp w1, #10, 0, hi
The CMP compares w0 and 3 then sets the condition flags. The CCMP checks the condition flags for greater than unsigned (`HI`) are set and if so, then compares w1 and 10 and sets the condition flags. If the flags for HI were not set, then set the condition flags are set to the failure state 0. Through careful maintenance of the flags, we can chain together arbitrary AND and OR conditions. Working through the original example with the compare chain pass enabled gives:
CMP
w0
HI
w1
0
export DOTNET_JitDisasm=TestAndOrWithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0
G_M64092_IG01: A9BF7BFD stp fp, lr, [sp, #-0x10]! 910003FD mov fp, sp G_M64092_IG02: 71000C1F cmp w0, #3 7A4A8820 ccmp w1, #10, 0, hi 7A441824 ccmp w1, #4, z, ne 54000068 bhi G_M64092_IG04 G_M64092_IG03: 52800120 mov w0, #9 14000002 b G_M64092_IG05 G_M64092_IG04: ;; offset=0020H 52800181 mov w1, #12 G_M64092_IG05: D288F302 movz x2, #0x4798 // code for CSharpTutorials.Program:Consume[uint](uint,uint) F2AA3302 movk x2, #0x5198 LSL #16 F2DFFFE2 movk x2, #0xFFFF LSL #32 F9400042 ldr x2, [x2] D63F0040 blr x2 G_M64092_IG06: A8C17BFD ldp fp, lr, [sp], #0x10 D65F03C0 ret lr
All the compare and branches have been replaced with the single block of chained compares. What may not be immediately obvious is why all the conditions have inverted. Regardless, it is much easier now to see what is going on. In addition, the code size has reduced from 18 instructions to 16.
Note that CCMP does not care which instruction set the condition flags. A CCMP could instead follow a TST or ADDS instruction.
TST
ADDS
To illustrate this, Let us walk through what RyuJIT is doing. Set the following and rerun:
export DOTNET_JitDump=TestAndOrWithElse export DOTNET_TieredCompilation=0 export DOTNET_JitDoIfConversion=0 export DOTNET_JitDoCompareChainCond=0
This generates about 4500 lines of dump info from RyuJIT, starting with:
****** START compiling CSharpTutorials.Program:TestAndOrWithElse(uint,uint) (MethodHash=8530e751) Generating code for Unix arm64 OPTIONS: compCodeOpt = BLENDED_CODE OPTIONS: compDbgCode = false OPTIONS: compDbgInfo = true OPTIONS: compDbgEnC = false OPTIONS: compProcedureSplitting = false OPTIONS: compProcedureSplittingEH = false OPTIONS: optimiser should use profile data IL to import: IL_0000 02 ldarg.0 IL_0001 19 ldc.i4.3 IL_0002 36 05 ble.un.s 5 (IL_0009) IL_0004 03 ldarg.1 IL_0005 1f 0a ldc.i4.s 0xA IL_0007 2e 04 beq.s 4 (IL_000d) IL_0009 03 ldarg.1 IL_000a 1a ldc.i4.4 IL_000b 35 06 bgt.un.s 6 (IL_0013) IL_000d 1f 09 ldc.i4.s 0x9 IL_000f 10 00 starg.s 0x0 IL_0011 2b 04 br.s 4 (IL_0017) IL_0013 1f 0c ldc.i4.s 0xC IL_0015 10 00 starg.s 0x0 IL_0017 02 ldarg.0 IL_0018 03 ldarg.1 IL_0019 28 02 00 00 2b call 0x2B000002 IL_001e 2a ret Arg #0 passed in register(s) x0 Arg #1 passed in register(s) x1
First is the CIL bytecode, which is a stack-based machine. If you have been following the code so far, then this should be easily readable. At this point, the function has already been broken down into several compare and branches, including the inversion of some of the conditions. For example:
ldarg.0 - load arg 0 onto the stack ldc.i4.3 - load integer constant 3 onto the stack ble.un.s 5 (IL_0009) - compare and branch the top two entries on the stack. if less than or equals then jump to IL_0009.
ldarg.0
ldc.i4.3
IL_0009
The dump is next broken down into a section for each phase, each one beginning with
*************** Starting PHASE .....
And ending with
*************** Finishing PHASE ....
then either a dump of the current JIT state, or [no changes]if nothing has changed across the phase.
[no changes]
The first phase is the import phase, where RyuJIT reads through the CIL and converts it into IR nodes. At the end of that is:
*************** Finishing PHASE Importation Trees after Importation ----------------------------------------------------------------------------------------------------------------------------------------- BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags] ----------------------------------------------------------------------------------------------------------------------------------------- BB01 [0000] 1 1 [000..004)-> BB03 ( cond ) i BB02 [0001] 1 BB01 1 [004..009)-> BB04 ( cond ) i BB03 [0002] 2 BB01,BB02 1 [009..00D)-> BB05 ( cond ) i BB04 [0003] 2 BB02,BB03 1 [00D..013)-> BB06 (always) i BB05 [0004] 1 BB03 1 [013..017) i BB06 [0005] 2 BB04,BB05 1 [017..01F) (return) i ----------------------------------------------------------------------------------------------------------------------------------------- ------------ BB01 [000..004) -> BB03 (cond), preds={} succs={BB02,BB03} ***** BB01 STMT00000 ( 0x000[E-] ... 0x002 ) [000003] ----------- * JTRUE void [000002] N--------U- \--* LE int [000000] ----------- +--* LCL_VAR int V00 arg0 [000001] ----------- \--* CNS_INT int 3 ------------ BB02 [004..009) -> BB04 (cond), preds={BB01} succs={BB03,BB04} ***** BB02 STMT00006 ( 0x004[E-] ... 0x007 ) [000019] ----------- * JTRUE void [000018] ----------- \--* EQ int [000016] ----------- +--* LCL_VAR int V01 arg1 [000017] ----------- \--* CNS_INT int 10 ------------ BB03 [009..00D) -> BB05 (cond), preds={BB01,BB02} succs={BB04,BB05} ***** BB03 STMT00001 ( 0x009[E-] ... 0x00B ) [000007] ----------- * JTRUE void [000006] N--------U- \--* GT int [000004] ----------- +--* LCL_VAR int V01 arg1 [000005] ----------- \--* CNS_INT int 4 ------------ BB04 [00D..013) -> BB06 (always), preds={BB02,BB03} succs={BB06} ***** BB04 STMT00005 ( 0x00D[E-] ... 0x00F ) [000015] DA--------- * STORE_LCL_VAR int V00 arg0 [000014] ----------- \--* CNS_INT int 9 ------------ BB05 [013..017), preds={BB03} succs={BB06} ***** BB05 STMT00002 ( 0x013[E-] ... 0x015 ) [000009] DA--------- * STORE_LCL_VAR int V00 arg0 [000008] ----------- \--* CNS_INT int 12 ------------ BB06 [017..01F) (return), preds={BB04,BB05} succs={} ***** BB06 STMT00003 ( 0x017[E-] ... 0x01E ) [000012] --C-G------ * CALL void CSharpTutorials.Program:Consume[uint](uint,uint) [000010] ----------- arg0 +--* LCL_VAR int V00 arg0 [000011] ----------- arg1 \--* LCL_VAR int V01 arg1 ***** BB06 STMT00004 ( 0x01E[E-] ... ??? ) [000013] ----------- * RETURN void
RyuJIT now has 6 basic blocks (BB). The flow is shown on the header for the block. By default, each block flows into the next one (for example, BB05 to BB06). Some are marked as conditional jumps (BB01 will conditionally jump to BB03, otherwise fall through to BB02) and one ends in a return (BB06). At this point it follows the same pattern as the CIL code.
BB05
BB06
BB01
BB03
BB02
Each block is broken down into multiple statements (STMT). Although here, only BB06 has more than one statement.
Each statement is a small tree of IR nodes, represented via an ASCII diagram. In STMT00000, LCL_VAR (local variable) arg0 and CNS_INT (constant int) 3 feed into LE (less than or equals comparison). This then feeds into the root node JTRUE (jump if true). Only certain node types can be the root of a statement, namely those that set state or change control flow. For JTRUE nodes, if the input of the node (the LE) returns true then the control flow goes down the conditional jump path. The other root nodes are STORE_LCL_VAR (store local variable) to store the result of an expression, CALL to call a functional and RETURN to return.
STMT00000
LCL_VAR
CNS_INT
LE
JTRUE
STORE_LCL_VAR
CALL
RETURN
RyuJIT goes through several phases (including Build SSA representation which adds a PHI statement into BB06), before hitting the optimize bools phase. A PHI statement is simply a way to represent a variable being set from one of two different sources depending on control flow into the block. This is required for full SSA notation and can be ignored for the purposes of these examples.
PHI
This phase walks through the blocks in order in a loop until there are no more changes. It starts by looking at BB01, and the following block (BB02):
------------ BB01 [000..004) -> BB03 (cond), preds={} succs={BB02,BB03} ***** BB01 STMT00000 ( 0x000[E-] ... 0x002 ) [000003] ----------- * JTRUE void [000002] N--------U- \--* LE int [000000] ----------- +--* LCL_VAR int V00 arg0 [000001] ----------- \--* CNS_INT int 3 ------------ BB02 [004..009) -> BB04 (cond), preds={BB01} succs={BB03,BB04} ***** BB02 STMT00006 ( 0x004[E-] ... 0x007 ) [000019] ----------- * JTRUE void [000018] ----------- \--* EQ int [000016] ----------- +--* LCL_VAR int V01 arg1 [000017] ----------- \--* CNS_INT int 10
Which represents the C# snippet:
if (op1 > 3 && op2 == 10)
First it performs some initial checks. The second block must contain a single statement of JTRUE. Plus, the child must be a compare node (EQ , NE, LT, LE), of integer type (no float compares allowed) that has no side effects. The final statement of the first block is checked for the same conditions.
EQ
NE
LT
The flow of the blocks is checked. If the condition in BB01 fails then it falls through to BB02. If the condition passes then it jumps to another block (BB03, the start of if op2 <= 4). This is the same block that BB02 falls through to if it fails. This flow is recognised as two statements being ANDed together, with the condition in the first block having been inverted.
op2 <= 4
From here it can create an AND statement of the two conditions, with the first condition inverted (from LE to GT), connected to a single JTRUE node. However, JTRUE can only be connected to certain nodes (namely those that set flags), so instead the result of the AND must be tested against 0. This looks like:
GT
------------ BB01 [000..009) -> BB04 (cond), preds={} succs={BB03,BB04} ***** BB01 STMT00006 ( 0x004[E-] ... 0x007 ) N010 ( 20, 14) [000021] ----------- * JTRUE void $VN.Void N009 ( 18, 12) [000029] ----------- \--* NE int N007 ( 13, 9) [000027] -------N--- +--* AND int N003 ( 6, 4) [000002] N------N-U- | +--* GT int N001 ( 1, 1) [000000] ----------- | | +--* LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] ----------- | | \--* CNS_INT int 3 $43 N006 ( 6, 4) [000020] -------N--- | \--* EQ int $101 N004 ( 1, 1) [000018] ----------- | +--* LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000019] ----------- | \--* CNS_INT int 10 $44 N008 ( 1, 2) [000028] ----------- \--* CNS_INT int 0
BB02 can now be deleted.
This new BB01 looks much closer to the original C# code, but with an ugly AND to glue it to the JTRUE. This is effectively:
if (AND((op1 > 3), (op2 == 10)) != 0)
RyuJIT node cycles through BB03, BB04, BB05, BB06, and does not find any additional opportunities. Due to the pass having found at least one optimization (combining BB02 into BB01), it starts again at the first block, BB01.
BB04
For second pass, it looks at BB01, and the following block (which is now BB03):
------------ BB01 [000..009) -> BB04 (cond), preds={} succs={BB03,BB04} ***** BB01 STMT00006 ( 0x004[E-] ... 0x007 ) N010 ( 20, 14) [000021] ----------- * JTRUE void $VN.Void N009 ( 18, 12) [000029] ----------- \--* NE int N007 ( 13, 9) [000027] -------N--- +--* AND int N003 ( 6, 4) [000002] N------N-U- | +--* GT int N001 ( 1, 1) [000000] ----------- | | +--* LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] ----------- | | \--* CNS_INT int 3 $43 N006 ( 6, 4) [000020] -------N--- | \--* EQ int $101 N004 ( 1, 1) [000018] ----------- | +--* LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000019] ----------- | \--* CNS_INT int 10 $44 N008 ( 1, 2) [000028] ----------- \--* CNS_INT int 0 ------------ BB03 [009..00D) -> BB05 (cond), preds={BB01} succs={BB04,BB05} ***** BB03 STMT00001 ( 0x009[E-] ... 0x00B ) N004 ( 5, 6) [000007] ----------- * JTRUE void $VN.Void N003 ( 3, 4) [000006] N------N-U- \--* GT int $102 N001 ( 1, 1) [000004] ----------- +--* LCL_VAR int V01 arg1 u:1 $81 N002 ( 1, 2) [000005] ----------- \--* CNS_INT int 4 $45
Representing the snippet:
if (op1 > 3 && op2 == 10 || op2 <= 4)
The same initial checks as before are performed (all JTRUEs, non-float conditions). Checking the control, it sees the exact same shape as when it previously combined BB01 and BB02. The same transformation can be performed again. The NE in BB01 is inverted (to EQ), then the two sets are connected by a new AND, giving:
JTRUEs
------------ BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05} ***** BB01 STMT00001 ( 0x009[E-] ... 0x00B ) N016 ( 32, 22) [000007] ----------- * JTRUE void $VN.Void N015 ( 30, 20) [000032] ----------- \--* NE int N013 ( 25, 17) [000030] -------N--- +--* AND int N009 ( 18, 12) [000029] ----------- | +--* EQ int N007 ( 13, 9) [000027] -------N--- | | +--* AND int N003 ( 6, 4) [000002] N------N-U- | | | +--* GT int N001 ( 1, 1) [000000] ----------- | | | | +--* LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] ----------- | | | | \--* CNS_INT int 3 $43 N006 ( 6, 4) [000020] -------N--- | | | \--* EQ int $101 N004 ( 1, 1) [000018] ----------- | | | +--* LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000019] ----------- | | | \--* CNS_INT int 10 $44 N008 ( 1, 2) [000028] ----------- | | \--* CNS_INT int 0 N012 ( 6, 4) [000006] N------N-U- | \--* GT int $102 N010 ( 1, 1) [000004] ----------- | +--* LCL_VAR int V01 arg1 u:1 $81 N011 ( 1, 2) [000005] ----------- | \--* CNS_INT int 4 $45 N014 ( 1, 2) [000031] ----------- \--* CNS_INT int 0
The two uses of AND bloat the IR more than before, but this is:
if (AND((AND((op1 > 3), (op2 == 10)) == 0), (op2>4)) != 0))
Or more simply: if (op1 > 3 && op2 == 10) fails and (op2 > 4) passes, then jump to BB05 (the else case), otherwise fall through to BB04 (the then case).
(op2 > 4)
Or more simply again: if (op1 > 3 && op2 == 10) fails and (op2 <= 4) fails, then jump to BB05 (the else case), otherwise fall through to BB04 (the then case).
Note that via the use of inversion, the && and || conditions have been folded into this single statement.
The entire function is now:
*************** Finishing PHASE optimise bools Trees after optimise bools ----------------------------------------------------------------------------------------------------------------------------------------- BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags] ----------------------------------------------------------------------------------------------------------------------------------------- BB01 [0000] 1 1 [000..00D)-> BB05 ( cond ) i BB04 [0003] 1 BB01 0.50 [00D..013)-> BB06 (always) i BB05 [0004] 1 BB01 0.50 [013..017) i BB06 [0005] 2 BB04,BB05 1 [017..01F) (return) i hascall gcsafe ----------------------------------------------------------------------------------------------------------------------------------------- ------------ BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05} ***** BB01 STMT00001 ( 0x009[E-] ... 0x00B ) N016 ( 32, 22) [000007] ----------- * JTRUE void $VN.Void N015 ( 30, 20) [000032] ----------- \--* NE int N013 ( 25, 17) [000030] -------N--- +--* AND int N009 ( 18, 12) [000029] ----------- | +--* EQ int N007 ( 13, 9) [000027] -------N--- | | +--* AND int N003 ( 6, 4) [000002] N------N-U- | | | +--* GT int N001 ( 1, 1) [000000] ----------- | | | | +--* LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] ----------- | | | | \--* CNS_INT int 3 $43 N006 ( 6, 4) [000020] -------N--- | | | \--* EQ int $101 N004 ( 1, 1) [000018] ----------- | | | +--* LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000019] ----------- | | | \--* CNS_INT int 10 $44 N008 ( 1, 2) [000028] ----------- | | \--* CNS_INT int 0 N012 ( 6, 4) [000006] N------N-U- | \--* GT int $102 N010 ( 1, 1) [000004] ----------- | +--* LCL_VAR int V01 arg1 u:1 $81 N011 ( 1, 2) [000005] ----------- | \--* CNS_INT int 4 $45 N014 ( 1, 2) [000031] ----------- \--* CNS_INT int 0 ------------ BB04 [00D..013) -> BB06 (always), preds={BB01} succs={BB06} ***** BB04 STMT00005 ( 0x00D[E-] ... 0x00F ) N003 ( 1, 3) [000017] -A------R-- * ASG int $VN.Void N002 ( 1, 1) [000016] D------N--- +--* LCL_VAR int V00 arg0 d:3 $VN.Void N001 ( 1, 2) [000015] ----------- \--* CNS_INT int 9 $47 ------------ BB05 [013..017), preds={BB01} succs={BB06} ***** BB05 STMT00002 ( 0x013[E-] ... 0x015 ) N003 ( 1, 3) [000010] -A------R-- * ASG int $VN.Void N002 ( 1, 1) [000009] D------N--- +--* LCL_VAR int V00 arg0 d:4 $VN.Void N001 ( 1, 2) [000008] ----------- \--* CNS_INT int 12 $46 ------------ BB06 [017..01F) (return), preds={BB04,BB05} succs={} ***** BB06 STMT00007 ( ??? ... ??? ) N005 ( 0, 0) [000024] -A------R-- * ASG int $VN.Void N004 ( 0, 0) [000022] D------N--- +--* LCL_VAR int V00 arg0 d:2 $VN.Void N003 ( 0, 0) [000023] ----------- \--* PHI int $140 N001 ( 0, 0) [000026] ----------- pred BB05 +--* PHI_ARG int V00 arg0 u:4 $46 N002 ( 0, 0) [000025] ----------- pred BB04 \--* PHI_ARG int V00 arg0 u:3 $47 ***** BB06 STMT00003 ( 0x017[E-] ... 0x01E ) N003 ( 16, 6) [000013] --CXG------ * CALL void CSharpTutorials.Program:Consume[uint](uint,uint) $VN.Void N001 ( 1, 1) [000011] ----------- arg0 in x0 +--* LCL_VAR int V00 arg0 u:2 (last use) $140 N002 ( 1, 1) [000012] ----------- arg1 in x1 \--* LCL_VAR int V01 arg1 u:1 (last use) $81 ***** BB06 STMT00004 ( 0x01E[E-] ... ??? ) N001 ( 0, 0) [000014] ----------- * RETURN void $VN.Void
Continuing through RyuJIT dump, following the optimize bools phase, a few more passes happen, before finally the IR is rationalized into linear IR (LIR):
*************** Finishing PHASE Rationalize IR Trees after Rationalize IR ----------------------------------------------------------------------------------------------------------------------------------------- BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags] ----------------------------------------------------------------------------------------------------------------------------------------- BB01 [0000] 1 1 [000..00D)-> BB05 ( cond ) i LIR BB04 [0003] 1 BB01 0.50 [00D..013)-> BB06 (always) i LIR BB05 [0004] 1 BB01 0.50 [013..017) i LIR BB06 [0005] 2 BB04,BB05 1 [017..01F) (return) i hascall gcsafe LIR ----------------------------------------------------------------------------------------------------------------------------------------- ------------ BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05} [000033] ----------- IL_OFFSET void INLRT @ 0x009[E-] N001 ( 1, 1) [000000] ----------- t0 = LCL_VAR int V00 arg0 u:1 (last use) $80 N002 ( 1, 2) [000001] ----------- t1 = CNS_INT int 3 $43 /--* t0 int +--* t1 int N003 ( 6, 4) [000002] N------N-U- t2 = * GT int N004 ( 1, 1) [000018] ----------- t18 = LCL_VAR int V01 arg1 u:1 $81 N005 ( 1, 2) [000019] ----------- t19 = CNS_INT int 10 $44 /--* t18 int +--* t19 int N006 ( 6, 4) [000020] -------N--- t20 = * EQ int $101 /--* t2 int +--* t20 int N007 ( 13, 9) [000027] -------N--- t27 = * AND int N008 ( 1, 2) [000028] ----------- t28 = CNS_INT int 0 /--* t27 int +--* t28 int N009 ( 18, 12) [000029] ----------- t29 = * EQ int N010 ( 1, 1) [000004] ----------- t4 = LCL_VAR int V01 arg1 u:1 $81 N011 ( 1, 2) [000005] ----------- t5 = CNS_INT int 4 $45 /--* t4 int +--* t5 int N012 ( 6, 4) [000006] N------N-U- t6 = * GT int $102 /--* t29 int +--* t6 int N013 ( 25, 17) [000030] -------N--- t30 = * AND int N014 ( 1, 2) [000031] ----------- t31 = CNS_INT int 0 /--* t30 int +--* t31 int N015 ( 30, 20) [000032] ----------- t32 = * NE int /--* t32 int N016 ( 32, 22) [000007] ----------- * JTRUE void $VN.Void ------------ BB04 [00D..013) -> BB06 (always), preds={BB01} succs={BB06} [000034] ----------- IL_OFFSET void INLRT @ 0x00D[E-] N001 ( 1, 2) [000015] ----------- t15 = CNS_INT int 9 $47 /--* t15 int N003 ( 1, 3) [000017] DA--------- * STORE_LCL_VAR int V00 arg0 d:3 ------------ BB05 [013..017), preds={BB01} succs={BB06} [000035] ----------- IL_OFFSET void INLRT @ 0x013[E-] N001 ( 1, 2) [000008] ----------- t8 = CNS_INT int 12 $46 /--* t8 int N003 ( 1, 3) [000010] DA--------- * STORE_LCL_VAR int V00 arg0 d:4 ------------ BB06 [017..01F) (return), preds={BB04,BB05} succs={} [000036] ----------- IL_OFFSET void INLRT @ 0x017[E-] N001 ( 1, 1) [000011] ----------- t11 = LCL_VAR int V00 arg0 u:2 (last use) $140 N002 ( 1, 1) [000012] ----------- t12 = LCL_VAR int V01 arg1 u:1 (last use) $81 /--* t11 int arg0 in x0 +--* t12 int arg1 in x1 N003 ( 16, 6) [000013] --CXG------ * CALL void CSharpTutorials.Program:Consume[uint](uint,uint) $VN.Void [000037] ----------- IL_OFFSET void INLRT @ 0x01E[E-] N001 ( 0, 0) [000014] ----------- RETURN void $VN.Void
This is essentially the same IR as before but rewritten in sequential execution order. Each node produces a named result (for example, t29). This naming is then used as the inputs to each node (shown on the lines above a node). For example, N015 is the NE node which takes in t30 and t31 as inputs and produces the result t29, of type integer.
t29
N015
t30
t31
Next the lowering phase will attempt to lower each node in turn, producing nodes which are more suitable for the target machine. Up until this point the RyuJIT has (mostly) attempted to be agnostic to the machine it is running on. From here on in, any optimizations and changes will be specific to Arm64. Each node should be able to be represented by a single instruction (or a short sequence of instructions) on the target platform. After this, the register allocator will allocate target registers to each node. Finally, each node will be generated into assembly.
As part of lowering, RyuJIT will attempt to spot the compare chains and optimize them for Arm64. Part 2 of this blog series will cover this..
That is everything for this part. In part 2 I take a look at the various optimizations that are happening within the lowering phase.
If Conversion Within .NET - Part 2