Arm Community
Arm Community
  • Site
  • User
  • Site
  • Search
  • User
Arm Community blogs
Arm Community blogs
Architectures and Processors blog If Conversion Within .NET - Part 1
  • Blogs
  • Mentions
  • Sub-Groups
  • Tags
  • Jump...
  • Cancel
More blogs in Arm Community blogs
  • AI blog

  • Announcements

  • Architectures and Processors blog

  • Automotive blog

  • Embedded and Microcontrollers blog

  • Internet of Things (IoT) blog

  • Laptops and Desktops blog

  • Mobile, Graphics, and Gaming blog

  • Operating Systems blog

  • Servers and Cloud Computing blog

  • SoC Design and Simulation blog

  • Tools, Software and IDEs blog

Tell us what you think
Tags
  • Compilers
  • Runtime
Actions
  • RSS
  • More
  • Cancel
Related blog posts
Related forum threads

If Conversion Within .NET - Part 1

Alan Hayward
Alan Hayward
October 2, 2023
22 minute read time.

Over the past few releases, the Arm64 port of .NET has seen a lot of additions and improvements. In many cases making it on par with the X64 implementation. For more details on the exact changes, the official blog posts give a good overview – see here for .NET 5 and here for .NET 7.

Recently, some of those patches have come from Arm. We have a small team working on improving .NET’s code generation component known as RyuJIT. This blog series provides more details into one optimization we have been working on - If Conversion and the associated optimization, optimize bools. In theory this is a simple optimization, but adding it requires many considerations into the design of the RyuJIT. For example, how it will interact with other optimizations. Also, sometimes better-looking code is not always better in practice. Part one will focus on following through the optimize bools optimization.

Note, Arm refers to the architecture as AArch64, but here we follow the .NET naming, which is Arm64.

Reproducing the examples

All the code and dumps generated below were produced using a modified version of coreclr from the .NET runtime repo, which had been slightly modified to allow for certain optimizations to be turned off via new config options. The code is available on GitHub. The SHA of the HEAD is:

182bee3189ebb38511e3963595bdb2234a310b09

If you want to follow along, then you will need a Arm64 (v8.0 or above) box running Linux. Install the standard build dependencies, then do a standard checked build and generate a CORE_ROOT folder:

./build.sh -rc Checked -lc Release -s clr+libs
./src/tests/build.sh generatelayoutonly Checked
 

The example test program will need building:

cp examples/ifconversion ~
cd ~/ifconversion
../dotnet/.dotnet/dotnet publish -c Release
 

And can be run with:

~/dotnet/runtime_bools/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun ./bin/Release/net7.0/ifconversion.dll
 

For each examples run below, it is assumed the test is run from a clean environment without any DOTNET_flags set. However, for convenience, any environment variables set in a previous stage and no longer required are explicitly unset. All the examples can also be reproduced on a Windows Arm64 machine; however, the commands will need adjusting.

If you only have an X64 machine, then you will get different results. Most of the phases of RyuJIT are common across the two architectures. However due to the differences, some of the optimizations will not apply, and others will not give the exact same results. In addition, the config variables for this optimization have only been added to Arm64 code.

A simple example

Consider the following snippet of C# from the example:

[MethodImpl(MethodImplOptions.NoInlining)]
static void TestAndOrWithElse(uint op1, uint op2) {
  if (op1 > 3 && op2 == 10 || op2 <= 4) {
    op1 = 9;
  } else {
    op1 = 12;
  }
  Consume(op1, op2);
}


This is a simple If condition, with some additional code to prevent things getting optimised away. The NoInlining ensures the function does not get folded into the caller. The Consume() call ensures op1 and op2 are consumed and not treated as unused and eliminated away by the compiler.

Run this with:

  • Tiered compilation turned off (to ensure the highest optimization level)
  • Disassembly dumping turned on (to see what code gets generated)
  • If Conversion turned off (to replicate the state as of .NET 7)
  • Compare Chain Conditions turned off (this is one of the config options unique to our modified coreclr).

export DOTNET_JitDisasm=TestAndOrWithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
export DOTNET_JitDoCompareChainCond=0

~/dotnet/runtime/artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/corerun ./bin/Release/net7.0/linux-arm64/conditional.dll
 

This gives the following annotated Arm64 assembly:

G_M64092_IG01:
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
G_M64092_IG02:
            cmp     w0, #3
            bls     G_M64092_IG04
G_M64092_IG03:
            cmp     w1, #10
            beq     G_M64092_IG05
G_M64092_IG04:
            cmp     w1, #4
            bhi     G_M64092_IG06
G_M64092_IG05:
            mov     w0, #9
            b       G_M64092_IG07
G_M64092_IG06:
            mov     w1, #12
G_M64092_IG07:
            movz    x2, #0x6EE0      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
            movk    x2, #0x55A7 LSL #16
            movk    x2, #0xFFFF LSL #32
            ldr     x2, [x2]
            blr     x2
G_M64092_IG08:
            ldp     fp, lr, [sp], #0x10
            ret     lr
 

There is some additional output shown by RyuJIT, for example, various instruction weighting throughout the assembly. This has been removed from the above for brevity.

The code has been broken down into several Instruction Groups (IG):

G_M64092_IG01: Set up the function stack. This is common boilerplate code.

G_M64092_IG02: Contains if (op1 > 3) . Compare op1 and 3 and set condition flags. If the result is less than, then (op1 > 3 && op2 == 10) has failed, so jump to IG04. Otherwise fall through to IG03.

G_M64092_IG03: Contains if (op2 == 10) . Compare op2 and 10 and set condition flags. If the result is equals, then (op1 > 3 && op2 == 10) has passed, so jump to IG05.

G_M64092_IG04: Contains if (op2 <= 4). Compare op2 and 4 and set condition flags. If op2 is greater than 4, then (op2 <= 4) has failed, the code jumps to IG06.

G_M64092_IG05: Contains the then clause op1 = 9. Set op1 then jump unconditionally to IG07.

G_M64092_IG06: Contains the else clause op1 = 12. Set op1 .

G_M64092_IG07: Call the consume function.

G_M64092_IG08: Restore the function stack and return. Common boilerplate code.

Note how IG02, IG03 and IG04 handle the conditions of the if statement. Each one is a pair of compare + conditional branch instructions. The IGs work through the conditions sequentially with no reordering and each branch always moves forwards, never backwards. The && and || are implicitly encoded in the jump destinations of IG02 to IG04. For this to work, the conditions in IG02 and IG04 are inverted to test the failure case, whereas IG03 has no inverting and tests the pass case.

There are two key issues with this code. First, for me at least, looking at the assembly it is tricky to determine the exact set of ANDed and ORed statements. For a CPU, this set of conditional jumps is not ideal as it cannot read ahead until the destination has been calculated. Over the years CPUs have become a lot better at this. Branch prediction is used to guess the likely outcome and continue decoding. If the prediction is wrong then the CPU must throw away the prediction and recalculate.

There are two main ways we can improve this code. The first is to roll all the compares from IG02 to IG04 into a single stream of instructions. Then calculate every step of the If, and make a single conditional jump to either IG05 or IG06. In the second way IG05 and IG06 are simply assigning a value to op1, so these two blocks can be merged into a conditional instruction, which removes the remaining jumps.

Let us look at the first one.

Combining Boolean operations

The AArch64 instruction CCMP is used to chain together successive comparisons. The CCMP is a conditional comparison - it will only compare based on the results of the previous compare. Consider:

            cmp     w0, #3
            ccmp    w1, #10, 0, hi

The CMP compares w0 and 3 then sets the condition flags. The CCMP checks the condition flags for greater than unsigned (`HI`) are set and if so, then compares w1 and 10 and sets the condition flags. If the flags for HI were not set, then set the condition flags are set to the failure state 0. Through careful maintenance of the flags, we can chain together arbitrary AND and OR conditions. Working through the original example with the compare chain pass enabled gives:

 

export DOTNET_JitDisasm=TestAndOrWithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0

G_M64092_IG01:
        A9BF7BFD          stp     fp, lr, [sp, #-0x10]!
        910003FD          mov     fp, sp
G_M64092_IG02:
        71000C1F          cmp     w0, #3
        7A4A8820          ccmp    w1, #10, 0, hi
        7A441824          ccmp    w1, #4, z, ne
        54000068          bhi     G_M64092_IG04
G_M64092_IG03:
        52800120          mov     w0, #9
        14000002          b       G_M64092_IG05
G_M64092_IG04:  ;; offset=0020H
        52800181          mov     w1, #12
G_M64092_IG05:
        D288F302          movz    x2, #0x4798      // code for CSharpTutorials.Program:Consume[uint](uint,uint)
        F2AA3302          movk    x2, #0x5198 LSL #16
        F2DFFFE2          movk    x2, #0xFFFF LSL #32
        F9400042          ldr     x2, [x2]
        D63F0040          blr     x2
G_M64092_IG06:
        A8C17BFD          ldp     fp, lr, [sp], #0x10
        D65F03C0          ret     lr

All the compare and branches have been replaced with the single block of chained compares. What may not be immediately obvious is why all the conditions have inverted. Regardless, it is much easier now to see what is going on. In addition, the code size has reduced from 18 instructions to 16. 

Note that CCMP does not care which instruction set the condition flags. A CCMP could instead follow a TST or ADDS instruction.

Reading Jit dumps

To illustrate this, Let us walk through what RyuJIT is doing. Set the following and rerun:

export DOTNET_JitDump=TestAndOrWithElse
export DOTNET_TieredCompilation=0
export DOTNET_JitDoIfConversion=0
export DOTNET_JitDoCompareChainCond=0

This generates about 4500 lines of dump info from RyuJIT, starting with:

****** START compiling CSharpTutorials.Program:TestAndOrWithElse(uint,uint) (MethodHash=8530e751)
Generating code for Unix arm64
OPTIONS: compCodeOpt = BLENDED_CODE
OPTIONS: compDbgCode = false
OPTIONS: compDbgInfo = true
OPTIONS: compDbgEnC  = false
OPTIONS: compProcedureSplitting   = false
OPTIONS: compProcedureSplittingEH = false
OPTIONS: optimiser should use profile data
IL to import:
IL_0000  02                ldarg.0     
IL_0001  19                ldc.i4.3    
IL_0002  36 05             ble.un.s     5 (IL_0009)
IL_0004  03                ldarg.1     
IL_0005  1f 0a             ldc.i4.s     0xA
IL_0007  2e 04             beq.s        4 (IL_000d)
IL_0009  03                ldarg.1     
IL_000a  1a                ldc.i4.4    
IL_000b  35 06             bgt.un.s     6 (IL_0013)
IL_000d  1f 09             ldc.i4.s     0x9
IL_000f  10 00             starg.s      0x0
IL_0011  2b 04             br.s         4 (IL_0017)
IL_0013  1f 0c             ldc.i4.s     0xC
IL_0015  10 00             starg.s      0x0
IL_0017  02                ldarg.0     
IL_0018  03                ldarg.1     
IL_0019  28 02 00 00 2b    call         0x2B000002
IL_001e  2a                ret         
Arg #0    passed in register(s) x0
Arg #1    passed in register(s) x1
 

First is the CIL bytecode, which is a stack-based machine. If you have been following the code so far, then this should be easily readable. At this point, the function has already been broken down into several compare and branches, including the inversion of some of the conditions. For example:

 ldarg.0                       - load arg 0 onto the stack
 ldc.i4.3                     - load integer constant 3 onto the stack
 ble.un.s    5 (IL_0009)   - compare and branch the top two entries on the stack. if less than or equals then jump to IL_0009.

The dump is next broken down into a section for each phase, each one beginning with 

*************** Starting PHASE .....

And ending with 

*************** Finishing PHASE ....

then either a dump of the current JIT state, or [no changes]if nothing has changed across the phase.

The first phase is the import phase, where RyuJIT reads through the CIL and converts it into IR nodes. At the end of that is:

 

*************** Finishing PHASE Importation
Trees after Importation

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..004)-> BB03 ( cond )                     i 
BB02 [0001]  1       BB01                  1       [004..009)-> BB04 ( cond )                     i 
BB03 [0002]  2       BB01,BB02             1       [009..00D)-> BB05 ( cond )                     i 
BB04 [0003]  2       BB02,BB03             1       [00D..013)-> BB06 (always)                     i 
BB05 [0004]  1       BB03                  1       [013..017)                                     i 
BB06 [0005]  2       BB04,BB05             1       [017..01F)        (return)                     i 
-----------------------------------------------------------------------------------------------------------------------------------------

------------ BB01 [000..004) -> BB03 (cond), preds={} succs={BB02,BB03}

***** BB01
STMT00000 ( 0x000[E-] ... 0x002 )
               [000003] -----------                         *  JTRUE     void  
               [000002] N--------U-                         \--*  LE        int   
               [000000] -----------                            +--*  LCL_VAR   int    V00 arg0         
               [000001] -----------                            \--*  CNS_INT   int    3

------------ BB02 [004..009) -> BB04 (cond), preds={BB01} succs={BB03,BB04}

***** BB02
STMT00006 ( 0x004[E-] ... 0x007 )
               [000019] -----------                         *  JTRUE     void  
               [000018] -----------                         \--*  EQ        int   
               [000016] -----------                            +--*  LCL_VAR   int    V01 arg1         
               [000017] -----------                            \--*  CNS_INT   int    10

------------ BB03 [009..00D) -> BB05 (cond), preds={BB01,BB02} succs={BB04,BB05}

***** BB03
STMT00001 ( 0x009[E-] ... 0x00B )
               [000007] -----------                         *  JTRUE     void  
               [000006] N--------U-                         \--*  GT        int   
               [000004] -----------                            +--*  LCL_VAR   int    V01 arg1         
               [000005] -----------                            \--*  CNS_INT   int    4

------------ BB04 [00D..013) -> BB06 (always), preds={BB02,BB03} succs={BB06}

***** BB04
STMT00005 ( 0x00D[E-] ... 0x00F )
               [000015] DA---------                         *  STORE_LCL_VAR int    V00 arg0         
               [000014] -----------                         \--*  CNS_INT   int    9

------------ BB05 [013..017), preds={BB03} succs={BB06}

***** BB05
STMT00002 ( 0x013[E-] ... 0x015 )
               [000009] DA---------                         *  STORE_LCL_VAR int    V00 arg0         
               [000008] -----------                         \--*  CNS_INT   int    12

------------ BB06 [017..01F) (return), preds={BB04,BB05} succs={}

***** BB06
STMT00003 ( 0x017[E-] ... 0x01E )
               [000012] --C-G------                         *  CALL      void   CSharpTutorials.Program:Consume[uint](uint,uint)
               [000010] ----------- arg0                    +--*  LCL_VAR   int    V00 arg0         
               [000011] ----------- arg1                    \--*  LCL_VAR   int    V01 arg1         

***** BB06
STMT00004 ( 0x01E[E-] ... ??? )
               [000013] -----------                         *  RETURN    void  

 

RyuJIT now has 6 basic blocks (BB). The flow is shown on the header for the block. By default, each block flows into the next one (for example, BB05 to BB06). Some are marked as conditional jumps (BB01 will conditionally jump to BB03, otherwise fall through to BB02) and one ends in a return (BB06). At this point it follows the same pattern as the CIL code.

Each block is broken down into multiple statements (STMT). Although here, only BB06 has more than one statement.

Each statement is a small tree of IR nodes, represented via an ASCII diagram. In STMT00000, LCL_VAR (local variable) arg0 and CNS_INT (constant int) 3 feed into LE (less than or equals comparison). This then feeds into the root node JTRUE (jump if true). Only certain node types can be the root of a statement, namely those that set state or change control flow. For JTRUE nodes, if the input of the node (the LE) returns true then the control flow goes down the conditional jump path. The other root nodes are STORE_LCL_VAR (store local variable) to store the result of an expression, CALL to call a functional and RETURN to return.

RyuJIT goes through several phases (including Build SSA representation which adds a PHI statement into BB06), before hitting the optimize bools phase. A PHI statement is simply a way to represent a variable being set from one of two different sources depending on control flow into the block. This is required for full SSA notation and can be ignored for the purposes of these examples.

Optimize Bools phase

This phase walks through the blocks in order in a loop until there are no more changes. It starts by looking at BB01, and the following block (BB02):

------------ BB01 [000..004) -> BB03 (cond), preds={} succs={BB02,BB03}

***** BB01
STMT00000 ( 0x000[E-] ... 0x002 )
               [000003] -----------                         *  JTRUE     void  
               [000002] N--------U-                         \--*  LE        int   
               [000000] -----------                            +--*  LCL_VAR   int    V00 arg0         
               [000001] -----------                            \--*  CNS_INT   int    3

------------ BB02 [004..009) -> BB04 (cond), preds={BB01} succs={BB03,BB04}

***** BB02
STMT00006 ( 0x004[E-] ... 0x007 )
               [000019] -----------                         *  JTRUE     void  
               [000018] -----------                         \--*  EQ        int   
               [000016] -----------                            +--*  LCL_VAR   int    V01 arg1         
               [000017] -----------                            \--*  CNS_INT   int    10

Which represents the C# snippet:

if (op1 > 3 && op2 == 10) 

First it performs some initial checks. The second block must contain a single statement of JTRUE. Plus, the child must be a compare node (EQ , NE, LT, LE), of integer type (no float compares allowed) that has no side effects. The final statement of the first block is checked for the same conditions.

The flow of the blocks is checked. If the condition in BB01 fails then it falls through to BB02. If the condition passes then it jumps to another block (BB03, the start of if op2 <= 4). This is the same block that BB02 falls through to if it fails. This flow is recognised as two statements being ANDed together, with the condition in the first block having been inverted.

From here it can create an AND statement of the two conditions, with the first condition inverted (from LE to GT), connected to a single JTRUE node. However, JTRUE can only be connected to certain nodes (namely those that set flags), so instead the result of the AND must be tested against 0. This looks like:

------------ BB01 [000..009) -> BB04 (cond), preds={} succs={BB03,BB04}

***** BB01
STMT00006 ( 0x004[E-] ... 0x007 )
N010 ( 20, 14) [000021] -----------                         *  JTRUE     void   $VN.Void
N009 ( 18, 12) [000029] -----------                         \--*  NE        int   
N007 ( 13,  9) [000027] -------N---                            +--*  AND       int   
N003 (  6,  4) [000002] N------N-U-                            |  +--*  GT        int   
N001 (  1,  1) [000000] -----------                            |  |  +--*  LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -----------                            |  |  \--*  CNS_INT   int    3 $43
N006 (  6,  4) [000020] -------N---                            |  \--*  EQ        int    $101
N004 (  1,  1) [000018] -----------                            |     +--*  LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000019] -----------                            |     \--*  CNS_INT   int    10 $44
N008 (  1,  2) [000028] -----------                            \--*  CNS_INT   int    0

BB02 can now be deleted.

This new BB01 looks much closer to the original C# code, but with an ugly AND to glue it to the JTRUE. This is effectively:

if (AND((op1 > 3), (op2 == 10)) != 0)

RyuJIT node cycles through BB03, BB04, BB05, BB06, and does not find any additional opportunities. Due to the pass having found at least one optimization (combining BB02 into BB01), it starts again at the first block, BB01. 

For second pass, it looks at BB01, and the following block (which is now BB03):

------------ BB01 [000..009) -> BB04 (cond), preds={} succs={BB03,BB04}

***** BB01
STMT00006 ( 0x004[E-] ... 0x007 )
N010 ( 20, 14) [000021] -----------                         *  JTRUE     void   $VN.Void
N009 ( 18, 12) [000029] -----------                         \--*  NE        int   
N007 ( 13,  9) [000027] -------N---                            +--*  AND       int   
N003 (  6,  4) [000002] N------N-U-                            |  +--*  GT        int   
N001 (  1,  1) [000000] -----------                            |  |  +--*  LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -----------                            |  |  \--*  CNS_INT   int    3 $43
N006 (  6,  4) [000020] -------N---                            |  \--*  EQ        int    $101
N004 (  1,  1) [000018] -----------                            |     +--*  LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000019] -----------                            |     \--*  CNS_INT   int    10 $44
N008 (  1,  2) [000028] -----------                            \--*  CNS_INT   int    0

------------ BB03 [009..00D) -> BB05 (cond), preds={BB01} succs={BB04,BB05}

***** BB03
STMT00001 ( 0x009[E-] ... 0x00B )
N004 (  5,  6) [000007] -----------                         *  JTRUE     void   $VN.Void
N003 (  3,  4) [000006] N------N-U-                         \--*  GT        int    $102
N001 (  1,  1) [000004] -----------                            +--*  LCL_VAR   int    V01 arg1         u:1 $81
N002 (  1,  2) [000005] -----------                            \--*  CNS_INT   int    4 $45

Representing the snippet:

if (op1 > 3 && op2 == 10 || op2 <= 4) 

The same initial checks as before are performed (all JTRUEs, non-float conditions). Checking the control, it sees the exact same shape as when it previously combined BB01 and BB02. The same transformation can be performed again. The NE in BB01 is inverted (to EQ), then the two sets are connected by a new AND, giving:

------------ BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05}

***** BB01
STMT00001 ( 0x009[E-] ... 0x00B )
N016 ( 32, 22) [000007] -----------                         *  JTRUE     void   $VN.Void
N015 ( 30, 20) [000032] -----------                         \--*  NE        int   
N013 ( 25, 17) [000030] -------N---                            +--*  AND       int   
N009 ( 18, 12) [000029] -----------                            |  +--*  EQ        int   
N007 ( 13,  9) [000027] -------N---                            |  |  +--*  AND       int   
N003 (  6,  4) [000002] N------N-U-                            |  |  |  +--*  GT        int   
N001 (  1,  1) [000000] -----------                            |  |  |  |  +--*  LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -----------                            |  |  |  |  \--*  CNS_INT   int    3 $43
N006 (  6,  4) [000020] -------N---                            |  |  |  \--*  EQ        int    $101
N004 (  1,  1) [000018] -----------                            |  |  |     +--*  LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000019] -----------                            |  |  |     \--*  CNS_INT   int    10 $44
N008 (  1,  2) [000028] -----------                            |  |  \--*  CNS_INT   int    0
N012 (  6,  4) [000006] N------N-U-                            |  \--*  GT        int    $102
N010 (  1,  1) [000004] -----------                            |     +--*  LCL_VAR   int    V01 arg1         u:1 $81
N011 (  1,  2) [000005] -----------                            |     \--*  CNS_INT   int    4 $45
N014 (  1,  2) [000031] -----------                            \--*  CNS_INT   int    0

The two uses of AND bloat the IR more than before, but this is:

if (AND((AND((op1 > 3), (op2 == 10)) == 0), (op2>4)) != 0)) 

Or more simply: if (op1 > 3 && op2 == 10) fails and (op2 > 4) passes, then jump to BB05 (the else case), otherwise fall through to BB04 (the then case).

Or more simply again: if (op1 > 3 && op2 == 10) fails and (op2 <= 4) fails, then jump to BB05 (the else case), otherwise fall through to BB04 (the then case).

Note that via the use of inversion, the && and || conditions have been folded into this single statement.

The entire function is now:

*************** Finishing PHASE optimise bools
Trees after optimise bools

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..00D)-> BB05 ( cond )                     i 
BB04 [0003]  1       BB01                  0.50    [00D..013)-> BB06 (always)                     i 
BB05 [0004]  1       BB01                  0.50    [013..017)                                     i 
BB06 [0005]  2       BB04,BB05             1       [017..01F)        (return)                     i hascall gcsafe 
-----------------------------------------------------------------------------------------------------------------------------------------

------------ BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05}

***** BB01
STMT00001 ( 0x009[E-] ... 0x00B )
N016 ( 32, 22) [000007] -----------                         *  JTRUE     void   $VN.Void
N015 ( 30, 20) [000032] -----------                         \--*  NE        int   
N013 ( 25, 17) [000030] -------N---                            +--*  AND       int   
N009 ( 18, 12) [000029] -----------                            |  +--*  EQ        int   
N007 ( 13,  9) [000027] -------N---                            |  |  +--*  AND       int   
N003 (  6,  4) [000002] N------N-U-                            |  |  |  +--*  GT        int   
N001 (  1,  1) [000000] -----------                            |  |  |  |  +--*  LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -----------                            |  |  |  |  \--*  CNS_INT   int    3 $43
N006 (  6,  4) [000020] -------N---                            |  |  |  \--*  EQ        int    $101
N004 (  1,  1) [000018] -----------                            |  |  |     +--*  LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000019] -----------                            |  |  |     \--*  CNS_INT   int    10 $44
N008 (  1,  2) [000028] -----------                            |  |  \--*  CNS_INT   int    0
N012 (  6,  4) [000006] N------N-U-                            |  \--*  GT        int    $102
N010 (  1,  1) [000004] -----------                            |     +--*  LCL_VAR   int    V01 arg1         u:1 $81
N011 (  1,  2) [000005] -----------                            |     \--*  CNS_INT   int    4 $45
N014 (  1,  2) [000031] -----------                            \--*  CNS_INT   int    0

------------ BB04 [00D..013) -> BB06 (always), preds={BB01} succs={BB06}

***** BB04
STMT00005 ( 0x00D[E-] ... 0x00F )
N003 (  1,  3) [000017] -A------R--                         *  ASG       int    $VN.Void
N002 (  1,  1) [000016] D------N---                         +--*  LCL_VAR   int    V00 arg0         d:3 $VN.Void
N001 (  1,  2) [000015] -----------                         \--*  CNS_INT   int    9 $47

------------ BB05 [013..017), preds={BB01} succs={BB06}

***** BB05
STMT00002 ( 0x013[E-] ... 0x015 )
N003 (  1,  3) [000010] -A------R--                         *  ASG       int    $VN.Void
N002 (  1,  1) [000009] D------N---                         +--*  LCL_VAR   int    V00 arg0         d:4 $VN.Void
N001 (  1,  2) [000008] -----------                         \--*  CNS_INT   int    12 $46

------------ BB06 [017..01F) (return), preds={BB04,BB05} succs={}

***** BB06
STMT00007 ( ??? ... ??? )
N005 (  0,  0) [000024] -A------R--                         *  ASG       int    $VN.Void
N004 (  0,  0) [000022] D------N---                         +--*  LCL_VAR   int    V00 arg0         d:2 $VN.Void
N003 (  0,  0) [000023] -----------                         \--*  PHI       int    $140
N001 (  0,  0) [000026] ----------- pred BB05                  +--*  PHI_ARG   int    V00 arg0         u:4 $46
N002 (  0,  0) [000025] ----------- pred BB04                  \--*  PHI_ARG   int    V00 arg0         u:3 $47

***** BB06
STMT00003 ( 0x017[E-] ... 0x01E )
N003 ( 16,  6) [000013] --CXG------                         *  CALL      void   CSharpTutorials.Program:Consume[uint](uint,uint) $VN.Void
N001 (  1,  1) [000011] ----------- arg0 in x0              +--*  LCL_VAR   int    V00 arg0         u:2 (last use) $140
N002 (  1,  1) [000012] ----------- arg1 in x1              \--*  LCL_VAR   int    V01 arg1         u:1 (last use) $81

***** BB06
STMT00004 ( 0x01E[E-] ... ??? )
N001 (  0,  0) [000014] -----------                         *  RETURN    void   $VN.Void

 

Lowering phase

Continuing through RyuJIT dump, following the optimize bools phase, a few more passes happen, before finally the IR is rationalized into linear IR (LIR):

*************** Finishing PHASE Rationalize IR
Trees after Rationalize IR

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight    lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1       [000..00D)-> BB05 ( cond )                     i LIR 
BB04 [0003]  1       BB01                  0.50    [00D..013)-> BB06 (always)                     i LIR 
BB05 [0004]  1       BB01                  0.50    [013..017)                                     i LIR 
BB06 [0005]  2       BB04,BB05             1       [017..01F)        (return)                     i hascall gcsafe LIR 
-----------------------------------------------------------------------------------------------------------------------------------------

------------ BB01 [000..00D) -> BB05 (cond), preds={} succs={BB04,BB05}
               [000033] -----------                            IL_OFFSET void   INLRT @ 0x009[E-]
N001 (  1,  1) [000000] -----------                    t0 =    LCL_VAR   int    V00 arg0         u:1 (last use) $80
N002 (  1,  2) [000001] -----------                    t1 =    CNS_INT   int    3 $43
                                                            /--*  t0     int    
                                                            +--*  t1     int    
N003 (  6,  4) [000002] N------N-U-                    t2 = *  GT        int   
N004 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   int    V01 arg1         u:1 $81
N005 (  1,  2) [000019] -----------                   t19 =    CNS_INT   int    10 $44
                                                            /--*  t18    int    
                                                            +--*  t19    int    
N006 (  6,  4) [000020] -------N---                   t20 = *  EQ        int    $101
                                                            /--*  t2     int    
                                                            +--*  t20    int    
N007 ( 13,  9) [000027] -------N---                   t27 = *  AND       int   
N008 (  1,  2) [000028] -----------                   t28 =    CNS_INT   int    0
                                                            /--*  t27    int    
                                                            +--*  t28    int    
N009 ( 18, 12) [000029] -----------                   t29 = *  EQ        int   
N010 (  1,  1) [000004] -----------                    t4 =    LCL_VAR   int    V01 arg1         u:1 $81
N011 (  1,  2) [000005] -----------                    t5 =    CNS_INT   int    4 $45
                                                            /--*  t4     int    
                                                            +--*  t5     int    
N012 (  6,  4) [000006] N------N-U-                    t6 = *  GT        int    $102
                                                            /--*  t29    int    
                                                            +--*  t6     int    
N013 ( 25, 17) [000030] -------N---                   t30 = *  AND       int   
N014 (  1,  2) [000031] -----------                   t31 =    CNS_INT   int    0
                                                            /--*  t30    int    
                                                            +--*  t31    int    
N015 ( 30, 20) [000032] -----------                   t32 = *  NE        int   
                                                            /--*  t32    int    
N016 ( 32, 22) [000007] -----------                         *  JTRUE     void   $VN.Void

------------ BB04 [00D..013) -> BB06 (always), preds={BB01} succs={BB06}
               [000034] -----------                            IL_OFFSET void   INLRT @ 0x00D[E-]
N001 (  1,  2) [000015] -----------                   t15 =    CNS_INT   int    9 $47
                                                            /--*  t15    int    
N003 (  1,  3) [000017] DA---------                         *  STORE_LCL_VAR int    V00 arg0         d:3

------------ BB05 [013..017), preds={BB01} succs={BB06}
               [000035] -----------                            IL_OFFSET void   INLRT @ 0x013[E-]
N001 (  1,  2) [000008] -----------                    t8 =    CNS_INT   int    12 $46
                                                            /--*  t8     int    
N003 (  1,  3) [000010] DA---------                         *  STORE_LCL_VAR int    V00 arg0         d:4

------------ BB06 [017..01F) (return), preds={BB04,BB05} succs={}
               [000036] -----------                            IL_OFFSET void   INLRT @ 0x017[E-]
N001 (  1,  1) [000011] -----------                   t11 =    LCL_VAR   int    V00 arg0         u:2 (last use) $140
N002 (  1,  1) [000012] -----------                   t12 =    LCL_VAR   int    V01 arg1         u:1 (last use) $81
                                                            /--*  t11    int    arg0 in x0
                                                            +--*  t12    int    arg1 in x1
N003 ( 16,  6) [000013] --CXG------                         *  CALL      void   CSharpTutorials.Program:Consume[uint](uint,uint) $VN.Void
               [000037] -----------                            IL_OFFSET void   INLRT @ 0x01E[E-]
N001 (  0,  0) [000014] -----------                            RETURN    void   $VN.Void

 

This is essentially the same IR as before but rewritten in sequential execution order. Each node produces a named result (for example, t29). This naming is then used as the inputs to each node (shown on the lines above a node). For example, N015 is the NE node which takes in t30 and t31 as inputs and produces the result t29, of type integer.

Next the lowering phase will attempt to lower each node in turn, producing nodes which are more suitable for the target machine. Up until this point the RyuJIT has (mostly) attempted to be agnostic to the machine it is running on. From here on in, any optimizations and changes will be specific to Arm64. Each node should be able to be represented by a single instruction (or a short sequence of instructions) on the target platform. After this, the register allocator will allocate target registers to each node. Finally, each node will be generated into assembly.

As part of lowering, RyuJIT will attempt to spot the compare chains and optimize them for Arm64. Part 2 of this blog series will cover this..

Closing comments

That is everything for this part. In part 2 I take a look at the various optimizations that are happening within the lowering phase.

If Conversion Within .NET - Part 2

Anonymous
Architectures and Processors blog
  • Introducing GICv5: Scalable and secure interrupt management for Arm

    Christoffer Dall
    Christoffer Dall
    Introducing Arm GICv5: a scalable, hypervisor-free interrupt controller for modern multi-core systems with improved virtualization and real-time support.
    • April 28, 2025
  • Getting started with AARCHMRS Features.json using Python

    Joh
    Joh
    A high-level introduction to the Arm Architecture Machine Readable Specification (AARCHMRS) Features.json with some examples to interpret and start to work with the available data using Python.
    • April 8, 2025
  • Advancing server manageability on Arm Neoverse Compute Subsystem (CSS) with OpenBMC

    Samer El-Haj-Mahmoud
    Samer El-Haj-Mahmoud
    Arm and 9elements Cyber Security have brought a prototype of OpenBMC to the Arm Neoverse Compute Subsystem (CSS) to advancing server manageability.
    • January 28, 2025