Simple loop example

Here we show the steps that Subzero takes in translating a simple loop example.

For translating this example to native code, we assume a fairly constrained platform: x86-32 with only 3 registers (eax, ecx, edx) available for local register allocation, and no global register allocation.

Start with the following C example, which computes the reduction of an integer array.

int simple_loop(int *a, int n) {
  int sum = 0;
  for (int i = 0; i < n; ++i) {
    sum += a[i];
  }
  return sum;
}

The corresponding PNaCl bitcode after compiling with pnacl-clang -O2 is the following:

define internal i32 @simple_loop(i32 %a, i32 %n) {
entry:
  %cmp4 = icmp sgt i32 %n, 0
  br i1 %cmp4, label %for.body, label %for.end
for.body:                                         ; preds = %for.body, %entry
  %i.06 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
  %sum.05 = phi i32 [ %add, %for.body ], [ 0, %entry ]
  %gep_array = mul i32 %i.06, 4
  %gep = add i32 %a, %gep_array
  %gep.asptr = inttoptr i32 %gep to i32*
  %0 = load i32* %gep.asptr, align 1
  %add = add i32 %0, %sum.05
  %inc = add i32 %i.06, 1
  %cmp = icmp slt i32 %inc, %n
  br i1 %cmp, label %for.body, label %for.end
for.end:                                          ; preds = %for.body, %entry
  %sum.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
  ret i32 %sum.0.lcssa
}

Generate high-level Vanilla ICE, including lowering phi instructions. Where possible, v_*_phi assignments are pushed up to just past their last use in the basic block, and past a compare/branch that ends the basic block.

simple_loop(v_a, v_n):
l_entry:
  v_i_phi = 0
  v_sum_phi = 0
  v_sum2_phi = 0
  v_c = icmp sgt v_n, 0
  br v_c, l_body, l_end
l_body:
  //v_i = phi [v_inc, l_body], [0, l_entry]
  v_i = v_i_phi
  //v_sum = phi [v_add, l_body], [0, l_entry]
  v_sum = v_sum_phi
  v_t1 = mul v_i, 4
  v_t2 = add v_a, v_t1
  v_elt = load v_t2
  v_add = add v_elt, v_sum
  v_sum_phi = v_add
  v_sum2_phi = v_add
  v_inc = add v_i, 1
  v_i_phi = v_inc
  v_c2 = icmp slt v_inc, v_n
  br v_c2, l_body, l_end
l_end:
  //v_sum2 = phi [0, l_entry], [v_add, l_body]
  v_sum2 = v_sum2_phi
  ret v_sum2

Liveness analysis determines which source operands in each instruction are the end of the operand's live range. As a bonus, it also determines the set of live operands at the beginning of each basic block. If liveness analysis is omitted, we assume all operands are live at the beginning of every basic block, and no operands' live ranges end in any instruction. The results are the LIVE annotations at the start of each basic block, and the LIVEEND annotations on each instruction.

simple_loop(v_a, v_n):
l_entry: // LIVE={v_a,v_n}
  v_i_phi = 0
  v_sum_phi = 0
  v_sum2_phi = 0;
  v_c = icmp sgt v_n, 0
  br v_c, l_body, l_end // LIVEEND={v_c}
l_body: // LIVE={v_i_phi,v_sum_phi,v_a,v_n}
  v_i = v_i_phi              // LIVEEND={v_i_phi}
  v_sum = v_sum_phi          // LIVEEND={v_sum_phi}
  v_t1 = mul v_i, 4
  v_t2 = add v_a, v_t1       // LIVEEND={v_t1}
  v_elt = load v_t2          // LIVEEND={v_t2}
  v_add = add v_elt, v_sum   // LIVEEND={v_elt,v_sum}
  v_sum_phi = v_add
  v_sum2_phi = v_add         // LIVEEND={v_add}
  v_inc = add v_i, 1         // LIVEEND={v_i}
  v_i_phi = v_inc
  v_c2 = icmp slt v_inc, v_n // LIVEEND={v_inc}
  br v_c2, l_body, l_end     // LIVEEND={v_c2}
l_end: // LIVE={v_sum2_phi}
  v_sum2 = v_sum2_phi // LIVEEND={v_sum2_phi}
  ret v_sum2          // LIVEEND={v_sum2}

Addressing mode inference (x86 only) identifies operands that pattern-match to an expression suitable to the special x86 addressing modes. The results are an annotation on the use of the calculated address, as well as annotations on the uses of the core operands as the subexpressions are built up into the calculated address. These annotations are labeled ADDROPT and ADDR, respectively. The code selector makes use of the ADDROPT annotation, and the local register manager takes the ADDR annotations into account when assigning registers.

simple_loop(v_a, v_n):
l_entry: // LIVE={v_a,v_n}
  v_i_phi = 0
  v_sum_phi = 0
  v_sum2_phi = 0
  v_c = icmp sgt v_n, 0
  br v_c, l_body, l_end // LIVEEND={v_c}
l_body: // LIVE={v_i_phi, v_sum_phi, v_a, v_n}
  v_i = v_i_phi              // LIVEEND={v_i_phi}
  v_sum = v_sum_phi          // LIVEEND={v_sum_phi}
  v_t1 = mul v_i, 4          // ADDR={v_i}
  v_t2 = add v_a, v_t1       // LIVEEND={v_t1}, ADDR={v_a}
  v_elt = load v_t2          // LIVEEND={v_t2}, ADDROPT=v_a+4*v_i+0
  v_add = add v_elt, v_sum   // LIVEEND={v_elt, v_sum}
  v_sum_phi = v_add
  v_sum2_phi = v_add         // LIVEEND={v_add}
  v_inc = add v_i, 1         // LIVEEND={v_i}
  v_i_phi = v_inc
  v_c2 = icmp slt v_inc, v_n // LIVEEND={v_inc}
  br v_c2, l_body, l_end     // LIVEEND={v_c2}
l_end: // LIVE={v_sum2_phi}
  v_sum2 = v_sum2_phi // LIVEEND={v_sum2_phi}
  ret v_sum2          // LIVEEND={v_sum2}

Instruction selection includes local register allocation, using 3 registers but calling them r1, r2, and r3 so that the specific mappings to eax, ecx, and edx can be deferred. In the code below, for each instruction, we retain the originating Vanilla ICE instruction in a comment for clarity, and each low-level ICE instruction is annotated with the local register manager state. This state includes the LRU order (the first register listed is the next register to be given out), and the set of operands and constants available in each register. When the instruction selection code works with the local register manager, the selector may take advantage of operator commutativity to minimize the instruction expansion (see the highlighted instruction).

simple_loop(v_a, v_n):
l_entry: // LIVE={v_a,v_n}, r1={}, r2={}, r3={}
  // v_i_phi = 0
  mov r1, 0
  mov v_i_phi, r1 // r2={}, r3={}, r1={0,v_i_phi}
  // v_sum_phi = 0
  mov r2, r1
  mov v_sum_phi, r2 // r3={}, r1={0,v_i_phi}, r2={0,v_sum_phi}
  // v_sum2_phi = 0
  mov r3, r1
  mov v_sum2_phi, r3 // r1={0,v_i_phi}, r2={0,v_sum_phi}, r3={0,v_sum2_phi}
  cmp v_n, 0
  ble l_end
  br l_body
l_body: // LIVE={v_i_phi, v_sum_phi, v_a, v_n}, r1={}, r2={}, r3={}
  //v_i = v_i_phi // LIVEEND={v_i_phi}
  mov r1, v_i_phi // r2={}, r3={}, r1={v_i_phi}
  mov v_i, r1 // r2={}, r3={}, r1={v_i}
  //v_sum = v_sum_phi // LIVEEND={v_sum_phi}
  mov r2, v_sum_phi // r3={}, r1={v_i}, r2={v_sum_phi}
  mov v_sum, r2 // r3={}, r1={v_i}, r2={v_sum}
  //v_t1 = mul v_i, 4 // ADDR={v_i}
  mov r3, r1
  shl r3, 2
  mov v_t1, r3 // r2={v_sum}, r3={v_t1}, r1={v_i}
  //v_t2 = add v_a, v_t1 // LIVEEND={v_t1}, ADDR={v_a}
  mov r2, v_a // r3={v_t1}, r1={v_i}, r2={v_a}
  add r3, r2
  mov v_t2, r3 // r3={v_t2}, r1={v_i}, r2={v_a}
  //v_elt = load v_t2 // LIVEEND={v_t2}, ADDROPT=v_a+4*v_i+0
  mov r3, [r2+4*r1]
  mov v_elt, r3 // r1={v_i}, r2={v_a}, r3={v_elt}
  //v_add = add v_elt, v_sum // LIVEEND={v_elt, v_sum}
  add r3, v_sum
  mov v_add, r3 // r1={v_i}, r2={v_a}, r3={v_add}
  //v_sum_phi = v_add // PHI
  mov v_sum_phi, r3 // r1={v_i}, r2={v_a}, r3={v_add,v_sum_phi}
  //v_sum2_phi = v_add // PHI, LIVEEND={v_add}
  mov v_sum2_phi, r3 // r1={v_i}, r2={v_a}, r3={v_add,v_sum_phi,v_sum2_phi}
  //v_inc = add v_i, 1 // LIVEEND={v_i}
  add r1, 1
  mov v_inc, r1 // r2={v_a}, r3={v_add,v_sum_phi,v_sum2_phi}, r1={v_inc}
  //v_i_phi = v_inc // PHI
  mv v_i_phi, r1 // r2={v_a}, r3={v_add,v_sum_phi,v_sum2_phi}, r1={v_inc,v_i_phi}
  //v_c2 = icmp slt v_inc, v_n // LIVEEND={v_inc}
  cmp r1, v_n
  //br v_c2, l_body, l_end // LIVEEND={v_c2}
  blt l_body
  br l_end
l_end: // LIVE={v_sum2_phi}, r1={}, r2={}, r3={}
  //v_sum2 = v_sum2_phi // LIVEEND={v_sum2_phi}
  mov r1, v_sum2_phi
  mov v_sum2, r1 // r1={v_sum2_phi,v_sum2}
  //ret v_sum2 // LIVEEND={v_sum2}
  mov eax, r1
  ret

Multiblock register allocation starts by identifying register/operand candidates (CAND) for each basic block in which the operand is live on entry and is the first value assigned to the register in the basic block. These candidates are used in conjunction with physical register preferences to decide on mappings between r1/r2/r3 and eax/ecx/edx in each basic block. For example, mov eax, r1 creates a preference for r1 to be eax in the last basic block, and that preference propagates when possible to the predecessor basic blocks. Edges may need to be split to add compensation code when the preferences are overconstrained.

simple_loop(v_a, v_n):
l_entry: // LIVE={v_a,v_n}, r1={}, r2={}, r3={}
// CAND={}
// REG={r3:eax, r1:ecx,r2:edx}
  mov r1, 0
  mov v_i_phi, r1 // r2={}, r3={}, r1={0,v_i_phi}
  mov r2, r1
  mov v_sum_phi, r2 // r3={}, r1={0,v_i_phi}, r2={0,v_sum_phi}
  mov r3, r1
  mov v_sum2_phi, r3 // r1={0,v_i_phi}, r2={0,v_sum_phi}, r3={0,v_sum2_phi}
  cmp v_n, 0
  ble l_end
  br l_body
l_body_split1:
  mov edx, eax // mov r2:v_sum_phi, r3:v_sum_phi
l_body: // LIVE={v_i_phi, v_sum_phi, v_a, v_n}, r1={}, r2={}, r3={}
// CAND={v_i_phi:r1,v_sum_phi:r2}
// REG={r3:eax, r1:ecx,r2:edx}
  ////mov r1, v_i_phi // r2={}, r3={}, r1={v_i_phi}
  mov v_i, r1 // r2={}, r3={}, r1={v_i}
  ////mov r2, v_sum_phi // r3={}, r1={v_i}, r2={v_sum_phi}
  mov v_sum, r2 // r3={}, r1={v_i}, r2={v_sum}
  mov r3, r1
  shl r3, 2
  mov v_t1, r3 // r2={v_sum}, r3={v_t1}, r1={v_i}
  mov r2, v_a // r3={v_t1}, r1={v_i}, r2={v_a}
  add r3, r2
  mov v_t2, r3 // r3={v_t2}, r1={v_i}, r2={v_a}
  mov r3, [r2+4*r1]
  mov v_elt, r3 // r1={v_i}, r2={v_a}, r3={v_elt}
  add r3, v_sum
  mov v_add, r3 // r1={v_i}, r2={v_a}, r3={v_add}
  mov v_sum_phi, r3 // r1={v_i}, r2={v_a}, r3={v_add,v_sum_phi}
  mov v_sum2_phi, r3 // r1={v_i}, r2={v_a}, r3={v_add,v_sum_phi,v_sum2_phi}
  add r1, 1
  mov v_inc, r1 // r2={v_a}, r3={v_add,v_sum_phi,v_sum2_phi}, r1={v_inc}
  mov v_i_phi, r1 // r2={v_a}, r3={v_add,v_sum_phi,v_sum2_phi}, r1={v_inc,v_i_phi}
  cmp r1, v_n
  blt l_body_split1
  br l_end
l_end: // LIVE={v_sum2_phi}, r1={}, r2={}, r3={}
// CAND={v_sum2_phi:r1}
// REG={r1:eax, r2:ecx,r3:edx}
  ////mov r1, v_sum2_phi
  mov v_sum2, r1 // r1={v_sum2_phi,v_sum2}
  mov eax, r1
  ret

After physical register substitution:

simple_loop(v_a, v_n):
l_entry: // LIVE={v_a,v_n}, r1={}, r2={}, r3={}
// CAND={}
// REG={r3:eax, r1:ecx,r2:edx}
  mov ecx, 0
  mov v_i_phi, ecx // edx={}, eax={}, ecx={0,v_i_phi}
  mov edx, ecx
  mov v_sum_phi, edx // eax={}, ecx={0,v_i_phi}, edx={0,v_sum_phi}
  mov eax, ecx
  mov v_sum2_phi, eax // ecx={0,v_i_phi}, edx={0,v_sum_phi}, eax={0,v_sum2_phi}
  cmp v_n, 0
  ble l_end
  br l_body
l_body_split1:
  mov edx, eax
l_body: // LIVE={v_i_phi, v_sum_phi, v_a, v_n}, r1={}, r2={}, r3={}
// CAND={v_i_phi:r1,v_sum_phi:r2}
// REG={r3:eax, r1:ecx,r2:edx}
  mov v_i, ecx // edx={}, eax={}, ecx={v_i}
  mov v_sum, edx // eax={}, ecx={v_i}, edx={v_sum}
  mov eax, ecx
  shl eax, 2
  mov v_t1, eax // edx={v_sum}, eax={v_t1}, ecx={v_i}
  mov edx, v_a // eax={v_t1}, ecx={v_i}, edx={v_a}
  add eax, edx
  mov v_t2, eax // eax={v_t2}, ecx={v_i}, edx={v_a}
  mov eax, [edx+4*ecx]
  mov v_elt, eax // ecx={v_i}, edx={v_a}, eax={v_elt}
  add eax, v_sum
  mov v_add, eax // ecx={v_i}, edx={v_a}, eax={v_add}
  mov v_sum_phi, eax // ecx={v_i}, edx={v_a}, eax={v_add,v_sum_phi}
  mov v_sum2_phi, eax // ecx={v_i}, edx={v_a}, eax={v_add,v_sum_phi,v_sum2_phi}
  add ecx, 1
  mov v_inc, ecx // edx={v_a}, eax={v_add,v_sum_phi,v_sum2_phi}, ecx={v_inc}
  mov v_i_phi, ecx // edx={v_a}, eax={v_add,v_sum_phi,v_sum2_phi}, ecx={v_inc,v_i_phi}
  cmp ecx, v_n
  blt l_body_split1
l_end: // LIVE={v_sum2_phi}, r1={}, r2={}, r3={}
// CAND={v_sum2_phi:r1}
// REG={r1:eax, r2:ecx,r3:edx}
  mov v_sum2, eax // eax={v_sum2_phi,v_sum2}
  mov eax, eax
  ret

Dead instruction elimination cleans up dead stores and no-longer-necessary calculations of intermediate components of addressing modes.

simple_loop(v_a, v_n):
l_entry:
  mov ecx, 0
  mov edx, ecx
  mov eax, ecx
  cmp v_n, 0
  ble l_end
  br l_body
l_body_split1:
  mov edx, eax
l_body:
  mov v_sum, edx
  mov edx, v_a
  mov eax, [edx+4*ecx]
  add eax, v_sum
  add ecx, 1
  cmp ecx, v_n
  blt l_body_split1
l_end:
  ret

In this example, there are no clear peephole optimizations available, and since only one stack operand is left, there is no need for local variable coalescing. Thus the difference between the code above and the actual code emitted would involve:

Frame setup in the prolog
Frame destruction in the epilog
Substituting stack frame addressing modes where generic stack operands appear
Loop alignment as desired
SFI sequences
Bundling