V8 Register allocation

“V8 Turbofan Register Allocation Design This is a high-level overview of the design of register allocation in the Turbofan compiler in V8. It focuses on those large picture concerns that span more than one type/method and are cumbersome to describe through code comments. It provides design rationale for some of the less obvious design choices. Overview Register allocation runs after instruction selection, and is followed by frame elision, jump threading1 and finally code generation. The key inputs to register allocation are the reverse postorder (RPO) control flow graph (CFG), in split-edge form, produced by instruction selection (isel), and the list of architecture-specific registers available for allocation. It consists of a number of phases, which are all coordinated from Pipeline::AllocateRegisters, in pipeline.cc. Some of the instructions coming from isel are not yet lowered to the architecture (calls, for example, are still represented as an abstract instruction) but they will all have architecture-specific operand information. For example: certain operations may have fixed, architecture-specific operands – for example, on x64, a variant of imul returns in rdx:rax, so the output of the multiplication operation is fixed to those registers function parameters are at specific stack locations certain operations require parameters be passed in some register, or the input and output register may need to match. Currently, we do not leverage register aliasing on architectures that feature it. Once register allocation completes, all instruction operands are assigned, either to an architecture-specific register, or to a stack slot, or a constant. In addition, moves (assignments from one operand to another) may be inserted between some instructions, to ensure correct data flow. Allocation Pipeline and Concepts We describe the main register allocation concepts as we go through the register allocation phases, which help explain the rationale behind some of the design decisions. The main concepts, specific to register allocation, are: LifetimePosition, UsePosition, UseInterval, LiveRange, SpillRange, and ParallelMove. Here is a high-level summary of the register allocation pipeline phases, for orientation: MeetRegisterConstraints connects fixed instruction operands to virtual registers. ResolvePhis lowers phis as move instructions. BuildLiveRanges performs liveness analysis, thus constructing LiveRanges. SplinterLiveRanges separates hot and cold sections of a live range. Cold sections are those spanning instructions in deferred blocks. Deferred blocks are blocks compiler generates for a certain class of exception-handling code, like stack overflows. RegisterAllocation runs a register allocation algorithm over live ranges, potentially splitting them into child live ranges. Assigns each live range either a register or a stack slot. MergeSplinters merges hot and cold constituents of an original live range back into one. AssignSpillSlots determines the number of spill slots and assigns them to virtual registers. CommitAssignment converts instruction operands covered by a live range to register or stack slot operands, depending on what that range has assigned to it. PopulateReferenceMaps populates the pointer maps (for GC). ConnectRanges ensures values flow from operands assigned to child ranges of the same virtual register, mainly when the connection is within a block. ResolveControlFlow complementary to ConnectRanges, between blocks with non-trivial control flow (i.e. many predecessors/successors) OptimizeMoves removes redundant moves. LocateSpillSlots identifies blocks that need a frame because they use stack operands. FrameElision (outside the register allocation pipeline) determines the optimal set of blocks where a frame should be constructed or deconstructed. MeetRegisterConstraints and ResolvePhis Phases Instructions coming from isel have either fixed operands, which are pre-assigned to a specific register or stack slot, or virtual register operands. Consider an instruction which takes a parameter that must be in a fixed register (say rax), and that value must correspond to the virtual register v4. We want to give as much freedom as possible to the register allocator in assigning hardware-specific storage to v4. In addition, other instructions where v4 is used may require the input slot for v4 in a different hardware register (or stack slot). For our case, just before the instruction executes, we must ensure rax takes the value of v4. This illustrates the concept of input constraints. Constraints (in our context) are move operations used either to ensure data flow between virtual registers and fixed registers, or to lower phis. A move is just an assignment. For cases like this instruction, we create a virtual register (let’s call it v200), indicate v200 is fixed to rax, and insert a move before the instruction of the form v200 = v4. We use v200 just here, for this instruction. This may seem convoluted – why create this short-lived v200 and not just directly write rax = v4? The reason is that modeling it as a virtual register helps later, when we calculate live ranges (BuildLiveRanges) because we can uniformly apply liveness analysis to all operands, and, in this case, determine the ranges of code where hardware registers are already occupied. The MeetRegisterConstraints phase inserts constraints, while ResolvePhis lowers phis. We’re describing them together because they serve to illustrate a particular design decision. Similar to input constraints, there are output constraints: if an instruction’s output operand is a specific register (say rdx), and that instruction must define a virtual register – say v10 – we must ensure v10 takes the value of rdx right after this instruction, and before the next one is performed. Similar to how input constraints work, and for the same rationales, we create a v201 that is pre-assigned to rdx and add a move after the instruction of the form v10 = v201. Phis are lowered by ending each block contributing a value to the phi with a move from the operand in that block to the phi operand. For instance, if we have: Block B1: v0 = … jump B3 Block B2: v1 = … jump B3 Block B3: v2 = phi(v0, v1) … We lower v2 as: Block B1: v0 = … v2 = v0 jump B3 Block B2: v1 = … v2 = v1 jump B3 Block B3: … Inserting constraints before register allocation means that the operands participating in the moves must be considered when performing liveness analysis. This means that we need to model the space between instructions, where moves may be inserted. We refer to this space as “gap”, and distinguish 2 positions – START and END. END is used to store input constraints and moves for phi lowering. The START gap position is used for output constraints. There may be more than one move in a gap position. We treat all the moves at one gap position as “”parallel moves””: they may be freely reordered, and the semantics of the assignment is that the RHS is captured before any of the moves is performed. Why 2 and not one gap positions? Consider this (we skipped the operand live ranges, writing directly the assigned operands, e.g. rax instead of v200, using the example before): rax = instr1 (v0 = rax, [stack:2] = v0) instr2([stack:2]) The first instruction defines v0 in rax. The second instruction has v0 as input, but it wants it on the stack. The constraints – listed in brackets – must happen between the 2 instructions. Since we treat the moves as parallel moves, [stack:2] is undefined when instr2 is executed. [stack:2] is only defined (or defined conforming to the semantics of the program) if v0=rax happened first. By having 2 gap positions, we can specify this ordering, by placing the [stack:2] = v0 assignment in the END gap position. An alternative would have been to not use the parallel move model and instead detect the data dependency between the operands, however, that would require an analysis of existing moves between instructions every time we added a move. The current design offers a simple, inexpensive alternative2. The only other source of moves is due to live range splits and due to spills, and we can always use START for those, we will explain why later. An effect is that we do not need more than 2 gap positions. BuildLiveRanges Phase With all necessary operands in place – either as provided by isel, or as introduced by constraints and phi lowering – we perform liveness analysis and construct LiveRanges. At this stage, we instantiate TopLevelLiveRanges. These are LiveRanges (TopLevelLiveRange is a subtype of LiveRange) that start at definition, or the beginning of the block where a phi is defined. We will not discuss the algorithm here, just some specifics of the V8 model. A LiveRange instance is associated with a specific virtual register (SSA variable). We store TopLevelLiveRanges in a list indexed by the virtual register they correspond to, in the RegisterAllocationData object that is passed between register allocation pipeline phases. LiveRanges are described by one or more UseIntervals. A UseInterval is a pair of positions [start, end) in the instruction sequence (including gaps) where the variable is live, for all positions between start and up to, but excluding, end. The design choice here was to model instruction positions as a START and END. Together with the 2 gap positions, each instruction position is modeled as 4 positions for register allocation. This is the purpose of LifetimePosition. UseInterval endpoints are given as LifetimePositions, for example. The UseIntervals in a LiveRange are chained in order: the end of one is before or same as the start of the next, see the next() member on UseIntervals. A LiveRange ends up with more than one UseInterval when the virtual register is used in multiple blocks, there is a branch, and the virtual register is used only on one side of the branch. For instance: Block B1: 1: v0 = … 2: branch (B2, B3) Block B2: 3: (no use of v0) 4: (no use of v0) 5: jump B4 Block B3: 6: use v0 7: jump B4 B4: … Here, the live range for v0 will contain the use intervals [1, 3) and [6,7). We did not use LifetimePositions here, for simplicity. Another situation where a live range has more that one interval is fixed ranges. Recall from the MeetRegistersConstraints phase that we create small live ranges around instructions, to model fixed operands. We coalesce all the intervals where a certain hardware register is occupied into a single fixed live range (one per hardware register). Fixed live ranges are accounted for separately from live ranges, in RegisterAllocationData (see fixed_{double_},live_ranges()), and they have negative virtual register numbers. When a LiveRange has more than a use interval, the use intervals may be assumed to be disconnected – the end of one is strictly before the start of the next. Because of later phases of splintering and merging, that assumption may be relaxed to allow intervals to touch. LiveRanges also chain UsePositions. These represent the association between a position (LifetimePosition) and the Operand corresponding to the virtual register of the owning LiveRange. The UsePositions chain, in order, via a next() member. The first UsePosition will be the definition, with the exception of phis. For phis, because we insert move operands at the end of blocks contributing values to the phi, there are UsePositions before the block where the phi is defined (corresponding to those moves). UsePositions also store a hint. Phis, for instance, have as hint one of the incoming operands. Virtual registers defined from a fixed register (in an output constraint) have as hint the operand giving them the value. The hint is used by the register allocator to try and assign the same register to the virtual register, thus avoiding the cost of a move. Virtually all register allocation algorithms resolve allocation conflicts arising from insufficient hardware registers by splitting and/or spilling live ranges. In support of that, LiveRanges expose a SplitAt member. When a LiveRange is split at a position pos, the new range starts at pos and ends at the old range’s end, while the current range ends at pos. UseIntervals and UsePositions are distributed accordingly to the new and old live range. The current range chains the new range via its (the current’s) next() member. Data flow from one range to the next is handled in later stages (ConnectLiveRanges and ResolveControlFlow). Initially, we only build TopLevelLiveRanges. Upon splitting, all subsequent child ranges refer back to the TopLevelLiveRange via the parent() member. That facilitates retrieval of information about the virtual register as a whole – its identity (the virtual register), the start of the whole live range chain, and the SpillRange. When we build TopLevelLiveRanges, if we determine the virtual register has or needs a stack operand, we build a SpillRange and associate it with the TopLevelLiveRange. SpillRanges are unaffected by range splitting, but may be merged – a SpillRange may be associated with more than one virtual register. If, during register allocation, we spill a LiveRange, and there is no SpillRange associated to the parent TopLevelLiveRange, we will create one at that stage, too. Spill ranges capture the entire span of UseIntervals of a TopLevelLiveRange and its children, if the spill range is created during register allocation, rather than the BuildLiveRanges phase. More on Spill ranges later. Live ranges for constants or operands that are defined from a memory location – like function parameters – do not need to be spilled. They do not have a spill range. SplinterLiveRanges Phase The incoming CFG labels certain basic blocks as “deferred”, to indicate they are on the cold path. Examples include compiler-generated code for runtime exception handling, such as stack overflow. When we need to spill live ranges defined in a register, we insert a spill move (stack=register) at the definition of the live range, so at the beginning of a TopLevelLiveRange. If a live range is only spilled in deferred blocks, and its definition is in a hot loop, the spill penalizes the hot path because of operands needed only on the cold path. To address this, we splinter live ranges: we separate a TopLevelLiveRange, fresh out of the BuildLiveRanges phase, into 2 independent TopLevelLiveRanges, one covering deferred blocks, and the other covering everything else. We will run register allocation on them independently; then merge them back under the original TopLevelLiveRange and continue the pipeline. This mechanism makes sure that, if the only reason for spilling is in the deferred blocks, the hot path is unaffected. Register Allocation Phase In production, we run a variant of Linear Scan. Experimentally, we also have an implementation of the backtracking algorithm used by LLVM (“greedy” register allocator). At the end of register allocation, the instruction operands are unchanged, and the allocation decision is embodied in the way the live ranges are split, which registers they get assigned, and whether they are spilled. No decision is made, yet, on which spill slot to use for the spilled cases. We run the allocation in 2 sub-phases, one for each type of register – general and double. This is possible because an operand may be of one kind or the other – so there are no cross-type conflicts. This is also beneficial, because the register allocator complexity is supralinear, and reducing the number of virtual registers processed by the algorithm (even when running it twice) reduces compile time. Linear Allocator This is a variant of Poletto & Sarkar’s algorithm. We perform a few heuristics: we use the first encountered hint on the live range to minimize unnecessary moves for phis with spilled incoming operands, an attempt is made to merge the spill ranges of the output and its operands. The goal is to avoid memory-to-memory moves. When conflicts arise, we split a range based on where conflicts start or end, with little regard on whether the split position is a gap start/end or instruction start/end. Splitting leads to a move being introduced at a later phase (see ConnectLiveRanges and ResolveControlFlow). Moves only happen in gap positions, so splitting at instruction start is actually handled as split at the gap position just above (END), and splitting at instruction end is equivalent to splitting at the next instruction’s gap START. Greedy Allocator This is an interpretation of the LLVM algorithm. It is an interpretation because the LLVM algorithm is quite poorly documented, when compared to linear scan or other register allocation algorithms. For that reason, we describe it in more detail, as implemented in V8. The main structure of this allocator is: a priority queue of live ranges that are yet to be allocated (or spilled). The priority is in reverse order of size, where size is defined as the sum of the size (end – start) of constituent UseIntervals. a data structure representing the current allocations (CoallescedLiveRanges) corresponding to a particular hardware register. The data structure must be efficient for insertions, removals, and queries, and is critical to the contribution of the register allocator to compile time. a weight measure for live ranges. We use the ratio (use positions requiring a register) / size. a packet of heuristics for what to do when a live range has conflicts. These must reduce (split and/or spill) the live range, and do it so based on what’s most beneficial for this range. This is different from Linear Scan, where we split based on where conflicts are positioned. The algorithm picks the top of the priority queue and tries to find a free register for that range, thus consulting the corresponding CoallescedLiveRanges. If a free register is found, we add the live range to CoallescedLiveRanges and continue to the next element in the queue. If there are conflicts, we try and find that register where the weight of none of the conflicts is higher than the weight of the candidate live range. If we find more than one such register, we pick the one where the highest weight of the conflicts is the lowest. We then evict all the conflicts: we remove them from the corresponding CoallescedLiveRanges structure (the one corresponding to the register) and add them unchanged back to the priority queue. If we can’t find a free or evictable register, we apply the heuristics, one by one, to the live range. The heuristics must reduce the live range: either spill it outright, if it is possible (there are no UsePositions requiring a register); or split it, and add the pieces back to the queue. In extreme, the pieces become so small that they are irreducible and have infinite weight, thus ensuring their allocation. When splitting, the greedy allocator always splits at gap START. The nature of the algorithm is to apply heuristics that do not consider the conflicting ranges. Instead, heuristics consider control flow and, at a high level, decide between which 2 instructions to split. This means we need to decide whether to split at START or at END. Splitting at END can result in a conflict when the position already contains a constraint where the operand of the virtual register appears on the RHS. Since the child range would begin at gap END, the RHS operand would be the new one, which hasn’t been assigned yet. Splitting at gap START is always safe. The only constraints introduced there are virtual register definitions, and splitting a range at its start position won’t happen – it doesn’t reduce the problem, so it wouldn’t be pursued by a well-behaved heuristic in the first place. The current heuristics are: split around calls and spill over the call instruction. Calls always clobber registers, so we proactively perform this segmentation; or split before loops; or split before the first use position, or, if that happens to be the start of the range, split after it; or spill The algorithm has a few enhancements over the basic structure described initially: we consider hints: we try to allocate first to the register that’s hinted. Failing that (there are conflicts and they are “heavier”), we try allocate elsewhere. we group together non-conflicting operands of a phi and the phi live range, and try to allocate them all to the same register. We do this by handling such a group as if it were one big live range (so its size is the sum of the components), and adding that to the priority queue. If we fail to allocate the whole group, we disperse it: we add the constituent live ranges to the queue and let the algorithm take its course. MergeSplinters Phase This phase merges splinters: it re-packages the 2 TopLevelLiveRanges obtained from the Splinter phase into the original one. It also detects if the range spills only in deferred blocks, and in that case, it marks the TopLevelLiveRange as such, (see the member IsSpilledOnlyInDeferredBlocks), so that when we add the spill moves later, we do not do so at definition, but rather inside the deferred blocks. AssignSpillSlots Phase So far we have not decided how many spill slots we need, and which virtual registers spill where. Recall that we spill at definition, and that spill ranges cover the entirety of the original live range. Because we spill at definition, we only need to spill once3, and then the slot is available for later reloads or uses. If 2 virtual registers’ spill ranges do not overlap, we may safely use the same spill slot for both of them, because the virtual registers are never simultaneously live. We merge their corresponding SpillRanges, and associate the resulting one to both live range chains. This is the role of the AssignSpillSlots phase: merge spill ranges, and then assign to the resulting ones a spill slot. That means, also, that at this point, virtual registers (i.e. live range chains) that need to be spilled have a spill operand. AssignSpillSlots does not handle slot layout on the stack, just assigning a stack operand (id and size). CommitAssignment Phase This phase updates all operands as per their assigned storage, register or slot. It also commits spills for live ranges that are not spilled only in deferred blocks. Special note here for live ranges defined by the last instruction in a block with multiple successors, presumably a branching instruction: the spill will be distributed to the first gap position of the first instruction of the block’s successors. For ranges spilled only in deferred blocks, we insert spills using a different mechanism, covered by subsequent phases. PopulateReferenceMaps Phase This phase populate pointer maps for this function. It does not contribute to register allocation, just needs to happen after ranges are computed, allocated, and operands are assigned. ConnectRanges Phase While the operands have been assigned according to the register allocator, nothing, so far, moves the value of a virtual register from one operand to the other. We do that in 2 stages. ConnectRanges flows the data between distinct operands belonging to consecutive live ranges, as long as the ranges do not connect at the beginning of a block that has more than a predecessor, or that is not consecutive after its sole predecessor. We handle these other cases in the next phase, which handles control flow. This is because, if the connection point is at the beginning of a block with more than one predecessor, there may be more than one path (and operand) that brings the value into this block. For live ranges spilled only in deferred blocks, if the connection is a reload (copy from slot to register) we record the block in a list – the block requires a spill operand be present. We tackle that at the end of the next phase. ResolveControlFlow Phase Here, we iterate through blocks with non-trivial control flow, identify the operands corresponding to the same virtual register, and insert moves. Recall the CFG is in split edge form. If we flow data from a node with multiple successors, the move is inserted at the end of the block (at the END gap position). If we flow data into a node with one predecessor, we’ll insert the move in the current block. Similarly to ConnectRanges, we take note of reloads for ranges spilled in deferred blocks. At the end, we use the list of such reloads to insert the spills for such ranges. We insert the spills only in deferred blocks dominating the locations the spill operands are necessary. We also mark these blocks as needing a frame, since they spill. OptimizeMoves Phase The last phase of register allocation. It compresses the moves in the 2 gap position, taking note of ordering, and rewriting some of the moves as a result. It then pushes downwards moves within a block, as long as that preserves semantics. Finally, it merges moves common to all last instructions of all predecessors of a block. LocateSpillSlots Phase This phase determines which blocks need a frame, based on whether the range is spilled, and excluding those spilled only in deferred blocks, which were handled in ResolveControlFlow, where we determined in which blocks they spill. For all the other ranges that spill, we know they spill at definition, and so we use the location of spills to mark the block as needing frame (blocks, when the defining instruction is the last in a block with many successors). 2 This is my speculation after reverse engineering the design. 3 Exception being ranges spilled only in deferred blocks, and spills for ranges defined in the last instruction of a block with many successors, where we’ll spill in each successor. 1 The name of the phase (“jump threading”) may be confusing, it suggests some concurrency concern. It is actually concerned with block to block jump optimization. “

Comments

Leave a Reply Cancel reply