Mastering Register Allocation in LLVM Through Graph Coloring and Strategic Spilling

Every compiler faces a fundamental constraint: infinite virtual registers must map onto a finite set of physical hardware registers. This transformation determines whether generated code runs efficiently or drowns in memory access penalties. Modern processors offer perhaps eight general-purpose registers on x86, sixteen on ARM, thirty-one on MIPS. When programs demand more storage than hardware provides, the compiler must choose wisely.

LLVM's backend transforms intermediate representation into executable machine code through several critical phases, with register allocation standing among the most complex and performance-sensitive. The allocator decides which virtual registers live in physical registers and which get relegated to memory. These decisions ripple through execution time, as accessing memory costs twenty to one hundred cycles compared to a single cycle for register access.

The Interference Graph Foundation

Register allocation begins by understanding which values cannot coexist in the same physical register. Two virtual registers interfere when their live ranges overlap, meaning both hold distinct values simultaneously at some program point. Picture a function computing mathematical expressions:

define i32 @compute(i32 %a, i32 %b, i32 %c) {
entry:
  %v1 = add i32 %a, %b
  %v2 = mul i32 %b, 2
  %v3 = add i32 %v1, %v2
  %v4 = mul i32 %v3, %c
  %v5 = add i32 %v4, 1
  ret i32 %v5
}

At the point where %v3 is computed, both %v1 and %v2 must still hold their values, creating interference. The interference graph captures these relationships as nodes representing virtual registers connected by edges denoting conflicts. Building this graph requires liveness analysis, determining precisely where each value remains needed.

Liveness flows backward through the program. A variable becomes live at its last use and remains live backward to its definition. The algorithm computes live-in and live-out sets for each basic block through iterative dataflow analysis:

// Simplified liveness analysis pseudocode
for each block B in reverse postorder {
  LiveOut[B] = union of LiveIn[successor] for all successors
  LiveIn[B] = (LiveOut[B] - Defined[B]) union Used[B]
}

This process iterates until reaching a fixed point where no sets change. Once liveness information exists, interference graph construction examines each instruction. For an operation like v1 = add v2, v3, the allocator adds edges between v1 and every different variable live after this instruction.

Graph Coloring as Register Assignment

The classical approach treats register allocation as graph coloring. Assign each node a color such that no two connected nodes share the same color, where colors represent physical registers. For a machine with K registers, find a K-coloring of the interference graph.

This problem proves NP-complete for arbitrary graphs with K greater than or equal to three. Chaitin pioneered applying graph coloring to register allocation in the early 1980s, introducing heuristics that work remarkably well in practice despite theoretical complexity.

The algorithm proceeds through simplification and selection phases. During simplification, nodes with fewer than K neighbors can always be colored after removing them from consideration. The allocator builds a stack of such nodes:

def simplify(graph, K, stack):
    while True:
        # Find node with degree < K
        node = find_node_with_degree_less_than(graph, K)
        if node is None:
            break
        
        stack.push(node)
        graph.remove_node(node)

Nodes with K or more neighbors present challenges. The allocator must choose which to spill, typically selecting those with lowest spill cost. Spill cost estimates the performance impact of storing a value in memory, considering usage frequency and loop nesting depth:

float calculateSpillCost(LiveInterval *LI) {
    float cost = 0.0;
    for (auto &segment : LI->segments) {
        unsigned uses = countUses(segment);
        unsigned loopDepth = getLoopDepth(segment);
        cost += uses * pow(10, loopDepth);
    }
    return cost / LI->getSize();
}

Selection reverses the process. Pop nodes from the stack and assign each the lowest-numbered color not used by already-colored neighbors. This greedy approach often succeeds when simplification removed enough nodes.

LLVM's Greedy Register Allocator

LLVM moved beyond pure graph coloring years ago. The current greedy allocator avoids building full interference graphs, which can have quadratic size in the number of live ranges. Instead, it processes live intervals in priority order, making allocation decisions incrementally.

Live intervals represent the continuous ranges where virtual registers hold values. LLVM's SlotIndex pass numbers instructions, enabling precise interval tracking:

// Example live interval structure
struct LiveInterval {
    unsigned reg;
    SmallVector<LiveRange::Segment, 4> segments;
    float weight;  // spill cost
    
    // Segment represents [start, end) range
    struct Segment {
        SlotIndex start;
        SlotIndex end;
        VNInfo *valno;
    };
};

The greedy allocator maintains a priority queue of virtual registers ordered by spill weight. Higher-weight intervals receive allocation attempts first, as spilling them would harm performance most severely:

struct CompareSpillWeight {
    bool operator()(LiveInterval *A, LiveInterval *B) {
        return A->weight < B->weight;
    }
};

std::priority_queue<LiveInterval*, 
                    std::vector<LiveInterval*>,
                    CompareSpillWeight> Queue;

For each interval, the allocator attempts direct assignment to a physical register. If all registers are occupied by interfering intervals, it considers eviction. Evicting lower-priority intervals can free registers for more important values. The allocator compares eviction cost against the current interval's weight, choosing the strategy minimizing overall spilling.

Live Range Splitting Sophistication

The breakthrough enabling superior code generation is global live range splitting. Rather than spilling entire live ranges, the allocator divides them into smaller pieces assigned different physical registers or memory locations.

Consider a value computed before a loop but used only inside it. Splitting creates one interval for the computation, another for loop iterations:

entry:
  %v0 = load i32, i32* %ptr
  br label %loop

loop:
  %i = phi i32 [ 0, %entry ], [ %i.next, %loop ]
  %result = phi i32 [ 0, %entry ], [ %sum, %loop ]
  %sum = add i32 %result, %v0
  %i.next = add i32 %i, 1
  %cond = icmp ult i32 %i.next, 100
  br i1 %cond, label %loop, label %exit

Without splitting, %v0 occupies a register throughout the loop despite being defined outside. Splitting inserts a store after the load and reload inside the loop:

// Split decision pseudocode
if (shouldSplitAroundLoop(interval, loop)) {
    SlotIndex splitPoint = loop->getHeader()->begin();
    LiveInterval *outside = split(interval, 0, splitPoint);
    LiveInterval *inside = split(interval, splitPoint, interval.end());
    
    insertSpill(outside, splitPoint);
    insertReload(inside, splitPoint);
}

The allocator tries multiple splitting strategies: local splits within single basic blocks, region splits around loop structures, and instruction splits at specific use points. Each creates smaller intervals potentially easier to assign than the original.

Spill Code Insertion and Placement

When no splitting strategy succeeds, spilling becomes necessary. The allocator reserves stack space and inserts load/store instructions moving values between memory and registers. Optimal placement of this spill code significantly impacts performance.

Naive approaches insert stores immediately after definitions and loads right before uses. This maximizes time values spend in memory, increasing redundant memory traffic. Better strategies leverage program structure.

LLVM's SpillPlacement analysis uses constraint propagation to find optimal locations. Each basic block receives entry and exit constraints indicating whether a value should arrive in a register or memory:

enum Constraint {
    DontCare,     // No preference
    PrefReg,      // Prefer register
    MustReg,      // Must be in register
    PrefSpill,    // Prefer memory
    MustSpill     // Must be in memory
};

struct BlockConstraint {
    unsigned Number;
    Constraint Entry;
    Constraint Exit;
    float EntryFrequency;
    float ExitFrequency;
};

The algorithm builds a constraint network connecting basic blocks through control flow edges. It then solves for a consistent assignment minimizing expected execution frequency of spill code:

void SpillPlacement::addConstraints(
    ArrayRef<BlockConstraint> Constraints) {
    
    for (auto &BC : Constraints) {
        // Add node for this block
        nodes[BC.Number].setConstraints(BC.Entry, BC.Exit);
        
        // Link to successor blocks
        for (auto *Succ : getSuccessors(BC.Number)) {
            addLink(BC.Number, Succ->getNumber(), 
                   BC.ExitFrequency);
        }
    }
    
    // Solve using iterative relaxation
    iterate();
}

This approach moves spill code outside loops when possible, dramatically reducing executed spills. A value used once per loop iteration might execute a store once and reload one hundred times with naive placement, versus one store before the loop and one reload inside with optimal placement.

Handling Complex Architectural Constraints

Real processors introduce complications beyond simple register counting. Some architectures have overlapping register classes, where certain operations require specific subsets of registers. x86 exemplifies this complexity with its register aliasing.

The 64-bit register RAX contains the 32-bit EAX, which contains the 16-bit AX, which divides into the 8-bit AL and AH. Operations on smaller registers implicitly affect larger ones:

mov eax, 42      ; Sets lower 32 bits, zeros upper 32 bits
mov al, 0xFF     ; Modifies lowest byte, preserves others

The allocator must track these dependencies. Assigning one virtual register to AL prevents using AX, EAX, or RAX for interfering values. LLVM models this through register units, atomic components of the register file:

class TargetRegisterInfo {
    // Returns all register units overlapping with Reg
    ArrayRef<MCRegUnit> getRegisterUnits(unsigned Reg) const;
    
    // Check if two registers share any units
    bool regsOverlap(unsigned RegA, unsigned RegB) const {
        auto UnitsA = getRegisterUnits(RegA);
        auto UnitsB = getRegisterUnits(RegB);
        return hasIntersection(UnitsA, UnitsB);
    }
};

Function calling conventions add another layer. Some registers must preserve values across calls (callee-saved), while others may be clobbered (caller-saved). The allocator models calls as implicitly defining all caller-saved registers, creating interferences with live values crossing calls.

Rematerialization as an Alternative to Spilling

Not all values require storing to memory when evicted from registers. Constants and easily recomputable values can be rematerialized, regenerating them when needed rather than loading from memory.

Loading an immediate constant illustrates the concept:

%v1 = add i32 17, 0  ; Trivially rematerializable

; Instead of:
; store i32 %v1, i32* %stack.slot
; %v1.reload = load i32, i32* %stack.slot

; Regenerate it:
; %v1.remat = add i32 17, 0

The allocator identifies rematerializable operations through target-specific hooks. Simple constant loads, address calculations, and certain arithmetic operations qualify:

bool isRematerializable(MachineInstr *MI) {
    // Check target instruction info
    if (TII->isTriviallyRematerializable(*MI))
        return true;
    
    // Additional checks for specific patterns
    if (MI->getOpcode() == TargetOpcode::COPY) {
        unsigned SrcReg = MI->getOperand(1).getReg();
        return isConstantReg(SrcReg);
    }
    
    return false;
}

When rematerialization applies, the allocator avoids spill costs entirely. This proves especially valuable for frequently used constants that would otherwise occupy registers throughout large code regions.

Measuring Allocation Quality

Register allocation quality manifests through multiple metrics. Code size increases when splitting inserts extra copy instructions or when expensive register encodings become necessary. Execution time suffers from memory access latency when values spill.

LLVM provides statistics tracking allocation decisions:

STATISTIC(NumSpills, "Number of spills inserted");
STATISTIC(NumReloads, "Number of reloads inserted");
STATISTIC(NumCopies, "Number of copies from splitting");
STATISTIC(NumRemats, "Number of rematerializations");

void recordAllocationStats(LiveInterval *LI, 
                          AllocationDecision Decision) {
    switch (Decision) {
    case Assigned:
        break;
    case Spilled:
        ++NumSpills;
        NumReloads += countUses(LI);
        break;
    case Split:
        NumCopies += countSplitPoints(LI);
        break;
    case Rematerialized:
        ++NumRemats;
        break;
    }
}

Benchmark suites like SPEC CPU reveal allocation impact. Improvements of one to two percent in code size and three to ten percent in execution time separate good allocators from mediocre ones. These gains compound across entire programs, making allocation one of the highest-impact compiler optimizations.

The Evolution Beyond Pure Graph Coloring

Modern allocators recognize that graph coloring, while elegant theoretically, misses critical practical concerns. Building interference graphs consumes quadratic time and space. More importantly, the coloring step assumes equal importance for all nodes, ignoring that some values live in tight loops while others execute rarely.

Spill code placement matters more than coloring sophistication. Placing a single spill poorly inside a hot loop causes more damage than suboptimal coloring of ten cold values. Live range splitting creates opportunities impossible in rigid graph coloring frameworks.

LLVM's greedy allocator embraces flexibility. It makes decisions incrementally, adjusting strategy based on what worked for previously allocated intervals. It freely modifies code and live ranges when beneficial, inserting copies, commuting instructions, or rearranging blocks.

This pragmatic approach yields better results than purist graph coloring while compiling faster through avoiding expensive graph construction. The algorithm scales to large functions that would overwhelm traditional approaches.

Looking Forward

Register allocation research continues evolving. Machine learning models train on thousands of programs, learning allocation heuristics surpassing hand-coded rules. Deep neural networks approximate optimal coloring, trading compilation time for code quality.

Integer linear programming formulations find provably optimal solutions for small functions. Constraint solvers balance multiple competing objectives: minimizing spills, maximizing rematerialization, reducing copy instructions, preferring cheap register encodings.

Yet the fundamental challenge remains: mapping infinite virtual registers onto finite hardware resources while preserving program semantics and maximizing performance. LLVM's greedy allocator with global live range splitting represents current state-of-the-art production technology, generating code approaching hand-optimized assembly quality.

The compiler backend transforms high-level intentions into low-level reality. Register allocation stands as the crucial bridge where abstract computation meets concrete silicon constraints. Understanding its mechanisms reveals how modern compilers extract maximum performance from minimal hardware resources, turning elegant algorithms into blazingly fast machine code.