The Completely Fair Scheduler in Linux 2.6.23 from O one heuristics to vruntime accounting, red black trees, cgroups and EEVDF evolution

Every time a Linux system runs a video call while compiling code in the background and streaming music in a third window, something has to decide who gets the CPU and for how long. That something is the scheduler, and for nearly two decades, the algorithm at the center of that decision was the Completely Fair Scheduler, known simply as CFS. It is not the most glamorous piece of the Linux kernel, but it is one of the most consequential. Understanding how it works means understanding how fairness can be encoded into a data structure at nanosecond precision.

The Problem CFS Was Built to Solve

Before CFS arrived in Linux 2.6.23 in October 2007, the dominant scheduler was the O(1) scheduler, named after its constant-time selection of the next process. On paper, O(1) was a triumph of algorithmic efficiency. In practice, it struggled on desktop workloads, where interactive processes like a text editor waiting for a keystroke needed to feel immediately responsive. The O(1) scheduler relied on complex heuristics to identify which tasks were "interactive" and which were not. Those heuristics were fragile, often mislabeling processes and causing sluggishness at the worst possible moments.

Ingo Molnar, the author of CFS, was inspired by Con Kolivas's Rotating Staircase Deadline scheduler, which introduced the concept of fair scheduling to Linux. Molnar took that concept and built something mathematically cleaner: a scheduler that would not guess whether a task was interactive, but would instead give every task precisely what it deserved based on a single central principle.

Virtual Runtime as the Unit of Fairness

The core idea behind CFS is deceptively elegant. Rather than assigning fixed time slices and switching tasks when the slice expires, CFS introduces the concept of virtual runtime, stored in a per-task field called vruntime. Virtual runtime tracks how much CPU time a task has consumed, normalized against the total number of running tasks. Think of it as a scorecard: a process with a low score has received less than its fair share, and a process with a high score has received more.

The sched_entity structure, embedded in every task_struct, is where this accounting lives:

struct sched_entity {
    struct load_weight  load;       /* weight derived from nice value */
    struct rb_node      run_node;   /* node in the red-black tree */
    u64                 vruntime;   /* accumulated virtual runtime (ns) */
    u64                 sum_exec_runtime; /* total wall-clock CPU time */
    u64                 exec_start; /* timestamp of last scheduling */
};

Every time the scheduler gets a chance to run, it calls update_curr(), which advances the vruntime of the currently executing task:

static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec;

    delta_exec = now - curr->exec_start;  /* nanoseconds elapsed */
    curr->exec_start = now;
    curr->vruntime += calc_delta_fair(delta_exec, curr);
    update_min_vruntime(cfs_rq);
}

calc_delta_fair() is where priority enters the picture. For the same physical time elapsed, a higher-priority task's vruntime increases by less, and a lower-priority task's vruntime increases by more. The formula at the heart of it is:

weighted_delta = delta_exec * NICE_0_LOAD / curr->load.weight

NICE_0_LOAD is 1024, the default weight for a process at nice level 0. A process at nice -5 carries a weight of 3121, so its vruntime grows roughly three times slower for the same CPU time, earning it roughly three times more scheduling turns. No magic flags, no special queues: priority is just arithmetic applied to the same vruntime counter everyone else uses.

The Red-Black Tree as the Engine of Selection

Fairness requires a data structure capable of answering one question extremely fast: which task has the lowest vruntime right now? CFS answers this using a red-black tree, a self-balancing binary search tree where every node corresponds to a schedulable entity ordered by its vruntime value.

The CFS runqueue structure holds the tree and a cached pointer to its leftmost node:

struct cfs_rq {
    struct load_weight      load;
    unsigned int            nr_running;
    u64                     min_vruntime;
    struct rb_root_cached   tasks_timeline; /* red-black tree + leftmost cache */
    struct sched_entity     *curr;
    struct task_group       *tg;
};

The rb_root_cached variant of the red-black tree keeps a direct pointer to the leftmost node at all times. That means picking the next task to run requires no traversal whatsoever. Selection is effectively O(1), while insertion and removal of tasks cost O(log N), where N is the number of runnable processes. On a busy server with thousands of concurrent tasks, this asymmetry matters enormously.

When a task is enqueued, it is inserted into the tree through enqueue_entity(). When it is removed after sleeping or completing, dequeue_entity() pulls it out. The red-black tree rebalances itself automatically after every insertion and deletion, ensuring that the leftmost position always reflects the task with the genuinely smallest vruntime.

How Preemption Decisions Are Actually Made

After update_curr() advances the running task's vruntime, check_preempt_tick() decides whether it is time to switch:

static void check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    struct sched_entity *se;
    s64 delta;

    ideal_runtime = sched_slice(cfs_rq, curr);
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;

    if (delta_exec > ideal_runtime) {
        resched_curr(rq_of(cfs_rq));
        clear_buddies(cfs_rq, curr);
        return;
    }

    if (delta_exec < sysctl_sched_min_granularity)
        return;

    se = __pick_first_entity(cfs_rq); /* leftmost node: smallest vruntime */
    delta = curr->vruntime - se->vruntime;

    if (delta > ideal_runtime)
        resched_curr(rq_of(cfs_rq));
}

sched_slice() calculates the task's proportional share of the current scheduling period, which defaults to 6 milliseconds. If the task has consumed more than its share, resched_curr() sets the TIF_NEED_RESCHED flag on the current thread, and the kernel will perform a context switch at the next safe opportunity. The minimum granularity guard of 0.75 milliseconds prevents pathological over-scheduling when hundreds of tasks compete for the same CPU.

To observe the tunables live on any running Linux system:

# View current CFS scheduling tunable values
cat /proc/sys/kernel/sched_latency_ns        # target latency (ns)
cat /proc/sys/kernel/sched_min_granularity_ns # minimum granularity (ns)
cat /proc/sys/kernel/sched_wakeup_granularity_ns

On a typical desktop kernel these read 6000000, 750000, and 1000000 respectively. Adjusting them via sysctl immediately affects scheduling behavior without a reboot.

How Sleeping Tasks Stay Fair

A scheduler that simply tracked accumulated CPU time would punish tasks that voluntarily sleep. A text editor waiting for user input accumulates no CPU time, and when the user finally types, the editor might find itself buried deep in the right side of the tree, far behind processes that never yielded. CFS handles this through min_vruntime, maintained in cfs_rq:

min_vruntime = MAX(min_vruntime,
                   MIN(curr->vruntime, leftmost_task->vruntime))

When a sleeping task wakes up, its vruntime is adjusted relative to the current min_vruntime before re-insertion into the tree. The task re-enters the competition near the left side without receiving an unfair windfall. In practice, this is why a terminal session feels snappy even when a CPU-hungry compile job is running in parallel. The terminal's vruntime stays close to min_vruntime during every sleep interval, so it is always near the front of the line the moment it needs the CPU.

To inspect the vruntime of running processes in real time:

# Show per-process scheduling statistics including vruntime
cat /proc/<pid>/sched

# Or watch all tasks sorted by vruntime via perf
perf sched record -- sleep 5
perf sched latency --sort max

Scheduling Entities and Group Scheduling

CFS does not schedule individual threads in isolation. It operates on the broader concept of scheduling entities, a design that lets the kernel treat a single thread, a group of threads, or all processes belonging to a particular cgroup as a single schedulable unit.

Without group scheduling, a user running 20 processes receives four times more CPU than a user running 5, simply by having more entries in the runqueue. With CONFIG_FAIR_GROUP_SCHED enabled, each group is a single entity at the top level of the hierarchy. Inside the group, tasks compete among themselves. The cpu.shares file exposed through cgroups controls the relative weight of each group:

# Create two cgroups and assign different CPU shares
mkdir /sys/fs/cgroup/cpu/group_a
mkdir /sys/fs/cgroup/cpu/group_b
echo 1024 > /sys/fs/cgroup/cpu/group_a/cpu.shares  # default weight
echo 512  > /sys/fs/cgroup/cpu/group_b/cpu.shares  # half the weight

# Assign a process to group_a
echo <pid> > /sys/fs/cgroup/cpu/group_a/tasks

With these settings, group_a receives roughly twice the CPU time of group_b under contention, regardless of how many tasks each group contains. Android uses precisely this mechanism to isolate background processes from the foreground application, ensuring that a sync service cannot steal CPU cycles from the app currently on screen.

What CFS Reveals About Systems Design

The story of CFS is, in a way, the story of how a good abstraction outlasts its implementation details. The original red-black tree at the core of CFS has already been augmented and in Linux 6.6 partially superseded by EEVDF (Earliest Eligible Virtual Deadline First), which adds deadline-aware selection on top of the same tree structure. The vruntime accounting persists. The fairness model persists. Only the selection algorithm at the top of the tree has been refined.

That durability is not accidental. By grounding every scheduling decision in a single monotonically increasing counter rather than a web of heuristics, CFS created a system where correctness is legible and behavior is predictable. A developer who understands vruntime understands why their process gets the CPU time it does. That transparency is rarer than it sounds in systems software, and it is ultimately what made CFS worth building on top of for nearly two decades.