How seccomp-bpf Uses Berkeley Packet Filter Logic to Shrink the Kernel Attack Surface

Every program running on Linux lives in a quiet arrangement with the kernel. It asks for services through system calls, the kernel decides whether to grant them, and execution continues. On a fully unrestricted system, roughly 400 such calls are available to every process, regardless of whether it needs more than a handful of them. A database engine does not need ptrace. A PDF renderer does not need mount. A network daemon has no business calling kexec_load. That gap between what a program needs and what it can call is not merely theoretical waste. It is exploitable surface, and every unnecessary syscall a compromised process can reach is one more potential path to privilege escalation.

seccomp-bpf closes that surface. It is a Linux kernel mechanism, stable since kernel 3.5, that lets a process install a filter written in Berkeley Packet Filter bytecode. From that point forward, every system call the process attempts passes through that filter before the kernel ever dispatches it. If the filter says no, the syscall never executes. No kernel module is required. No root privileges are needed. The mechanism is available to any unprivileged process that sets a single process attribute and loads a valid BPF program.

From Strict Mode to Programmable Filters: A Brief History of seccomp

The original seccomp, introduced in Linux 2.6.12 in 2005 by Andrea Arcangeli, was brutally minimal. After calling prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT), a process could use exactly four syscalls: read, write, _exit, and sigreturn. Anything else resulted in immediate termination with SIGKILL. The motivation was renting out CPU cycles for untrusted computation, where the goal was simply to prevent the rented code from doing anything beyond arithmetic.

Useful as that was for narrow compute workloads, it was far too coarse for real applications. A web browser rendering untrusted web content needs dozens of syscalls. A container runtime needs to constrain arbitrary workloads without knowing in advance exactly which syscalls they use. What seccomp needed was programmability, and that arrived in 2012 with the BPF-based filter mode, SECCOMP_MODE_FILTER. Rather than a fixed allowlist, the process supplies a small BPF program. The kernel runs that program for every syscall attempt and uses its return value to decide what to do. That design, the combination of seccomp's security model with BPF's programmable filtering, is what the community calls seccomp-bpf.

How the Kernel Evaluates a Filter at Syscall Time

Understanding where in the kernel the filter runs clarifies why seccomp-bpf is both effective and efficient. On x86-64, a syscall instruction saves the user-space instruction pointer, switches the CPU to ring 0, and jumps to the kernel's entry_SYSCALL_64 handler. The syscall number arrives in the rax register; arguments sit in rdi, rsi, rdx, r10, r8, and r9. The kernel builds a struct seccomp_data from those values, then runs any installed seccomp filters against it before ever consulting sys_call_table. If the filter says the call is forbidden, the handler never executes.

The struct seccomp_data that the BPF program receives looks like this in the kernel headers:

struct seccomp_data {
    int   nr;                   /* syscall number                  */
    __u32 arch;                 /* AUDIT_ARCH_* value              */
    __u64 instruction_pointer;  /* CPU instruction pointer         */
    __u64 args[6];              /* syscall arguments               */
};

The BPF program reads fields from this structure using load instructions, performs comparisons and conditional jumps, and terminates with a return instruction whose value tells the kernel what action to take. The kernel interprets several return actions, in descending order of severity: SECCOMP_RET_KILL_PROCESS terminates the entire process immediately with SIGSYS; SECCOMP_RET_KILL kills only the calling thread; SECCOMP_RET_TRAP sends SIGSYS to the thread, which a signal handler may catch; SECCOMP_RET_ERRNO causes the syscall to return a specified error code without executing; SECCOMP_RET_LOG allows the syscall but records it to the audit log; and SECCOMP_RET_ALLOW permits the call unconditionally.

When multiple filters are stacked on a process, each is evaluated and the most restrictive result wins. SECCOMP_RET_KILL_PROCESS always takes precedence over SECCOMP_RET_ALLOW, regardless of which filter returned which value.

seccomp-bpf uses classic BPF, known as cBPF, deliberately. Unlike extended BPF with its general-purpose capabilities, cBPF is simple enough that the kernel can verify it in bounded time, guarantee it terminates, and confirm it contains no loops. A cBPF program is limited to 4096 instructions and all branches must point forward. These constraints are not limitations to work around; they are the properties that make it safe to run untrusted filter code inside the kernel.

Installing a Raw BPF Filter with prctl

Before a process can load a seccomp filter, it must either hold CAP_SYS_ADMIN in its namespace or set PR_SET_NO_NEW_PRIVS. The latter is the practical choice for unprivileged sandboxing: it prevents the process and all its descendants from gaining elevated privileges through setuid binaries or filesystem capabilities, making it safe for the kernel to accept filter installation from an unprivileged caller.

A minimal raw BPF filter that validates the calling architecture and blocks a specific syscall demonstrates the low-level structure directly. The following C example prevents any call to execve and kills the process if it is attempted:

#include <linux/audit.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>
#include <stddef.h>
#include <unistd.h>

static void install_filter(void) {
    struct sock_filter filter[] = {
        /* Load the architecture field from seccomp_data */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 (offsetof(struct seccomp_data, arch))),
        /* Kill the process if not running on x86-64 */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 1, 0),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
        /* Load the syscall number */
        BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
                 (offsetof(struct seccomp_data, nr))),
        /* Block execve (syscall 59 on x86-64) */
        BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 59, 0, 1),
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL_PROCESS),
        /* Allow everything else */
        BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
    };

    struct sock_fprog prog = {
        .len    = sizeof(filter) / sizeof(filter[0]),
        .filter = filter,
    };

    prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);
}

The architecture check on the second instruction is not optional paranoia. On x86-64 systems the kernel also supports 32-bit binaries through the int 0x80 entry point. A 32-bit open() is syscall number 5; the 64-bit openat() is 257. A filter that checks only syscall numbers without verifying the architecture can be bypassed by switching to the 32-bit ABI and invoking the corresponding syscall number. Verifying AUDIT_ARCH_X86_64 first closes that gap entirely.

libseccomp: Writing Filters Without Counting Bytes

Raw BPF filter construction is error-prone, architecture-dependent, and difficult to audit. libseccomp is the standard high-level library that generates correct BPF programs from a declarative API. It handles architecture normalization, instruction encoding, and filter optimization internally, exposing a straightforward three-phase interface: initialize a context with a default action, add rules for specific syscalls, load the filter into the kernel.

A practical libseccomp sandbox for a process that should only be able to read from stdin, write to stdout and stderr, and exit cleanly looks like this:

#include <seccomp.h>
#include <stdio.h>

int main(void) {
    scmp_filter_ctx ctx;

    /* Default action: kill the process for any unlisted syscall */
    ctx = seccomp_init(SCMP_ACT_KILL_PROCESS);
    if (!ctx) return 1;

    /* Allow reading only from stdin (fd 0) */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 1,
                     SCMP_A0(SCMP_CMP_EQ, 0));

    /* Allow writing only to stdout (fd 1) and stderr (fd 2) */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 1,
                     SCMP_A0(SCMP_CMP_LE, 2));

    /* Allow clean termination */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit),       0);
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    /* Load the filter into the kernel — from here on it is enforced */
    seccomp_load(ctx);
    seccomp_release(ctx);

    /* Any syscall not in the allowlist from this point kills the process */
    puts("Sandbox active.");
    return 0;
}

Compile and link against libseccomp with gcc sandbox.c -o sandbox -lseccomp. The SCMP_SYS() macro resolves syscall names portably across architectures, and SCMP_A0() through SCMP_A5() construct argument comparisons. The filter produced by seccomp_load() is binary-identical to a hand-written BPF program, but libseccomp verifies its correctness and can optimize instruction ordering for better runtime performance.

Verifying which syscalls a program actually uses before writing a filter is the sensible starting point. strace provides the necessary data:

# Record every syscall made by a program during a test run
strace -o syscalls.log -e trace=all ./myprogram

# Extract just the syscall names, sorted and deduplicated
grep -oP '^\w+' syscalls.log | sort -u

That list becomes the basis of the allowlist. Any syscall not observed during testing and not demonstrably required for correct operation is a candidate for blocking.

Audit Mode: Building Filters Safely in Production

The most common mistake when deploying seccomp filters is blocking a syscall that the program uses in a code path not exercised during testing. The result is a process that works under normal conditions and fails unexpectedly in production, often with an opaque crash rather than a meaningful error. SCMP_ACT_LOG exists specifically to prevent this failure mode.

Instead of starting with SCMP_ACT_KILL_PROCESS as the default action, a filter in development or audit mode uses SCMP_ACT_LOG. Every syscall that does not match an explicit allow rule is permitted to execute but recorded to the kernel audit log. The process continues running, and the log accumulates every syscall that would have been blocked under an enforcement policy:

/* Audit mode: log unmatched syscalls rather than killing the process */
ctx = seccomp_init(SCMP_ACT_LOG);

Reading the audit log after a representative workload run reveals which additional syscalls need to be added to the allowlist before switching to enforcement:

# View seccomp audit events from the kernel audit subsystem
ausearch -m SECCOMP | grep "syscall="

# Or via the kernel ring buffer if auditd is not running
dmesg | grep seccomp

The kernel makes the actions available for logging configurable through the /proc/sys/kernel/seccomp/actions_logged interface, readable and writable as plain text. Actions listed there are eligible for audit logging; allow is never logged regardless of this setting, because logging every permitted syscall would be prohibitively expensive.

Filter Inheritance, Stacking, and Thread Synchronization

A property of seccomp filters that carries significant security implications is inheritance. When a process that has installed a filter calls fork() or clone(), every child process inherits all of the parent's filters. Those filters cannot be removed or relaxed; a child can only add more restrictive filters on top of the inherited stack. The kernel's do_seccomp function enforces this at the system call level: filter installation always moves toward restriction, never away from it.

This inheritance model is what makes seccomp-bpf genuinely useful for sandboxing child processes. A process manager or container runtime installs a broad filter on itself, then forks a worker. The worker inherits that filter and may install a tighter one for its own operation. At no point can the worker escape the constraints imposed by its parent.

Multi-threaded programs introduce a complication. Seccomp filters apply per-thread, not per-process, and installing a filter in one thread does not automatically cover others. The SECCOMP_FILTER_FLAG_TSYNC flag addresses this: when passed to seccomp(2), it synchronizes the new filter to all threads of the process atomically. If any thread cannot accept the filter, the operation fails without modifying any thread.

Where seccomp-bpf Is Already Doing Its Work

The technology is not experimental. It is active in production software that many administrators run daily, often without realizing it. Chromium has used seccomp-bpf to sandbox its renderer processes since version 23, confining the component most exposed to untrusted web content. OpenSSH has applied seccomp filtering to post-authentication helper processes since version 6.0. vsftpd uses it to lock down the FTP session handler. systemd installs seccomp filters for virtually every system service it manages, configured through the SystemCallFilter= directive in service unit files:

# /etc/systemd/system/myservice.service
[Service]
ExecStart=/usr/bin/myprogram
SystemCallFilter=@system-service
SystemCallErrorNumber=EPERM

The @system-service group is a curated set of syscalls that systemd considers appropriate for a typical background service. Calls outside that set return EPERM rather than killing the process, which produces cleaner error handling in programs that do not expect SIGSYS. Systemd maintains several such named groups: @network-io, @file-system, @process, @basic-io, and others, making it straightforward to assemble a reasonable policy for a new service without enumerating hundreds of individual syscall names.

Container runtimes apply seccomp at a different layer. By default, containers run under a profile that blocks around 44 syscalls out of the roughly 300 available, including ptrace, mount, kexec_load, create_module, and others that have no business in a typical application container. A custom profile in JSON format can be applied at container startup:

docker run --rm \
  --security-opt seccomp=/path/to/custom-profile.json \
  myimage

The profile format specifies a default action and a list of syscall-specific overrides with optional argument constraints, translating directly to the libseccomp filter that the container runtime installs before the workload process starts.

What seccomp-bpf Cannot Do and Why That Matters

The kernel documentation is direct on this point: seccomp-bpf is not a sandbox. It is a mechanism for reducing exposed kernel surface, and it is designed to be one layer among several. The BPF program operates on syscall numbers and their integer arguments, and it cannot dereference pointers. A filter can verify that open() is called, but it cannot inspect the filename string that the first argument points to. Filesystem access control remains the responsibility of the VFS permission system, mount namespaces, and tools like AppArmor or SELinux.

This constraint eliminates an entire class of time-of-check-time-of-use vulnerabilities that plagued earlier ptrace-based syscall interception systems. Because the filter only evaluates values present in the seccomp_data structure at the moment of the call, there is nothing to substitute between inspection and enforcement. The design accepts the limitation on what can be inspected in exchange for the guarantee that what is inspected is immutable.

Effective use of seccomp-bpf means understanding that boundary. It is the right tool for preventing a compromised process from calling ptrace on its parent, loading a kernel module, or remapping its own memory protections. It is not the right tool for restricting which files a process may open. Used alongside namespaces, capabilities, LSM policy, and careful privilege separation, it represents the kind of defense-in-depth that makes a compromised process an isolated problem rather than a system-wide one.