Store Halfword Byte-Reverse Indexed

A Power Technical Blog

Context switching SPRs on PowerPC

Introduction

This post is a dive (well, more of a meander) through some of the PowerPC specific aspects of context switching, especially on the Special Purpose Register (SPR) handling. It was motivated by my recent work on adding kernel support for a hardware feature that interfaces with software through an SPR.

What's a context anyway?

The context we are concerned about in this post is the task context. That's all the state that makes up a 'task' in the eyes of the kernel. These are resources like registers, the thread ID, memory mappings, and so on. The kernel is able to save and restore this state into a task context data structure, allowing it to run arbitrarily many tasks concurrently despite the limited number of CPU cores available. It can simply save these resource values when switching out a task, and replace the CPU state with the values stored by the task being switched in.

Unless you're a long time kernel developer, chances are you haven't heard of or looked too closely at a 'task' in the kernel. The next section gives a rundown of tasks and processes, giving you a better frame of reference for the later context switching discussion.

Processes, threads, and tasks

To understand the difference between these three concepts, we'll start with how the kernel sees everything: tasks. Tasks are the kernel's view of an 'executable unit' (think single threaded process), a self contained thread of execution that has a beginning, performs some operations, then (maybe) ends. They are the indivisible building blocks upon which multitasking and multithreading can be built, where multiple tasks run independently, or optionally communicate in some manner to distribute work.

The kernel represents each task with a struct task_struct. This is an enormous struct (around 10KB) of all the pieces of data people have wanted to associate with a particular unit of execution over the decades. The architecture specific state of the task is stored in a one-to-one mapped struct thread_struct, available through the thread member in the task_struct. The name 'thread' when referring to this structure on a task should not be confused with the concept of a thread we'll visit shortly.

A task is highly flexible in terms of resource sharing. Many resources, such as the memory mappings and file descriptor tables, are held through reference counted handles to a backing data structure. This makes it easy to mix-and-match sharing of different components between other tasks.

Approaching tasks from the point of view of userspace, here we think of execution in terms of processes and threads. If you want an 'executable unit' in userspace, you are understood to be talking about a process or thread. These are implemented as tasks by the kernel though; a detail like running in userspace mode on the CPU is just another property stored in the task struct.

For an example of how tasks can either copy or share their parent's resources, consider what happens when creating a child process with the fork() syscall. The child will share memory and open files at the time of the fork, but further changes to these resources in either process are not visible to the other. At the time of the fork, the kernel simply duplicates the parent task and replaces relevant resources with copies of the parent's values1.

It is often useful to have multiple processes share things like memory and open files though: this is what threads provide. A process can be 'split' into multiple threads2, each backed by its own task. These threads can share resources that are normally isolated between processes.

This thread creation mechanism is very similar to process creation. The clone() family of syscalls allow creating a new thread that shares resources with the thread that cloned itself. Exactly what resources get shared between threads is highly configurable, thanks to the kernel's task representation. See the clone(2) manpage for all the options. Creating a process can be thought of as creating a thread where nothing is shared. In fact, that's how fork() is implemented under the hood: the fork() syscall is implemented as a thin wrapper around clone()'s implementation where nothing is shared, including the process group ID.

// kernel/fork.c  (approximately)

SYSCALL_DEFINE0(fork)
{
    struct kernel_clone_args args = {
        .exit_signal = SIGCHLD,
    };

    return kernel_clone(&args);
}

SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
{
    int err;

    struct kernel_clone_args kargs;
    pid_t set_tid[MAX_PID_NS_LEVEL];

    kargs.set_tid = set_tid;

    err = copy_clone_args_from_user(&kargs, uargs, size);
    if (err)
        return err;

    if (!clone3_args_valid(&kargs))
        return -EINVAL;

    return kernel_clone(&kargs);
}

pid_t kernel_clone(struct kernel_clone_args *args) {
    // do the clone
}

That's about it for the differences in processes, threads, and tasks from the point of view of the kernel. A key takeaway here is that, while you will often see processes and threads discussed with regards to userspace programs, there is very little difference under the hood. To the kernel, it's all just tasks with various degrees of shared resources.

Special Purpose Registers (SPRs)

As a final prerequisite, in this section we look at SPRs. The SPRs are CPU registers that, to put it simply, provide a catch-all set of functionalities for interacting with the CPU. They tend to affect or reflect the CPU state in various ways, though are very diverse in behaviour and purpose. Some SPRs, such as the Link Register (LR), are similar to GPRs in that you can read and write arbitrary data. They might have special interactions with certain instructions, such as the LR value being used as the branch target address of the blr (branch to LR) instruction. Others might provide a way to view or interact with more fundamental CPU state, such as the Authority Mask Register (AMR) which the kernel uses to disable accesses to userspace memory while in supervisor mode.

Most SPRs are per-CPU, much like GPRs. And, like GPRs, it makes sense for many of them to be tracked per-task, so that we can conceptually treat them as a per-task resource. But, depending on the particular SPR, writing to them can be slow. The highly used data-like SPRs such as LR, CTR (Count Register), etc., are possible to rename, making them comparable to GPRs in terms of read/write performance. But others that affect the state of large parts of the CPU core, such as the Data Stream Control Register (DSCR), can take a while for the effects of changing them to be applied. Well, not so slow that you will notice the occasional access, but there's one case in the kernel that occurs extremely often and needs to change a lot of these SPRs to support our per-task abstraction: context switches.

Anatomy of a context switch

Here we'll explore the actual function behind performing a context switch. We're interested in the SPR handling especially, because that's going to inform how we start tracking a new SPR on a per-task basis, so we'll be skimming over a lot of unrelated aspects.

We start our investigation in the aptly named context_switch() function in kernel/sched/core.c.

// kernel/sched/core.c

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
               struct task_struct *next, struct rq_flags *rf)

Along with some scheduling metadata, we see it takes a previous task and a next task. As we discussed above, the struct task_struct type describes a unit of execution that defines (among other things) how to set up the state of the CPU.

This function starts off with some generic preparation and memory context changes3, before getting to the meat of the function with switch_to(prev, next,prev). This switch_to() call is actually a macro, which unwraps to a call to __switch_to(). It's also at this point that we enter the architecture specific implementation.

// arch/powerpc/include/asm/switch_to.h  (switch_to)
// arch/powerpc/kernel/process.c         (__switch_to)

struct task_struct *__switch_to(struct task_struct *prev,
                                struct task_struct *new)

Here we've only got our previous and next tasks to work with, focusing on just doing the switch.

Once again, we'll skip through most of the implementation. You'll see a few odds and ends being handled: asserting we won't be taking any interrupts, handling some TLB flushing, a copy-paste edge case, and some breakpoint handling on certain platforms. Then we reach what we were looking for: save_sprs(). The relevant section looks something like as follows

    /*
     * We need to save SPRs before treclaim/trecheckpoint as these will
     * change a number of them.
     */
    save_sprs(&prev->thread);

    /* Save FPU, Altivec, VSX and SPE state */
    giveup_all(prev);

    __switch_to_tm(prev, new);

    if (!radix_enabled()) {
        /*
         * We can't take a PMU exception inside _switch() since there
         * is a window where the kernel stack SLB and the kernel stack
         * are out of sync. Hard disable here.
         */
        hard_irq_disable();
    }

    /*
     * Call restore_sprs() and set_return_regs_changed() before calling
     * _switch(). If we move it after _switch() then we miss out on calling
     * it for new tasks. The reason for this is we manually create a stack
     * frame for new tasks that directly returns through ret_from_fork() or
     * ret_from_kernel_thread(). See copy_thread() for details.
     */
    restore_sprs(old_thread, new_thread);

The save_sprs() function itself does the following to its prev->thread argument.

// arch/powerpc/kernel/process.c

static inline void save_sprs(struct thread_struct *t)
{
#ifdef CONFIG_ALTIVEC
    if (cpu_has_feature(CPU_FTR_ALTIVEC))
        t->vrsave = mfspr(SPRN_VRSAVE);
#endif
#ifdef CONFIG_SPE
    if (cpu_has_feature(CPU_FTR_SPE))
        t->spefscr = mfspr(SPRN_SPEFSCR);
#endif
#ifdef CONFIG_PPC_BOOK3S_64
    if (cpu_has_feature(CPU_FTR_DSCR))
        t->dscr = mfspr(SPRN_DSCR);

    if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
        t->bescr = mfspr(SPRN_BESCR);
        t->ebbhr = mfspr(SPRN_EBBHR);
        t->ebbrr = mfspr(SPRN_EBBRR);

        t->fscr = mfspr(SPRN_FSCR);

        /*
         * Note that the TAR is not available for use in the kernel.
         * (To provide this, the TAR should be backed up/restored on
         * exception entry/exit instead, and be in pt_regs.  FIXME,
         * this should be in pt_regs anyway (for debug).)
         */
        t->tar = mfspr(SPRN_TAR);
    }

    if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE))
        t->hashkeyr = mfspr(SPRN_HASHKEYR);
#endif
}

Later, we set up the SPRs of the new task with restore_sprs():

// arch/powerpc/kernel/process.c

static inline void restore_sprs(struct thread_struct *old_thread,
                struct thread_struct *new_thread)
{
#ifdef CONFIG_ALTIVEC
    if (cpu_has_feature(CPU_FTR_ALTIVEC) &&
        old_thread->vrsave != new_thread->vrsave)
        mtspr(SPRN_VRSAVE, new_thread->vrsave);
#endif
#ifdef CONFIG_SPE
    if (cpu_has_feature(CPU_FTR_SPE) &&
        old_thread->spefscr != new_thread->spefscr)
        mtspr(SPRN_SPEFSCR, new_thread->spefscr);
#endif
#ifdef CONFIG_PPC_BOOK3S_64
    if (cpu_has_feature(CPU_FTR_DSCR)) {
        u64 dscr = get_paca()->dscr_default;
        if (new_thread->dscr_inherit)
            dscr = new_thread->dscr;

        if (old_thread->dscr != dscr)
            mtspr(SPRN_DSCR, dscr);
    }

    if (cpu_has_feature(CPU_FTR_ARCH_207S)) {
        if (old_thread->bescr != new_thread->bescr)
            mtspr(SPRN_BESCR, new_thread->bescr);
        if (old_thread->ebbhr != new_thread->ebbhr)
            mtspr(SPRN_EBBHR, new_thread->ebbhr);
        if (old_thread->ebbrr != new_thread->ebbrr)
            mtspr(SPRN_EBBRR, new_thread->ebbrr);

        if (old_thread->fscr != new_thread->fscr)
            mtspr(SPRN_FSCR, new_thread->fscr);

        if (old_thread->tar != new_thread->tar)
            mtspr(SPRN_TAR, new_thread->tar);
    }

    if (cpu_has_feature(CPU_FTR_P9_TIDR) &&
        old_thread->tidr != new_thread->tidr)
        mtspr(SPRN_TIDR, new_thread->tidr);

    if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE) &&
        old_thread->hashkeyr != new_thread->hashkeyr)
        mtspr(SPRN_HASHKEYR, new_thread->hashkeyr);
#endif

}

The gist is we first perform a series of mfspr operations, saving the SPR values of the currently running task into its associated task_struct. Then we do a series of mtspr operations to restore the desired values of the new task back into the CPU.

This procedure has two interesting optimisations, as explained by the commit that introduces save_sprs() and restore_sprs():

powerpc: Create context switch helpers save_sprs() and restore_sprs()

Move all our context switch SPR save and restore code into two helpers. We do a few optimisations:

  • Group all mfsprs and all mtsprs. In many cases an mtspr sets a scoreboarding bit that an mfspr waits on, so the current practise of mfspr A; mtspr A; mfpsr B; mtspr B is the worst scheduling we can do.

  • SPR writes are slow, so check that the value is changing before writing it.

And that's basically it, as far as the implementation goes at least. When first investigating this one question that kept nagging me was: why do we read these values here, instead of tracking them as they are set? I can think of several reasons this might be done:

  1. It's more robust to check at the time of swap what the current value is. You otherwise risk that a single inline assembly mtspr in a completely different part of the codebase breaks context switching. It would also mean that every mtspr would have to disable interrupts, lest the context switch occurs between the mtspr and recording the change in the task struct.
  2. It might be faster. Changing an SPR would need an accompanying write to the task struct. This is pessimistic if there are many such changes to the SPR between context switches.
  3. Even if you were to track every mtspr correctly, certain SPRs can be changed by userspace without kernel assistance. Some of these SPRs are also unused by the kernel, so saving them with the GPRs would be pessimistic (a waste of time if the task ends up returning back to userspace without swapping). For example, VRSAVE is an unprivileged scratch register that the kernel doesn't make use of.

Did you catch the issue with restore_sprs()?

If you paid close attention, you might have noticed that the previous task being passed to restore_sprs() is not the same as the one being passed to save_sprs(). We have the following instead

struct task_struct *__switch_to(struct task_struct *prev,
    struct task_struct *new)
{
    // ...
    new_thread = &new->thread;
    old_thread = &current->thread;
    // ...
    save_sprs(&prev->thread);             // using prev->thread
    // ...
    restore_sprs(old_thread, new_thread); // using old_thread (current->thread)
    // ...
    last = _switch(old_thread, new_thread);
    // ...
    return last;
}

What gives? As far as I can determine, we require that the prev argument to __switch_to is always the currently running task (as opposed being in some dedicated handler or ill-defined task state during the switch). And on PowerPC, we can access the currently running task's thread struct through the current macro. So, in theory, current->thread is an alias for prev->thread. Anything else wouldn't make any sense here, as we are storing the SPR values into prev->thread, but making decisions about their values in restore_sprs() based on the current->thread saved values.

As for why we use both, it appears to be historical. We originally ran restore_sprs() after _switch(), which finishes swapping state from the original thread to the one being loaded in. This means our stack and registers are swapped out, so our prev variable we stored our current SPRs in is lost to us: it is now the prev of the task we just woke up. In fact, we've completely lost any handle to the task that just swapped itself out. Well, almost: that's where the last return value of _switch() comes in. This is a handle to the task that just went to sleep, and we were originally reloading old_thread based on this last value. However a future patch moved restore_sprs() to above the _switch() call thanks to an edge case with newly created tasks, but the use of old_thread apparently remained.

Conclusion

Congratulations, you are now an expert on several of the finer details of context switching on PowerPC. Well, hopefully you learned something new and/or interesting at least. I definitely didn't appreciate a lot of the finer details until I went down the rabbit hole of differentiating threads, processes, and tasks, and the whole situation with prev vs old_thread.

Bonus tidbit

This is completely unrelated, but the kernel's implementation of doubly-linked lists does not follow the classic implementation, where a list node contains a next, previous, and data pointer. No, if you look at the actual struct definition you will find

struct hlist_node {
    struct hlist_node *next, **pprev;
};

which decidedly does not contain any data component.

It turns out that the kernel expects you to embed the node as a field on the data struct, and the data-getter applies a mixture of macro and compiler builtin magic to do some pointer arithmetic to convert a node pointer into a pointer to the structure it belongs to. Naturally this is incredibly type-unsafe, but it's elegant in its own way.


  1. While this is conceptually what happens, the kernel can apply tricks to avoid the overhead of copying everything up front. For example, memory mappings apply copy-on-write (COW) to avoid duplicating all of the memory of the parent process. But from the point of view of the processes, it is no longer shared. 

  2. Or you could say that what we just called a 'process' is a thread, and the 'process' is really a process group initially containing a single thread. In the end it's semantics that don't really matter to the kernel though. Any thread/process can create more threads that can share resources with the parent. 

  3. Changing the active memory mapping has no immediate effect on the running code due to address space quadrants. In the hardware, the top two bits of a 64 bit effective address determine what memory mapping is applied to resolve it to a real address. If it is a userspace address (top two bits are 0) then the configured mapping is used. But if it is a kernel address (top two bits are 1) then the hardware always uses whatever mapping is in place for process ID 0 (the kernel knows this, so reserves process ID 0 for this purpose and does not allocate it to any userspace tasks). So our change to the memory mapping only applies once we return to userspace, or try to access memory through a userspace address (through get_user() and put_user()). The hypervisor has similar quadrant functionality, but different rules. 

Beginner's first kernel CTF with CVE-2017-5123

The other kind of kernel hacking

Instead of writing mitigations, memory protections and sanitisers all day, I figured it'd be fun to get the team to try playing for the other team. It's a fun set of skills to learn, and it's a very hands-on way to understand why kernel hardening is so important. To that end, I decided to concoct a simple kernel CTF and enforce some mandatory fun. Putting this together, I had a few rules:

  • it should be exploiting a real world vulnerability
  • it should be on Power (since that's what we do around here)
  • it should be conceptually simple to understand (and not require knowledge of network stacks or sandboxes etc)
  • it's more important to be educational than to be realistic

So I threw something together and I think it did a decent job of meeting those targets, so let's go through it!

Stage 1: the bug

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;
    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
    }

    if (!err) {
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    user_access_begin();
    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user((short)info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_access_end();
    return err;
Efault:
    user_access_end();
    return -EFAULT;
}

This is the implementation of the waitid syscall in Linux v4.13, released in September 2017. For our purposes it doesn't matter what the syscall is supposed to do - there's a serious bug here that will let us do very naughty things. Try and spot it yourself, though it may not be obvious unless you're familiar with the kernel's user access routines.

#define put_user(x, ptr)                        \
({                                  \
    __typeof__(*(ptr)) __user *_pu_addr = (ptr);            \
                                    \
    access_ok(_pu_addr, sizeof(*(ptr))) ?               \
          __put_user(x, _pu_addr) : -EFAULT;            \
})

This is put_user() from arch/powerpc/include/asm/uaccess.h. The implementation goes deeper, but this tells us that the normal way the kernel would write to user memory involves calling access_ok() and only performing the write if the access was indeed OK (meaning the address is in user memory, not kernel memory). As the name may suggest, unsafe_put_user() skips that part, and for good reason - sometimes you want to do multiple user accesses at once. With SMAP/PAN/KUAP etc enabled, every put_user() will enable user access, perform its operation then disable it again, which is very inefficient. Instead, patterns like in waitid above are rather common - enable user access, perform a bunch of "unsafe" operations and then disable user access again.

The bug in waitid is that access_ok() is never called, and thus there is no validation that the user provided pointer *infop is pointing to user memory instead of kernel memory. Calling waitid and pointing into kernel memory allows unprivileged users to write into whatever they're pointing at. Neat! This is CVE-2017-5123, summarised as "Insufficient data validation in waitid allowed an user to escape sandboxes on Linux". It's a primitive that can be used for more than that, but that's what its discoverer used it for, escaping the Chrome sandbox.

If you're curious, there's a handful of different writeups exploiting this bug for different things that you can search for. I suppose I'm now joining them!

A tangent: API design

Failing to enforce that a user-provided address to write to is actually in userspace is a hefty mistake, one that wasn't caught until after the code made it all the way to a tagged release (though the only distro release I could find with it was Ubuntu 17.10-beta2). Linux is big, complicated, fast-moving, and all that - there's always going to be bugs. It's not possible to prevent the entire developer base from ever making mistakes, but you can design better APIs so mistakes like this are much less likely to happen.

Let's have a look at the waitid syscall implementation as it is in upstream Linux at the time of writing.

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;

    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    if (!user_write_access_begin(infop, sizeof(*infop)))
        return -EFAULT;

    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user(info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_write_access_end();
    return err;
Efault:
    user_write_access_end();
    return -EFAULT;
}

Notice any differences? Not a lot has changed, but instead of an unconditional user_access_begin(), there's now a call to user_write_access_begin(). Not only have the user access functions been split into read and write (though whether there's actually read/write granularity under the hood depends on the MMU-specific implementation), but the _begin() function takes a pointer and the size of the write. And what do you think that's doing...

static __must_check inline bool
user_write_access_begin(const void __user *ptr, size_t len)
{
    if (unlikely(!access_ok(ptr, len)))
        return false;

    might_fault();

    allow_write_to_user((void __user *)ptr, len);
    return true;
}

That's right! The missing access_ok() check from v4.13 is now part of the API for enabling user access, so you can't forget it (without trying really hard). If there's something else you should be doing every time you call a function (i.e. access_ok() when calling user_access_begin()), it should probably just be part of the function, especially if there's a security implication.

This bug was fixed by adding in the missing access_ok() check, but it's very cool to see that bugs like this are now much less likely to get written.

Stage 2: the primitive

Before we do anything too interesting, we should figure out what we actually have here. We point our pointer at 0xc000000012345678 (an arbitrary kernel address) then take a look in gdb, revealing the following:

pwndbg> x/10 0xc000000012345678
0xc000000012345678:     17      0       1       0
0xc000000012345688:     2141    1001    1       0

So we know that we can at least set something to zero, and there's some potential for more mischief. We could fork() a lot to change the value of the PID to make our write a bit more arbitrary, but to not get too fancy I figured we should just see where we could get by setting something either to zero or to something non-zero.

A few targets came to mind. We could spray around where we think creds are located to try and overwrite the effective user ID of a process to 0, making it run as root. We could go after something like SELinux, aiming for flags like selinux_enabled and selinux_enforcing. I'm sure there's other sandbox-type controls we could try and escape from, too.

None of these were taking my CTF in the direction I wanted it to go (which was shellcode running in the kernel), so I decided to turn the realism down a notch and aim for exploiting a null pointer dereference. We'd map our shellcode to *0, induce a null pointer dereference in the kernel, and then our exploit would work. Right?

So we're just going to go for a classic privilege escalation. We start as an unprivileged user and end up as root. Easy.

Stage 3: the target

I found an existing exploit doing the same thing I wanted to do, so I just stole the target from that. It has some comments in French which don't really help me, but thankfully I found another version with some additional comments - in Chinese. Oh well. have_canfork_callback is a symbol that marks whether cgroup subsystems have a can_fork() callback that is checked when a fork is attempted. If we overwrite have_canfork_callback to be non-zero when can_fork is still NULL, then we win! We can reliably reproduce a null pointer dereference as soon as we fork().

I'm sure there's heaps of different symbols we could have hit, but this one has some nice properties. Any non-zero write is enough, we can trigger the dereference at a time in our control with fork(), and to cover our bases we can just set it back to 0 later.

In our case, we had debug info and a debugger, so finding where the symbol was located in memory is pretty easy. There's also /proc/kallsyms which is great if it's enabled. Linux on Power doesn't yet support KASLR which also saves us a headache or two here, and you can feel free to ask me why it's low on the priority list.

So now we have a null pointer dereference. Now let's get that doing something!

Stage 4: preparing the exploit

Virtual memory is one heck of a drug. If the kernel is going to execute from 0x0, we just need to mmap() to 0! Easy.

Well, it's not that easy. Turning any null pointer dereference into an easy attack vector is not ideal, so users aren't allowed to mmap to low address ranges, in our case, below PAGE_SIZE. Surely there's nothing in the kernel that would try to dereference a pointer + PAGE_SIZE? Maybe that's for a future CTF...

There's a sysctl for this, so in the actual CTF we just did sysctl -w vm.mmap_min_addr=0 and moved on for brevity. As I was writing this I decided to make sure it was possible to bypass this without cheating by making use of our kernel write primitive, and sure enough, it works! I had to zero out both mmap_min_addr and dac_mmap_min_addr symbols, the latter seemingly required for filesystem interactions to work post-exploit.

So now we can trigger a null pointer dereference in the kernel and we can mmap() our shellcode to 0x0, we should probably get some shellcode. We want to escalate our privileges, and the easiest way to do that is the iconic commit_creds(prepare_kernel_cred(0)).

prepare_kernel_cred() is intended to produce a credential for a kernel task. Passing 0/NULL gets you the same credential that init runs with, which is about as escalated as our privileges can get. commit_creds() applies the given credential to the currently running task - thus making our exploit run as root.

As of somewhat recently it's a bit more complex than that, but we're still back in v4.13, so we just need a way to execute that from a triggered null pointer dereference.

Stage 5: the shellcode

The blessing and curse of Power being a niche architecture is that it's hard to find existing exploits for. Perhaps lacking in grace and finesse, but effective nonetheless, is the shellcode I wrote myself:

    static const unsigned char shellcode[] = {
        0x00, 0x00, 0xc0, 0x3b, // li r30, 0
        0x20, 0x00, 0x9e, 0xe9, // ld r12,32(r30)
        0x00, 0x00, 0xcc, 0xfb, // std r30,0(r12)
        0x18, 0x00, 0x9e, 0xe9, // ld r12,24(r30)
        0xa6, 0x03, 0x89, 0x7d, // mtctr r12
        0x20, 0x04, 0x80, 0x4e, // bctr
    };

After the CTF I encouraged everyone to try writing their own shellcode and noone did, and I will take that as a sign that mine is flawlessly designed.

First we throw 0 into r30, which sounds like a register we'll get away with clobbering. We load an offset of 32 bytes from the value of r30 into r12 (and r30 is 0, so this is the address 32). Then, we store the value of r30 (which is 0) into the address in r12 - writing zero to the address found at *32.

Then, we replace the contents of r12 with the value contained at address 24. Then, we move that value into the count register, and branch to the count register - redirecting execution to the address found at *24.

I wrote it this way so participants would have to understand what the shellcode was trying to do to be able to get any use out of it. It expects two addresses to be placed immediately after it terminates and it's up to you to figure out what those addresses should be!

In our case, everyone figured out pretty quickly that *24 should point at our very classic privesc:

void get_root() {
    if (commit_creds && prepare_kernel_cred)
        commit_creds(prepare_kernel_cred(0));
}

Addresses for those kernel symbols need to be obtained first, but we're experts at that now. So we add in:

    *(unsigned long *)24 = (unsigned long)get_root;

And that part's sorted. How good is C?

Noone guessed what address we were zeroing, though, and the answer is have_canfork_callback. Without mending that, the kernel will keep attempting to execute from address 0, which we don't want. We only need it to do that once!

So we wrap up with

    *(unsigned long *)32 = have_canfork_callback;

and our shellcode's ready to go!

Stage 6: it doesn't work

We've had good progress so far - we needed a way to get the kernel to execute from address 0 and we found a way to do that, and we needed to mmap to 0 and we found a way to do that. And yet, running the exploit doesn't work. How come?

Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x00000000
Oops: Kernel access of bad area, sig: 11 [#2]

The MMU has ended our fun. KUEP is enabled (SMEP on x86, PXN on ARM) so the MMU is enforcing that the kernel can't execute from user addresses. I gave everyone a bit of a trick question here - how can you get around this purely from the qemu command line?

The way I did it wasn't to parse nosmep (and I'm not even sure that was implemented for powerpc in v4.13 anyway), it was to change from -cpu POWER9 to -cpu POWER8. Userspace execution prevention wasn't implemented in the MMU until POWER9, so reverting to an older processor was a cheeky way to get around that.

Stage 7: victory!

Putting all of that together, we have a successful privilege escalation from attacking the kernel.

/ $ ./exploit 
Overwriting mmap_min_addr...
Overwriting dac_mmap_min_addr...
Overwriting have_canfork_callback...
Successfully acquired root shell!
/ # whoami
root

It's wild to think that even an exploit this simple would have been possible in the "real world" back in 2017, so it really highlights the value of kernel hardening! It made for a good introduction to kernel exploitation for me and my team and wasn't too contrived for the sake of simplicity.

Whether you're a beginner or an expert at kernel exploitation (or somewhere vaguely in the middle like me), I hope you found this interesting. There's lots of great PoCs, writeups and papers out there to learn from and CTFs to try if you want to learn more!

Going out on a Limb: Efficient Elliptic Curve Arithmetic in OpenSSL

So I've just managed to upstream some changes to OpenSSL for a new strategy I've developed for efficient arithmetic used in secp384r1, a curve prescribed by NIST for digital signatures and key exchange. In spite of its prevalence, its implementation in OpenSSL has remained somewhat unoptimised, even as less frequently used curves (P224, P256, P521) each have their own optimisations.

The strategy I have used could be called a 56-bit redundant limb implementation with Solinas reduction. Without too much micro-optimisation, we get ~5.5x speedup over the default (Montgomery Multiplication) implementation for creation of digital signatures.

How is this possible? Well first let's quickly explain some language:

Elliptic Curves

When it comes to cryptography, it's highly likely that those with a computer science background will be familiar with ideas such as key-exchange and private-key signing. The stand-in asymmetric cipher in a typical computer science curriculum is typically RSA. However, the heyday of Elliptic Curve ciphers has well and truly arrived, and their operation seems no less mystical than when they were just a toy for academia.

The word 'Elliptic' may seem to imply continuous mathematics. As a useful cryptographic problem, we fundamentally are just interested with the algebraic properties of these curves, whose points are elements of a finite field. Irrespective of the underlying finite field, the algebraic properties of the elliptic curve group can be shown to exist by an application of Bézout's Theorem. The group operator on points on an elliptic curve for a particular choice of field involves the intersection of lines intersecting either once, twice or thrice with the curve, granting notions of addition and doubling for the points of intersection, and giving the 'point at infinity' as the group identity. A closed form exists for computing a point double/addition in arbitrary fields (different closed forms can apply, but determined by the field's characteristic, and the same closed form applies for all large prime fields).

Our algorithm uses a field of the form \(\mathbb{F}_p\), that is the unique field with \(p\) (a prime) elements. The most straightforward construction of this field is arithmetic modulo \(p\). The other finite fields used in practise in ECC are of the form \(\mathbb{F}_{2^m}\) and are sometimes called 'binary fields' (representible as polynomials with binary coefficients). Their field structure is also used in AES through byte substitution, implemented by inversion modulo \(\mathbb{F}_{2^8}\).

From a performance perspective, great optimisations can be made by implementing efficient fixed-point arithmetic specialised to modulo by single prime constant, \(p\). From here on out, I'll be speaking from this abstraction layer alone.

Limbs

We wish to multiply two \(m\)-bit numbers, each of which represented with \(n\) 64-bit machine words in some way. Let's suppose just for now that \(n\) divides \(m\) neatly, then the quotient \(d\) is the minimum number of bits in each machine word that will be required for representing our number. Suppose we use the straightforward representation whereby the least significant \(d\) bits are used for storing parts of our number, which we better call \(x\) because this is crypto and descriptive variable names are considered harmful (apparently).

$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$

If we then drop the requirement for each of our \(n\) machine words (also referred to as a 'limb' from hereon out) to have no more than the least significant \(d\) bits populated, we say that such an implementation uses 'redundant limbs', meaning that the \(k\)-th limb has high bits which overlap with the place values represented in the \((k+1)\)-th limb.

Multiplication (mod p)

The fundamental difficulty with making modulo arithmetic fast is to do with the following property of multiplication.

Let \(a\) and \(b\) be \(m\)-bit numbers, then \(0 \leq a < 2^m\) and \(0 \leq b < 2^m\), but critically we cannot say the same about \(ab\). Instead, the best we can say is that \(0 \leq ab < 2^{2m}\). Multiplication can in the worst case double the number of bits that must be stored, unless we can reduce modulo our prime.

If we begin with non-redundant, 56-bit limbs, then for \(a\) and \(b\) not too much larger than \(2^{384} > p_{384}\) that are 'reduced sufficiently' then we can multiply our limbs in the following ladder, so long as we are capable of storing the following sums without overflow.

    /* and so on ... */

    out[5] = ((uint128_t) in1[0]) * in2[5]
           + ((uint128_t) in1[1]) * in2[4]
           + ((uint128_t) in1[2]) * in2[3]
           + ((uint128_t) in1[3]) * in2[2]
           + ((uint128_t) in1[4]) * in2[1]
           + ((uint128_t) in1[5]) * in2[0];

    out[6] = ((uint128_t) in1[0]) * in2[6]
           + ((uint128_t) in1[1]) * in2[5]
           + ((uint128_t) in1[2]) * in2[4]
           + ((uint128_t) in1[3]) * in2[3]
           + ((uint128_t) in1[4]) * in2[2]
           + ((uint128_t) in1[5]) * in2[1]
           + ((uint128_t) in1[6]) * in2[0];

    out[7] = ((uint128_t) in1[1]) * in2[6]
           + ((uint128_t) in1[2]) * in2[5]
           + ((uint128_t) in1[3]) * in2[4]
           + ((uint128_t) in1[4]) * in2[3]
           + ((uint128_t) in1[5]) * in2[2]
           + ((uint128_t) in1[6]) * in2[1];

    out[8] = ((uint128_t) in1[2]) * in2[6]
           + ((uint128_t) in1[3]) * in2[5]
           + ((uint128_t) in1[4]) * in2[4]
           + ((uint128_t) in1[5]) * in2[3]
           + ((uint128_t) in1[6]) * in2[2];

    /* ... and so forth */

This is possible, if we back each of the 56-bit limbs with a 64-bit machine word, with products being stored in 128-bit machine words. The numbers \(a\) and \(b\) were able to be stored with 7 limbs, whereas we use 13 limbs for storing the product. If \(a\) and \(b\) were stored non-redundantly, than each of the output (redundant) limbs must contain values less than \(6 \cdot 2^{56} \cdot 2^{56} < 2^{115}\), so there is no possibility of overflow in 128 bits. We even have room spare to do some additions/subtractions in cheap, redundant limb arithmetic.

But we can't keep doing our sums in redundant limb arithmetic forever, we must eventually reduce. Doing so may be expensive, and so we would rather reduce only when strictly necessary!

Solinas-ish Reduction

Our prime is a Solinas (Pseudo/Generalised-Mersenne) Prime. Mersenne Primes are primes expressible as \(2^m - 1\). This can be generalised to low-degree polynomials in \(2^m\). For example, another NIST curve uses \(p_{224} = 2^{224} - 2^{96} + 1\) (a 224-bit number) where \(p_{224} = f(2^{32})\) for \(f(t) = t^7 - t^3 + 1\). The simpler the choice of polynomial, the simpler the modular reduction logic.

Our choice of \(t\) is \(2^{56}\). Wikipedia the ideal case for Solinas reduction where the bitwidth of the prime is divisible by \(\log_2{t}\), but that is not our scenario. We choose 56-bits for some pretty simple realities of hardware. 56 is less than 64 (standard machine word size) but not by too much, and the difference is byte-addressible (\(64-56=8\)). Let me explain:

Just the Right Amount of Reduction (mod p)

Let's first describe the actual prime that is our modulus.

$$p_{384} = 2^{384} - 2^{128} - 2^{96} + 2^{32} - 1$$

Yuck. This number is so yuck in fact, that noone has so far managed to upstream a Solinas' reduction method for it in OpenSSL, in spite of secp384r1 being the preferred curve for ECDH (Elliptic Curve Diffie-Hellman key exchange) and ECDSA (Elliptic Curve Digital Signature Algorithm) by NIST.

In 56-bit limbs, we would express this number so:

Let \(f(t) = 2^{48} t^6 - 2^{16} t^2 - 2^{40} t + (2^{32} - 1)\), then observe that all coefficients are smaller than \(2^{56}\), and that \(p_{384} = f(2^{56})\).

Now let \(\delta(t) = 2^{16} t^2 + 2^{40} t - 2^{32} + 1\), consider that \(p_{384} = 2^{384} - \delta(2^{56})\), and thus \(2^{384} \equiv \delta(2^{56}) \mod{p_{384}}\). From now on let's call \(\delta(2^{56})\) just \(\delta\). Thus, 'reduction' can be achieved as follows for suitable \(X\) and \(Y\):

$$ab = X + 2^{384} Y \equiv X + \delta Y \mod{p_{384}}$$

Calculating \(\delta Y\)

First Substitution

First make a choice of \(X\) and \(Y\). The first thing to observe here is that this can actually be made a large number of ways! We choose:

$$X_1 = \sum_{k=0}^8\texttt{in[k]} t^k$$
$$Y_1 = 2^8 t^2 \sum_{k=9}^{12}\texttt{in[k]} t^{k-9} = 2^8 \sum_{k=9}^{12}\texttt{in[k]} t^{k-7}$$

'Where does the \(2^8 t^{2}\) come from?' I hear you ask. See \(t^9 = t^2 \cdot t^7 = t^2 (2^8 \cdot 2^{384}) \equiv (2^8 t^2) \delta \mod{f(t)}\). It's clear to see that the place value of in[9] ... in[12] is greater than \(2^{384}\).

I'm using the subscripts here because we're in fact going to do a series of these reductions to reach a suitably small answer. That's because our equation for reducing \(t^7\) terms is as follows:

$$t^7 \equiv 2^8\delta \equiv 2^{24} t^2 + 2^{48} t + (-2^{40} + 2^8) \mod{f(t)}$$

Thus reducing in[12] involves computing:

$$\texttt{in[12]} t^{12} = \texttt{in[12]} (t^5)(t^7) \equiv 2^8\delta \cdot \texttt{in[12]} t^5 \mod{f(t)}$$

But \(\delta\) is a degree two polynomial, and so our numbers can still have two more limbs than we would want them to have. To be safe, let's store \(X_1 + \delta Y_1\) in accumulator limbs acc[0] ... acc[8] (this will at first appear to be one more limb than necessary), then we can eliminate in[12] with the following logic.

    /* assign accumulators to begin */
    for (int i = 0; i < 9; i++)
        acc[i] = in[i];

    /* X += 2^128 Y */
    acc[8] += in[12] >> 32;
    acc[7] += (in[12] & 0xffffffff) << 24;

    /* X += 2^96 Y */
    acc[7] += in[12] >> 8;
    acc[6] += (in[12] & 0xff) << 48;

    /* X += (-2^32 + 1) Y */
    acc[6] -= in[12] >> 16;
    acc[5] -= ((in[12] & 0xffff) << 40);
    acc[6] += in[12] >> 48;
    acc[5] += (in[12] & 0xffffffffffff) << 8;

Notice that for each term in \(\delta = 2^{128} + 2^{96} + (2^{32} - 1)\) we do two additions/subtractions. This is in order to split up operands in order to minimise the final size of numbers and prevent over/underflows. Consequently, we need an acc[8] to receive the high bits of our in[12] substitution given above.

Second Substitution

Let's try and now eliminate through substitution acc[7] and acc[8]. Let

$$X_2 = \sum^{6}_{k=0}\texttt{acc[k]}t^k $$
$$Y_2 = 2^8(\texttt{acc[7]} t^7 + \texttt{acc[8]} t^8)$$

But this time, \(\delta Y_2\) is a number that comfortably can take up just five limbs, so we can update acc[0], ..., acc[5] comfortably in-place.

Third Substitution

Finally, let's reduce all the high bits of in[6]. Since in[6] has place value \(t^6 = 2^{336}\), thus we wish to reduce all but the least significant \(384 - 336 = 48\) bits.

A goal in designing this algorithm is to ensure that acc[6] has as tight a bound as reasonably possible. Intuitively, if we can cause acc[6] to be as large as possible by absorbing the high bits of lower limbs, we reduce the number of bits that must be carried forward later on. As such, we perform a carry of the high-bits of acc[4], acc[5] into acc[6] before we begin our substitution.

Again, let

$$X_3 = \sum^{5}_{k=0}\texttt{acc[k]}t^k + (\texttt{acc[6]} \text{(low bits)})t^6$$
$$Y_3 = 2^{48}(\texttt{acc[6]} \text{(high bits, right shifted)}) t^6$$

The equation for eliminating \(2^{48}t^6\) is pretty straightforward:

$$2^{384} = 2^{48}t^6 \equiv 2^{16}t^2 + 2^{40}t + (-2^{32} + 1) \mod{f(t)}$$

Carries

Finally, as each of acc[0], ..., acc[6] can contain values larger than \(2^{56}\), we carry their respective high bits into acc[6] so as to remove any redundancy. Conveniently, our preemptive carrying before the third substitution has granted us a pretty tight bound on our final calculation - the final reduced number has the range \([0, 2^{384}]\).

Canonicalisation

This is 'just the right amount of reduction' but not canonicalisation. That is, since \(0 < p_{384} < 2^{384}\), there can be multiple possible reduced values for a given congruence class. felem_contract is a method which uses the fact that \(0 \leq x < 2 p_{384}\) to further reduce the output of felem_reduce into the range \([0, p_{384})\) in constant time.

This code has many more dragons I won't explain here, but the basic premise to the calculations performed there is as follows:

Given a 385 bit input, checking whether our input (expressed as a concatenation of bits) \(b_{384}b_{383} \ldots b_1b_0\) is greater than or equal to \(p_{384}\) whose bits we denote \(q_{384}, \ldots, q_0\) (\(q_{384} = 0\)) is determined by the following logical predicate (\(G(384)\)):

$$G(k) \equiv (b_k \land \lnot q_k) \lor ((b_k = q_k) \land G(k-1))$$
$$G(0) \equiv b_k = q_k$$

With \(p_{384}\) being a Solinas'/Pseudo-Mersenne Prime, it has a large number of contiguous runs of repeated bits, so we can of course use this to massively simplify our predicate. Doing this in constant time involves some interesting bit-shifting/masking schenanigans. Essentially, you want a bit vector of all ones/zeros depending on the value of \(G(384)\), we then logically 'and' with this bitmask to 'conditionally' subtract \(p_{384}\) from our result.

A Side Note about the Weird Constants

Okay so we're implementing our modular arithmetic with unsigned integer limbs that together represent a number of the following form:

$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$

How do we then do subtractions in a way which will make overflow impossible? Well computing \(a - b\) is really straightforward if every limb of \(a\) is larger than every limb of \(b\). We then add a suitable multiple of \(p_{384}\) to \(a\) that causes each limb of \(a\) to be sufficiently large.

Thankfully, with redundant-limb arithmetic, we can do this easily by means of telescopic sums. For example, in felem_reduce we wanted all limbs of our \(p_{384}\) multiple to be sufficiently large. We overshot any requirement and provided such a multiple which gives a lower bound \(2^{123}\). We first scale our prime accordingly so that its 'lead term' (speaking in the polynomial representation) is \(2^{124}\).

$$2^{76} f(t) = 2^{124} t^6 - 2^{92} t^2 - 2^{116} t + (2^{108} - 2^{76}) t^0$$

Notice that most limbs of this multiple (the limbs will be the coefficients) are either too small or negative. We then transform this expression into a suitable telescopic sum. Observe that when \(t = 2^{56}\), \(2^{124} t^k = 2^{124-56}t^{k+1} = 2^{68} t^{k+1}\), and so simply introduce into each limb where required a \(2^{124}\) term by means of addition, subtracting the same number from a higher limb.

$$ \begin{align*} 2^{76} f(t) &= (2^{124} - 2^{68}) t^6 \\ &+ (2^{124} - 2^{68}) t^5 \\ &+ (2^{124} - 2^{68}) t^4 \\ &+ (2^{124} - 2^{68}) t^3 \\ &+ (2^{124} - 2^{92} - 2^{68}) t^2 \\ &+ (2^{124} - 2^{116} - 2^{68}) t \\ &+ (2^{124} + 2^{108} - 2^{76}) \end{align*} $$

We can then subtract values whose limbs are no larger than the least of these limbs above without fear of underflows providing us with an incorrect result. In our case, that upper bound for limb value is \(2^{124} - 2^{116} - 2^{68} > 2^{123}\). Very comfortable.

Concerning Timing Side-Channels

Cryptographic routines must perform all of their calculations in constant time. More specifically, it is important that timing cryptography code should not reveal any private keys or random nonces used during computation. Ultimately, all of our work so far has been to speed up field arithmetic in the modulo field with prime \(p_{384}\). But this is done in order to facilitate calculations in the secp384r1 elliptic curve, and ECDSA/ECDH each depend on being able to perform scalar 'point multiplication' (repeat application of the group operator). Since such an operation is inherently iterative, it presents the greatest potential for timing attacks.

We implement constant-time multiplication with the wNAF ladder method. This relies on pre-computing a window of multiples of the group generator, and then scaling and selectively adding multiples when required. Wikipedia provides a helpful primer to this method by cumulatively building upon more naive approaches.

Conclusion

While the resulting code borrows from and uses common language of Solinas reduction, ultimately there are a number of implementation decisions that were guided by heuristic - going from theory to implementation was far from cut-and-dry. The limb size, carry order, choice of substitutions as well as pre and post conditions made here are ultimately arbitrary. You could easily imagine there being further refinements obtaining a better result. For now, I hope this post serves to demystify the inner workings of ECC implementations in OpenSSL. These algorithms, although particular and sophisticated, need not be immutable.

Quirks of parsing SSH configs

Introduction

I've been using the VSCodium Open Remote - SSH extension recently to great results. I can treat everything as a single environment, without any worry about syncing between my local development files and the remote. This is very different to mounting the remote as a network drive and opening a local instance of VSCodium on it: in addition to crippling latency on every action, a locally mounted drive doesn't bring the build context that tools like clangd require (e.g., system headers).

Instead, the remote extension runs a server on the remote that performs most actions, and the local VSCodium instance acts as a client that buffers and caches data seamlessly, so the experience is nearly as good as developing locally.

For example, a project wide file search on a network drive is unusably slow because every file and directory read requires a round trip back to the remote, and the latency is just too large to finish getting results back in a reasonable time. But with the client-server approach, the client just sends the search request to the server for it to fulfil, and all the server has to do is send the matches back. This eliminates nearly all the latency effects, except for the initial request and receiving any results.

However there has been one issue with using this for everything: the extension failed to connect when I wasn't on the same network as the host machine. So I wasn't able to use it when working from home over a VPN. In this post we find out why this happened, and in the process look at some of the weird quirks of parsing an SSH config.

The issue

As above, I wasn't able to connect to my remote machines when working from home. The extension would abort with the following error:

[Error  - 00:23:10.592] Error resolving authority
Error: getaddrinfo ENOTFOUND remotename.ozlabs.ibm.com
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:109:26)

So it's a DNS issue. This would make sense, as the remote machine is not exposed to the internet, and must instead be accessed through a proxy. What's weird is that the integrated terminal in VSCodium has no problem connecting to the remote. So the extension seems to be doing something different than just a plain SSH connection.

You might think that the extension is not reading the SSH config. But the extension panel lists all the host aliases I've declared in the config, so it's clearly aware of the config at least. Possibly it doesn't understand the proxy config correctly? If it was trying to connect directly from the host, it would make sense to fail a DNS lookup.

Investigating

Enough theorising, time to debug the extension as it tries to connect.

From the error above, the string "Error resolving authority" looks like something I can search for. This takes me to the catch case for a large try-catch block. It could be annoying to narrow down which part of the block throws the exception, but fortunately debugging is as easy as installing the dependencies and running the pre-configured 'Extension' debug target. This opens a new window with the local copy of the extension active, and I can debug it in the original window.

In this block, there is a conditional statement on whether the ProxyJump field is present in the config. This is a good place to break on and see what the computed config looks like. If it doesn't find a proxy then of course it's going to run everything on the host.

And indeed, it doesn't think there is a proxy. This is progress, but why does the extension's view of the config not match up with what SSH does? After all, invoking SSH directly connects properly. Tracing back the source of the config in the extension, it ultimately comes from manually reading in and parsing the SSH config. When resolving the host argument it manually computes the config as per ssh_config(5). Yet somewhere it makes a mistake, because it doesn't include the ProxyJump field.

Parsing SSH config

To get to the bottom of this, we need to know the rules behind parsing SSH configs. The ssh_config(5) manpage does a pretty decent job of explaining this, but I'm going to go over the relevant information here. I reckon most people have a vague idea of how it works, and can write enough to meet their needs, but have never looked deeper into the actual rules behind how SSH parses the config.

  1. For starters, the config is parsed line by line. Leading whitespace (i.e., indentation) is ignored. So, while indentation makes it look like you are configuring properties for a particular host, this isn't quite correct. Instead, the Host and Match lines are special statements that enable or disable all subsequent lines until the next Host or Match.

    There is no backtracking; previous conditions and lines are not re-evaluated after learning more about the config later on.

  2. When a config line is seen, and is active thanks to the most recent Host or Match succeeding, its value is selected if it is the first of that config to be selected. So the earliest place a value is set takes priority; this may be a little counterintuitive if you are used to having the latest value be picked, like enable/disable command line flags tend to work.

  3. When HostName is set, it replaces the host value in Match matches. It is also used as the Host value during a final pass (if requested).

  4. The last behaviour of interest is the Match final rule. There are several conditions a Match statement can have, and the final rule says make this active on the final pass over the config.

Wait, final pass? Multiple passes? Yes. If final is a condition on a Match, SSH will do another pass over the entire config, following all the rules above. Except this time all the configs we read on the first pass are still active (and can't be changed). But all the Host and Matches are re-evaluated, allowing other configs to potentially be set. I guess that means rule (1) ought to have a big asterisk next to it.

Together, these rules can lead to some quirky behaviours. Consider the following config

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

Host example
    HostName example.ozlabs.ibm.com

If I run ssh example on the command line, will it use the proxy?

By rule (1), no. When testing the first Match host condition, our host value is currently example. It is not until we reach the HostName config that we start using example.ozlabs.ibm.com for these matches.

But by rule (4), the answer turns into maybe. If we end up doing a second pass over the config thanks to a Match final that could be anywhere else, we would now be matching example.ozlabs.ibm.com against the first line on the second go around. This will pass, and, since nothing has set ProxyJump yet, we would gain the proxy.

You may think, yes, but we don't have a Match final in that example. But if you thought that, then you forgot about the system config.

The system config is effectively appended to the user config, to allow any system wide settings. Most of the time this isn't an issue because of the first-come-first-served rule with config matches (rule 2). But if the system config includes a Match final, it will trigger the entire config to be re-parsed, including the user section. And it so happens that, at least on Fedora with the openssh-clients package installed, the system config does contain a Match final (see /etc/ssh/ssh_config.d).

But wait, there's more! If we want to specify a custom SSH config file, then we can use -F path/to/config in the command line. But this disables loading a system config, so we would no longer get the proxy!

To sum up, for the above config:

  1. ssh example doesn't have a proxy
  2. ...unless a system config contains Match final
  3. ...but invoking it as ssh -F ~/.ssh/config example definitely won't have the proxy
  4. ...but if a subprocess invokes ssh example while trying to resolve another host, it'll probably not add the -F ~/.ssh/config, so we might get the proxy again (in the child process).

Wait, how did that last one slip in? Well, unlike environment variables, it's a lot harder for processes to propagate command line flags correctly. If resolving the config involves running a script that itself tries to run SSH, chances are the -F flag won't be propagated and you'll see some weird behaviour.

I swear that's all for now, you've probably learned more about SSH configs than you will ever need to care about.

Back to VSCodium

Alright, armed now with this knowledge on SSH config parsing, we can work out what's going on with the extension. It ends up being a simple issue: it doesn't apply rules (3) and (4), so all Host matches are done against the original host name.

In my case, there are several machines behind the proxy, but they all share a common suffix, so I had a Host *.ozlabs.ibm.com rule to apply the proxy. I also use aliases to refer to the machines without the .ozlabs.ibm.com suffix, so failing to follow rule (3) lead to the situation where the extension didn't think there was a proxy.

However, even if this were to be fixed, it still doesn't respect rule (4), or most complex match logic in general. If the hostname bug is fixed then my setup would work, but it's less than ideal to keep playing whack-a-mole with parsing bugs. It would be a lot easier if there was a way to just ask SSH for the config that a given host name resolves to.

Enter ssh -G. The -G flag asks SSH to dump the complete resolved config, without actually opening the connection (it may execute arbitrary code while resolving the config however!). So to fix the extension once and for all, we could swap the manual parser to just invoking ssh -G example, and parsing the output as the final config. No Host or Match or HostName or Match final quirks to worry about.

Sure enough, if we replace the config backend with this 'native' resolver, we can connect to all the machines with no problem. Hopefully the pull request to add this support will get accepted, and I can stop running my locally patched copy of the extension.

In general, I'd suggest avoiding any dependency on a second pass being done on the config. Resolve your aliases early, so that the rest of your matches work against the full hostname. If you later need to match against the name passed in the command line, you can use Match originalhost=example. The example above should always be written as

Host example
    HostName example.ozlabs.ibm.com

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

even if the reversed order might appear to work thanks to the weird interactions described above. And after learning these parser quirks, I find the idea of using Host match statements unreliable; that they may or may not be run against the HostName value allows for truely strange bugs to appear. Maybe you should remove this uncertainty by starting your config with Match final to at least always be parsed the same way.

Detecting rootless Docker

Trying to do some fuzzing...

The other day, for the first time in a while, I wanted to do something with syzkaller, a system call fuzzer that has been used to find literally thousands of kernel bugs. As it turns out, since the last time I had done any work on syzkaller, I switched to a new laptop, and so I needed to set up a few things in my development environment again.

While I was doing this, I took a look at the syzkaller source again and found a neat little script called syz-env, which uses a Docker image to provide you with a standardised environment that has all the necessary tools and dependencies preinstalled.

I decided to give it a go, and then realised I hadn't actually installed Docker since getting my new laptop. So I went to do that, and along the way I discovered rootless mode, and decided to give it a try.

What's rootless mode?

As of relatively recently, Docker supports rootless mode, which allows you to run your dockerd as a non-root user. This is helpful for security, as traditional "rootful" Docker can trivially be used to obtain root privileges outside of a container. Rootless Docker is implemented using RootlessKit (a fancy replacement for fakeroot that uses user namespaces) to create a new user namespace that maps the UID of the user running dockerd to 0.

You can find more information, including details of the various restrictions that apply to rootless setups, in the Docker documentation.

The problem

I ran tools/syz-env make to test things out. It pulled the container image, then gave me some strange errors:

ajd@jarvis-debian:~/syzkaller$ tools/syz-env make NCORES=1
gcr.io/syzkaller/env:latest
warning: Not a git repository. Use --no-index to compare two paths outside a working tree
usage: git diff --no-index [<options>] <path> <path>

    ...

fatal: detected dubious ownership in repository at '/syzkaller/gopath/src/github.com/google/syzkaller'
To add an exception for this directory, call:

        git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
fatal: detected dubious ownership in repository at '/syzkaller/gopath/src/github.com/google/syzkaller'
To add an exception for this directory, call:

        git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
go list -f '{{.Stale}}' ./sys/syz-sysgen | grep -q false || go install ./sys/syz-sysgen
error obtaining VCS status: exit status 128
        Use -buildvcs=false to disable VCS stamping.
error obtaining VCS status: exit status 128
        Use -buildvcs=false to disable VCS stamping.
make: *** [Makefile:155: descriptions] Error 1

After a bit of digging, I found that syz-env mounts the syzkaller source directory inside the container as a volume. make was running with UID 1000, while the files in the mounted volume appeared to be owned by root.

Reading the script, it turns out that syz-env invokes docker run with the --user option to set the UID inside the container to match the user's UID outside the container, to ensure that file ownership and permissions behave as expected.

This works in rootful Docker, where files appear inside the container to be owned by the same UID as they are outside the container. However, it breaks in rootless mode: due to the way RootlessKit sets up the namespaces, the user's UID is mapped to 0, causing the files to appear to be owned by root.

The workaround seemed pretty obvious: just skip the --user flag if running rootless.

How can you check whether your Docker daemon is running in rootless mode?

It took me quite a while, as a total Docker non-expert, to figure out how to definitively check whether the Docker daemon is running rootless or not. There's a variety of ways you could do this, such as checking the name of the current Docker context to see if it's called rootless (as used by the Docker rootless setup scripts), but I think the approach I settled on is the most correct one.

If you want to check whether your Docker daemon is running in rootless mode, use docker info to query the daemon's security options, and check for the rootless option.

docker info -f "{{println .SecurityOptions}}" | grep rootless

If this prints something like:

[name=seccomp,profile=builtin name=rootless name=cgroupns]

then you're running rootless.

If not, then you're running the traditional rootful.

Easy! (And I sent a fix which is now merged into syzkaller!)