Beginner's first kernel CTF with CVE-2017-5123 — Store Halfword Byte-Reverse Indexed

The other kind of kernel hacking

Instead of writing mitigations, memory protections and sanitisers all day, I figured it'd be fun to get the team to try playing for the other team. It's a fun set of skills to learn, and it's a very hands-on way to understand why kernel hardening is so important. To that end, I decided to concoct a simple kernel CTF and enforce some mandatory fun. Putting this together, I had a few rules:

it should be exploiting a real world vulnerability
it should be on Power (since that's what we do around here)
it should be conceptually simple to understand (and not require knowledge of network stacks or sandboxes etc)
it's more important to be educational than to be realistic

So I threw something together and I think it did a decent job of meeting those targets, so let's go through it!

Stage 1: the bug

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;
    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
    }

    if (!err) {
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    user_access_begin();
    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user((short)info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_access_end();
    return err;
Efault:
    user_access_end();
    return -EFAULT;
}

This is the implementation of the waitid syscall in Linux v4.13, released in September 2017. For our purposes it doesn't matter what the syscall is supposed to do - there's a serious bug here that will let us do very naughty things. Try and spot it yourself, though it may not be obvious unless you're familiar with the kernel's user access routines.

#define put_user(x, ptr)                        \
({                                  \
    __typeof__(*(ptr)) __user *_pu_addr = (ptr);            \
                                    \
    access_ok(_pu_addr, sizeof(*(ptr))) ?               \
          __put_user(x, _pu_addr) : -EFAULT;            \
})

This is put_user() from arch/powerpc/include/asm/uaccess.h. The implementation goes deeper, but this tells us that the normal way the kernel would write to user memory involves calling access_ok() and only performing the write if the access was indeed OK (meaning the address is in user memory, not kernel memory). As the name may suggest, unsafe_put_user() skips that part, and for good reason - sometimes you want to do multiple user accesses at once. With SMAP/PAN/KUAP etc enabled, every put_user() will enable user access, perform its operation then disable it again, which is very inefficient. Instead, patterns like in waitid above are rather common - enable user access, perform a bunch of "unsafe" operations and then disable user access again.

The bug in waitid is that access_ok() is never called, and thus there is no validation that the user provided pointer *infop is pointing to user memory instead of kernel memory. Calling waitid and pointing into kernel memory allows unprivileged users to write into whatever they're pointing at. Neat! This is CVE-2017-5123, summarised as "Insufficient data validation in waitid allowed an user to escape sandboxes on Linux". It's a primitive that can be used for more than that, but that's what its discoverer used it for, escaping the Chrome sandbox.

If you're curious, there's a handful of different writeups exploiting this bug for different things that you can search for. I suppose I'm now joining them!

A tangent: API design

Failing to enforce that a user-provided address to write to is actually in userspace is a hefty mistake, one that wasn't caught until after the code made it all the way to a tagged release (though the only distro release I could find with it was Ubuntu 17.10-beta2). Linux is big, complicated, fast-moving, and all that - there's always going to be bugs. It's not possible to prevent the entire developer base from ever making mistakes, but you can design better APIs so mistakes like this are much less likely to happen.

Let's have a look at the waitid syscall implementation as it is in upstream Linux at the time of writing.

SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
        infop, int, options, struct rusage __user *, ru)
{
    struct rusage r;
    struct waitid_info info = {.status = 0};
    long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
    int signo = 0;

    if (err > 0) {
        signo = SIGCHLD;
        err = 0;
        if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
            return -EFAULT;
    }
    if (!infop)
        return err;

    if (!user_write_access_begin(infop, sizeof(*infop)))
        return -EFAULT;

    unsafe_put_user(signo, &infop->si_signo, Efault);
    unsafe_put_user(0, &infop->si_errno, Efault);
    unsafe_put_user(info.cause, &infop->si_code, Efault);
    unsafe_put_user(info.pid, &infop->si_pid, Efault);
    unsafe_put_user(info.uid, &infop->si_uid, Efault);
    unsafe_put_user(info.status, &infop->si_status, Efault);
    user_write_access_end();
    return err;
Efault:
    user_write_access_end();
    return -EFAULT;
}

Notice any differences? Not a lot has changed, but instead of an unconditional user_access_begin(), there's now a call to user_write_access_begin(). Not only have the user access functions been split into read and write (though whether there's actually read/write granularity under the hood depends on the MMU-specific implementation), but the _begin() function takes a pointer and the size of the write. And what do you think that's doing...

static __must_check inline bool
user_write_access_begin(const void __user *ptr, size_t len)
{
    if (unlikely(!access_ok(ptr, len)))
        return false;

    might_fault();

    allow_write_to_user((void __user *)ptr, len);
    return true;
}

That's right! The missing access_ok() check from v4.13 is now part of the API for enabling user access, so you can't forget it (without trying really hard). If there's something else you should be doing every time you call a function (i.e. access_ok() when calling user_access_begin()), it should probably just be part of the function, especially if there's a security implication.

This bug was fixed by adding in the missing access_ok() check, but it's very cool to see that bugs like this are now much less likely to get written.

Stage 2: the primitive

Before we do anything too interesting, we should figure out what we actually have here. We point our pointer at 0xc000000012345678 (an arbitrary kernel address) then take a look in gdb, revealing the following:

pwndbg> x/10 0xc000000012345678
0xc000000012345678:     17      0       1       0
0xc000000012345688:     2141    1001    1       0

So we know that we can at least set something to zero, and there's some potential for more mischief. We could fork() a lot to change the value of the PID to make our write a bit more arbitrary, but to not get too fancy I figured we should just see where we could get by setting something either to zero or to something non-zero.

A few targets came to mind. We could spray around where we think creds are located to try and overwrite the effective user ID of a process to 0, making it run as root. We could go after something like SELinux, aiming for flags like selinux_enabled and selinux_enforcing. I'm sure there's other sandbox-type controls we could try and escape from, too.

None of these were taking my CTF in the direction I wanted it to go (which was shellcode running in the kernel), so I decided to turn the realism down a notch and aim for exploiting a null pointer dereference. We'd map our shellcode to *0, induce a null pointer dereference in the kernel, and then our exploit would work. Right?

So we're just going to go for a classic privilege escalation. We start as an unprivileged user and end up as root. Easy.

Stage 3: the target

I found an existing exploit doing the same thing I wanted to do, so I just stole the target from that. It has some comments in French which don't really help me, but thankfully I found another version with some additional comments - in Chinese. Oh well. have_canfork_callback is a symbol that marks whether cgroup subsystems have a can_fork() callback that is checked when a fork is attempted. If we overwrite have_canfork_callback to be non-zero when can_fork is still NULL, then we win! We can reliably reproduce a null pointer dereference as soon as we fork().

I'm sure there's heaps of different symbols we could have hit, but this one has some nice properties. Any non-zero write is enough, we can trigger the dereference at a time in our control with fork(), and to cover our bases we can just set it back to 0 later.

In our case, we had debug info and a debugger, so finding where the symbol was located in memory is pretty easy. There's also /proc/kallsyms which is great if it's enabled. Linux on Power doesn't yet support KASLR which also saves us a headache or two here, and you can feel free to ask me why it's low on the priority list.

So now we have a null pointer dereference. Now let's get that doing something!

Stage 4: preparing the exploit

Virtual memory is one heck of a drug. If the kernel is going to execute from 0x0, we just need to mmap() to 0! Easy.

Well, it's not that easy. Turning any null pointer dereference into an easy attack vector is not ideal, so users aren't allowed to mmap to low address ranges, in our case, below PAGE_SIZE. Surely there's nothing in the kernel that would try to dereference a pointer + PAGE_SIZE? Maybe that's for a future CTF...

There's a sysctl for this, so in the actual CTF we just did sysctl -w vm.mmap_min_addr=0 and moved on for brevity. As I was writing this I decided to make sure it was possible to bypass this without cheating by making use of our kernel write primitive, and sure enough, it works! I had to zero out both mmap_min_addr and dac_mmap_min_addr symbols, the latter seemingly required for filesystem interactions to work post-exploit.

So now we can trigger a null pointer dereference in the kernel and we can mmap() our shellcode to 0x0, we should probably get some shellcode. We want to escalate our privileges, and the easiest way to do that is the iconic commit_creds(prepare_kernel_cred(0)).

prepare_kernel_cred() is intended to produce a credential for a kernel task. Passing 0/NULL gets you the same credential that init runs with, which is about as escalated as our privileges can get. commit_creds() applies the given credential to the currently running task - thus making our exploit run as root.

As of somewhat recently it's a bit more complex than that, but we're still back in v4.13, so we just need a way to execute that from a triggered null pointer dereference.

Stage 5: the shellcode

The blessing and curse of Power being a niche architecture is that it's hard to find existing exploits for. Perhaps lacking in grace and finesse, but effective nonetheless, is the shellcode I wrote myself:

    static const unsigned char shellcode[] = {
        0x00, 0x00, 0xc0, 0x3b, // li r30, 0
        0x20, 0x00, 0x9e, 0xe9, // ld r12,32(r30)
        0x00, 0x00, 0xcc, 0xfb, // std r30,0(r12)
        0x18, 0x00, 0x9e, 0xe9, // ld r12,24(r30)
        0xa6, 0x03, 0x89, 0x7d, // mtctr r12
        0x20, 0x04, 0x80, 0x4e, // bctr
    };

After the CTF I encouraged everyone to try writing their own shellcode and noone did, and I will take that as a sign that mine is flawlessly designed.

First we throw 0 into r30, which sounds like a register we'll get away with clobbering. We load an offset of 32 bytes from the value of r30 into r12 (and r30 is 0, so this is the address 32). Then, we store the value of r30 (which is 0) into the address in r12 - writing zero to the address found at *32.

Then, we replace the contents of r12 with the value contained at address 24. Then, we move that value into the count register, and branch to the count register - redirecting execution to the address found at *24.

I wrote it this way so participants would have to understand what the shellcode was trying to do to be able to get any use out of it. It expects two addresses to be placed immediately after it terminates and it's up to you to figure out what those addresses should be!

In our case, everyone figured out pretty quickly that *24 should point at our very classic privesc:

void get_root() {
    if (commit_creds && prepare_kernel_cred)
        commit_creds(prepare_kernel_cred(0));
}

Addresses for those kernel symbols need to be obtained first, but we're experts at that now. So we add in:

    *(unsigned long *)24 = (unsigned long)get_root;

And that part's sorted. How good is C?

Noone guessed what address we were zeroing, though, and the answer is have_canfork_callback. Without mending that, the kernel will keep attempting to execute from address 0, which we don't want. We only need it to do that once!

So we wrap up with

    *(unsigned long *)32 = have_canfork_callback;

and our shellcode's ready to go!

Stage 6: it doesn't work

We've had good progress so far - we needed a way to get the kernel to execute from address 0 and we found a way to do that, and we needed to mmap to 0 and we found a way to do that. And yet, running the exploit doesn't work. How come?

Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x00000000
Oops: Kernel access of bad area, sig: 11 [#2]

The MMU has ended our fun. KUEP is enabled (SMEP on x86, PXN on ARM) so the MMU is enforcing that the kernel can't execute from user addresses. I gave everyone a bit of a trick question here - how can you get around this purely from the qemu command line?

The way I did it wasn't to parse nosmep (and I'm not even sure that was implemented for powerpc in v4.13 anyway), it was to change from -cpu POWER9 to -cpu POWER8. Userspace execution prevention wasn't implemented in the MMU until POWER9, so reverting to an older processor was a cheeky way to get around that.

Stage 7: victory!

Putting all of that together, we have a successful privilege escalation from attacking the kernel.

/ $ ./exploit 
Overwriting mmap_min_addr...
Overwriting dac_mmap_min_addr...
Overwriting have_canfork_callback...
Successfully acquired root shell!
/ # whoami
root

It's wild to think that even an exploit this simple would have been possible in the "real world" back in 2017, so it really highlights the value of kernel hardening! It made for a good introduction to kernel exploitation for me and my team and wasn't too contrived for the sake of simplicity.

Whether you're a beginner or an expert at kernel exploitation (or somewhere vaguely in the middle like me), I hope you found this interesting. There's lots of great PoCs, writeups and papers out there to learn from and CTFs to try if you want to learn more!