Detecting rootless Docker

Wed 05 April 2023

Posted by Andrew Donnellan Wed 05 April 2023

Trying to do some fuzzing...

The other day, for the first time in a while, I wanted to do something with syzkaller, a system call fuzzer that has been used to find literally thousands of kernel bugs. As it turns out, since the last time I had done any work on syzkaller, I switched to a new laptop, and so I needed to set up a few things in my development environment again.

While I was doing this, I took a look at the syzkaller source again and found a neat little script called syz-env, which uses a Docker image to provide you with a standardised environment that has all the necessary tools and dependencies preinstalled.

I decided to give it a go, and then realised I hadn't actually installed Docker since getting my new laptop. So I went to do that, and along the way I discovered rootless mode, and decided to give it a try.

What's rootless mode?

As of relatively recently, Docker supports rootless mode, which allows you to run your dockerd as a non-root user. This is helpful for security, as traditional "rootful" Docker can trivially be used to obtain root privileges outside of a container. Rootless Docker is implemented using RootlessKit (a fancy replacement for fakeroot that uses user namespaces) to create a new user namespace that maps the UID of the user running dockerd to 0.

You can find more information, including details of the various restrictions that apply to rootless setups, in the Docker documentation.

The problem

I ran tools/syz-env make to test things out. It pulled the container image, then gave me some strange errors:

ajd@jarvis-debian:~/syzkaller$ tools/syz-env make NCORES=1
gcr.io/syzkaller/env:latest
warning: Not a git repository. Use --no-index to compare two paths outside a working tree
usage: git diff --no-index [<options>] <path> <path>

    ...

fatal: detected dubious ownership in repository at '/syzkaller/gopath/src/github.com/google/syzkaller'
To add an exception for this directory, call:

        git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
fatal: detected dubious ownership in repository at '/syzkaller/gopath/src/github.com/google/syzkaller'
To add an exception for this directory, call:

        git config --global --add safe.directory /syzkaller/gopath/src/github.com/google/syzkaller
go list -f '{{.Stale}}' ./sys/syz-sysgen | grep -q false || go install ./sys/syz-sysgen
error obtaining VCS status: exit status 128
        Use -buildvcs=false to disable VCS stamping.
error obtaining VCS status: exit status 128
        Use -buildvcs=false to disable VCS stamping.
make: *** [Makefile:155: descriptions] Error 1

After a bit of digging, I found that syz-env mounts the syzkaller source directory inside the container as a volume. make was running with UID 1000, while the files in the mounted volume appeared to be owned by root.

Reading the script, it turns out that syz-env invokes docker run with the --user option to set the UID inside the container to match the user's UID outside the container, to ensure that file ownership and permissions behave as expected.

This works in rootful Docker, where files appear inside the container to be owned by the same UID as they are outside the container. However, it breaks in rootless mode: due to the way RootlessKit sets up the namespaces, the user's UID is mapped to 0, causing the files to appear to be owned by root.

The workaround seemed pretty obvious: just skip the --user flag if running rootless.

How can you check whether your Docker daemon is running in rootless mode?

It took me quite a while, as a total Docker non-expert, to figure out how to definitively check whether the Docker daemon is running rootless or not. There's a variety of ways you could do this, such as checking the name of the current Docker context to see if it's called rootless (as used by the Docker rootless setup scripts), but I think the approach I settled on is the most correct one.

If you want to check whether your Docker daemon is running in rootless mode, use docker info to query the daemon's security options, and check for the rootless option.

docker info -f "{{println .SecurityOptions}}" | grep rootless

If this prints something like:

[name=seccomp,profile=builtin name=rootless name=cgroupns]

then you're running rootless.

If not, then you're running the traditional rootful.

Easy! (And I sent a fix which is now merged into syzkaller!)

Dumb bugs: the PCI device that wasn't

Tue 04 April 2023

Posted by Russell Currey Tue 04 April 2023

I was happily minding my own business one fateful afternoon when I received the following kernel bug report:

BUG: KASAN: slab-out-of-bounds in vga_arbiter_add_pci_device+0x60/0xe00
Read of size 4 at addr c000000264c26fdc by task swapper/0/1

Call Trace:
dump_stack_lvl+0x1bc/0x2b8 (unreliable)
print_report+0x3f4/0xc60
kasan_report+0x244/0x698
__asan_load4+0xe8/0x250
vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c
__machine_initcall_pseries_vio_device_init+0x94/0xf0
do_one_initcall+0x12c/0xaa8
kernel_init_freeable+0xa48/0xba8
kernel_init+0x64/0x400
ret_from_kernel_thread+0x5c/0x64

OK, so KASAN has helpfully found an out-of-bounds access in vga_arbiter_add_pci_device(). What the heck is that?

Why does my VGA require arbitration?

I'd never heard of the VGA arbiter in the kernel (do kids these days know what VGA is?), or vgaarb as it's called. What it does is irrelevant to this bug, but I found the history pretty interesting! Benjamin Herrenschmidt proposed VGA arbitration back in 2005 as a way of resolving conflicts between multiple legacy VGA devices that want to use the same address assignments. This was previously handled in userspace by the X server, but issues arose with multiple X servers on the same machine. Plus, it's probably not a good idea for this kind of thing to be handled by userspace. You can read more about the VGA arbiter in the kernel docs, but it's probably not something anyone has thought much about in a long time.

The bad access

static bool vga_arbiter_add_pci_device(struct pci_dev *pdev)
{
        struct vga_device *vgadev;
        unsigned long flags;
        struct pci_bus *bus;
        struct pci_dev *bridge;
        u16 cmd;

        /* Only deal with VGA class devices */
        if ((pdev->class >> 8) != PCI_CLASS_DISPLAY_VGA)
                return false;

We're blowing up on the read to pdev->class, and it's not something like the data being uninitialised, it's out-of-bounds. If we look back at the call trace:

vga_arbiter_add_pci_device+0x60/0xe00
pci_notify+0x88/0x444
notifier_call_chain+0x104/0x320
blocking_notifier_call_chain+0xa0/0x140
device_add+0xac8/0x1d30
device_register+0x58/0x80
vio_register_device_node+0x9ac/0xce0
vio_bus_scan_register_devices+0xc4/0x13c

This thing is a VIO device, not a PCI device! Let's jump into the caller, pci_notify(), to find out how we got our pdev.

static int pci_notify(struct notifier_block *nb, unsigned long action,
                      void *data)
{
        struct device *dev = data;
        struct pci_dev *pdev = to_pci_dev(dev);

So pci_notify() gets called with our VIO device (somehow), and we're converting that struct device into a struct pci_dev with no error checking. We could solve this particular bug by just checking that our device is actually a PCI device before we proceed - but we're in a function called pci_notify, we're expecting a PCI device to come in, so this would just be a bandaid.

to_pci_dev() works like other struct containers in the kernel - struct pci_dev contains a struct device as a member, so the container_of() function returns an address based on where a struct pci_dev would have to be if the given struct device was actually a PCI device. Since we know it's not actually a PCI device and this struct device does not actually sit inside a struct pci_dev, our pdev is now pointing to some random place in memory, hence our access to a member like class is caught by KASAN.

Now we know why and how we're blowing up, but we still don't understand how we got here, so let's back up further.

Notifiers

The kernel's device subsystem allows consumers to register callbacks so that they can be notified of a given event. I'm not going to go into a ton of detail on how they work, because I don't fully understand myself, and there's a lot of internals of the device subsystem involved. The best references I could find for this are notifier.h, and for our purposes here, the register notifier functions in bus.h.

Something's clearly gone awry if we can end up in a function named pci_notify() without passing it a PCI device. We find where the notifier is registered in vgaarb.c here:

static struct notifier_block pci_notifier = {
        .notifier_call = pci_notify,
};

static int __init vga_arb_device_init(void)
{
        /* some stuff removed here... */

        bus_register_notifier(&pci_bus_type, &pci_notifier);

This all looks sane. A blocking notifier is registered so that pci_notify() gets called whenever there's a notification going out to PCI buses. Our VIO device is distinctly not on a PCI bus, and in my debugging I couldn't find any potential causes of such confusion, so how on earth is a notification for PCI buses being applied to our non-PCI device?

Deep in the guts of the device subsystem, if we have a look at device_add() we find the following:

int device_add(struct device *dev)
{
        /* lots of device init stuff... */

        if (dev->bus)
                blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
                                             BUS_NOTIFY_ADD_DEVICE, dev);

If the device we're initialising is attached to a bus, then we call the bus notifier of that bus with the BUS_NOTIFY_ADD_DEVICE notification, and the device in question. So we're going through the process of adding a VIO device, and somehow calling into a notifier that's only registered for PCI devices. I did a bunch of debugging to see if our VIO device was somehow malformed and pointing to a PCI bus, or the struct subsys_private (that's the bus->p above) was somehow pointing to the wrong place, but everything seemed sane. My thesis of there being confusion while matching devices to buses was getting harder to justify - everything still looked sane.

Debuggers

I do not like debuggers. I am an avid printk() enthusiast. There's no real justification for this, a bunch of my problems could almost certainly be solved easier by using actual tools, but my brain seemingly enjoys the routine of printing and building and running until I figure out what's going on. It was becoming increasingly obvious, however, that printk could not save me here, and we needed to go deeper.

Very thankfully for me, even though this bug was discovered on real hardware, it reproduces easily in QEMU, making iteration easy. With GDB attached to QEMU, it's time to dive in to the guts of this issue and figure out what's happening.

Somehow, VIO buses are ending up with pci_notify() in their bus_notifier list. Let's break down the data structures here with a look at struct notifier_block:

struct notifier_block {
        notifier_fn_t notifier_call;
        struct notifier_block __rcu *next;
        int priority;
};

So notifier chains are singly linked lists. Callbacks are registered through functions like bus_register_notifier(), then after a long chain of breadcrumbs we reach notifier_chain_register() which walks the list of ->next pointers until it reaches NULL, at which point it sets ->next of the tail node to the struct notifier_block that was passed in. It's very important to note here that the data being appended to the list here is not just the callback function (i.e. pci_notify()), but the struct notifier_block itself (i.e. struct notifier_block pci_notifier from earlier). There's no new data being initialised, just updating a pointer to the object that was passed by the caller.

If you've guessed what our bug is at this point, great job! If the same struct notifier_block gets registered to two different bus types, then both of their bus_notifier fields will point to the same memory, and any further notifiers registered to either bus will end up being referenced by both since they walk through the same node.

So we bust out the debugger and start looking at what ends up in bus_notifier for PCI and VIO buses with breakpoints and watchpoints.

Candidates

Walking the bus_notifier list gave me the following:

__gcov_.perf_trace_module_free
fail_iommu_bus_notify
isa_bridge_notify
ppc_pci_unmap_irq_line
eeh_device_notifier
iommu_bus_notifier
tce_iommu_bus_notifier
pci_notify

Time to find out if our assumption is correct - the same struct notifier_block is being registered to both bus types. Let's start going through them!

First up, we have __gcov_.perf_trace_module_free. Thankfully, I recognised this as complete bait. Trying to figure out what gcov and perf are doing here is going to be its own giant rabbit hole, and unless building without gcov makes our problem disappear, we skip this one and keep on looking. Rabbit holes in the kernel never end, we have to be strategic with our time!

Next, we reach fail_iommu_bus_notify, so let's take a look at that.

static struct notifier_block fail_iommu_bus_notifier = {
        .notifier_call = fail_iommu_bus_notify
};

static int __init fail_iommu_setup(void)
{
#ifdef CONFIG_PCI
        bus_register_notifier(&pci_bus_type, &fail_iommu_bus_notifier);
#endif
#ifdef CONFIG_IBMVIO
        bus_register_notifier(&vio_bus_type, &fail_iommu_bus_notifier);
#endif

        return 0;
}

Sure enough, here's our bug. The same node is being registered to two different bus types:

+------------------+
| PCI bus_notifier \
+------------------+\
                     \+-------------------------+    +-----------------+    +------------+
                      | fail_iommu_bus_notifier |----| PCI + VIO stuff |----| pci_notify |
                     /+-------------------------+    +-----------------+    +------------+
+------------------+/
| VIO bus_notifier /
+------------------+

when it should be like:

+------------------+    +-----------------------------+    +-----------+    +------------+
| PCI bus_notifier |----| fail_iommu_pci_bus_notifier |----| PCI stuff |----| pci_notify |
+------------------+    +-----------------------------+    +-----------+    +------------+

+------------------+    +-----------------------------+    +-----------+
| VIO bus_notifier |----| fail_iommu_vio_bus_notifier |----| VIO stuff |
+------------------+    +-----------------------------+    +-----------+

The fix

Ultimately, the fix turned out to be pretty simple:

Author: Russell Currey <ruscur@russell.cc>
Date:   Wed Mar 22 14:37:42 2023 +1100

    powerpc/iommu: Fix notifiers being shared by PCI and VIO buses

    fail_iommu_setup() registers the fail_iommu_bus_notifier struct to both
    PCI and VIO buses.  struct notifier_block is a linked list node, so this
    causes any notifiers later registered to either bus type to also be
    registered to the other since they share the same node.

    This causes issues in (at least) the vgaarb code, which registers a
    notifier for PCI buses.  pci_notify() ends up being called on a vio
    device, converted with to_pci_dev() even though it's not a PCI device,
    and finally makes a bad access in vga_arbiter_add_pci_device() as
    discovered with KASAN:

    [stack trace redacted, see above]

    Fix this by creating separate notifier_block structs for each bus type.

    Fixes: d6b9a81b2a45 ("powerpc: IOMMU fault injection")
    Reported-by: Nageswara R Sastry <rnsastry@linux.ibm.com>
    Signed-off-by: Russell Currey <ruscur@russell.cc>

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..6f1117fe3870 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -171,17 +171,26 @@ static int fail_iommu_bus_notify(struct notifier_block *nb,
         return 0;
 }

-static struct notifier_block fail_iommu_bus_notifier = {
+/*
+ * PCI and VIO buses need separate notifier_block structs, since they're linked
+ * list nodes.  Sharing a notifier_block would mean that any notifiers later
+ * registered for PCI buses would also get called by VIO buses and vice versa.
+ */
+static struct notifier_block fail_iommu_pci_bus_notifier = {
+        .notifier_call = fail_iommu_bus_notify
+};
+
+static struct notifier_block fail_iommu_vio_bus_notifier = {
         .notifier_call = fail_iommu_bus_notify
 };

 static int __init fail_iommu_setup(void)
 {
 #ifdef CONFIG_PCI
-        bus_register_notifier(&pci_bus_type, &fail_iommu_bus_notifier);
+        bus_register_notifier(&pci_bus_type, &fail_iommu_pci_bus_notifier);
 #endif
 #ifdef CONFIG_IBMVIO
-        bus_register_notifier(&vio_bus_type, &fail_iommu_bus_notifier);
+        bus_register_notifier(&vio_bus_type, &fail_iommu_vio_bus_notifier);
 #endif

         return 0;

Easy! Problem solved. The commit that introduced this bug back in 2012 was written by the legendary Anton Blanchard, so it's always a treat to discover an Anton bug. Ultimately this bug is of little consequence, but it's always fun to catch dormant issues with powerful tools like KASAN.

In conclusion

I think this bug provides a nice window into what kernel debugging can be like. Thankfully, things are made easier by not dealing with any specific hardware and being easily reproducible in QEMU.

Bugs like this have an absurd amount of underlying complexity, but you rarely need to understand all of it to comprehend the situation and discover the issue. I spent way too much time digging into device subsystem internals, when the odds of the issue lying within were quite low - the combination of IBM VIO devices and VGA arbitration isn't exactly common, so searching for potential issues within the guts of a heavily utilised subsystem isn't going to yield results very often.

Is there something haunted in the device subsystem? Is there something haunted inside the notifier handlers? It's possible, but assuming the core guts of the kernel have a baseline level of sanity helps to let you stay focused on the parts more likely to be relevant.

Finally, the process was made much easier by having good code navigation. A ludicrous amount of kernel developers still use plain vim or Emacs, maybe with tags if you're lucky, and get by on git grep (not even ripgrep!) and memory. Sort yourselves out and get yourself an editor with LSP support. I personally use Doom Emacs with clangd, and with the amount of jumping around the kernel I had to do to solve this bug, it would've been a much bigger ordeal without that power.

If you enjoyed the read, why not follow me on Mastodon or checkout Ben's recount of another cursed bug! Thanks for stopping by.

Dumb bugs: When a date breaks booting the kernel

Fri 24 March 2023

Posted by Benjamin Gray Fri 24 March 2023

The setup

I've recently been working on internal CI infrastructure for testing kernels before sending them to the mailing list. As part of this effort, I became interested in reproducible builds. Minimising the changing parts outside of the source tree itself could improve consistency and ccache hits, which is great for trying to make the CI faster and more reproducible across different machines. This means removing 'external' factors like timestamps from the build process, because the time changes every build and means the results between builds of the same tree are no longer identical binaries. This also prevents using previously cached results, potentially slowing down builds (though it turns out the kernel does a good job of limiting the scope of where timestamps appear in the build).

As part of this effort, I came across the KBUILD_BUILD_TIMESTAMP environment variable. This variable is used to set the kernel timestamp, which is primarily for any users who want to know when their kernel was built. That's mostly irrelevant for our work, so an easy KBUILD_BUILD_TIMESTAMP=0 later and... it still uses the current date.

Ok, checking the documentation it says

Setting this to a date string overrides the timestamp used in the UTS_VERSION definition (uname -v in the running kernel). The value has to be a string that can be passed to date -d. The default value is the output of the date command at one point during build.

So it looks like the timestamp variable is actually expected to be a date format. To make it obvious that it's not a 'real' date, let's set KBUILD_BUILD_TIMESTAMP=0000-01-01. A bunch of zeroes (and the ones to make it a valid month and day) should tip off anyone to the fact it's invalid.

As an aside, this is a different date to what I tried to set it to earlier; a 'timestamp' typically refers to the number of seconds since the UNIX epoch (1970), so my first attempt would have corresponded to 1970-01-01. But given we're passing a date, not a timestamp, there should be no problem setting it back to the year 0. And I like the aesthetics of 0000 over 1970.

Building and booting the kernel, we see #1 SMP 0000-01-01 printed as the build timestamp. Success! After confirming everything works, I set the environment variable in the CI jobs and call it a day.

An unexpected error

A few days later I need to run the CI to test my patches, and something strange happens. It builds fine, but the boot tests that load a root disk image fail inexplicably: there is a kernel panic saying "VFS: Unable to mount root fs on unknown-block(253,2)".

[    0.909648][    T1] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(253,2)
[    0.909797][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc2-g065ffaee7389 #8
[    0.909880][    T1] Hardware name: IBM pSeries (emulated by qemu) POWER8 (raw) 0x4d0200 0xf000004 of:SLOF,HEAD pSeries
[    0.910044][    T1] Call Trace:
[    0.910107][    T1] [c000000003643b00] [c000000000fb6f9c] dump_stack_lvl+0x70/0xa0 (unreliable)
[    0.910378][    T1] [c000000003643b30] [c000000000144e34] panic+0x178/0x424
[    0.910423][    T1] [c000000003643bd0] [c000000002005144] mount_block_root+0x1d0/0x2bc
[    0.910457][    T1] [c000000003643ca0] [c000000002005720] prepare_namespace+0x1d4/0x22c
[    0.910487][    T1] [c000000003643d20] [c000000002004b04] kernel_init_freeable+0x36c/0x3bc
[    0.910517][    T1] [c000000003643df0] [c000000000013830] kernel_init+0x30/0x1a0
[    0.910549][    T1] [c000000003643e50] [c00000000000df94] ret_from_kernel_thread+0x5c/0x64
[    0.910587][    T1] --- interrupt: 0 at 0x0
[    0.910794][    T1] NIP:  0000000000000000 LR: 0000000000000000 CTR: 0000000000000000
[    0.910828][    T1] REGS: c000000003643e80 TRAP: 0000   Not tainted  (6.3.0-rc2-g065ffaee7389)
[    0.910883][    T1] MSR:  0000000000000000 <>  CR: 00000000  XER: 00000000
[    0.910990][    T1] CFAR: 0000000000000000 IRQMASK: 0
[    0.910990][    T1] GPR00: 0000000000000000 c000000003644000 0000000000000000 0000000000000000
[    0.910990][    T1] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.910990][    T1] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.910990][    T1] GPR12: 0000000000000000 0000000000000000 c000000000013808 0000000000000000
[    0.910990][    T1] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.910990][    T1] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.910990][    T1] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.910990][    T1] GPR28: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.911371][    T1] NIP [0000000000000000] 0x0
[    0.911397][    T1] LR [0000000000000000] 0x0
[    0.911427][    T1] --- interrupt: 0
qemu-system-ppc64: OS terminated: OS panic: VFS: Unable to mount root fs on unknown-block(253,2)

Above the panic was some more context, saying

[    0.906194][    T1] Warning: unable to open an initial console.
...
[    0.908321][    T1] VFS: Cannot open root device "vda2" or unknown-block(253,2): error -2
[    0.908356][    T1] Please append a correct "root=" boot option; here are the available partitions:
[    0.908528][    T1] 0100           65536 ram0
[    0.908657][    T1]  (driver?)
[    0.908735][    T1] 0101           65536 ram1
[    0.908744][    T1]  (driver?)
...
[    0.909216][    T1] 010f           65536 ram15
[    0.909226][    T1]  (driver?)
[    0.909265][    T1] fd00         5242880 vda
[    0.909282][    T1]  driver: virtio_blk
[    0.909335][    T1]   fd01            4096 vda1 d1f35394-01
[    0.909364][    T1]
[    0.909401][    T1]   fd02         5237760 vda2 d1f35394-02
[    0.909408][    T1]
[    0.909441][    T1] fd10             366 vdb
[    0.909446][    T1]  driver: virtio_blk
[    0.909479][    T1] 0b00         1048575 sr0
[    0.909486][    T1]  driver: sr

This is even more baffling: if it's unable to open a console, then what am I reading these messages on? And error -2, or ENOENT, on opening 'vda2' implies that no such file or directory exists. But it then lists vda2 as a present drive with a known driver? So is vda2 missing or not?

Living in denial

As you've read the title of this article, you can probably guess as to what changed to cause this error. But at the time I had no idea what could have been the cause. I'd already confirmed that a kernel with a set timestamp can boot to userspace, and there was another (seemingly) far more likely candidate for the failure: as part of the CI design, patches are extracted from the submitted branch and rebased onto the maintainer's tree. This is great from a convenience perspective, because you don't need to worry about forgetting to rebase your patches before testing and submission. But if the maintainer has synced their branch with Linus' tree it means there could be a lot of things changed in the source tree between runs, even if they were only a few days apart.

So, when you're faced with a working test on one commit and a broken test on another commit, it's time to break out the git bisect. Downloading the kernel images from the relevant CI jobs, I confirmed that indeed one was working while the other was broken. So I bisected the relevant commits, and... everything kept working. Each step I would build and boot the kernel, and each step would reach userspace just fine. I was getting suspicious at this point, so skipped ahead to the known bad commit and built and tested it locally. It also worked.

This was highly confusing, because it meant there was something fishy going on. Some kind of state outside of the kernel tree. Could it be... surely not...

Comparing the boot logs of the two CI kernels, I see that the working one indeed uses an actual timestamp, and the broken one uses the 0000-01-01 fixed date. Oh no. Setting the timestamp with a local build, I can now reproduce the boot panic with a kernel I built myself.

But... why?

OK, so it's obvious at this point that the timestamp is affecting loading a root disk somehow. But why? The obvious answer is that it's before the UNIX epoch. Something in the build process is turning the date into an actual timestamp, and going wrong when that timestamp gets used for something.

But it's not like there was a build error complaining about it. As best I could tell, the kernel doesn't try to parse the date anywhere, besides passing it to date during the build. And if date had an issue with it, it would have broken the build. Not booting the kernel. There's no date utility being invoked during kernel boot!

Regardless, I set about tracing the usage of KBUILD_BUILD_TIMESTAMP inside the kernel. The stacktrace in the panic gave the end point of the search; the function mount_block_root() wasn't happy. So all I had to do was work out at which point mount_block_root() tried to access the KBUILD_BUILD_TIMESTAMP value.

In short, that went nowhere.

mount_block_root() effectively just tries to open a file in the filesystem. There's massive amounts of code handling this, and any part could have had the undocumented dependency on KBUILD_BUILD_TIMESTAMP. Approaching from the other direction, KBUILD_BUILD_TIMESTAMP is turned into build-timestamp inside a Makefile, which is in turn related to a file include/generated/utsversion.h. This file #defines UTS_VERSION equal to the KBUILD_BUILD_TIMESTAMP value. Searching the kernel for UTS_VERSION, we hit init/version-timestamp.c which stores it in a struct with other build information:

struct uts_namespace init_uts_ns = {
    .ns.count = REFCOUNT_INIT(2),
    .name = {
        .sysname    = UTS_SYSNAME,
        .nodename   = UTS_NODENAME,
        .release    = UTS_RELEASE,
        .version    = UTS_VERSION,
        .machine    = UTS_MACHINE,
        .domainname = UTS_DOMAINNAME,
    },
    .user_ns = &init_user_ns,
    .ns.inum = PROC_UTS_INIT_INO,
#ifdef CONFIG_UTS_NS
    .ns.ops = &utsns_operations,
#endif
};

This is where the trail goes cold: I don't know if you've ever tried this, but searching for .version in the kernel's codebase is not a very fruitful endeavor when you're interested in a specific kind of version.

$ rg "(\.|\->)version\b" | wc -l
5718

I tried tracing the usage of init_uts_ns, but didn't get very far.

By now I'd already posted this in chat and another developer, Joel Stanley, was also investigating this bizarre bug. They had been testing different timestamp values and made the horrifying discovery that the bug sticks around after a rebuild. So you could start with a broken build, set the timestamp back to the correct value, rebuild, and the resulting kernel would still be broken. The boot log would report the correct time, but the root disk mounter panicked all the same.

Getting sidetracked

I wasn't prepared to investigate the boot panic directly until the persistence bug was fixed. Having to run make clean and rebuild everything would take an annoyingly long time, even with ccache. Fortunately, I had a plan. All I had to do was work out which generated files are different between a broken and working build, and binary search by deleting half of them until deleting only one made the difference between the bug persisting or not. We can use diff for this. Running the initial diff we get

$ diff -q --exclude System.map --exclude .tmp_vmlinux* --exclude tools broken/ working/
Common subdirectories: broken/arch and working/arch
Common subdirectories: broken/block and working/block
Files broken/built-in.a and working/built-in.a differ
Common subdirectories: broken/certs and working/certs
Common subdirectories: broken/crypto and working/crypto
Common subdirectories: broken/drivers and working/drivers
Common subdirectories: broken/fs and working/fs
Common subdirectories: broken/include and working/include
Common subdirectories: broken/init and working/init
Common subdirectories: broken/io_uring and working/io_uring
Common subdirectories: broken/ipc and working/ipc
Common subdirectories: broken/kernel and working/kernel
Common subdirectories: broken/lib and working/lib
Common subdirectories: broken/mm and working/mm
Common subdirectories: broken/net and working/net
Common subdirectories: broken/scripts and working/scripts
Common subdirectories: broken/security and working/security
Common subdirectories: broken/sound and working/sound
Common subdirectories: broken/usr and working/usr
Files broken/.version and working/.version differ
Common subdirectories: broken/virt and working/virt
Files broken/vmlinux and working/vmlinux differ
Files broken/vmlinux.a and working/vmlinux.a differ
Files broken/vmlinux.o and working/vmlinux.o differ
Files broken/vmlinux.strip.gz and working/vmlinux.strip.gz differ

Hmm, OK so only some top level files are different. Deleting all the different files doesn't fix the persistence bug though, and I know that a proper make clean does fix it, so what could possibly be the difference when all the remaining files are identical?

Oh wait. man diff reports that diff only compares the top level folder entries by default. So it was literally just telling me "yes, both the broken and working builds have a folder named X". How GNU of it. Re-running the diff command with actually useful options, we get a more promising story

$ diff -qr --exclude System.map --exclude .tmp_vmlinux* --exclude tools build/broken/ build/working/
Files build/broken/arch/powerpc/boot/zImage and build/working/arch/powerpc/boot/zImage differ
Files build/broken/arch/powerpc/boot/zImage.epapr and build/working/arch/powerpc/boot/zImage.epapr differ
Files build/broken/arch/powerpc/boot/zImage.pseries and build/working/arch/powerpc/boot/zImage.pseries differ
Files build/broken/built-in.a and build/working/built-in.a differ
Files build/broken/include/generated/utsversion.h and build/working/include/generated/utsversion.h differ
Files build/broken/init/built-in.a and build/working/init/built-in.a differ
Files build/broken/init/utsversion-tmp.h and build/working/init/utsversion-tmp.h differ
Files build/broken/init/version.o and build/working/init/version.o differ
Files build/broken/init/version-timestamp.o and build/working/init/version-timestamp.o differ
Files build/broken/usr/built-in.a and build/working/usr/built-in.a differ
Files build/broken/usr/initramfs_data.cpio and build/working/usr/initramfs_data.cpio differ
Files build/broken/usr/initramfs_data.o and build/working/usr/initramfs_data.o differ
Files build/broken/usr/initramfs_inc_data and build/working/usr/initramfs_inc_data differ
Files build/broken/.version and build/working/.version differ
Files build/broken/vmlinux and build/working/vmlinux differ
Files build/broken/vmlinux.a and build/working/vmlinux.a differ
Files build/broken/vmlinux.o and build/working/vmlinux.o differ
Files build/broken/vmlinux.strip.gz and build/working/vmlinux.strip.gz differ

There are some new entries here: notably init/version* and usr/initramfs*. Binary searching these files results in a single culprit: usr/initramfs_data.cpio. This is quite fitting, as the .cpio file is an archive defining a filesystem layout, much like .tar files. This file is actually embedded into the kernel image, and loaded as a bare-bones shim filesystem when the user doesn't provide their own initramfs¹.

So it would make sense that if the CPIO archive wasn't being rebuilt, then the initial filesystem wouldn't change. And it would make sense for the initial filesystem to be causing mount issues of the proper root disk filesystem.

This just leaves the question of how KBUILD_BUILD_TIMESTAMP is breaking the CPIO archive. And it's around this time that a third developer, Andrew, who I'd roped into this bug hunt for having the (mis)fortune to sit next to me, pointed out that the generator script for this CPIO archive was passing the KBUILD_BUILD_TIMESTAMP to date. Whoop, we've found the murder weapon²!

The persistence bug could be explained now: because the script was only using KBUILD_BUILD_TIMESTAMP internally, make had no way of knowing that the archive generation depended on this variable. So even when I changed the variable to a valid value, make didn't know to rebuild the corrupt archive. Let's now get back to the main issue: why boot panics.

Solving the case

Following along the CPIO generation script, the KBUILD_BUILD_TIMESTAMP variable is turned into a timestamp by date -d"$KBUILD_BUILD_TIMESTAMP" +%s. Testing this in the shell with 0000-01-01 we get this (somewhat amusing, but also painful) result

date -d"$KBUILD_BUILD_TIMESTAMP" +%s
-62167255492

This timestamp is then passed to a C program that assigns it to a variable default_mtime. Looking over the source, it seems this variable is used to set the mtime field on the files in the CPIO archive. The timestamp is stored as a time_t, which is an alias for int64_t. That's 64 bits of data, up to 16 hexadecimal characters. And yes, that's relevant: CPIO stores the mtime (and all other numerical fields) as 32 bit unsigned integers represented by ASCII hexadecimal characters. The sprintf() call that ultimately embeds the timestamp uses the %08lX format specifier. This formats a long as hexadecimal, padded to at least 8 characters. Hang on... at least 8 characters? What if our timestamp happens to be more?

It turns out that large timestamps are already guarded against. The program will error during build if the date is later than 2106-02-07 (maximum unsigned 8 hex digit timestamp).

/*
 * Timestamps after 2106-02-07 06:28:15 UTC have an ascii hex time_t
 * representation that exceeds 8 chars and breaks the cpio header
 * specification.
 */
if (default_mtime > 0xffffffff) {
    fprintf(stderr, "ERROR: Timestamp too large for cpio format\n");
    exit(1);
}

But we are using an int64_t. What would happen if one were to provide a negative timestamp?

Well, sprintf() happily spits out FFFFFFF1868AF63C when we pass in our negative timestamp representing 0000-01-01. That's 16 characters, 8 too many for the CPIO header³.

So at last we've found the cause of the panic: the timestamp is being formatted too long, which breaks the CPIO header and the kernel doesn't create an initial filesystem correctly. This includes the /dev folder (which surprisingly is not hardcoded into kernel, but must be declared by the initramfs). So when the root disk mounter tries to open /dev/vda2, it correctly complains that it failed to create a device in the non-existent /dev.

Postmortem

After discovering all this, I sent in a couple of patches to fix the CPIO generation and rebuild logic. They were not complicated fixes, but wow were they time consuming to track down. I didn't see the error initially because I typically only boot with my own initramfs over the embedded one, and not with the intent to load a root disk. Then the panic itself was quite far away from the real issue, and there were many dead ends to explore.

I also got curious as to why the kernel didn't complain about a corrupt initramfs earlier. A brief investigation showed a streaming parser that is extremely fault tolerant, silently skipping invalid entries (like ones missing or having too long a name). The corrupted header was being interpreted as an entry with an empty name and 2 gigabyte body contents, which meant that (1) the kernel skipped inserting it due to the empty name, and (2) the kernel skipped the rest of the initramfs because it thought that up to 2 GB of the remaining content was part of that first entry.

Perhaps this could be improved to require that all input is consumed without unexpected EOF, such as how the userspace cpio tool works (which, by the way, recognises the corrupt archive as such and refuses to decompress it). The parsing logic is mostly from the before-times though (i.e., pre initial git commit), so it's difficult to distinguish intentional leniency and bugs.

Afterword

Incidentally, in investigating this I came across another bug. There is a helper function panic_show_mem() in the initramfs that's meant to dump memory information and then call panic(). It takes in standard printf() style format string and arguments, and tries to forward them to panic() which ultimately prints them.

static void panic_show_mem(const char *fmt, ...)
{
    va_list args;

    show_mem(0, NULL);
    va_start(args, fmt);
    panic(fmt, args);
    va_end(args);
}

void panic(const char *fmt, ...);

But variadic arguments don't quite work this way: instead of forwarding the list args as intended, panic() will instead interpret args as a single argument for the format string fmt. Standard library functions address this by providing v* variants of printf() and friends. For example,

int printf(char *fmt, ...);

int vprintf(char *fmt, va_list args);

We might create a vpanic() function in the kernel that follows this style, but it seems easier to just make panic_show_mem() a macro and 'forward' the arguments in the source code

#define panic_show_mem(fmt, ...) \
    ({ show_mem(0, NULL); panic(fmt, ##__VA_ARGS__); })

Patch sent.

And that's where I've left things. Big thanks to Joel and Andrew for helping me with this bug. It was certainly a trip.

initramfs, or initrd for the older format, are specific kinds of CPIO archives. The initramfs is intended to be loaded as the initial filesystem of a booted kernel, typically in preparation for loading your normal root filesystem. It might contain modules necessary to mount the disk for example. ↩
Hindsight again would suggest it was obvious to look here because it shows up when searching for KBUILD_BUILD_TIMESTAMP. I unfortunately wasn't familiar with the usr/ source folder initially, and focused on the core kernel components too much earlier. Oh well, we found it eventually. ↩
I almost missed this initially. Thanks to the ASCII header format, strings was able to print the headers without any CPIO specific tooling. I did a double take when I noticed the headers for the broken CPIO were a little longer than the headers in the working one. ↩

What distro options are there for POWER8 in 2022?

Wed 16 November 2022

Posted by Russell Currey Wed 16 November 2022

If you have POWER8 systems that you want to keep alive, what are your options in 2022? You can keep using the legacy distribution you're still using as long as it's still supported, but if you want some modernisation, that might not be the best option for you. Here's the current landscape of POWER8 support in major distributions, and hopefully it helps you out!

Please note that I am entirely focused on what runs and keeps getting new packages, not what companies will officially support. IBM provides documentation for that. I'm also mostly focused on OpenPOWER and not what's supported under IBM PowerVM.

RHEL-compatible

Things aren't too great on the RHEL-compatible side. RHEL 9 is compiled with P9 instructions, removing support for P8. This includes compatible distributions, like CentOS Stream and Rocky Linux.

You can continue to use RHEL 8 for a long time. Unfortunately, Rocky Linux only has a Power release for EL9 and not EL8, and CentOS Stream 8 hits EOL May 31st, 2024 - a bit too soon for my liking. If you're a RHEL customer though, you're set.

Fedora

Fedora seems like a great option - the latest versions still support P8 and there's no immediate signs of that changing. The issue is that Fedora could change this with relatively little warning (and their big brother RHEL already has), Fedora doesn't provide LTS versions that will stay supported if this happens, and any options you could migrate to would be very different from what you're using.

For that reason, I don't recommend using Fedora on POWER8 if you intend to keep it around for a while. If you want something modern for a short-term project, go right ahead! Otherwise, I'd avoid it. If you're still keeping POWER8 systems alive, you probably want something more set-and-forget than Fedora anyway.

Ubuntu

Ubuntu is a mixed bag. The good news is that Ubuntu 20.04 LTS is supported until mid-2025, and if you give Canonical money, that support can extend through 2030. Ubuntu 20.04 LTS is my personal pick for the best distro to install on POWER8 systems that you want to have somewhat modern software but without the risks of future issues.

The bad news is that POWER8 support went away in Ubuntu 22.04, which is extremely unfortunate. Missing an LTS cycle is one thing, but not having a pathway from 21.10 is another. If you were on 20.10/21.04/21.10, you are completely boned, because they're all out of support and 22.04 and later don't support POWER8. You're going to have to reinstall 20.04.

If I sound salty, it's because I had to do this for a few machines. Hopefully you're not in that situation. 20.04 is going to be around for a good while longer, with a lot of modern creature comforts you'd miss on an EL8-compatible distro, so it's my pick for now.

OpenSUSE

I'm pretty ignorant when it comes to chameleon-flavoured distros, so take this with a grain of salt as most of it is from some quick searching. OpenSUSE Leap follows SLES, but without extended support lifetimes for older major versions. From what I can tell, the latest release (15.4) still includes POWER8 support (and adds Power10 support!), but similar to Fedora, that looks rather prone to a new version dropping P8 support to me.

If the 15.x series stayed alive after 16 came out, you might be good, but it doesn't seem like there's a history of that happening.

Debian

Debian 11 "bullseye" came out in 2021, supports POWER8, and is likely to be supported until around 2026. I can't really chime in on more than that because I am a certified Debian hater (even newer releases feel outdated to me), but that looks like a pretty good deal.

Other options

Those are just some major distros, there's plenty of others, including some Power-specific ones from the community.

Conclusion

POWER8's getting old, but is still plenty capable. Make sure your distro still remembers to send your POWER8 a birthday card each year and you'll have plenty more good times to come.

Power kernel hardening features in Linux 6.1

Wed 26 October 2022

Posted by Russell Currey Wed 26 October 2022

Linux 6.1-rc1 was tagged on October 16th, 2022 and includes a bunch of nice things from my team that I want to highlight. Our goal is to make the Linux kernel running on IBM's Power CPUs more secure, and landed a few goodies upstream in 6.1 to that end.

Specifically, Linux 6.1 on Power will include a complete system call infrastructure rework with security and performance benefits, support for KFENCE (a low-overhead memory safety error detector), and execute-only memory (XOM) support on the Radix MMU.

The syscall work from Rohan McLure and Andrew Donnellan replaces arch/powerpc's legacy infrastructure with the syscall wrapper shared between architectures. This was a significant overhaul of a lot of legacy code impacting all of powerpc's many platforms, including multiple different ABIs and 32/64bit compatibility infrastructure. Rohan's series started at v1 with 6 patches and ended at v6 with 25 patches, and he's done an incredible job at adopting community feedback and handling new problems.

Big thanks to Christophe Leroy, Arnd Bergmann, Nick Piggin, Michael Ellerman and others for their reviews, and of course Andrew for providing a lot of review and feedback (and prototyping the syscall wrapper in the first place). Our syscalls have entered the modern era, we can zeroise registers to improve security (but don't yet due to some ongoing discussion around compatibility and making it optional, look out for Linux 6.2), and gain a nice little performance boost by avoiding the allocation of a kernel stack frame. For more detail, see Rohan's cover letter.

Next, we have Nicholas Miehlbradt's implementation of Kernel Electric Fence (KFENCE) (and DEBUG_PAGEALLOC) for 64-bit Power, including the Hash and Radix MMUs. Christophe Leroy has already implemented KFENCE for 32-bit powerpc upstream and a series adding support for 64-bit was posted by Jordan Niethe last year, but couldn't proceed due to locking issues. Those issues have since been resolved, and after fixing a previously unknown and very obscure MM issue, Nick's KFENCE patches have been merged.

KFENCE is a low-overhead alternative to memory detectors like KASAN (which we implemented for Radix earlier this year, thanks to Daniel Axtens and Paul Mackerras), which you probably wouldn't want to run in production. If you're chasing a memory corruption bug that doesn't like to present itself, KFENCE can help you do that for out-of-bounds accesses, use-after-frees, double frees etc without significantly impacting performance.

Finally, I wired up execute-only memory (XOM) for the Radix MMU. XOM is a niche feature that lets users map pages with PROT_EXEC only, creating a page that can't be read or written to, but still executed. This is primarily useful for defending against code reuse attacks like ROP, but has other uses such as JIT/sandbox environments. Power8 and later CPUs running the Hash MMU already had this capability through protection keys (pkeys), my implementation for Radix uses the native execute permission bit of the Radix MMU instead.

This basically took me an afternoon to wire up after I had the idea and I roped in Nicholas Miehlbradt to contribute a selftest, which ended up being a more significant engineering effort than the feature implementation itself. We now have a comprehensive test for XOM that runs on both Hash and Radix for all possible combinations of R/W/X upstream.

Anyway, that's all I have - this is my first time writing a post like this, so let me know what you think! A lot of our work doesn't result in upstream patches so we're not always going to have kernel releases as eventful as this, but we can post summaries every once in a while if there's interest. Thanks for reading!

← Older Newer →

Store Halfword Byte-Reverse Indexed

A Power Technical Blog