Store Halfword Byte-Reverse Indexed

A Power Technical Blog

Fuzzing grub, part 2: going faster

Recently a set of 8 vulnerabilities were disclosed for the grub bootloader. I found 2 of them (CVE-2021-20225 and CVE-2021-20233), and contributed a number of other fixes for crashing bugs which we don't believe are exploitable. I found them by applying fuzz testing to grub. Here's how.

This is a multi-part series: I think it will end up being 4 posts. I'm hoping to cover:

We've been looking at fuzzing grub-emu, which is basically most parts of grub built into a standard userspace program. This includes all the script parsing logic, fonts, graphics, partition tables, filesystems and so on - just not platform specific driver code or the ability to actually load and boot a kernel.

Previously, we talked about some issues building grub with AFL++'s instrumentation:

:::text
./configure --with-platform=emu --disable-grub-emu-sdl CC=$AFL_PATH/afl-cc
...
checking whether target compiler is working... no
configure: error: cannot compile for the target

It also doesn't work with afl-gcc.

We tried to trick configure:

:::shell
./configure --with-platform=emu --disable-grub-emu-sdl CC=clang CXX=clang++
make CC="$AFL_PATH/afl-cc" 

Sadly, things still break:

:::text
/usr/bin/ld: disk.module:(.bss+0x20): multiple definition of `__afl_global_area_ptr'; kernel.exec:(.bss+0xe078): first defined here
/usr/bin/ld: regexp.module:(.bss+0x70): multiple definition of `__afl_global_area_ptr'; kernel.exec:(.bss+0xe078): first defined here
/usr/bin/ld: blocklist.module:(.bss+0x28): multiple definition of `__afl_global_area_ptr'; kernel.exec:(.bss+0xe078): first defined here

The problem is the module linkage that I talked about in part 1. There is a link stage of sorts for the kernel (kernel.exec) and each module (e.g. disk.module), so some AFL support code gets linked into each of those. Then there's another link stage for grub-emu itself, which also tries to bring in the same support code. The linker doesn't like the symbols being in multiple places, which is fair enough.

There are (at least) 3 ways you could solve this. I'm going to call them the hard way, and the ugly way and the easy way.

The hard way: messing with makefiles

We've been looking at fuzzing grub-emu. Building grub-emu links kernel.exec and almost every .module file that grub produces into the final binary. Maybe we could avoid our duplicate symbol problems entirely by changing how we build things?

I didn't do this in my early work because, to be honest, I don't like working with build systems and I'm not especially good at it. grub's build system is based on autotools but is even more quirky than usual: rather than just having a Makefile.am, we have Makefile.core.def which is used along with other things to generate Makefile.am. It's a pretty cool system for making modules, but it's not my idea of fun to work with.

But, for the sake of completeness, I tried again.

It gets unpleasant quickly. The generated grub-core/Makefile.core.am adds each module to platform_PROGRAMS, and then each is built with LDFLAGS_MODULE = $(LDFLAGS_PLATFORM) -nostdlib $(TARGET_LDFLAGS_OLDMAGIC) -Wl,-r,-d.

Basically, in the makefile this ends up being (e.g.):

:::make
tar.module$(EXEEXT): $(tar_module_OBJECTS) $(tar_module_DEPENDENCIES) $(EXTRA_tar_module_DEPENDENCIES) 
    @rm -f tar.module$(EXEEXT)
    $(AM_V_CCLD)$(tar_module_LINK) $(tar_module_OBJECTS) $(tar_module_LDADD) $(LIBS)

Ideally I don't want them to be linked at all; there's no benefit if they're just going to be linked again.

You can't just collect the sources and build them into grub-emu - they all have to built with different CFLAGS! So instead I spent some hours messing around with the build system. Given some changes to the python script that converts the Makefile.*.def files into Makefile.am files, plus some other bits and pieces, we can build grub-emu by linking the object files rather than the more-processed modules.

The build dies immediately after linking grub-emu in other components, and it requires a bit of manual intervention to get the right things built in the right order, but with all of those caveats, it's enough. It works, and you can turn on things like ASAN, but getting there was hard, unrewarding and unpleasant. Let's consider alternative ways to solve this problem.

The ugly way: patching AFL

What I did when finding the bugs was to observe that we only wanted AFL to link in its extra instrumentation at certain points of the build process. So I patched AFL to add an environment variable AFL_DEFER_LIB - which prevented AFL adding its own instrumentation library when being called as a linker. I combined this with the older CFG instrumentation, as the PCGUARD instrumentation brought in a bunch of symbols from LLVM which I didn't want to also figure out how to guard.

I then wrapped this in a horrifying script that basically built bits and pieces of grub with the environment variable on or off, in order to at least get the userspace tools and grub-emu built. Basically it set AFL_DEFER_LIB when building all the modules and turned it off when building the userspace tools and grub-emu.

This worked and it's what I used to find most of my bugs. But I'd probably not recommend it, and I'm not sharing the source: it's extremely fragile and brittle, the hard way is more generally applicable, and the easy way is nicer.

The easy way: adjusting linker flags

After posting part 1 of this series, I had a fascinating twitter DM conversation with @hackerschoice, who pointed me to some new work that had been done in AFL++ between when I started and when I published part 1.

AFL++ now has the ability to dynamically detect some duplicate symbols, allowing it to support plugins and modules better. This isn't directly applicable because we link all the modules in at build time, but in the conversation I was pointed to a linker flag which instructs the linker to ignore the symbol duplication rather than error out. This provides a significantly simpler way to instrument grub-emu, avoiding all the issues I'd previously been fighting so hard to address.

So, with a modern AFL++, and the patch from part 1, you can sort out this entire process like this:

:::shell
./bootstrap
./configure --with-platform=emu CC=clang CXX=clang++ --disable-grub-emu-sdl
make CC=/path/to/afl-clang-fast LDFLAGS="-Wl,--allow-multiple-definition"

Eventually it will error out, but ./grub-core/grub-emu should be successfully built first.

(Why not just build grub-emu directly? It gets built by grub-core/Makefile, but depends on a bunch of things made by the top-level makefile and doesn't express its dependencies well. So you can try to build all the things that you need separately and then cd grub-core; make ...flags... grub-emu if you want - but it's way more complicated to do it that way!)

Going extra fast: __AFL_INIT

Now that we can compile with instrumentation, we can use __AFL_INIT. I'll leave the precise details of how this works to the AFL docs, but in short it allows us to do a bunch of early setup only once, and just fork the process after the setup is done.

There's a patch that inserts a call to __AFL_INIT in the grub-emu start path in my GitHub repo.

All up, this can lead to a 2x-3x speedup over the figures I saw in part 1. In part 1 we saw around 244 executions per second at this point - now we're over 500:

afl-fuzz fuzzing grub, showing fuzzing happening

Finding more bugs with sanitisers

A 'sanitizer' refers to a set of checks inserted by a compiler at build time to detect various runtime issues that might not cause a crash or otherwise be detected. A particularly common and useful sanitizer is ASAN, the AddressSanitizer, which detects out-of-bounds memory accesses, use-after-frees and other assorted memory bugs. Other sanitisers can check for undefined behaviour, uninitialised memory reads or even breaches of control flow integrity.

ASAN is particularly popular for fuzzing. In theory, compiling with AFL++ and LLVM makes it really easy to compile with ASAN. Setting AFL_USE_ASAN=1 should be sufficient.

However, in practice, it's quite fragile for grub. I believe I had it all working, and then I upgraded my distro, LLVM and AFL++ versions, and everything stopped working. (It's possible that I hadn't sufficiently cleaned my source tree and I was still building based on the hard way? Who knows.)

I spent quite a while fighting "truncated relocations". ASAN instrumentation was bloating the binaries, and the size of all the *.module files was over 512MB, which I suspect was causing the issues. (Without ASAN, it only comes to 35MB.)

I tried afl-clang-lto: I installed lld, rebuilt AFL++, and managed to segfault the linker while building grub. So I wrote that off. Changing the instrumentation type to classic didn't help either.

Some googling suggested GCC's -mmodel, which in Clang seems to be -mcmodel, but CFLAGS="-mcmodel=large" didn't get me any further either: it's already added in a few different links.

My default llvm is llvm-12, so I tried building with llvm-9 and llvm-11 in case that helped. Both built a binary, but it would fail to start:

:::text
==375638==AddressSanitizer CHECK failed: /build/llvm-toolchain-9-8fovFY/llvm-toolchain-9-9.0.1/compiler-rt/lib/sanitizer_common/sanitizer_common_libcdep.cc:23 "((SoftRssLimitExceededCallback)) == ((nullptr))" (0x423660, 0x0)

The same happens if I build with llvm-12 and afl-clang, the old-style instrumentation.

I spun up a Ubuntu 20.04 VM and build there with LLVM 10 and the latest stable AFL++. That didn't work either.

I had much better luck using GCC's and GCC's ASAN implementation, either with the old-school afl-gcc or the newer GCC plugin-based afl-gcc-fast. (I have some hypotheses around shared library vs static library ASAN, but having spent way more work time on this than was reasonable, I was disinclined to debug it further.) Here's what worked for me:

:::shell
./configure --with-platform=emu --disable-grub-emu-sdl
# the ASAN option is required because one of the tools leaks memory and
# that breaks the generation of documentation.
# -Wno-nested-extern makes __AFL_INIT work on gcc
ASAN_OPTIONS=detect_leaks=0 AFL_USE_ASAN=1 make CC=/path/to/afl-gcc-fast LDFLAGS="-Wl,--allow-multiple-definition" CFLAGS="-Wno-nested-extern"

GCC doesn't support as many sanitisers as LLVM, but happily it does support ASAN. AFL++'s GCC plugin mode should get us most of the speed we would get from LLVM, and indeed the speed - even with ASAN - is quite acceptable.

If you persist, you should be able to find some more bugs: for example there's a very boring global array out-of-bounds read when parsing config files.

That's all for part 2. In part 3 we'll look at fuzzing filesystems and more. Hopefully there will be a quicker turnaround between part 2 and part 3 than there was between part 1 and part 2!

Fuzzing grub: part 1

Recently a set of 8 vulnerabilities were disclosed for the grub bootloader. I found 2 of them (CVE-2021-20225 and CVE-2021-20233), and contributed a number of other fixes for crashing bugs which we don't believe are exploitable. I found them by applying fuzz testing to grub. Here's how.

This is a multi-part series: I think it will end up being 4 posts. I'm hoping to cover:

  • Part 1 (this post): getting started with fuzzing grub
  • Part 2: going faster by doing lots more work
  • Part 3: fuzzing filesystems and more
  • Part 4: potential next steps and avenues for further work

Fuzz testing

Let's begin with part one: getting started with fuzzing grub.

One of my all-time favourite techniques for testing programs, especially programs that handle untrusted input, and especially-especially programs written in C that parse untrusted input, is fuzz testing. Fuzz testing (or fuzzing) is the process of repeatedly throwing randomised data at your program under test and seeing what it does.

(For the secure boot threat model, untrusted input is anything not validated by a cryptographic signature - so config files are untrusted for our purposes, but grub modules can only be loaded if they are signed, so they are trusted.)

Fuzzing has a long history and has recently received a new lease on life with coverage-guided fuzzing tools like AFL and more recently AFL++.

Building grub for AFL++

AFL++ is extremely easy to use ... if your program:

  1. is built as a single binary with a regular tool-chain
  2. runs as a regular user-space program on Linux
  3. reads a small input files from disk and then exits
  4. doesn't do anything fancy with threads or signals

Beyond that, it gets a bit more complex.

On the face of it, grub fails 3 of these 4 criteria:

  • grub is a highly modular program: it loads almost all of its functionality as modules which are linked as separate ELF relocatable files. (Not runnable programs, but not shared libraries either.)

  • grub usually runs as a bootloader, not as a regular app.

  • grub reads all sorts of things, ranging in size from small files to full disks. After loading most things, it returns to a command prompt rather than exiting.

Fortunately, these problems are not insurmountable.

We'll start with the 'running as a bootloader' problem. Here, grub helps us out a bit, because it provides an 'emulator' target, which runs most of grub functionality as a userspace program. It doesn't support actually booting anything (unsurprisingly) but it does support most other modules, including things like the config file parser.

We can configure grub to build the emulator. We disable the graphical frontend for now.

:::shell
./bootstrap
./configure --with-platform=emu --disable-grub-emu-sdl

At this point in building a fuzzing target, we'd normally try to configure with afl-cc to get the instrumentation that makes AFL(++) so powerful. However, the grub configure script is not a fan:

:::text
./configure --with-platform=emu --disable-grub-emu-sdl CC=$AFL_PATH/afl-cc
...
checking whether target compiler is working... no
configure: error: cannot compile for the target

It also doesn't work with afl-gcc.

Hmm, ok, so what if we just... lie a bit?

:::shell
./configure --with-platform=emu --disable-grub-emu-sdl
make CC="$AFL_PATH/afl-gcc" 

(Normally I'd use CC=clang and afl-cc, but clang support is slightly broken upstream at the moment.)

After a small fix for gcc-10 compatibility, we get the userspace tools (potentially handy!) but a bunch of link errors for grub-emu:

:::text
/usr/bin/ld: disk.module:(.bss+0x20): multiple definition of `__afl_global_area_ptr'; kernel.exec:(.bss+0xe078): first defined here
/usr/bin/ld: regexp.module:(.bss+0x70): multiple definition of `__afl_global_area_ptr'; kernel.exec:(.bss+0xe078): first defined here
/usr/bin/ld: blocklist.module:(.bss+0x28): multiple definition of `__afl_global_area_ptr'; kernel.exec:(.bss+0xe078): first defined here

The problem is the module linkage that I talked about earlier: because there is a link stage of sorts for each module, some AFL support code gets linked in to both the grub kernel (kernel.exec) and each module (here disk.module, regexp.module, ...). The linker doesn't like it being in both, which is fair enough.

To get started, let's instead take advantage of the smarts of AFL++ using Qemu mode instead. This builds a specially instrumented qemu user-mode emulator that's capable of doing coverage-guided fuzzing on uninstrumented binaries at the cost of a significant performance penalty.

:::shell
make clean
make

Now we have a grub-emu binary. If you run it directly, you'll pick up your system boot configuration, but the -d option can point it to a directory of your choosing. Let's set up one for fuzzing:

:::shell
mkdir stage
echo "echo Hello sthbrx readers" > stage/grub.cfg
cd stage
../grub-core/grub-emu -d .

You probably won't see the message because the screen gets blanked at the end of running the config file, but if you pipe it through less or something you'll see it.

Running the fuzzer

So, that seems to work - let's create a test input and try fuzzing:

:::shell
cd ..
mkdir in
echo "echo hi" > in/echo-hi

cd stage
# -Q qemu mode
# -M main fuzzer
# -d don't do deterministic steps (too slow for a text format)
# -f create file grub.cfg
$AFL_PATH/afl-fuzz -Q -i ../in -o ../out -M main -d -- ../grub-core/grub-emu -d .

Sadly:

:::text
[-] The program took more than 1000 ms to process one of the initial test cases.
    This is bad news; raising the limit with the -t option is possible, but
    will probably make the fuzzing process extremely slow.

    If this test case is just a fluke, the other option is to just avoid it
    altogether, and find one that is less of a CPU hog.

[-] PROGRAM ABORT : Test case 'id:000000,time:0,orig:echo-hi' results in a timeout
         Location : perform_dry_run(), src/afl-fuzz-init.c:866

What we're seeing here (and indeed what you can observe if you run grub-emu directly) is that grub-emu isn't exiting when it's done. It's waiting for more input, and will keep waiting for input until it's killed by afl-fuzz.

We need to patch grub to sort that out. It's on my GitHub.

Apply that, rebuild with FUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION, and voila:

:::shell
cd ..
make CFLAGS="-DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION"
cd stage
$AFL_PATH/afl-fuzz -Q -i ../in -o ../out -M main -d -f grub.cfg -- ../grub-core/grub-emu -d .

And fuzzing is happening!

afl-fuzz fuzzing grub, showing fuzzing happening

This is enough to find some of the (now-fixed) bugs in the grub config file parsing!

Fuzzing beyond the config file

You can also extend this to fuzzing other things that don't require the graphical UI, such as grub's transparent decompression support:

:::shell
cd ..
rm -rf in out stage
mkdir in stage
echo hi > in/hi
gzip in/hi
cd stage
echo "cat thefile" > grub.cfg
$AFL_PATH/afl-fuzz -Q -i ../in -o ../out -M main -f thefile -- ../grub-core/grub-emu -d .

You should be able to find a hang pretty quickly with this, an as-yet-unfixed bug where grub will print output forever from a corrupt file: (your mileage may vary, as will the paths.)

:::shell
cp ../out/main/hangs/id:000000,src:000000,time:43383,op:havoc,rep:16 thefile
../grub-core/grub-emu -d . | less # observe this going on forever

zcat, on the other hand, reports it as simply corrupt:

:::text
$ zcat thefile

gzip: thefile: invalid compressed data--format violated

(Feel free to fix that and send a patch to the list!)

That wraps up part 1. Eventually I'll be back with part 2, where I explain the hoops to jump through to go faster with the afl-cc instrumentation.

linux.conf.au 2020 recap

It's that time of year again. Most of OzLabs headed up to the Gold Coast for linux.conf.au 2020.

linux.conf.au is one of the longest-running community-led Linux and Free Software events in the world, and attracts a crowd from Australia, New Zealand and much further afield. OzLabbers have been involved in LCA since the very beginning and this year was no exception with myself running the Kernel Miniconf and several others speaking.

The list below contains some of our highlights that we think you should check out. This is just a few of the talks that we managed to make it to - there's plenty more worthwhile stuff on the linux.conf.au YouTube channel.

We'll see you all at LCA2021 right here in Canberra...

Keynotes

A couple of the keynotes really stood out:

Sean is a forensic structural engineer who shows us a variety of examples, from structural collapses and firefighting disasters, where trained professionals were blinded by their expertise and couldn't bring themselves to do things that were obvious.

There's nothing quite like cryptography proofs presented to a keynote audience at 9:30 in the morning. Vanessa goes over the issues with electronic voting systems in Australia, and especially internet voting as used in NSW, including flaws in their implementation of cryptographic algorithms. There continues to be no good way to do internet voting, but with developments in methodologies like risk-limiting audits there may be reasonably safe ways to do in-person electronic voting.

OpenPOWER

There was an OpenISA miniconf, co-organised by none other than Hugh Blemings of the OpenPOWER Foundation.

Anton (on Mikey's behalf) introduces the Power OpenISA and the Microwatt FPGA core which has been released to go with it.

Anton live demos Microwatt in simulation, and also tries to synthesise it for his FPGA but runs out of time...

Paul presents an in-depth overview of the design of the Microwatt core.

Kernel

There were quite a few kernel talks, both in the Kernel Miniconf and throughout the main conference. These are just some of them:

There's been many cases where we've introduced a syscall only to find out later on that we need to add some new parameters - how do we make our syscalls extensible so we can add new parameters later on without needing to define a whole new syscall, while maintaining both forward and backward compatibility? It turns out it's pretty simple but needs a few more kernel helpers.

There are a bunch of tools out there which you can use to make your kernel hacking experience much more pleasant. You should use them.

Among other security issues with container runtimes, using procfs to setup security controls during the startup of a container is fraught with hilarious problems, because procfs and the Linux filesystem API aren't really designed to do this safely, and also have a bunch of amusing bugs.

Control Flow Integrity is a technique for restricting exploit techniques that hijack a program's control flow (e.g. by overwriting a return address on the stack (ROP), or overwriting a function pointer that's used in an indirect jump). Kees goes through the current state of CFI supporting features in hardware and what is currently available to enable CFI in the kernel.

Linux has supported huge pages for many years, which has significantly improved CPU performance. However, the huge page mechanism was driven by hardware advancements and is somewhat inflexible, and it's just as important to consider software overhead. Matthew has been working on supporting more flexible "large pages" in the page cache to do just that.

Spoiler: the magical fantasy land is a trap.

Community

Lots of community and ethics discussion this year - one talk which stood out to me:

Bradley and Karen argue that while open source has "won", software freedom has regressed in recent years, and present their vision for what modern, pragmatic Free Software activism should look like.

Other

Among the variety of other technical talks at LCA...

Quantum compilers are not really like regular classical compilers (indeed, they're really closer to FPGA synthesis tools). Matthew talks through how quantum compilers map a program on to IBM's quantum hardware and the types of optimisations they apply.

Clevis and Tang provide an implementation of "network bound encryption", allowing you to magically decrypt your secrets when you are on a secure network with access to the appropriate Tang servers. This talk outlines use cases and provides a demonstration.

Christoph discusses how to deal with the hardware and software limitations that make it difficult to capture traffic at wire speed on fast fibre networks.

rfid and hrfid

I was staring at some assembly recently, and for not the first time encountered rfid and hrfid, two instructions that we use when doing things like returning to userspace, returning from OPAL to the kernel, or from a host kernel into a guest.

rfid copies various bits from the register SRR1 (Machine Status Save/Restore Register 1) into the MSR (Machine State Register), and then jumps to an address given in SRR0 (Machine Status Save/Restore Register 0). hrfid does something similar, using HSRR0 and HSRR1 (Hypervisor Machine Status Save/Restore Registers 0/1), and slightly different handling of MSR bits.

The various Save/Restore Registers are used to preserve the state of the CPU before jumping to an interrupt handler, entering the kernel, etc, and are set up as part of instructions like sc (System Call), by the interrupt mechanism, or manually (using instructions like mtsrr1).

Anyway, the way in which rfid and hrfid restores MSR bits is documented somewhat obtusely in the ISA (if you don't believe me, look it up), and I was annoyed by this, so here, have a more useful definition. Leave a comment if I got something wrong.

rfid - Return From Interrupt Doubleword

Machine State Register

Copy all bits (except some reserved bits) from SRR1 into the MSR, with the following exceptions:

  • MSR_3 (HV, Hypervisor State) = MSR_3 & SRR1_3
    [We won't put the thread into hypervisor state if we're not already in hypervisor state]

  • If MSR_29:31 != 0b010 [Transaction State Suspended, TM not available], or SRR1_29:31 != 0b000 [Transaction State Non-transactional, TM not available] then:

    • MSR_29:30 (TS, Transaction State) = SRR1_29:30
    • MSR_31 (TM, Transactional Memory Available) = SRR1_31

    [See the ISA description for explanation on how rfid interacts with TM and resulting interrupts]

  • MSR_48 (EE, External Interrupt Enable) = SRR1_48 | SRR1_49 (PR, Problem State)
    [If going into problem state, external interrupts will be enabled]

  • MSR_51 (ME, Machine Check Interrupt Enable) = (MSR_3 (HV, Hypervisor State) & SRR1_51) | ((! MSR_3) & MSR_51)
    [If we're not already in hypervisor state, we won't alter ME]

  • MSR_58 (IR, Instruction Relocate) = SRR1_58 | SRR1_49 (PR, Problem State)
    [If going into problem state, relocation will be enabled]

  • MSR_59 (DR, Data Relocate) = SRR1_59 | SRR1_49 (PR, Problem State)
    [If going into problem state, relocation will be enabled]

Next Instruction Address

  • NIA = SRR0_0:61 || 0b00
    [Jump to SRR0, set last 2 bits to zero to ensure address is aligned to 4 bytes]

hrfid - Hypervisor Return From Interrupt Doubleword

Machine State Register

Copy all bits (except some reserved bits) from HSRR1 into the MSR, with the following exceptions:

  • If MSR_29:31 != 0b010 [Transaction State Suspended, TM not available], or HSRR1_29:31 != 0b000 [Transaction State Non-transactional, TM not available] then:

    • MSR_29:30 (TS, Transaction State) = HSRR1_29:30
    • MSR_31 (TM, Transactional Memory Available) = HSRR1_31

    [See the ISA description for explanation on how rfid interacts with TM and resulting interrupts]

  • MSR_48 (EE, External Interrupt Enable) = HSRR1_48 | HSRR1_49 (PR, Problem State)
    [If going into problem state, external interrupts will be enabled]

  • MSR_58 (IR, Instruction Relocate) = HSRR1_58 | HSRR1_49 (PR, Problem State)
    [If going into problem state, relocation will be enabled]

  • MSR_59 (DR, Data Relocate) = HSRR1_59 | HSRR1_49 (PR, Problem State)
    [If going into problem state, relocation will be enabled]

Next Instruction Address

  • NIA = HSRR0_0:61 || 0b00
    [Jump to HSRR0, set last 2 bits to zero to ensure address is aligned to 4 bytes]

TEN THOUSAND DISKS

In OpenPOWER land we have a project called op-test-framework which (for all its strengths and weaknesses) allows us to test firmware on a variety of different hardware platforms and even emulators like Qemu.

Qemu is a fantasic tool allowing us to relatively quickly test against an emulated POWER model, and of course is a critical part of KVM virtual machines running natively on POWER hardware. However the default POWER model in Qemu is based on the "pseries" machine type, which models something closer to a virtual machine or a PowerVM partition rather than a "bare metal" machine.

Luckily we have Cédric Le Goater who is developing and maintaining a Qemu "powernv" machine type which more accurately models running directly on an OpenPOWER machine. It's an unwritten rule that if you're using Qemu in op-test, you've compiled this version of Qemu!

Teething Problems

Because the "powernv" type does more accurately model the physical system some extra care needs to be taken when setting it up. In particular at one point we noticed that the pretend CDROM and disk drive we attached to the model were.. not being attached. This commit took care of that; the problem was that the PCI topology defined by the layout required us to be more exact about where PCI devices were to be added. By default only three spare PCI "slots" are available but as the commit says, "This can be expanded by adding bridges"...

More Slots!

Never one to stop at a just-enough solution, I wondered how easy it would be to add an extra PCI bridge or two to give the Qemu model more available slots for PCI devices. It turns out, easy enough once you know the correct invocation. For example, adding a PCI bridge in the first slot of the first default PHB is:

-device pcie-pci-bridge,id=pcie.3,bus=pcie.0,addr=0x0

And inserting a device in that bridge just requires us to specify the bus and slot:

-device virtio-blk-pci,drive=cdrom01,id=virtio02,bus=pcie.4,addr=3

Great! Each bridge provides 31 slots, so now we have plenty of room for extra devices.

Why Stop There?

We have three free slots, and we don't have a strict requirement on where devices are plugged in, so lets just plug a bridge into each of those slots while we're here:

-device pcie-pci-bridge,id=pcie.3,bus=pcie.0,addr=0x0 \
-device pcie-pci-bridge,id=pcie.4,bus=pcie.1,addr=0x0 \
-device pcie-pci-bridge,id=pcie.5,bus=pcie.2,addr=0x0

What happens if we insert a new PCI bridge into another PCI bridge? Aside from stressing out our PCI developers, a bunch of extra slots! And then we could plug bridges into those bridges and then..


Thus was born "OpTestQemu: Add PCI bridges to support more devices." and the testcase "Petitboot10000Disks". The changes to the Qemu model setup fill up each PCI bridge as long as we have devices to add, but reserve the first slot to add another bridge if we run out of room... and so on..

Officially this is to support adding interesting disk topologies to test Pettiboot use cases, stress test device handling, and so on, but while we're here... what happens with 10,000 temporary disks?

======================================================================
ERROR: testListDisks (testcases.Petitboot10000Disks.ConfigEditorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sam/git/op-test-framework/testcases/Petitboot10000Disks.py", line 27, in setUp
    self.system.goto_state(OpSystemState.PETITBOOT_SHELL)
  File "/home/sam/git/op-test-framework/common/OpTestSystem.py", line 366, in goto_state
    self.state = self.stateHandlers[self.state](state)
  File "/home/sam/git/op-test-framework/common/OpTestSystem.py", line 695, in run_IPLing
    raise my_exception
UnknownStateTransition: Something happened system state="2" and we transitioned to UNKNOWN state.  Review the following for more details
Message="OpTestSystem in run_IPLing and the Exception=
"filedescriptor out of range in select()"
 caused the system to go to UNKNOWN_BAD and the system will be stopping."

Yeah that's probably to be expected without some more massaging. What about a more modest 512?

I: Resetting PHBs and training links...
[   55.293343496,5] PCI: Probing slots...
[   56.364337089,3] PHB#0000:02:01.0 pci_find_ecap hit a loop !
[   56.364973775,3] PHB#0000:02:01.0 pci_find_ecap hit a loop !
[   57.127964432,3] PHB#0000:03:01.0 pci_find_ecap hit a loop !
[   57.128545637,3] PHB#0000:03:01.0 pci_find_ecap hit a loop !
[   57.395489618,3] PHB#0000:04:01.0 pci_find_ecap hit a loop !
[   57.396048285,3] PHB#0000:04:01.0 pci_find_ecap hit a loop !
[   58.145944205,3] PHB#0000:05:01.0 pci_find_ecap hit a loop !
[   58.146465795,3] PHB#0000:05:01.0 pci_find_ecap hit a loop !
[   58.404954853,3] PHB#0000:06:01.0 pci_find_ecap hit a loop !
[   58.405485438,3] PHB#0000:06:01.0 pci_find_ecap hit a loop !
[   60.178957315,3] PHB#0001:02:01.0 pci_find_ecap hit a loop !
[   60.179524173,3] PHB#0001:02:01.0 pci_find_ecap hit a loop !
[   60.198502097,3] PHB#0001:02:02.0 pci_find_ecap hit a loop !
[   60.198982582,3] PHB#0001:02:02.0 pci_find_ecap hit a loop !
[   60.435096197,3] PHB#0001:03:01.0 pci_find_ecap hit a loop !
[   60.435634380,3] PHB#0001:03:01.0 pci_find_ecap hit a loop !
[   61.171512439,3] PHB#0001:04:01.0 pci_find_ecap hit a loop !
[   61.172029071,3] PHB#0001:04:01.0 pci_find_ecap hit a loop !
[   61.425416049,3] PHB#0001:05:01.0 pci_find_ecap hit a loop !
[   61.425934524,3] PHB#0001:05:01.0 pci_find_ecap hit a loop !
[   62.172664549,3] PHB#0001:06:01.0 pci_find_ecap hit a loop !
[   62.173186458,3] PHB#0001:06:01.0 pci_find_ecap hit a loop !
[   63.434516732,3] PHB#0002:02:01.0 pci_find_ecap hit a loop !
[   63.435062124,3] PHB#0002:02:01.0 pci_find_ecap hit a loop !
[   64.177567772,3] PHB#0002:03:01.0 pci_find_ecap hit a loop !
[   64.178099773,3] PHB#0002:03:01.0 pci_find_ecap hit a loop !
[   64.431763989,3] PHB#0002:04:01.0 pci_find_ecap hit a loop !
[   64.432285000,3] PHB#0002:04:01.0 pci_find_ecap hit a loop !
[   65.180506790,3] PHB#0002:05:01.0 pci_find_ecap hit a loop !
[   65.181049905,3] PHB#0002:05:01.0 pci_find_ecap hit a loop !
[   65.432105600,3] PHB#0002:06:01.0 pci_find_ecap hit a loop !
[   65.432654326,3] PHB#0002:06:01.0 pci_find_ecap hit a loop !

(That isn't good)

[   66.177240655,5] PCI Summary:
[   66.177906083,5] PHB#0000:00:00.0 [ROOT] 1014 03dc R:00 C:060400 B:01..07 
[   66.178760724,5] PHB#0000:01:00.0 [ETOX] 1b36 000e R:00 C:060400 B:02..07 
[   66.179501494,5] PHB#0000:02:01.0 [ETOX] 1b36 000e R:00 C:060400 B:03..07 
[   66.180227773,5] PHB#0000:03:01.0 [ETOX] 1b36 000e R:00 C:060400 B:04..07 
[   66.180953149,5] PHB#0000:04:01.0 [ETOX] 1b36 000e R:00 C:060400 B:05..07 
[   66.181673576,5] PHB#0000:05:01.0 [ETOX] 1b36 000e R:00 C:060400 B:06..07 
[   66.182395253,5] PHB#0000:06:01.0 [ETOX] 1b36 000e R:00 C:060400 B:07..07 
[   66.183207399,5] PHB#0000:07:02.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 
[   66.183969138,5] PHB#0000:07:03.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 

(a lot more of this)

[   67.055196945,5] PHB#0002:02:1e.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 
[   67.055926264,5] PHB#0002:02:1f.0 [PCID] 1af4 1001 R:00 C:010000 (          scsi) 
[   67.094591773,5] INIT: Waiting for kernel...
[   67.095105901,5] INIT: 64-bit LE kernel discovered
[   68.095749915,5] INIT: Starting kernel at 0x20010000, fdt at 0x3075d270 168365 bytes

zImage starting: loaded at 0x0000000020010000 (sp: 0x0000000020d30ee8)
Allocating 0x1dc5098 bytes for kernel...
Decompressing (0x0000000000000000 <- 0x000000002001f000:0x0000000020d2e578)...
Done! Decompressed 0x1c22900 bytes

Linux/PowerPC load: 
Finalizing device tree... flat tree at 0x20d320a0
[   10.120562] watchdog: CPU 0 self-detected hard LOCKUP @ pnv_pci_cfg_write+0x88/0xa4
[   10.120746] watchdog: CPU 0 TB:50402010473, last heartbeat TB:45261673150 (10039ms ago)
[   10.120808] Modules linked in:
[   10.120906] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.5-openpower1 #2
[   10.120956] NIP:  c000000000058544 LR: c00000000004d458 CTR: 0000000030052768
[   10.121006] REGS: c0000000fff5bd70 TRAP: 0900   Not tainted  (5.0.5-openpower1)
[   10.121030] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 48002482  XER: 20000000
[   10.121215] CFAR: c00000000004d454 IRQMASK: 1 
[   10.121260] GPR00: 00000000300051ec c0000000fd7c3130 c000000001bcaf00 0000000000000000 
[   10.121368] GPR04: 0000000048002482 c000000000058544 9000000002009033 0000000031c40060 
[   10.121476] GPR08: 0000000000000000 0000000031c40060 c00000000004d46c 9000000002001003 
[   10.121584] GPR12: 0000000031c40000 c000000001dd0000 c00000000000f560 0000000000000000 
[   10.121692] GPR16: 0000000000000000 0000000000000000 0000000000000001 0000000000000000 
[   10.121800] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
[   10.121908] GPR24: 0000000000000005 0000000000000000 0000000000000000 0000000000000104 
[   10.122016] GPR28: 0000000000000002 0000000000000004 0000000000000086 c0000000fd9fba00 
[   10.122150] NIP [c000000000058544] pnv_pci_cfg_write+0x88/0xa4
[   10.122187] LR [c00000000004d458] opal_return+0x14/0x48
[   10.122204] Call Trace:
[   10.122251] [c0000000fd7c3130] [c000000000058544] pnv_pci_cfg_write+0x88/0xa4 (unreliable)
[   10.122332] [c0000000fd7c3150] [c0000000000585d0] pnv_pci_write_config+0x70/0x9c
[   10.122398] [c0000000fd7c31a0] [c000000000234fec] pci_bus_write_config_word+0x74/0x98
[   10.122458] [c0000000fd7c31f0] [c00000000023764c] __pci_read_base+0x88/0x3a4
[   10.122518] [c0000000fd7c32c0] [c000000000237a18] pci_read_bases+0xb0/0xc8
[   10.122605] [c0000000fd7c3300] [c0000000002384bc] pci_setup_device+0x4f8/0x5b0
[   10.122670] [c0000000fd7c33a0] [c000000000238d9c] pci_scan_single_device+0x9c/0xd4
[   10.122729] [c0000000fd7c33f0] [c000000000238e2c] pci_scan_slot+0x58/0xf4
[   10.122796] [c0000000fd7c3430] [c000000000239eb8] pci_scan_child_bus_extend+0x40/0x2a8
[   10.122861] [c0000000fd7c34a0] [c000000000239e34] pci_scan_bridge_extend+0x4d4/0x504
[   10.122928] [c0000000fd7c3580] [c00000000023a0f8] pci_scan_child_bus_extend+0x280/0x2a8
[   10.122993] [c0000000fd7c35f0] [c000000000239e34] pci_scan_bridge_extend+0x4d4/0x504
[   10.123059] [c0000000fd7c36d0] [c00000000023a0f8] pci_scan_child_bus_extend+0x280/0x2a8
[   10.123124] [c0000000fd7c3740] [c000000000239e34] pci_scan_bridge_extend+0x4d4/0x504
[   10.123191] [c0000000fd7c3820] [c00000000023a0f8] pci_scan_child_bus_extend+0x280/0x2a8
[   10.123256] [c0000000fd7c3890] [c000000000239b5c] pci_scan_bridge_extend+0x1fc/0x504
[   10.123322] [c0000000fd7c3970] [c00000000023a064] pci_scan_child_bus_extend+0x1ec/0x2a8
[   10.123388] [c0000000fd7c39e0] [c000000000239b5c] pci_scan_bridge_extend+0x1fc/0x504
[   10.123454] [c0000000fd7c3ac0] [c00000000023a064] pci_scan_child_bus_extend+0x1ec/0x2a8
[   10.123516] [c0000000fd7c3b30] [c000000000030dcc] pcibios_scan_phb+0x134/0x1f4
[   10.123574] [c0000000fd7c3bd0] [c00000000100a800] pcibios_init+0x9c/0xbc
[   10.123635] [c0000000fd7c3c50] [c00000000000f398] do_one_initcall+0x80/0x15c
[   10.123698] [c0000000fd7c3d10] [c000000001000e94] kernel_init_freeable+0x248/0x24c
[   10.123756] [c0000000fd7c3db0] [c00000000000f574] kernel_init+0x1c/0x150
[   10.123820] [c0000000fd7c3e20] [c00000000000b72c] ret_from_kernel_thread+0x5c/0x70
[   10.123854] Instruction dump:
[   10.123885] 7d054378 4bff56f5 60000000 38600000 38210020 e8010010 7c0803a6 4e800020 
[   10.124022] e86a0018 54c6043e 7d054378 4bff5731 <60000000> 4bffffd8 e86a0018 7d054378 
[   10.124180] Kernel panic - not syncing: Hard LOCKUP
[   10.124232] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.5-openpower1 #2
[   10.124251] Call Trace:

I wonder if I can submit that bug without someone throwing something at my desk.