Store Half Byte-Reverse Indexed

A Power Technical Blog

Getting In Sync

Since at least v1.0.0 Petitboot has used device-mapper snapshots to avoid mounting block devices directly. Primarily this is so Petitboot can mount disks and potentially perform filesystem recovery without worrying about messing it up and corrupting a host's boot partition - all changes happen to the snapshot in memory without affecting the actual device.

This of course gets in the way if you actually do want to make changes to a block device. Petitboot will allow certain bootloader scripts to make changes to disks if configured (eg, grubenv updates), but if you manually make changes you would need to know the special sequence of dmsetup commands to merge the snapshots back to disk. This is particulary annoying if you're trying to copy logs to a USB device!

Depending on how recent a version of Petitboot you're running, there are two ways of making sure your changes persist:

Before v1.2.2

If you really need to save changes from within Petitboot, the most straightforward way is to disable snapshots. Drop to the shell and enter

nvram --update-config petitboot,snapshots?=false

Once you have rebooted you can remount the device as read-write and modify it as normal.

After v1.2.2

To make this easier while keeping the benefit of snapshots, v1.2.2 introduces a new user-event that will merge snapshots on demand. For example:

mount -o remount,rw /var/petitboot/mnt/dev/sda2
cp /var/log/messages /var/petitboot/mnt/dev/sda2/
pb-event sync@sda2

After calling pb-event sync@yourdevice, Petitboot will remount the device back to read-only and merge the current snapshot differences back to disk. You can also run pb-event sync@all to sync all existing snapshots if desired.

Get off my lawn: separating Docker workloads using cgroups

On my team, we do two different things in our Continuous Integration setup: build/functional tests, and performance tests. Build tests simply test whether a project builds, and, if the project provides a functional test suite, that the tests pass. We do a lot of MySQL/MariaDB testing this way. The other type of testing we do is performance tests: we build a project and then run a set of benchmarks against it. Python is a good example here.

Build tests want as much grunt as possible. Performance tests, on the other hand, want a stable, isolated environment. Initially, we set up Jenkins so that performance and build tests never ran at the same time. Builds would get the entire machine, and performance tests would never have to share with anyone.

This, while simple and effective, has some downsides. In POWER land, our machines are quite beefy. For example, one of the boxes I use - an S822L - has 4 sockets, each with 4 cores. At SMT-8 (an 8 way split of each core) that gives us 4 x 4 x 8 = 128 threads. It seems wasteful to lock this entire machine - all 128 threads - just so as to isolate a single-threaded test.1

So, can we partition our machine so that we can be running two different sorts of processes in a sufficiently isolated way?

What counts as 'sufficiently isolated'? Well, my performance tests are CPU bound, so I want CPU isolation. I also want memory, and in particular memory bandwith to be isolated. I don't particularly care about IO isolation as my tests aren't IO heavy. Lastly, I have a couple of tests that are very multithreaded, so I'd like to have enough of a machine for those test results to be interesting.

For CPU isolation we have CPU affinity. We can also do something similar with memory. On a POWER8 system, memory is connected to individual P8s, not to some central point. This is a 'Non-Uniform Memory Architecture' (NUMA) setup: the directly attached memory will be very fast for a processor to access, and memory attached to other processors will be slower to access. An accessible guide (with very helpful diagrams!) is the relevant RedBook (PDF), chapter 2.

We could achieve the isolation we want by dividing up CPUs and NUMA nodes between the competing workloads. Fortunately, all of the hardware NUMA information is plumbed nicely into Linux. Each P8 socket gets a corresponding NUMA node. lscpu will tell you what CPUs correspond to which NUMA nodes (although what it calls a CPU we would call a hardware thread). If you install numactl, you can use numactl -H to get even more details.

In our case, the relevant lscpu output is thus:

NUMA node0 CPU(s):     0-31
NUMA node1 CPU(s):     96-127
NUMA node16 CPU(s):    32-63
NUMA node17 CPU(s):    64-95

Now all we have to do is find some way to tell Linux to restrict a group of processes to a particular NUMA node and the corresponding CPUs. How? Enter control groups, or cgroups for short. Processes can be put into a cgroup, and then a cgroup controller can control the resouces allocated to the cgroup. Cgroups are hierarchical, and there are controllers for a number of different ways you could control a group of processes. Most helpfully for us, there's one called cpuset, which can control CPU affinity, and restrict memory allocation to a NUMA node.

We then just have to get the processes into the relevant cgroup. Fortunately, Docker is incredibly helpful for this!2 Docker containers are put in the docker cgroup. Each container gets it's own cgroup under the docker cgroup, and fortunately Docker deals well with the somewhat broken state of cpuset inheritance.3 So it suffices to create a cpuset cgroup for docker, and allocate some resources to it, and Docker will do the rest. Here we'll allocate the last 3 sockets and NUMA nodes to Docker containers:

cgcreate -g cpuset:docker
echo 32-127 > /sys/fs/cgroup/cpuset/docker/cpuset.cpus
echo 1,16-17 > /sys/fs/cgroup/cpuset/docker/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/docker/cpuset.mem_hardwall

mem_hardwall prevents memory allocations under docker from spilling over into the one remaining NUMA node.

So, does this work? I created a container with sysbench and then ran the following:

root@0d3f339d4181:/# sysbench --test=cpu --num-threads=128 --max-requests=10000000 run

Now I've asked for 128 threads, but the cgroup only has CPUs/hwthreads 32-127 allocated. So If I run htop, I shouldn't see any load on CPUs 0-31. What do I actually see?

htop screenshot, showing load only on CPUs 32-127

It works! Now, we create a cgroup for performance tests using the first socket and NUMA node:

cgcreate -g cpuset:perf-cgroup
echo 0-31 > /sys/fs/cgroup/cpuset/perf-cgroup/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/perf-cgroup/cpuset.mems
echo 1 > /sys/fs/cgroup/cpuset/perf-cgroup/cpuset.mem_hardwall

Docker conveniently lets us put new containers under a different cgroup, which means we can simply do:

dja@p88 ~> docker run -it --rm --cgroup-parent=/perf-cgroup/ ppc64le/ubuntu bash
root@b037049f94de:/# # ... install sysbench
root@b037049f94de:/# sysbench --test=cpu --num-threads=128 --max-requests=10000000 run

And the result?

htop screenshot, showing load only on CPUs 0-31

It works! My benchmark results also suggest this is sufficient isolation, and the rest of the team is happy to have more build resources to play with.

There are some boring loose ends to tie up: if a build job does anything outside of docker (like clone a git repo), that doesn't come under the docker cgroup, and we have to interact with systemd. Because systemd doesn't know about cpuset, this is quite fiddly. We also want this in a systemd unit so it runs on start up, and we want some code to tear it down. But I'll spare you the gory details.

In summary, cgroups are surprisingly powerful and simple to work with, especially in conjunction with Docker and NUMA on Power!

  1. It gets worse! Before the performance test starts, all the running build jobs must drain. If we have 8 Jenkins executors running on the box, and a performance test job is the next in the queue, we have to wait for 8 running jobs to clear. If they all started at different times and have different runtimes, we will inevitably spend a fair chunk of time with the machine at less than full utilisation while we're waiting. 

  2. At least, on Ubuntu 16.04. I haven't tested if this is true anywhere else. 

  3. I hear this is getting better. It is also why systemd hasn't done cpuset inheritance yet. 

Where to Get a POWER8 Development VM

POWER8 sounds great, but where the heck can I get a Power VM so I can test my code?

This is a common question we get at OzLabs from other open source developers looking to port their software to the Power Architecture. Unfortunately, most developers don't have one of our amazing servers just sitting around under their desk.

Thankfully, there's a few IBM partners who offer free VMs for development use. If you're in need of a development VM, check out:

So, next time you wonder how you can test your project on POWER8, request a VM and get to it!

Optical Action at a Distance

Generally when someone wants to install a Linux distro they start with an ISO file. Now we could burn that to a DVD, walk into the server room, and put it in our machine, but that's a pain. Instead let's look at how to do this over the network with Petitboot!

At the moment Petitboot won't be able to handle an ISO file unless it's mounted in an expected place (eg. as a mounted DVD), so we need to unpack it somewhere. Choose somewhere to host the result and unpack the ISO via whatever method you prefer. (For example bsdtar -xf /path/to/image.iso).

You'll get a bunch of files but for our purposes we only care about a few; the kernel, the initrd, and the bootloader configuration file. Using the Ubuntu 16.04 ppc64el ISO as an example, these are:


In grub.cfg we can see that the boot arguments are actually quite simple:

set timeout=-1

menuentry "Install" {
    linux   /install/vmlinux tasks=standard pkgsel/language-pack-patterns= pkgsel/install-language-support=false --- quiet
    initrd  /install/initrd.gz

menuentry "Rescue mode" {
    linux   /install/vmlinux rescue/enable=true --- quiet
    initrd  /install/initrd.gz

So all we need to do is create a PXE config file that points Petitboot towards the correct files.

We're going to create a PXE config file which you could serve from your DHCP server, but that does not mean we need to use PXE - if you just want a quick install you only need make these files accessible to Petitboot, and then we can use the 'Retrieve config from URL' option to download the files.

Create a petitboot.conf file somewhere accessible that contains (for Ubuntu):

label Install Ubuntu 16.04 Xenial Xerus
    kernel http://myaccesibleserver/path/to/vmlinux
    initrd http://myaccesibleserver/path/to/initrd.gz
    append tasks=standard pkgsel/language-pack-patterns= pkgsel/install-language-support=false --- quiet

Then in Petitboot, select 'Retrieve config from URL' and enter http://myaccesibleserver/path/to/petitboot.conf. In the main menu your new option should appear - select it and away you go!

A Taste of IBM

As a hobbyist programmer and Linux user, I was pretty stoked to be able to experience real work in the IT field that interests me most, Linux. With a mainly disconnected understanding of computer hardware and software, I braced myself to entirely relearn everything and anything I thought I knew. Furthermore, I worried that my usefulness in a world of maintainers, developers and testers would not be enough to provide any real contribution to the company. In actual fact however, the employees at OzLabs (IBM ADL) put a really great effort into making use of my existing skills, were attentive to my current knowledge and just filled in the gaps! The knowledge they've given me is practical, interlinked with hardware and provided me with the foot-up that I'd been itching for to establish my own portfolio as a programmer. I was both honoured and astonished by their dedication to helping me make a truly meaningful contribution!

On applying for the placement, I listed my skills and interests. Having a Mathematics, Science background, I listed among my greatest interests development of scientific simulation and graphics using libraries such as Python matplotlib and R. By the first day they got me to work, researching and implementing a routine in R that would qualitatively model the ability of a system to perform common tasks - a benchmark. A series of these microbenchmarks were made; I was in my element and actually able to contribute to a corporation much larger than I could imagine. The team at IBM reinforced my knowledge from the ground up, introducing the rigorous hardware and corporate elements at a level I was comfortable with.

I would say that my greatest single piece of take-home knowledge over the two weeks was knowledge of the Linux Kernel project, Git and GitHub. Having met the arch/powerpc and linux-next maintainers in person placed the Linux and Open Source development cycle in an entirely new perspective. I was introduced to the world of GitHub, and thanks to a few rigorous lessons of Git, I now have access to tools that empower me to safely and efficiently write code, and to build a public portfolio I can be proud of. Most members of the office donated their time to instruct me on all fronts, whether to do with career paths, programming expertise or conceptual knowledge, and the rest were all very good for a chat.

Approaching the tail-end of Year Twelve, I was blessed with some really good feedback and recommendations regarding further study. If during the two weeks I had any query regarding anything ranging from work-life to programming expertise even to which code editor I should use (a source of much contention) the people in the office were very happy to help me. Several employees donated their time to teach me really very intensive and long lessons regarding the software development concepts, including (but not limited to!) a thorough and helpful lesson on Git that was just on my level of understanding.

Working at IBM these past two weeks has not only bridged the gap between my hobby and my professional prospects, but more importantly established friendships with professionals in the field of Software Development. Without a doubt this really great experience of an environment that rewards my enthusiasm will fondly stay in my mind as I enter the next chapter of my life!