Store Halfword Byte-Reverse Indexed

A Power Technical Blog

Erasure Coding for Programmers, Part 1

Erasure coding is an increasingly popular storage technology - allowing the same level of fault tolerance as replication with a significantly reduced storage footprint.

Increasingly, erasure coding is available 'out of the box' on storage solutions such as Ceph and OpenStack Swift. Normally, you'd just pull in a library like ISA-L or jerasure, and set some config options, and you'd be done.

This post is not about that. This post is about how I went from knowing nothing about erasure coding to writing POWER optimised routines to make it go fast. (These are in the process of being polished for upstream at the moment.) If you want to understand how erasure coding works under the hood - and in particular if you're interested in writing optimised routines to make it run quickly in your platform - this is for you.

What are erasure codes anyway?

I think the easiest way to begin thinking about erasure codes is "RAID 6 on steroids". RAID 6 allows you to have up to 255 data disks and 2 parity disks (called P and Q), thus allowing you to tolerate the failure of up to 2 arbitrary disks without data loss.

Erasure codes allow you to have k data disks and m 'parity' or coding disks. You then have a total of m + k disks, and you can tolerate the failure of up to m without losing data.

The downside of erasure coding is that computing what to put on those parity disks is CPU intensive. Lets look at what we put on them.

RAID 6

RAID 6 is the easiest way to get started on understanding erasure codes for a number of reasons. H Peter Anvin's paper on RAID 6 in the Linux kernel is an excellent start, but does dive in a bit quickly to the underlying mathematics. So before reading that, read on!

Rings and Fields

As programmers we're pretty comfortable with modular arithmetic - the idea that if you have:

unsigned char a = 255;
a++;

the new value of a will be 0, not 256.

This is an example of an algebraic structure called a ring.

Rings obey certain laws. For our purposes, we'll consider the following incomplete and somewhat simplified list:

  • There is an addition operation.
  • There is an additive identity (normally called 0), such that 'a + 0 = a'.
  • Every element has an additive inverse, that is, for every element 'a', there is an element -a such that 'a + (-a) = 0'
  • There is a multiplication operation.
  • There is a multiplicative identity (normally called 1), such that 'a * 1 = a'.

These operations aren't necessarily addition or multiplication as we might expect from the integers or real numbers. For example, in our modular arithmetic example, we have 'wrap around'. (There are also certain rules the addition and multiplication rules must satisfy - we are glossing over them here.)

One thing a ring doesn't have a 'multiplicative inverse'. The multiplicative inverse of some non-zero element of the ring (call it a), is the value b such that a * b = 1. (Often instead of b we write 'a^-1', but that looks bad in plain text, so we shall stick to b for now.)

We do have some inverses in 'mod 256': the inverse of 3 is 171 as 3 * 171 = 513, and 513 = 1 mod 256, but there is no b such that 2 * b = 1 mod 256.

If every non-zero element of our ring had a multiplicative inverse, we would have what is called a field.

Now, let's look at a the integers modulo 2, that is, 0 and 1.

We have this for addition:

+ 0 1
0 0 1
1 1 0

Eagle-eyed readers will notice that this is the same as XOR.

For multiplication:

* 0 1
0 0 0
1 0 1

As we said, a field is a ring where every non-zero element has a multiplicative inverse. As we can see, the integers modulo 2 shown above is a field: it's a ring, and 1 is its own multiplicative inverse.

So this is all well and good, but you can't really do very much in a field with 2 elements. This is sad, so we make bigger fields. For this application, we consider the Galois Field with 256 elements - GF(2^8). This field has some surprising and useful properties.

Remember how we said that integers modulo 256 weren't a field because they didn't have multiplicative inverses? I also just said that GF(2^8) also has 256 elements, but is a field - i.e., it does have inverses! How does that work?

Consider an element in GF(2^8). There are 2 ways to look at an element in GF(2^8). The first is to consider it as an 8-bit number. So, for example, let's take 100. We can express that as as an 8 bit binary number: 0b01100100.

We can write that more explicitly as a sum of powers of 2:

0 * 2^7 + 1 * 2^6 + 1 * 2^5 + 0 * 2^4 + 0 * 2^3 + 1 * 2^2 + 0 * 2 + 0 * 1
= 2^6 + 2^5 + 2^2

Now the other way we can look at elements in GF(2^8) is to replace the '2's with 'x's, and consider them as polynomials. Each of our bits then represents the coefficient of a term of a polynomial, that is:

0 x^7 + 1 x^6 + 1 x^5 + 0 x^4 + 0 x^3 + 1 x^2 + 0 x + 0 * 1

or more simply

x^6 + x^5 + x^2

Now, and this is important: each of the coefficients are elements of the integers modulo 2: x + x = 2x = 0 as 2 mod 2 = 0. There is no concept of 'carrying' in this addition.

Let's try: what's 100 + 79 in GF(2^8)?

100 = 0b01100100 => x^6 + x^5 +       x^2
 79 = 0b01001111 => x^6 +       x^3 + x^2 + x + 1

100 + 79         =>   0 + x^5 + x^3 +   0 + x + 1
                 =    0b00101011 = 43

So, 100 + 79 = 43 in GF(2^8)

You may notice we could have done that much more efficiently: we can add numbers in GF(2^8) by just XORing their binary representations together. Subtraction, amusingly, is the same as addition: 0 + x = x = 0 - x, as -1 is congruent to 1 modulo 2.

So at this point you might be wanting to explore a few additions yourself. Fortuantely there's a lovely tool that will allow you to do that:

sudo apt install gf-complete-tools
gf_add $A $B 8

This will give you A + B in GF(2^8).

> gf_add 100 79 8
43

Excellent!

So, hold on to your hats, as this is where things get really weird. In modular arithmetic example, we considered the elements of our ring to be numbers, and we performed our addition and multiplication modulo 256. In GF(2^8), we consider our elements as polynomials and we perform our addition and multiplication modulo a polynomial. There is one conventional polynomial used in applications:

0x11d => 0b1 0001 1101 => x^8 + x^4 + x^3 + x^2 + 1

It is possible to use other polynomials if they satisfy particular requirements, but for our applications we don't need to worry as we will always use 0x11d. I am not going to attempt to explain anything about this polynomial - take it as an article of faith.

So when we multiply two numbers, we multiply their polynomial representations. Then, to find out what that is modulo 0x11d, we do polynomial long division by 0x11d, and take the remainder.

Some examples will help.

Let's multiply 100 by 3.

100 = 0b01100100 => x^6 + x^5 + x^2
  3 = 0b00000011 => x + 1

(x^6 + x^5 + x^2)(x + 1) = x^7 + x^6 + x^3 + x^6 + x^5 + x^2
                         = x^7 + x^5 + x^3 + x^2

Notice that some of the terms have disappeared: x^6 + x^6 = 0.

The degree (the largest power of a term) is 7. 7 is less than the degree of 0x11d, which is 8, so we don't need to do anything: the remainder modulo 0x11d is simply x^7 + x^5 + x^3 + x^2.

In binary form, that is 0b10101100 = 172, so 100 * 3 = 172 in GF(2^8).

Fortunately gf-complete-tools also allows us to check multiplications:

> gf_mult 100 3 8
172

Excellent!

Now let's see what happens if we multiply by a larger number. Let's multiply 100 by 5.

100 = 0b01100100 => x^6 + x^5 + x^2
  5 = 0b00000101 => x^2 + 1

(x^6 + x^5 + x^2)(x^2 + 1) = x^8 + x^7 + x^4 + x^6 + x^5 + x^2
                           = x^8 + x^7 + x^6 + x^5 + x^4 + x^2

Here we have an x^8 term, so we have a degree of 8. This means will get a different remainder when we divide by our polynomial. We do this with polynomial long division, which you will hopefully remember if you did some solid algebra in high school.

                              1
                           ---------------------------------------------
x^8 + x^4 + x^3 + x^2 + 1 | x^8 + x^7 + x^6 + x^5 + x^4       + x^2
                          - x^8                   + x^4 + x^3 + x^2 + 1
                            -------------------------------------------
                          =       x^7 + x^6 + x^5       + x^3       + 1

So we have that our original polynomial (x^8 + x^4 + x^3 + x^2 + 1) is congruent to (x^7 + x^6 + x^5 + x^3 + 1) modulo the polynomial 0x11d. Looking at the binary representation of that new polynomial, we have 0b11101001 = 233.

Sure enough:

> gf_mult 100 5 8
233

Just to solidify the polynomial long division a bit, let's try a slightly larger example, 100 * 9:

100 = 0b01100100 => x^6 + x^5 + x^2
  9 = 0b00001001 => x^3 + 1

(x^6 + x^5 + x^2)(x^3 + 1) = x^9 + x^8 + x^5 + x^6 + x^5 + x^2
                           = x^9 + x^8 + x^6 + x^2

Doing long division to reduce our result:

                              x
                           -----------------------------------
x^8 + x^4 + x^3 + x^2 + 1 | x^9 + x^8       + x^6                   + x^2
                          - x^9                   + x^5 + x^4 + x^3       + x
                            -------------------------------------------------
                          =       x^8       + x^6 + x^5 + x^4 + x^3 + x^2 + x

We still have a polynomial of degree 8, so we can do another step:

                              x +   1
                           -----------------------------------
x^8 + x^4 + x^3 + x^2 + 1 | x^9 + x^8       + x^6                   + x^2
                          - x^9                   + x^5 + x^4 + x^3       + x
                            -------------------------------------------------
                          =       x^8       + x^6 + x^5 + x^4 + x^3 + x^2 + x
                          -       x^8                   + x^4 + x^3 + x^2     + 1
                                  -----------------------------------------------
                          =                   x^6 + x^5                   + x + 1

We now have a polynomial of degree less than 8 that is congruent to our original polynomial modulo 0x11d, and the binary form is 0x01100011 = 99.

> gf_mult 100 9 8
99

This process can be done more efficiently, of course - but understanding what is going on will make you much more comfortable with what is going on!

I will not try to convince you that all multiplicative inverses exist in this magic shadow land of GF(2^8), but it's important for the rest of the algorithms to work that they do exist. Trust me on this.

Back to RAID 6

Equipped with this knowledge, you are ready to take on RAID6 in the kernel (PDF) sections 1 - 2.

Pause when you get to section 3 - this snippet is a bit magic and benefits from some explanation:

Multiplication by {02} for a single byte can be implemeted using the C code:

uint8_t c, cc;
cc = (c << 1) ^ ((c & 0x80) ? 0x1d : 0);

How does this work? Well:

Say you have a binary number 0bNMMM MMMM. Mutiplication by 2 gives you 0bNMMMMMMM0, which is 9 bits. Now, there are two cases to consider.

If your leading bit (N) is 0, your product doesn't have an x^8 term, so we don't need to reduce it modulo the irreducible polynomial.

If your leading bit is 1 however, your product is x^8 + something, which does need to be reduced. Fortunately, because we took an 8 bit number and multiplied it by 2, the largest term is x^8, so we only need to reduce it once. So we xor our number with our polynomial to subtract it.

We implement this by letting the top bit overflow out and then xoring the lower 8 bits with the low 8 bits of the polynomial (0x1d)

So, back to the original statement:

(c << 1) ^ ((c & 0x80) ? 0x1d : 0)
    |          |          |     |
    > multiply by 2       |     |
               |          |     |
               > is the high bit set - will the product have an x^8 term?
                          |     |
                          > if so, reduce by the polynomial
                                |
                                > otherwise, leave alone

Hopefully that makes sense.

Key points

It's critical you understand the section on Altivec (the vperm stuff), so let's cover it in a bit more detail.

Say you want to do A * V, where A is a constant and V is an 8-bit variable. We can express V as V_a + V_b, where V_a is the top 4 bits of V, and V_b is the bottom 4 bits. A * V = A * V_a + A * V_b

We can then make lookup tables for multiplication by A.

If we did this in the most obvious way, we would need a 256 entry lookup table. But by splitting things into the top and bottom halves, we can reduce that to two 16 entry tables. For example, say A = 02.

V_a A * V_a
00 00
01 02
02 04
... ...
0f 1e
V_b A * V_b
00 00
10 20
20 40
... ...
f0 fd

We then use vperm to look up entries in these tables and vxor to combine our results.

So - and this is a key point - for each A value we wish to multiply by, we need to generate a new lookup table.

So if we wanted A = 03:

V_a A * V_a
00 00
01 03
02 06
... ...
0f 11
V_b A * V_b
00 00
10 30
20 60
... ...
f0 0d

One final thing is that Power8 adds a vpermxor instruction, so we can reduce the entire 4 instruction sequence in the paper:

vsrb v1, v0, v14
vperm v2, v12, v12, v0
vperm v1, v13, v13, v1
vxor v1, v2, v1

to 1 vpermxor:

vpermxor v1, v12, v13, v0

Isn't POWER grand?

OK, but how does this relate to erasure codes?

I'm glad you asked.

Galois Field arithmetic, and its application in RAID 6 is the basis for erasure coding. (It's also the basis for CRCs - two for the price of one!)

But, that's all to come in part 2, which will definitely be published before 7 April!

Many thanks to Sarah Axtens who reviewed the mathematical content of this post and suggested significant improvements. All errors and gross oversimplifications remain my own. Thanks also to the OzLabs crew for their feedback and comments.

High Power Lustre

(Most of the hard work here was done by fellow blogger Rashmica - I just verified her instructions and wrote up this post.)

Lustre is a high-performance clustered file system. Traditionally the Lustre client and server have run on x86, but both the server and client will also work on Power. Here's how to get them running.

Server

Lustre normally requires a patched 'enterprise' kernel - normally an old RHEL, CentOS or SUSE kernel. We tested with a CentOS 7.3 kernel. We tried to follow the Intel instructions for building the kernel as much as possible - any deviations we had to make are listed below.

Setup quirks

We are told to edit ~/kernel/rpmbuild/SPEC/kernel.spec. This doesn't exist because the directory is SPECS not SPEC: you need to edit ~/kernel/rpmbuild/SPECS/kernel.spec.

I also found there was an extra quote mark in the supplied patch script after -lustre.patch. I removed that and ran this instead:

for patch in $(<"3.10-rhel7.series"); do \
      patch_file="$HOME/lustre-release/lustre/kernel_patches/patches/${patch}" \
      cat "${patch_file}" >> $HOME/lustre-kernel-x86_64-lustre.patch \
done

The fact that there is 'x86_64' in the patch name doesn't matter as you're about to copy it under a different name to a place where it will be included by the spec file.

Building for ppc64le

Building for ppc64le was reasonably straight-forward. I had one small issue:

[build@dja-centos-guest rpmbuild]$ rpmbuild -bp --target=`uname -m` ./SPECS/kernel.spec
Building target platforms: ppc64le
Building for target ppc64le
error: Failed build dependencies:
       net-tools is needed by kernel-3.10.0-327.36.3.el7.ppc64le

Fixing this was as simple as a yum install net-tools.

This was sufficient to build the kernel RPMs. I installed them and booted to my patched kernel - so far so good!

Building the client packages: CentOS

I then tried to build and install the RPMs from lustre-release. This repository provides the sources required to build the client and utility binaries.

./configure and make succeeded, but when I went to install the packages with rpm, I found I was missing some dependencies:

error: Failed dependencies:
        ldiskfsprogs >= 1.42.7.wc1 is needed by kmod-lustre-osd-ldiskfs-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le
    sg3_utils is needed by lustre-iokit-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le
        attr is needed by lustre-tests-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le
        lsof is needed by lustre-tests-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le

I was able to install sg3_utils, attr and lsof, but I was still missing ldiskfsprogs.

It seems we need the lustre-patched version of e2fsprogs - I found a mailing list post to that effect.

So, following the instructions on the walkthrough, I grabbed the SRPM and installed the dependencies: yum install -y texinfo libblkid-devel libuuid-devel

I then tried rpmbuild -ba SPECS/e2fsprogs-RHEL-7.spec. This built but failed tests. Some failed because I ran out of disk space - they were using 10s of gigabytes. I found that there were some comments in the spec file about this with suggested tests to disable, so I did that. Even with that fix, I was still failing two tests:

  • f_pgsize_gt_blksize: Intel added this to their fork, and no equivalent exists in the master e2fsprogs branches. This relates to Intel specific assumptions about page sizes which don't hold on Power.
  • f_eofblocks: This may need fixing for large page sizes, see this bug.

I disabled the tests by adding the following two lines to the spec file, just before make %{?_smp_mflags} check.

rm -rf tests/f_pgsize_gt_blksize
rm -rf tests/f_eofblocks

With those tests disabled I was able to build the packages successfully. I installed them with yum localinstall *1.42.13.wc5* (I needed that rather weird pattern to pick up important RPMs that didn't fit the e2fs* pattern - things like libcom_err and libss)

Following that I went back to the lustre-release build products and was able to successfully run yum localinstall *ppc64le.rpm!

Testing the server

After disabling SELinux and rebooting, I ran the test script:

sudo /usr/lib64/lustre/tests/llmount.sh

This spat out one scary warning:

mount.lustre FATAL: unhandled/unloaded fs type 0 'ext3'

The test did seem to succeed overall, and it would seem that is a known problem, so I pressed on undeterred.

I then attached a couple of virtual harddrives for the metadata and object store volumes, and having set them up, proceeded to try to mount my freshly minted lustre volume from some clients.

Testing with a ppc64le client

My first step was to test whether another ppc64le machine would work as a client.

I tried with an existing Ubuntu 16.04 VM that I use for much of my day to day development.

A quick google suggested that I could grab the lustre-release repository and run make debs to get Debian packages for my system.

I needed the following dependencies:

sudo apt install module-assistant debhelper dpatch libsnmp-dev quilt

With those the packages built successfully, and could be easily installed:

dpkg -i lustre-client-modules-4.4.0-57-generic_2.9.52-60-g1d2fbad-dirty-1_ppc64el.deblustre-utils_2.9.52-60-g1d2fbad-dirty-1_ppc64el.deb

I tried to connect to the server:

sudo mount -t lustre $SERVER_IP@tcp:/lustre /lustre/

Initially I wasn't able to connect to the server at all. I remembered that (unlike Ubuntu), CentOS comes with quite an aggressive firewall by default. I ran the following on the server:

systemctl stop firewalld

And voila! I was able to connect, mount the lustre volume, and successfully read and write to it. This is very much an over-the-top hack - I should have poked holes in the firewall to allow just the ports lustre needed. This is left as an exercise for the reader.

Testing with an x86_64 client

I then tried to run make debs on my Ubuntu 16.10 x86_64 laptop.

This did not go well - I got the following error:

liblustreapi.c: In function ‘llapi_get_poollist’:
liblustreapi.c:1201:3: error: ‘readdir_r’ is deprecated [-Werror=deprecated-declarations]

This looks like one of the new errors introduced in recent GCC versions, and is a known bug. To work around it, I found the following stanza in a lustre/autoconf/lustre-core.m4, and removed the -Werror:

AS_IF([test $target_cpu == "i686" -o $target_cpu == "x86_64"],
        [CFLAGS="$CFLAGS -Wall -Werror"])

Even this wasn't enough: I got the following errors:

/home/dja/dev/lustre-release/debian/tmp/modules-deb/usr_src/modules/lustre/lustre/llite/dcache.c:387:22: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
         .d_compare = ll_dcompare,
                  ^~~~~~~~~~~
/home/dja/dev/lustre-release/debian/tmp/modules-deb/usr_src/modules/lustre/lustre/llite/dcache.c:387:22: note: (near initialization for ll_d_ops.d_compare)

I figured this was probably because Ubuntu 16.10 has a 4.8 kernel, and Ubuntu 16.04 has a 4.4 kernel. Work on supporting 4.8 is ongoing.

Sure enough, when I fired up a 16.04 x86_64 VM with a 4.4 kernel, I was able to build and install fine.

Connecting didn't work first time - the guest failed to mount, but I did get the following helpful error on the server:

LNetError: 2595:0:(acceptor.c:406:lnet_acceptor()) Refusing connection from 10.61.2.227: insecure port 1024

Refusing insecure port 1024 made me thing that perhaps the NATing that qemu was performing for me was interfering - perhaps the server expected to get a connection where the source port was privileged, and qemu wouldn't be able to do that with NAT.

Sure enough, switching NAT to bridging was enough to get the x86 VM to talk to the ppc64le server. I verified that ls, reading and writing all succeeded.

Next steps

The obvious next steps are following up the disabled tests in e2fsprogs, and doing a lot of internal performance and functionality testing.

Happily, it looks like Lustre might be in the mainline kernel before too long - parts have already started to go in to staging. This will make our lives a lot easier: for example, the breakage between 4.4 and 4.8 would probably have already been picked up and fixed if it was the main kernel tree rather than an out-of-tree patch set.

In the long run, we'd like to make Lustre on Power just as easy as Lustre on x86. (And, of course, more performant!) We'll keep you up to date!

(Thanks to fellow bloggers Daniel Black and Andrew Donnellan for useful feedback on this post.)

NAMD on NVLink

NAMD is a molecular dynamics program that can use GPU acceleration to speed up its calculations. Recent OpenPOWER machines like the IBM Power Systems S822LC for High Performance Computing (Minsky) come with a new interconnect for GPUs called NVLink, which offers extremely high bandwidth to a number of very powerful Nvidia Pascal P100 GPUs. So they're ideal machines for this sort of workload.

Here's how to set up NAMD 2.12 on your Minsky, and how to debug some common issues. We've targeted this script for CentOS, but we've successfully compiled NAMD on Ubuntu as well.

Prerequisites

GPU Drivers and CUDA

Firstly, you'll need CUDA and the NVidia drivers.

You can install CUDA by following the instructions on NVidia's CUDA Downloads page.

yum install epel-release
yum install dkms
# download the rpm from the NVidia website
rpm -i cuda-repo-rhel7-8-0-local-ga2-8.0.54-1.ppc64le.rpm
yum clean expire-cache
yum install cuda
# this will take a while...

Then, we set up a profile file to automatically load CUDA into our path:

cat >  /etc/profile.d/cuda_path.sh <<EOF
# From http://developer.download.nvidia.com/compute/cuda/8.0/secure/prod/docs/sidebar/CUDA_Quick_Start_Guide.pdf - 4.4.2.1
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
EOF

Now, open a new terminal session and check to see if it works:

cuda-install-samples-8.0.sh ~
cd ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/bandwidthTest
make && ./bandwidthTest

If you see a figure of ~32GB/s, that means NVLink is working as expected. A figure of ~7-8GB indicates that only PCI is working, and more debugging is required.

Compilers

You need a c++ compiler:

yum install gcc-c++

Building NAMD

Once CUDA and the compilers are installed, building NAMD is reasonably straightforward. The one hitch is that because we're using CUDA 8.0, and the NAMD build scripts assume CUDA 7.5, we need to supply an updated Linux-POWER.cuda file. (We also enable code generation for the Pascal in this file.)

We've documented the entire process as a script which you can download. We'd recommend executing the commands one by one, but if you're brave you can run the script directly.

The script will fetch NAMD 2.12 and build it for you, but won't install it. It will look for the CUDA override file in the directory you are running the script from, and will automatically move it into the correct place so it is picked up by the build system..

The script compiles for a single multicore machine setup, rather than for a cluster. However, it should be a good start for an Ethernet or Infiniband setup.

If you're doing things by hand, you may see some errors during the compilation of charm - as long as you get charm++ built successfully. at the end, you should be OK.

Testing NAMD

We have been testing NAMD using the STMV files available from the NAMD website:

cd NAMD_2.12_Source/Linux-POWER-g++
wget http://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
tar -xf stmv.tar.gz
sudo ./charmrun +p80 ./namd2 +pemap 0-159:2 +idlepoll +commthread stmv/stmv.namd

This binds a namd worker thread to every second hardware thread. This is because hardware threads share resources, so using every hardware thread costs overhead and doesn't give us access to any more physical resources.

You should see messages about finding and using GPUs:

Pe 0 physical rank 0 binding to CUDA device 0 on <hostname>: 'Graphics Device'  Mem: 4042MB  Rev: 6.0

This should be significantly faster than on non-NVLink machines - we saw a gain of about 2x in speed going from a machine with Nvidia K80s to a Minsky. If things aren't faster for you, let us know!

Downloads

Other notes

Namd requires some libraries, some of which they supply as binary downloads on their website. Make sure you get the ppc64le versions, not the ppc64 versions, otherwise you'll get errors like:

/bin/ld: failed to merge target specific data of file .rootdir/tcl/lib/libtcl8.5.a(regfree.o)
/bin/ld: .rootdir/tcl/lib/libtcl8.5.a(regerror.o): compiled for a big endian system and target is little endian
/bin/ld: failed to merge target specific data of file .rootdir/tcl/lib/libtcl8.5.a(regerror.o)
/bin/ld: .rootdir/tcl/lib/libtcl8.5.a(tclAlloc.o): compiled for a big endian system and target is little endian

The script we supply should get these right automatically.

linux.conf.au 2017 review

I recently attended LCA 2017, where I gave a talk at the Linux Kernel miniconf (run by fellow sthbrx blogger Andrew Donnellan!) and a talk at the main conference.

I received some really interesting feedback so I've taken the opportunity to write some of it down to complement the talk videos and slides that are online. (And to remind me to follow up on it!)

Miniconf talk: Sparse Warnings

My kernel miniconf talk was on sparse warnings (pdf slides, 23m video).

The abstract read (in part):

sparse is a semantic parser for C, and is one of the static analysis tools available to kernel devs.

Sparse is a powerful tool with good integration into the kernel build system. However, we suffer from warning overload - there are too many sparse warnings to spot the serious issues amongst the trivial. This makes it difficult to use, both for developers and maintainers.

Happily, I received some feedback that suggests it's not all doom and gloom like I had thought!

  • Dave Chinner told me that the xfs team uses sparse regularly to make sure that the file system is endian-safe. This is good news - we really would like that to be endian-safe!

  • Paul McKenney let me know that the 0day bot does do some sparse checking - it would just seem that it's not done on PowerPC.

Main talk: 400,000 Ephemeral Containers

My main talk was entitled "400,000 Ephemeral Containers: testing entire ecosystems with Docker". You can read the abstract for full details, but it boils down to:

What if you want to test how all the packages in a given ecosystem work in a given situation?

My main example was testing how many of the Ruby packages successfully install on Power, but I also talk about other languages and other cool tests you could run.

The 44m video is online. I haven't put the slides up yet but they should be available on GitHub soonish.

Unlike with the kernel talk, I didn't catch the names of most of the people with feedback.

Docker memory issues

One of the questions I received during the talk was about running into memory issues in Docker. I attempted to answer that during the Q&A. The person who asked the question then had a chat with me afterwards, and it turns out I had completely misunderstood the question. I thought it was about memory usage of running containers in parallel. It was actually about memory usage in the docker daemon when running lots of containers in serial. Apparently the docker daemon doesn't free memory during the life of the process, and the question was whether or not I had observed that during my runs.

I didn't have a good answer for this at the time other than "it worked for me", so I have gone back and looked at the docker daemon memory usage.

After a full Ruby run, the daemon is using about 13.9G of virtual memory, and 1.975G of resident memory. If I restart it, the memory usage drops to 1.6G of virtual and 43M of resident memory. So it would appear that the person asking the question was right, and I'm just not seeing it have an effect.

Other interesting feedback

  • Someone was quite interested in testing on Sparc, once they got their Go runtime nailed down.

  • A Rackspacer was quite interested in Python testing for OpenStack - this has some intricacies around Py2/Py3, but we had an interesting discussion around just testing to see if packages that claim Py3 support provide Py3 support.

  • A large jobs site mentioned using this technique to help them migrate their dependencies between versions of Go.

  • I was 'gently encouraged' to try to do better with how long the process takes to run - if for no other reason than to avoid burning more coal. This is a fair point. I did not explain very well what I meant with diminishing returns in the talk: there's lots you could do to make the process faster, it's just comes at the cost of the simplicity that I really wanted when I first started the project. I am working (on and off) on better ways to deal with this by considering the dependency graph.

Extracting Early Boot Messages in QEMU

Be me, you're a kernel hacker, you make some changes to your kernel, you boot test it in QEMU, and it fails to boot. Even worse is the fact that it just hangs without any failure message, no stack trace, no nothing. "Now what?" you think to yourself.

You probably do the first thing you learnt in debugging101 and add abundant print statements all over the place to try and make some sense of what's happening and where it is that you're actually crashing. So you do this, you recompile your kernel, boot it in QEMU and lo and behold, nothing... What happened? You added all these shiny new print statements, where did the output go? The kernel still failed to boot (obviously), but where you were hoping to get some clue to go on you were again left with an empty screen. "Maybe I didn't print early enough" or "maybe I got the code paths wrong" you think, "maybe I just need more prints" even. So lets delve a bit deeper, why didn't you see those prints, where did they go, and how can you get at them?

__log_buf

So what happens when you call printk()? Well what normally happens is, depending on the log level you set, the output is sent to the console or logged so you can see it in dmesg. But what happens if we haven't registered a console yet? Well then we can't print the message can we, so its logged in a buffer, kernel log buffer to be exact helpfully named __log_buf.

Console Registration

So how come I eventually see print statements on my screen? Well at some point during the boot process a console is registered with the printk system, and any buffered output can now be displayed. On ppc it happens that this occurs in register_early_udbg_console() called in setup_arch() from start_kernel(), which is the generic kernel entry point. From this point forward when you print something it will be displayed on the console, but what if you crash before this? What are you supposed to do then?

Extracting Early Boot Messages in QEMU

And now the moment you've all been waiting for, how do I extract those early boot messages in QEMU if my kernel crashes before the console is registered? Well it's quite simple really, QEMU is nice enough to allow us to dump guest memory, and we know the log buffer is in there some where, so we just need to dump the correct part of memory which corresponds to the log buffer.

Locating __log_buf

Before we can dump the log buffer we need to know where it is. Luckily for us this is fairly simple, we just need to dump all the kernel symbols and look for the right one.

> nm vmlinux > tmp; grep __log_buf tmp;
c000000000f5e3dc b __log_buf

We use the nm tool to list all the kernel symbols and output this into some temporary file, we can then grep this for the log buffer (which we know to be named __log_buf), and presto we are told that it's at kernel address 0xf5e3dc.

Dumping Guest Memory

It's then simply a case of dumping guest memory from the QEMU console. So first we press ^a+c to get us to the QEMU console, then we can use the aptly named dump-guest-memory.

> help dump-guest-memory
dump-guest-memory [-p] [-d] [-z|-l|-s] filename [begin length] -- dump guest memory into file 'filename'.
            -p: do paging to get guest's memory mapping.
            -d: return immediately (do not wait for completion).
            -z: dump in kdump-compressed format, with zlib compression.
            -l: dump in kdump-compressed format, with lzo compression.
            -s: dump in kdump-compressed format, with snappy compression.
            begin: the starting physical address.
            length: the memory size, in bytes.

We just give it a filename for where we want our output to go, we know the starting address, we just don't know the length. We could choose some arbitrary length, but inspection of the kernel code shows us that:

#define __LOG_BUF_LEN (1 << CONFIG_LOG_BUF_SHIFT)
static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);

Looking at the pseries_defconfig file shows us that the LOG_BUF_SHIFT is set to 18, and thus we know that the buffer is 2^18 bytes or 256kb. So now we run:

> dump-guest-memory tmp 0xf5e3dc 262144

And we now get our log buffer in the file tmp. This can simply be viewed with:

> hexdump -C tmp

This gives a readable, if poorly formatted output. I'm sure you can find something better but I'll leave that as an exercise for the reader.

Conclusion

So if like me your kernel hangs somewhere early in the boot process and you're left without your console output you are now fully equipped to extract the log buffer in QEMU and hopefully therein lies the answer to why you failed to boot.