Store Half Byte-Reverse Indexed

A Power Technical Blog

Evaluating CephFS on Power

Methodology

To evaluate CephFS, we will create a ppc64le virtual machine, with sufficient space to compile the software, as well as 3 sparse 1TB disks to create the object store.

We will then build & install the Ceph packages, after adding the PowerPC optimisiations to the code. This is done, as ceph-deploy will fetch prebuilt packages that do not have the performance patches if the packages are not installed.

Finally, we will use the ceph-deploy to deploy the instance. We will ceph-deploy via pip, to avoid file conflicts with the packages that we built.

For more information on what each command does, visit the following tutorial, upon which which this is based: http://palmerville.github.io/2016/04/30/single-node-ceph-install.html

Virtual Machine Config

Create a virtual machine with at least the following: - 16GB of memory - 16 CPUs - 64GB disk for the root filesystem - 3 x 1TB for the Ceph object store - Ubuntu 16.04 default install (only use the 64GB disk, leave the others unpartitioned)

Initial config

  • Enable ssh
    sudo apt install openssh-server
    sudo apt update
    sudo apt upgrade
    sudo reboot
  • Install build tools
    sudo apt install git debhelper

Build Ceph

    mkdir $HOME/src
    cd $HOME/src
    git clone --recursive https://github.com/ceph/ceph.git  # This may take a while
    cd ceph
    git checkout master
    git submodule update --force --init --recursive
  • Cherry-pick the Power performance patches:
    git remote add kestrels https://github.com/kestrels/ceph.git
    git fetch --all
    git cherry-pick 59bed55a676ebbe3ad97d8ec005c2088553e4e11
  • Install prerequisites
    ./install-deps.sh
    sudo apt install python-requests python-flask resource-agents curl python-cherrypy python3-pip python-django python-dateutil python-djangorestframework
    sudo pip3 install ceph-deploy
    cd $HOME/src/ceph
    sudo dpkg-buildpackage -J$(nproc) # This will take a couple of hours (16 cpus)
  • Install the packages (note that python3-ceph-argparse will fail, but is safe to ignore)
    cd $HOME/src
    sudo dpkg -i *.deb

Create the ceph-deploy user

    sudo adduser ceph-deploy
    echo "ceph-deploy ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/ceph-deploy
    sudo chmod 0440 /etc/sudoers.d/ceph-deploy

Configure the ceph-deploy user environment

    su - ceph-deploy
    ssh-keygen
    node=`hostname`
    ssh-copy-id ceph-deploy@$node
    mkdir $HOME/ceph-cluster
    cd $HOME/ceph-cluster
    ceph-deploy new $node # If this fails, remove the bogus 127.0.1.1 entry from /etc/hosts
    echo 'osd pool default size = 2' >> ceph.conf
    echo 'osd crush chooseleaf type = 0' >> ceph.conf

Complete the Ceph deployment

    ceph-deploy install $node
    ceph-deploy mon create-initial
    drives="vda vdb vdc"  # the 1TB drives - check that these are correct for your system
    for drive in $drives; do ceph-deploy disk zap $node:$drive; ceph-deploy osd prepare $node:$drive; done
    for drive in $drives; do ceph-deploy osd activate $node:/dev/${drive}1; done
    ceph-deploy admin $node
    sudo chmod +r /etc/ceph/ceph.client.admin.keyring
    ceph -s # Check the state of the cluster

Configure CephFS

    ceph-deploy mds create $node
    ceph osd pool create cephfs_data 128
    ceph osd pool create cephfs_metadata 128
    ceph fs new cephfs cephfs_metadata cephfs_data
    sudo systemctl status ceph\*.service ceph\*.target # Ensure the ceph-osd, ceph-mon & ceph-mds daemons are running
    sudo mkdir /mnt/cephfs
    key=`grep key ~/ceph-cluster/ceph.client.admin.keyring | cut -d ' ' -f 3`
    sudo mount -t ceph $node:6789:/ /mnt/cephfs -o name=admin,secret=$key

References

  1. http://docs.ceph.com/docs/master/install/clone-source/
  2. http://docs.ceph.com/docs/master/install/build-ceph/
  3. http://palmerville.github.io/2016/04/30/single-node-ceph-install.html

Erasure Coding for Programmers, Part 2

We left part 1 having explored GF(2^8) and RAID 6, and asking the question "what does all this have to do with Erasure Codes?"

Basically, the thinking goes "RAID 6 is cool, but what if, instead of two parity disks, we had an arbitrary number of parity disks?"

How would we do that? Well, let's introduce our new best friend: Coding Theory!

Say we want to transmit some data across an error-prone medium. We don't know where the errors might occur, so we add some extra information to allow us to detect and possibly correct for errors. This is a code. Codes are a largish field of engineering, but rather than show off my knowledge about systematic linear block codes, let's press on.

Today, our error-prone medium is an array of inexpensive disks. Now we make this really nice assumption about disks, namely that they are either perfectly reliable or completely missing. In other words, we consider that a disk will either be present or 'erased'. We come up with 'erasure codes' that are able to reconstruct data when it is known to be missing. (This is a slightly different problem to being able to verify and correct data that might or might not be subtly corrupted. Disks also have to deal with this problem, but it is not something erasure codes address!)

The particular code we use is a Reed-Solomon code. The specific details are unimportant, but there's a really good graphical outline of the broad concepts in sections 1 and 3 of the Jerasure paper/manual. (Don't go on to section 4.)

That should give you some background on how this works at a pretty basic mathematical level. Implementation is a matter of mapping that maths (matrix multiplication) onto hardware primitives, and making it go fast.

Scope

I'm deliberately not covering some pretty vast areas of what would be required to write your own erasure coding library from scratch. I'm not going to talk about how to compose the matricies, how to invert them, or anything like that. I'm not sure how that would be a helpful exercise - ISA-L and jerasure already exist and do that for you.

What I want to cover is an efficient implementation of the some algorithms, once you have the matricies nailed down.

I'm also going to assume your library already provides a generic multiplication function in GF(2^8). That's required to construct the matrices, so it's a pretty safe assumption.

The beginnings of an API

Let's make this a bit more concrete.

This will be heavily based on the ISA-L API but you probably want to plug into ISA-L anyway, so that shouldn't be a problem.

What I want to do is build up from very basic algorithmic components into something useful.

The first thing we want to do is to be able to is Galois Field multiplication of an entire region of bytes by an arbitrary constant.

We basically want gf_vect_mul(size_t len, <something representing the constant>, unsigned char * src, unsigned char * dest)

Simple and slow approach

The simplest way is to do something like this:

void gf_vect_mul_simple(size_t len, unsigned char c, unsigned char * src, unsigned char * dest) {

    size_t i;
    for (i=0; i<len; i++) {
        dest[i] = gf_mul(c, src[i]);
    }
}

That does multiplication element by element using the library's supplied gf_mul function, which - as the name suggests - does GF(2^8) multiplication of a scalar by a scalar.

This works. The problem is that it is very, painfully, slow - in the order of a few hundred megabytes per second.

Going faster

How can we make this faster?

There are a few things we can try: if you want to explore a whole range of different ways to do this, check out the gf-complete project. I'm going to assume we want to skip right to the end and know what is the fastest we've found.

Cast your mind back to the RAID 6 paper (PDF). I talked about in part 1. That had a way of doing an efficient multiplication in GF(2^8) using vector instructions.

To refresh your memory, we split the multiplication into two parts - low bits and high bits, looked them up separately in a lookup table, and joined them with XOR. We then discovered that on modern Power chips, we could do that in one instruction with vpermxor.

So, a very simple way to do this would be:

  • generate the table for a
  • for each 16-byte chunk of our input:
    • load the input
    • do the vpermxor with the table
    • save it out

Generating the tables is reasonably straight-forward, in theory. Recall that the tables are a * {{00},{01},...,{0f}} and a * {{00},{10},..,{f0}} - a couple of loops in C will generate them without difficulty. ISA-L has a function to do this, as does gf-complete in split-table mode, so I won't repeat them here.

So, let's recast our function to take the tables as an input rather than the constant a. Assume we're provided the two tables concatenated into one 32-byte chunk. That would give us:

void gf_vect_mul_v2(size_t len, unsigned char * table, unsigned char * src, unsigned char * dest)

Here's how you would do it in C:

void gf_vect_mul_v2(size_t len, unsigned char * table, unsigned char * src, unsigned char * dest) {
        vector unsigned char tbl1, tbl2, in, out;
        size_t i;

        /* Assume table, src, dest are aligned and len is a multiple of 16 */

        tbl1 = vec_ld(16, table);
        tbl2 = vec_ld(0, table);
        for (i=0; i<len; i+=16) {
            in = vec_ld(i, (unsigned char *)src);
            __asm__("vpermxor %0, %1, %2, %3" : "=v"(out) : "v"(tbl1), "v"(tbl2), "v"(in)
            vec_st(out, i, (unsigned char *)dest);
        }
}

There's a few quirks to iron out - making sure the table is laid out in the vector register in the way you expect, etc, but that generally works and is quite fast - my Power 8 VM does about 17-18 GB/s with non-cache-contained data with this implementation.

We can go a bit faster by doing larger chunks at a time:

    for (i=0; i<vlen; i+=64) {
            in1 = vec_ld(i, (unsigned char *)src);
            in2 = vec_ld(i+16, (unsigned char *)src);
            in3 = vec_ld(i+32, (unsigned char *)src);
            in4 = vec_ld(i+48, (unsigned char *)src);
            __asm__("vpermxor %0, %1, %2, %3" : "=v"(out1) : "v"(tbl1), "v"(tbl2), "v"(in1));
            __asm__("vpermxor %0, %1, %2, %3" : "=v"(out2) : "v"(tbl1), "v"(tbl2), "v"(in2));
            __asm__("vpermxor %0, %1, %2, %3" : "=v"(out3) : "v"(tbl1), "v"(tbl2), "v"(in3));
            __asm__("vpermxor %0, %1, %2, %3" : "=v"(out4) : "v"(tbl1), "v"(tbl2), "v"(in4));
            vec_st(out1, i, (unsigned char *)dest);
            vec_st(out2, i+16, (unsigned char *)dest);
            vec_st(out3, i+32, (unsigned char *)dest);
            vec_st(out4, i+48, (unsigned char *)dest);
    }

This goes at about 23.5 GB/s.

We can go one step further and do the core loop in assembler - that means we control the instruction layout and so on. I tried this: it turns out that for the basic vector multiply loop, if we turn off ASLR and pin to a particular CPU, we can see a improvement of a few percent (and a decrease in variability) over C code.

Building from vector multiplication

Once you're comfortable with the core vector multiplication, you can start to build more interesting routines.

A particularly useful one on Power turned out to be the multiply and add routine: like gf_vect_mul, except that rather than overwriting the output, it loads the output and xors the product in. This is a simple extension of the gf_vect_mul function so is left as an exercise to the reader.

The next step would be to start building erasure coding proper. Recall that to get an element of our output, we take a dot product: we take the corresponding input element of each disk, multiply it with the corresponding GF(2^8) coding matrix element and sum all those products. So all we need now is a dot product algorithm.

One approach is the conventional dot product:

  • for each element
    • zero accumulator
    • for each source
      • load input[source][element]
      • do GF(2^8) multiplication
      • xor into accumulator
    • save accumulator to output[element]

The other approach is multiply and add:

  • for each source
    • for each element
      • load input[source][element]
      • do GF(2^8) multiplication
      • load output[element]
      • xor in product
      • save output[element]

The dot product approach has the advantage of fewer writes. The multiply and add approach has the advantage of better cache/prefetch performance. The approach you ultimately go with will probably depend on the characteristics of your machine and the length of data you are dealing with.

For what it's worth, ISA-L ships with only the first approach in x86 assembler, and Jerasure leans heavily towards the second approach.

Once you have a vector dot product sorted, you can build a full erasure coding setup: build your tables with your library, then do a dot product to generate each of your outputs!

In ISA-L, this is implemented something like this:

/*
 * ec_encode_data_simple(length of each data input, number of inputs,
 *                       number of outputs, pre-generated GF(2^8) tables,
 *                       input data pointers, output code pointers)
 */
void ec_encode_data_simple(int len, int k, int rows, unsigned char *g_tbls,
                           unsigned char **data, unsigned char **coding)
{
        while (rows) {
                gf_vect_dot_prod(len, k, g_tbls, data, *coding);
                g_tbls += k * 32;
                coding++;
                rows--;
        }
}

Going faster still

Eagle eyed readers will notice that however we generate an output, we have to read all the input elements. This means that if we're doing a code with 10 data disks and 4 coding disks, we have to read each of the 10 inputs 4 times.

We could do better if we could calculate multiple outputs for each pass through the inputs. This is a little fiddly to implement, but does lead to a speed improvement.

ISA-L is an excellent example here. Intel goes up to 6 outputs at once: the number of outputs you can do is only limited by how many vector registers you have to put the various operands and results in.

Tips and tricks

  • Benchmarking is tricky. I do the following on a bare-metal, idle machine, with ASLR off and pinned to an arbitrary hardware thread. (Code is for the fish shell)

    for x in (seq 1 50)
        setarch ppc64le -R taskset -c 24 erasure_code/gf_vect_mul_perf
    end | awk '/MB/ {sum+=$13} END {print sum/50, "MB/s"}'
    
  • Debugging is tricky; the more you can do in C and the less you do in assembly, the easier your life will be.

  • Vector code is notoriously alignment-sensitive - if you can't figure out why something is wrong, check alignment. (Pro-tip: ISA-L does not guarantee the alignment of the gftbls parameter, and many of the tests supply an unaligned table from the stack. For testing __attribute__((aligned(16))) is your friend!)

  • Related: GCC is moving towards assignment over vector intrinsics, at least on Power:

    vector unsigned char a;
    unsigned char * data;
    // good, also handles word-aligned data with VSX
    a = *(vector unsigned char *)data;
    // bad, requires special handling of non-16-byte aligned data
    a = vec_ld(0, (unsigned char *) data);
    

Conclusion

Hopefully by this point you're equipped to figure out how your erasure coding library of choice works, and write your own optimised implementation (or maintain an implementation written by someone else).

I've referred to a number of resources throughout this series:

If you want to go deeper, I also read the following and found them quite helpful in understanding Galois Fields and Reed-Solomon coding:

For a more rigorous mathematical approach to rings and fields, a university mathematics course may be of interest. For more on coding theory, a university course in electronics engineering may be helpful.

Erasure Coding for Programmers, Part 1

Erasure coding is an increasingly popular storage technology - allowing the same level of fault tolerance as replication with a significantly reduced storage footprint.

Increasingly, erasure coding is available 'out of the box' on storage solutions such as Ceph and OpenStack Swift. Normally, you'd just pull in a library like ISA-L or jerasure, and set some config options, and you'd be done.

This post is not about that. This post is about how I went from knowing nothing about erasure coding to writing POWER optimised routines to make it go fast. (These are in the process of being polished for upstream at the moment.) If you want to understand how erasure coding works under the hood - and in particular if you're interested in writing optimised routines to make it run quickly in your platform - this is for you.

What are erasure codes anyway?

I think the easiest way to begin thinking about erasure codes is "RAID 6 on steroids". RAID 6 allows you to have up to 255 data disks and 2 parity disks (called P and Q), thus allowing you to tolerate the failure of up to 2 arbitrary disks without data loss.

Erasure codes allow you to have k data disks and m 'parity' or coding disks. You then have a total of m + k disks, and you can tolerate the failure of up to m without losing data.

The downside of erasure coding is that computing what to put on those parity disks is CPU intensive. Lets look at what we put on them.

RAID 6

RAID 6 is the easiest way to get started on understanding erasure codes for a number of reasons. H Peter Anvin's paper on RAID 6 in the Linux kernel is an excellent start, but does dive in a bit quickly to the underlying mathematics. So before reading that, read on!

Rings and Fields

As programmers we're pretty comfortable with modular arithmetic - the idea that if you have:

unsigned char a = 255;
a++;

the new value of a will be 0, not 256.

This is an example of an algebraic structure called a ring.

Rings obey certain laws. For our purposes, we'll consider the following incomplete and somewhat simplified list:

  • There is an addition operation.
  • There is an additive identity (normally called 0), such that 'a + 0 = a'.
  • Every element has an additive inverse, that is, for every element 'a', there is an element -a such that 'a + (-a) = 0'
  • There is a multiplication operation.
  • There is a multiplicative identity (normally called 1), such that 'a * 1 = a'.

These operations aren't necessarily addition or multiplication as we might expect from the integers or real numbers. For example, in our modular arithmetic example, we have 'wrap around'. (There are also certain rules the addition and multiplication rules must satisfy - we are glossing over them here.)

One thing a ring doesn't have a 'multiplicative inverse'. The multiplicative inverse of some non-zero element of the ring (call it a), is the value b such that a * b = 1. (Often instead of b we write 'a^-1', but that looks bad in plain text, so we shall stick to b for now.)

We do have some inverses in 'mod 256': the inverse of 3 is 171 as 3 * 171 = 513, and 513 = 1 mod 256, but there is no b such that 2 * b = 1 mod 256.

If every non-zero element of our ring had a multiplicative inverse, we would have what is called a field.

Now, let's look at a the integers modulo 2, that is, 0 and 1.

We have this for addition:

+ 0 1
0 0 1
1 1 0

Eagle-eyed readers will notice that this is the same as XOR.

For multiplication:

* 0 1
0 0 0
1 0 1

As we said, a field is a ring where every non-zero element has a multiplicative inverse. As we can see, the integers modulo 2 shown above is a field: it's a ring, and 1 is its own multiplicative inverse.

So this is all well and good, but you can't really do very much in a field with 2 elements. This is sad, so we make bigger fields. For this application, we consider the Galois Field with 256 elements - GF(2^8). This field has some surprising and useful properties.

Remember how we said that integers modulo 256 weren't a field because they didn't have multiplicative inverses? I also just said that GF(2^8) also has 256 elements, but is a field - i.e., it does have inverses! How does that work?

Consider an element in GF(2^8). There are 2 ways to look at an element in GF(2^8). The first is to consider it as an 8-bit number. So, for example, let's take 100. We can express that as as an 8 bit binary number: 0b01100100.

We can write that more explicitly as a sum of powers of 2:

0 * 2^7 + 1 * 2^6 + 1 * 2^5 + 0 * 2^4 + 0 * 2^3 + 1 * 2^2 + 0 * 2 + 0 * 1
= 2^6 + 2^5 + 2^2

Now the other way we can look at elements in GF(2^8) is to replace the '2's with 'x's, and consider them as polynomials. Each of our bits then represents the coefficient of a term of a polynomial, that is:

0 x^7 + 1 x^6 + 1 x^5 + 0 x^4 + 0 x^3 + 1 x^2 + 0 x + 0 * 1

or more simply

x^6 + x^5 + x^2

Now, and this is important: each of the coefficients are elements of the integers modulo 2: x + x = 2x = 0 as 2 mod 2 = 0. There is no concept of 'carrying' in this addition.

Let's try: what's 100 + 79 in GF(2^8)?

100 = 0b01100100 => x^6 + x^5 +       x^2
 79 = 0b01001111 => x^6 +       x^3 + x^2 + x + 1

100 + 79         =>   0 + x^5 + x^3 +   0 + x + 1
                 =    0b00101011 = 43

So, 100 + 79 = 43 in GF(2^8)

You may notice we could have done that much more efficiently: we can add numbers in GF(2^8) by just XORing their binary representations together. Subtraction, amusingly, is the same as addition: 0 + x = x = 0 - x, as -1 is congruent to 1 modulo 2.

So at this point you might be wanting to explore a few additions yourself. Fortuantely there's a lovely tool that will allow you to do that:

sudo apt install gf-complete-tools
gf_add $A $B 8

This will give you A + B in GF(2^8).

> gf_add 100 79 8
43

Excellent!

So, hold on to your hats, as this is where things get really weird. In modular arithmetic example, we considered the elements of our ring to be numbers, and we performed our addition and multiplication modulo 256. In GF(2^8), we consider our elements as polynomials and we perform our addition and multiplication modulo a polynomial. There is one conventional polynomial used in applications:

0x11d => 0b1 0001 1101 => x^8 + x^4 + x^3 + x^2 + 1

It is possible to use other polynomials if they satisfy particular requirements, but for our applications we don't need to worry as we will always use 0x11d. I am not going to attempt to explain anything about this polynomial - take it as an article of faith.

So when we multiply two numbers, we multiply their polynomial representations. Then, to find out what that is modulo 0x11d, we do polynomial long division by 0x11d, and take the remainder.

Some examples will help.

Let's multiply 100 by 3.

100 = 0b01100100 => x^6 + x^5 + x^2
  3 = 0b00000011 => x + 1

(x^6 + x^5 + x^2)(x + 1) = x^7 + x^6 + x^3 + x^6 + x^5 + x^2
                         = x^7 + x^5 + x^3 + x^2

Notice that some of the terms have disappeared: x^6 + x^6 = 0.

The degree (the largest power of a term) is 7. 7 is less than the degree of 0x11d, which is 8, so we don't need to do anything: the remainder modulo 0x11d is simply x^7 + x^5 + x^3 + x^2.

In binary form, that is 0b10101100 = 172, so 100 * 3 = 172 in GF(2^8).

Fortunately gf-complete-tools also allows us to check multiplications:

> gf_mult 100 3 8
172

Excellent!

Now let's see what happens if we multiply by a larger number. Let's multiply 100 by 5.

100 = 0b01100100 => x^6 + x^5 + x^2
  5 = 0b00000101 => x^2 + 1

(x^6 + x^5 + x^2)(x^2 + 1) = x^8 + x^7 + x^4 + x^6 + x^5 + x^2
                           = x^8 + x^7 + x^6 + x^5 + x^4 + x^2

Here we have an x^8 term, so we have a degree of 8. This means will get a different remainder when we divide by our polynomial. We do this with polynomial long division, which you will hopefully remember if you did some solid algebra in high school.

                              1
                           ---------------------------------------------
x^8 + x^4 + x^3 + x^2 + 1 | x^8 + x^7 + x^6 + x^5 + x^4       + x^2
                          - x^8                   + x^4 + x^3 + x^2 + 1
                            -------------------------------------------
                          =       x^7 + x^6 + x^5       + x^3       + 1

So we have that our original polynomial (x^8 + x^4 + x^3 + x^2 + 1) is congruent to (x^7 + x^6 + x^5 + x^3 + 1) modulo the polynomial 0x11d. Looking at the binary representation of that new polynomial, we have 0b11101001 = 233.

Sure enough:

> gf_mult 100 5 8
233

Just to solidify the polynomial long division a bit, let's try a slightly larger example, 100 * 9:

100 = 0b01100100 => x^6 + x^5 + x^2
  9 = 0b00001001 => x^3 + 1

(x^6 + x^5 + x^2)(x^3 + 1) = x^9 + x^8 + x^5 + x^6 + x^5 + x^2
                           = x^9 + x^8 + x^6 + x^2

Doing long division to reduce our result:

                              x
                           -----------------------------------
x^8 + x^4 + x^3 + x^2 + 1 | x^9 + x^8       + x^6                   + x^2
                          - x^9                   + x^5 + x^4 + x^3       + x
                            -------------------------------------------------
                          =       x^8       + x^6 + x^5 + x^4 + x^3 + x^2 + x

We still have a polynomial of degree 8, so we can do another step:

                              x +   1
                           -----------------------------------
x^8 + x^4 + x^3 + x^2 + 1 | x^9 + x^8       + x^6                   + x^2
                          - x^9                   + x^5 + x^4 + x^3       + x
                            -------------------------------------------------
                          =       x^8       + x^6 + x^5 + x^4 + x^3 + x^2 + x
                          -       x^8                   + x^4 + x^3 + x^2     + 1
                                  -----------------------------------------------
                          =                   x^6 + x^5                   + x + 1

We now have a polynomial of degree less than 8 that is congruent to our original polynomial modulo 0x11d, and the binary form is 0x01100011 = 99.

> gf_mult 100 9 8
99

This process can be done more efficiently, of course - but understanding what is going on will make you much more comfortable with what is going on!

I will not try to convince you that all multiplicative inverses exist in this magic shadow land of GF(2^8), but it's important for the rest of the algorithms to work that they do exist. Trust me on this.

Back to RAID 6

Equipped with this knowledge, you are ready to take on RAID6 in the kernel (PDF) sections 1 - 2.

Pause when you get to section 3 - this snippet is a bit magic and benefits from some explanation:

Multiplication by {02} for a single byte can be implemeted using the C code:

uint8_t c, cc;
cc = (c << 1) ^ ((c & 0x80) ? 0x1d : 0);

How does this work? Well:

Say you have a binary number 0bNMMM MMMM. Mutiplication by 2 gives you 0bNMMMMMMM0, which is 9 bits. Now, there are two cases to consider.

If your leading bit (N) is 0, your product doesn't have an x^8 term, so we don't need to reduce it modulo the irreducible polynomial.

If your leading bit is 1 however, your product is x^8 + something, which does need to be reduced. Fortunately, because we took an 8 bit number and multiplied it by 2, the largest term is x^8, so we only need to reduce it once. So we xor our number with our polynomial to subtract it.

We implement this by letting the top bit overflow out and then xoring the lower 8 bits with the low 8 bits of the polynomial (0x1d)

So, back to the original statement:

(c << 1) ^ ((c & 0x80) ? 0x1d : 0)
    |          |          |     |
    > multiply by 2       |     |
               |          |     |
               > is the high bit set - will the product have an x^8 term?
                          |     |
                          > if so, reduce by the polynomial
                                |
                                > otherwise, leave alone

Hopefully that makes sense.

Key points

It's critical you understand the section on Altivec (the vperm stuff), so let's cover it in a bit more detail.

Say you want to do A * V, where A is a constant and V is an 8-bit variable. We can express V as V_a + V_b, where V_a is the top 4 bits of V, and V_b is the bottom 4 bits. A * V = A * V_a + A * V_b

We can then make lookup tables for multiplication by A.

If we did this in the most obvious way, we would need a 256 entry lookup table. But by splitting things into the top and bottom halves, we can reduce that to two 16 entry tables. For example, say A = 02.

V_a A * V_a
00 00
01 02
02 04
... ...
0f 1e
V_b A * V_b
00 00
10 20
20 40
... ...
f0 fd

We then use vperm to look up entries in these tables and vxor to combine our results.

So - and this is a key point - for each A value we wish to multiply by, we need to generate a new lookup table.

So if we wanted A = 03:

V_a A * V_a
00 00
01 03
02 06
... ...
0f 11
V_b A * V_b
00 00
10 30
20 60
... ...
f0 0d

One final thing is that Power8 adds a vpermxor instruction, so we can reduce the entire 4 instruction sequence in the paper:

vsrb v1, v0, v14
vperm v2, v12, v12, v0
vperm v1, v13, v13, v1
vxor v1, v2, v1

to 1 vpermxor:

vpermxor v1, v12, v13, v0

Isn't POWER grand?

OK, but how does this relate to erasure codes?

I'm glad you asked.

Galois Field arithmetic, and its application in RAID 6 is the basis for erasure coding. (It's also the basis for CRCs - two for the price of one!)

But, that's all to come in part 2, which will definitely be published before 7 April!

Many thanks to Sarah Axtens who reviewed the mathematical content of this post and suggested significant improvements. All errors and gross oversimplifications remain my own. Thanks also to the OzLabs crew for their feedback and comments.

High Power Lustre

(Most of the hard work here was done by fellow blogger Rashmica - I just verified her instructions and wrote up this post.)

Lustre is a high-performance clustered file system. Traditionally the Lustre client and server have run on x86, but both the server and client will also work on Power. Here's how to get them running.

Server

Lustre normally requires a patched 'enterprise' kernel - normally an old RHEL, CentOS or SUSE kernel. We tested with a CentOS 7.3 kernel. We tried to follow the Intel instructions for building the kernel as much as possible - any deviations we had to make are listed below.

Setup quirks

We are told to edit ~/kernel/rpmbuild/SPEC/kernel.spec. This doesn't exist because the directory is SPECS not SPEC: you need to edit ~/kernel/rpmbuild/SPECS/kernel.spec.

I also found there was an extra quote mark in the supplied patch script after -lustre.patch. I removed that and ran this instead:

for patch in $(<"3.10-rhel7.series"); do \
      patch_file="$HOME/lustre-release/lustre/kernel_patches/patches/${patch}" \
      cat "${patch_file}" >> $HOME/lustre-kernel-x86_64-lustre.patch \
done

The fact that there is 'x86_64' in the patch name doesn't matter as you're about to copy it under a different name to a place where it will be included by the spec file.

Building for ppc64le

Building for ppc64le was reasonably straight-forward. I had one small issue:

[build@dja-centos-guest rpmbuild]$ rpmbuild -bp --target=`uname -m` ./SPECS/kernel.spec
Building target platforms: ppc64le
Building for target ppc64le
error: Failed build dependencies:
       net-tools is needed by kernel-3.10.0-327.36.3.el7.ppc64le

Fixing this was as simple as a yum install net-tools.

This was sufficient to build the kernel RPMs. I installed them and booted to my patched kernel - so far so good!

Building the client packages: CentOS

I then tried to build and install the RPMs from lustre-release. This repository provides the sources required to build the client and utility binaries.

./configure and make succeeded, but when I went to install the packages with rpm, I found I was missing some dependencies:

error: Failed dependencies:
        ldiskfsprogs >= 1.42.7.wc1 is needed by kmod-lustre-osd-ldiskfs-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le
    sg3_utils is needed by lustre-iokit-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le
        attr is needed by lustre-tests-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le
        lsof is needed by lustre-tests-2.9.52_60_g1d2fbad_dirty-1.el7.centos.ppc64le

I was able to install sg3_utils, attr and lsof, but I was still missing ldiskfsprogs.

It seems we need the lustre-patched version of e2fsprogs - I found a mailing list post to that effect.

So, following the instructions on the walkthrough, I grabbed the SRPM and installed the dependencies: yum install -y texinfo libblkid-devel libuuid-devel

I then tried rpmbuild -ba SPECS/e2fsprogs-RHEL-7.spec. This built but failed tests. Some failed because I ran out of disk space - they were using 10s of gigabytes. I found that there were some comments in the spec file about this with suggested tests to disable, so I did that. Even with that fix, I was still failing two tests:

  • f_pgsize_gt_blksize: Intel added this to their fork, and no equivalent exists in the master e2fsprogs branches. This relates to Intel specific assumptions about page sizes which don't hold on Power.
  • f_eofblocks: This may need fixing for large page sizes, see this bug.

I disabled the tests by adding the following two lines to the spec file, just before make %{?_smp_mflags} check.

rm -rf tests/f_pgsize_gt_blksize
rm -rf tests/f_eofblocks

With those tests disabled I was able to build the packages successfully. I installed them with yum localinstall *1.42.13.wc5* (I needed that rather weird pattern to pick up important RPMs that didn't fit the e2fs* pattern - things like libcom_err and libss)

Following that I went back to the lustre-release build products and was able to successfully run yum localinstall *ppc64le.rpm!

Testing the server

After disabling SELinux and rebooting, I ran the test script:

sudo /usr/lib64/lustre/tests/llmount.sh

This spat out one scary warning:

mount.lustre FATAL: unhandled/unloaded fs type 0 'ext3'

The test did seem to succeed overall, and it would seem that is a known problem, so I pressed on undeterred.

I then attached a couple of virtual harddrives for the metadata and object store volumes, and having set them up, proceeded to try to mount my freshly minted lustre volume from some clients.

Testing with a ppc64le client

My first step was to test whether another ppc64le machine would work as a client.

I tried with an existing Ubuntu 16.04 VM that I use for much of my day to day development.

A quick google suggested that I could grab the lustre-release repository and run make debs to get Debian packages for my system.

I needed the following dependencies:

sudo apt install module-assistant debhelper dpatch libsnmp-dev quilt

With those the packages built successfully, and could be easily installed:

dpkg -i lustre-client-modules-4.4.0-57-generic_2.9.52-60-g1d2fbad-dirty-1_ppc64el.deblustre-utils_2.9.52-60-g1d2fbad-dirty-1_ppc64el.deb

I tried to connect to the server:

sudo mount -t lustre $SERVER_IP@tcp:/lustre /lustre/

Initially I wasn't able to connect to the server at all. I remembered that (unlike Ubuntu), CentOS comes with quite an aggressive firewall by default. I ran the following on the server:

systemctl stop firewalld

And voila! I was able to connect, mount the lustre volume, and successfully read and write to it. This is very much an over-the-top hack - I should have poked holes in the firewall to allow just the ports lustre needed. This is left as an exercise for the reader.

Testing with an x86_64 client

I then tried to run make debs on my Ubuntu 16.10 x86_64 laptop.

This did not go well - I got the following error:

liblustreapi.c: In function ‘llapi_get_poollist’:
liblustreapi.c:1201:3: error: ‘readdir_r’ is deprecated [-Werror=deprecated-declarations]

This looks like one of the new errors introduced in recent GCC versions, and is a known bug. To work around it, I found the following stanza in a lustre/autoconf/lustre-core.m4, and removed the -Werror:

AS_IF([test $target_cpu == "i686" -o $target_cpu == "x86_64"],
        [CFLAGS="$CFLAGS -Wall -Werror"])

Even this wasn't enough: I got the following errors:

/home/dja/dev/lustre-release/debian/tmp/modules-deb/usr_src/modules/lustre/lustre/llite/dcache.c:387:22: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
         .d_compare = ll_dcompare,
                  ^~~~~~~~~~~
/home/dja/dev/lustre-release/debian/tmp/modules-deb/usr_src/modules/lustre/lustre/llite/dcache.c:387:22: note: (near initialization for ‘ll_d_ops.d_compare’)

I figured this was probably because Ubuntu 16.10 has a 4.8 kernel, and Ubuntu 16.04 has a 4.4 kernel. Work on supporting 4.8 is ongoing.

Sure enough, when I fired up a 16.04 x86_64 VM with a 4.4 kernel, I was able to build and install fine.

Connecting didn't work first time - the guest failed to mount, but I did get the following helpful error on the server:

LNetError: 2595:0:(acceptor.c:406:lnet_acceptor()) Refusing connection from 10.61.2.227: insecure port 1024

Refusing insecure port 1024 made me thing that perhaps the NATing that qemu was performing for me was interfering - perhaps the server expected to get a connection where the source port was privileged, and qemu wouldn't be able to do that with NAT.

Sure enough, switching NAT to bridging was enough to get the x86 VM to talk to the ppc64le server. I verified that ls, reading and writing all succeeded.

Next steps

The obvious next steps are following up the disabled tests in e2fsprogs, and doing a lot of internal performance and functionality testing.

Happily, it looks like Lustre might be in the mainline kernel before too long - parts have already started to go in to staging. This will make our lives a lot easier: for example, the breakage between 4.4 and 4.8 would probably have already been picked up and fixed if it was the main kernel tree rather than an out-of-tree patch set.

In the long run, we'd like to make Lustre on Power just as easy as Lustre on x86. (And, of course, more performant!) We'll keep you up to date!

(Thanks to fellow bloggers Daniel Black and Andrew Donnellan for useful feedback on this post.)

NAMD on NVLink

NAMD is a molecular dynamics program that can use GPU acceleration to speed up its calculations. Recent OpenPOWER machines like the IBM Power Systems S822LC for High Performance Computing (Minsky) come with a new interconnect for GPUs called NVLink, which offers extremely high bandwidth to a number of very powerful Nvidia Pascal P100 GPUs. So they're ideal machines for this sort of workload.

Here's how to set up NAMD 2.12 on your Minsky, and how to debug some common issues. We've targeted this script for CentOS, but we've successfully compiled NAMD on Ubuntu as well.

Prerequisites

GPU Drivers and CUDA

Firstly, you'll need CUDA and the NVidia drivers.

You can install CUDA by following the instructions on NVidia's CUDA Downloads page.

yum install epel-release
yum install dkms
# download the rpm from the NVidia website
rpm -i cuda-repo-rhel7-8-0-local-ga2-8.0.54-1.ppc64le.rpm
yum clean expire-cache
yum install cuda
# this will take a while...

Then, we set up a profile file to automatically load CUDA into our path:

cat >  /etc/profile.d/cuda_path.sh <<EOF
# From http://developer.download.nvidia.com/compute/cuda/8.0/secure/prod/docs/sidebar/CUDA_Quick_Start_Guide.pdf - 4.4.2.1
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
EOF

Now, open a new terminal session and check to see if it works:

cuda-install-samples-8.0.sh ~
cd ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/bandwidthTest
make && ./bandwidthTest

If you see a figure of ~32GB/s, that means NVLink is working as expected. A figure of ~7-8GB indicates that only PCI is working, and more debugging is required.

Compilers

You need a c++ compiler:

yum install gcc-c++

Building NAMD

Once CUDA and the compilers are installed, building NAMD is reasonably straightforward. The one hitch is that because we're using CUDA 8.0, and the NAMD build scripts assume CUDA 7.5, we need to supply an updated Linux-POWER.cuda file. (We also enable code generation for the Pascal in this file.)

We've documented the entire process as a script which you can download. We'd recommend executing the commands one by one, but if you're brave you can run the script directly.

The script will fetch NAMD 2.12 and build it for you, but won't install it. It will look for the CUDA override file in the directory you are running the script from, and will automatically move it into the correct place so it is picked up by the build system..

The script compiles for a single multicore machine setup, rather than for a cluster. However, it should be a good start for an Ethernet or Infiniband setup.

If you're doing things by hand, you may see some errors during the compilation of charm - as long as you get charm++ built successfully. at the end, you should be OK.

Testing NAMD

We have been testing NAMD using the STMV files available from the NAMD website:

cd NAMD_2.12_Source/Linux-POWER-g++
wget http://www.ks.uiuc.edu/Research/namd/utilities/stmv.tar.gz
tar -xf stmv.tar.gz
sudo ./charmrun +p80 ./namd2 +pemap 0-159:2 +idlepoll +commthread stmv/stmv.namd

This binds a namd worker thread to every second hardware thread. This is because hardware threads share resources, so using every hardware thread costs overhead and doesn't give us access to any more physical resources.

You should see messages about finding and using GPUs:

Pe 0 physical rank 0 binding to CUDA device 0 on <hostname>: 'Graphics Device'  Mem: 4042MB  Rev: 6.0

This should be significantly faster than on non-NVLink machines - we saw a gain of about 2x in speed going from a machine with Nvidia K80s to a Minsky. If things aren't faster for you, let us know!

Downloads

Other notes

Namd requires some libraries, some of which they supply as binary downloads on their website. Make sure you get the ppc64le versions, not the ppc64 versions, otherwise you'll get errors like:

/bin/ld: failed to merge target specific data of file .rootdir/tcl/lib/libtcl8.5.a(regfree.o)
/bin/ld: .rootdir/tcl/lib/libtcl8.5.a(regerror.o): compiled for a big endian system and target is little endian
/bin/ld: failed to merge target specific data of file .rootdir/tcl/lib/libtcl8.5.a(regerror.o)
/bin/ld: .rootdir/tcl/lib/libtcl8.5.a(tclAlloc.o): compiled for a big endian system and target is little endian

The script we supply should get these right automatically.