Store Halfword Byte-Reverse Indexed

A Power Technical Blog

What Do You Mean "No"?

Quite often when building small Linux images having separate user accounts isn't always at the top of the list of things to include. Petitboot is no different; the most common operations like mounting disks, configuring interfaces, and calling kexec all require root and Petitboot generally only exists long enough to boot into the next thing, so why not run it all as root?

The picture is less clear when we start to think about what is possible to do in Petitboot by default. If someone comes across an open Petitboot console they're only a few keystrokes away from wiping disks, changing files, or even flashing firmware. Depending on how your system is used that may or may not be something you care about, but over time there have been a few requests to "add a password screen to Petitboot" to at least make it so that the system isn't open season for whoever sees it.

Enter Password:

The most direct way to avoid this would be to slap a password prompt onto Petitboot before any changes can be made. There are two immediate drawbacks to this:

  • The Petitboot UI still runs as root, and
  • Exiting to the shell gives the user root permissions as well.

There is already a mechanism to prevent the user exiting to the shell, but this puts all of our eggs in the basket of petitboot-nc being a secure program. If a user can accidentally or otherwise find a way to exit or crash the UI then they're immediately in a root shell, and while petitboot-nc is a good UI it was never designed to be a hardened program protecting the system.

You Have No Power Here

The idea instead as of Petitboot v1.10.0 is not to care if the user drops to the shell; because now it's completely unprivileged.

Normal shell

The only process now that runs as root is pb-discover itself; the console, UI, and helper scripts run as a new 'petituser'. For the server and clients to still communicate the "petitiboot.ui" socket permissions are modified to allow processes that are part of the 'petitgroup' to connect. However now if pb-discover notices that a client in the petitgroup is connecting (or more accurately the client isn't running as root) by default it ignores any commands from it that would configure or boot the system.

A new command, PB_PROTOCOL_ACTION_AUTHENTICATE, lets a client send a password to the server to then be allowed to send all the usual commands like updating the config or booting a specific option. This keeps all the authentication on the server side, avoiding writing any "secure" ncurses code. In the UI the biggest difference is that when trying to change something the user will hit a password field:

Denied

Then the password is sent to the server, checked, and if correct the action goes ahead.

Whose Passwords?

But where does this password come from? Technically it's just the root password. The server computes a hash of the supplied password and compares it against the system's root password. Similarly in the shell the user can run sudo with the root password to enter a full shell if needed:

Oops

Petitboot of course runs in memory, and writing a root password into the image itself would mean recompiling to change the password, so instead Petitboot pulls the root password from NVRAM. On startup Petitboot reads the petitboot,password parameter which is the hash of the root password and updates /etc/shadow with it. This happens before any clients are up or can connect to the server.

Don't Panic

By default no password is set. After all we don't want people upgrading and then being somehow locked out of their system. For ease of use, and for testing-purposes, if no password is configured and the user drops to the shell it is automatically upgraded to a root shell:

Elevated

To set a password there is a new subscreen in System Configuration:

New Password

This sends an authentication command to the server, and assuming the client is authenticated with the current password as well pb-discover updates the shadow file, and writes the hash back to NVRAM.


User support exists in Petitboot v1.10.0 onwards. You will also need some support in your build system to set up the users, see how op-build did it for an example.

There are a few items on the TODO list still which would be good to have. For example storing the password hash in an attached TPM if available, as well as splitting out more of what runs as root; for example the bootloader parsers in pb-discover preferably wouldn't run with root privileges, but they are all part of the one binary.

As always, comments, suggestions, and patches welcome on the list!

IPMI: Initiating Better Overrides

On platforms that support it Petitboot can interact with the inband IPMI interface to pull information from the BMC. One particularly useful example of this is the "Get System Boot Options" command which we use to implement boot "overrides". By setting parameter 5 of the command a user can remotely force Petitboot to boot from only one class of device or disable autoboot completely. This is great for automation or debug purposes, but since it can only specify device types like "disk" or "network" it can't be used to boot from specific devices.

Introducing..

The Boot Initiator Mailbox

Alexander Amelkin pointed out that the "Get System Boot Options" command also specifies parameter 7, "Boot Initiator Mailbox". This parameter just defines a region of vendor-defined data that can be used to influence the booting behaviour of the system. The parameter description specifies that a BMC must support at least 80 bytes of data in that mailbox so as Alex pointed out we could easily use it to set a partition UUID. But why stop there? Let's go further and use the mailbox to provide an alterate "petitboot,bootdevs=.." parameter and let a user set a full substitute boot order!

The Mailbox Format

Parameter 7 has two fields, 1 byte for the "set selector", and up to 16 bytes of "block data". The spec sets the minimum amount of data to support at 80 bytes, which means a BMC must support at least 5 of these 16-byte "blocks" which can be individually accessed via the set selector. Aside from the first 3 bytes which must be an IANA ID number, the rest of the data is defined by us.

So if we want to set an alternate Petitboot boot order such as "network, usb, disk", the format of the mailbox would be:

Block # |       Block Data              |
-----------------------------------------
0       |2|0|0|p|e|t|i|t|b|o|o|t|,|b|o|o|
1       |t|d|e|v|s|=|n|e|t|w|o|r|k| |u|s|
2       |b| |d|i|s|k| | | | | | | | | | |
3       | | | | | | | | | | | | | | | | |
4       | | | | | | | | | | | | | | | | |

Where the string is null-terminated, 2,0,0 is the IBM IANA ID, and the contents of any remaining data is not important. The ipmi-mailbox-config.py script constructs and sends the required IPMI commands from a given parameter string to make this easier, eg:

./utils/ipmi-mailbox-config.py -b bmc-ip -u user -p pass -m 5 \
        -c "petitboot,bootdevs=uuid:c6e4c4f9-a9a2-4c30-b0db-6fa00f433b3b"

Active Mailbox Override


That is basically all there is to it. Setting a boot order this way overrides the existing order from NVRAM if there is one. Parameter 7 doesn't have a 'persistent' flag so the contents need to either be manually cleared from the BMC or cleared via the "Clear" button in the System Configuration screen.

From the machines I've been able to test on at least AMI BMCs support the mailbox, and hopefully OpenBMC will be able to add it to their IPMI implementation. This is supported in Petitboot as of v1.10.0 so go ahead and try it out!

OpenPOWER Summit Europe 2018: A Software Developer's Introduction to OpenCAPI

Last month, I was in Amsterdam at OpenPOWER Summit Europe. It was great to see so much interest in OpenPOWER, with a particularly strong contingent of researchers sharing how they're exploiting the unique advantages of OpenPOWER platforms, and a number of OpenPOWER hardware partners announcing products.

(It was also my first time visiting Europe, so I had a lot of fun exploring Amsterdam, taking a few days off in Vienna, then meeting some of my IBM Linux Technology Centre colleagues in Toulouse. I also now appreciate just what ~50 hours on planes does to you!)

One particular area which got a lot of attention at the Summit was OpenCAPI, an open coherent high-performance bus interface designed for accelerators, which is supported on POWER9. We had plenty of talks about OpenCAPI and the interesting work that is already happening with OpenCAPI accelerators.

I was invited to present on the Linux Technology Centre's work on enabling OpenCAPI from the software side. In this talk, I outline the OpenCAPI software stack and how you can interface with an OpenCAPI device through the ocxl kernel driver and the libocxl userspace library.

My slides are available, though you'll want to watch the presentation for context.

Apart from myself, the OzLabs team were well represented at the Summit:

Unfortunately none of their videos are up yet, but they'll be there over the next few weeks. Keep an eye on the Summit website and the Summit YouTube playlist, where you'll find all the rest of the Summit content.

If you've got any questions about OpenCAPI feel free to leave a comment!

Open Source Firmware Conference 2018

I recently had the pleasure of attending the 2018 Open Source Firmware Conference in Erlangen, Germany. Compared to other more general conferences I've attended in the past, the laser focus of OSFC on firmware and especially firmware security was fascinating. Seeing developers from across the world coming together to discuss how they are improving their corner of the stack was great, and I've walked away with plenty of new knowledge and ideas (and several kilos of German food and drink..).


What was especially exciting though is that I had the chance to talk about my work on Petitboot and what's happened from the POWER8 launch until now. If you're interested in that, or seeing how I talk after 36 hours of travel, check it out here:


OSFC have made all the talks from the first two days available in a playlist on Youtube
If you're after a few suggestions there was, in no particular order:

Ryan O'Leary giving an update on Linuxboot - also known as NERF, Google's approach to a Linux bootloader all written in Go.

Subrate Banik talking about porting Coreboot on top of Intel's FSP

Ron Minnich describing his work with "rompayloads" on Coreboot

Vadmin Bendebury describing Google's "Secure Microcontroller" Chip

Facebook presenting their use of Linuxboot and "systemboot"

And heaps more, check out the full playlist!

Improving performance of Phoronix benchmarks on POWER9

Recently Phoronix ran a range of benchmarks comparing the performance of our POWER9 processor against the Intel Xeon and AMD EPYC processors.

We did well in the Stockfish, LLVM Compilation, Zstd compression, and the Tinymembench benchmarks. A few of my colleagues did a bit of investigating into some the benchmarks where we didn't perform quite so well.

LBM / Parboil

The Parboil benchmarks are a collection of programs from various scientific and commercial fields that are useful for examining the performance and development of different architectures and tools. In this round of benchmarks Phoronix used the lbm benchmark: a fluid dynamics simulation using the Lattice-Boltzmann Method.

lbm is an iterative algorithm - the problem is broken down into discrete time steps, and at each time step a bunch of calculations are done to simulate the change in the system. Each time step relies on the results of the previous one.

The benchmark uses OpenMP to parallelise the workload, spreading the calculations done in each time step across many CPUs. The number of calculations scales with the resolution of the simulation.

Unfortunately, the resolution (and therefore the work done in each time step) is too small for modern CPUs with large numbers of SMT (simultaneous multi-threading) threads. OpenMP doesn't have enough work to parallelise and the system stays relatively idle. This means the benchmark scales relatively poorly, and is definitely not making use of the large POWER9 system

Also this benchmark is compiled without any optimisation. Recompiling with -O3 improves the results 3.2x on POWER9.

x264 Video Encoding

x264 is a library that encodes videos into the H.264/MPEG-4 format. x264 encoding requires a lot of integer kernels doing operations on image elements. The math and vectorisation optimisations are quite complex, so Nick only had a quick look at the basics. The systems and environments (e.g. gcc version 8.1 for Skylake, 8.0 for POWER9) are not completely apples to apples so for now patterns are more important than the absolute results. Interestingly the output video files between architectures are not the same, particularly with different asm routines and compiler options used, which makes it difficult to verify the correctness of any changes.

All tests were run single threaded to avoid any SMT effects.

With the default upstream build of x264, Skylake is significantly faster than POWER9 on this benchmark (Skylake: 9.20 fps, POWER9: 3.39 fps). POWER9 contains some vectorised routines, so an initial suspicion is that Skylake's larger vector size may be responsible for its higher throughput.

Let's test our vector size suspicion by restricting Skylake to SSE4.2 code (with 128 bit vectors, the same width as POWER9). This hardly slows down the x86 CPU at all (Skylake: 8.37 fps, POWER9: 3.39 fps), which indicates it's not taking much advantage of the larger vectors.

So the next guess would be that x86 just has more and better optimized versions of costly functions (in the version of x264 that Phoronix used there are only six powerpc specific files compared with 21 x86 specific files). Without the time or expertise to dig into the complex task of writing vector code, we'll see if the compiler can help, and turn on autovectorisation (x264 compiles with -fno-tree-vectorize by default, which disables auto vectorization). Looking at a perf profile of the benchmark we can see that one costly function, quant_4x4x4, is not autovectorised. With a small change to the code, gcc does vectorise it, giving a slight speedup with the output file checksum unchanged (Skylake: 9.20 fps, POWER9: 3.83 fps).

We got a small improvement with the compiler, but it looks like we may have gains left on the table with our vector code. If you're interested in looking into this, we do have some active bounties for x264 (lu-zero/x264).

Test Skylake POWER9
Original - AVX256 9.20 fps 3.39 fps
Original - SSE4.2 8.37 fps 3.39 fps
Autovectorisation enabled, quant_4x4x4 vectorised 9.20 fps 3.83 fps

Nick also investigated running this benchmark with SMT enabled and across multiple cores, and it looks like the code is not scalable enough to feed 176 threads on a 44 core system. Disabling SMT in parallel runs actually helped, but there was still idle time. That may be another thing to look at, although it may not be such a problem for smaller systems.

Primesieve

Primesieve is a program and C/C++ library that generates all the prime numbers below a given number. It uses an optimised Sieve of Eratosthenes implementation.

The algorithm uses the L1 cache size as the sieve size for the core loop. This is an issue when we are running in SMT mode (aka more than one thread per core) as all threads on a core share the same L1 cache and so will constantly be invalidating each others cache-lines. As you can see in the table below, running the benchmark in single threaded mode is 30% faster than in SMT4 mode!

This means in SMT-4 mode the workload is about 4x too large for the L1 cache. A better sieve size to use would be the L1 cache size / number of threads per core. Anton posted a pull request to update the sieve size.

It is interesting that the best overall performance on POWER9 is with the patch applied and in SMT2 mode:

SMT level baseline patched
1 14.728s 14.899s
2 15.362s 14.040s
4 19.489s 17.458s

LAME

Despite its name, a recursive acronym for "LAME Ain't an MP3 Encoder", LAME is indeed an MP3 encoder.

Due to configure options not being parsed correctly this benchmark is built without any optimisation regardless of architecture. We see a massive speedup by turning optimisations on, and a further 6-8% speedup by enabling USE_FAST_LOG (which is already enabled for Intel).

LAME Duration Speedup
Default 82.1s n/a
With optimisation flags 16.3s 5.0x
With optimisation flags and USE_FAST_LOG set 15.6s 5.3x

For more detail see Joel's writeup.

FLAC

FLAC is an alternative encoding format to MP3. But unlike MP3 encoding it is lossless! The benchmark here was encoding audio files into the FLAC format.

The key part of this workload is missing vector support for POWER8 and POWER9. Anton and Amitay submitted this patch series that adds in POWER specific vector instructions. It also fixes the configuration options to correctly detect the POWER8 and POWER9 platforms. With this patch series we get see about a 3x improvement in this benchmark.

OpenSSL

OpenSSL is among other things a cryptographic library. The Phoronix benchmark measures the number of RSA 4096 signs per second:

$ openssl speed -multi $(nproc) rsa4096

Phoronix used OpenSSL-1.1.0f, which is almost half as slow for this benchmark (on POWER9) than mainline OpenSSL. Mainline OpenSSL has some powerpc multiplication and squaring assembly code which seems to be responsible for most of this speedup.

To see this for yourself, add these four powerpc specific commits on top of OpenSSL-1.1.0f:

  1. perlasm/ppc-xlate.pl: recognize .type directive
  2. bn/asm/ppc-mont.pl: prepare for extension
  3. bn/asm/ppc-mont.pl: add optimized multiplication and squaring subroutines
  4. ppccap.c: engage new multipplication and squaring subroutines

The following results were from a dual 16-core POWER9:

Version of OpenSSL Signs/s Speedup
1.1.0f 1921 n/a
1.1.0f with 4 patches 3353 1.74x
1.1.1-pre1 3383 1.76x

SciKit-Learn

SciKit-Learn is a bunch of python tools for data mining and analysis (aka machine learning).

Joel noticed that the benchmark spent 92% of the time in libblas. Libblas is a very basic BLAS (basic linear algebra subprograms) library that python-numpy uses to do vector and matrix operations. The default libblas on Ubuntu is only compiled with -O2. Compiling with -Ofast and using alternative BLAS's that have powerpc optimisations (such as libatlas or libopenblas) we see big improvements in this benchmark:

BLAS used Duration Speedup
libblas -O2 64.2s n/a
libblas -Ofast 36.1s 1.8x
libatlas 8.3s 7.7x
libopenblas 4.2s 15.3x

You can read more details about this here.

Blender

Blender is a 3D graphics suite that supports image rendering, animation, simulation and game creation. On the surface it appears that Blender 2.79b (the distro package version that Phoronix used by system/blender-1.0.2) failed to use more than 15 threads, even when "-t 128" was added to the Blender command line.

It turns out that even though this benchmark was supposed to be run on CPUs only (you can choose to render on CPUs or GPUs), the GPU file was always being used. The GPU file is configured with a very large tile size (256x256) - which is fine for GPUs but not great for CPUs. The image size (1280x720) to tile size ratio limits the number of jobs created and therefore the number threads used.

To obtain a realistic CPU measurement with more that 15 threads you can force the use of the CPU file by overwriting the GPU file with the CPU one:

$ cp
~/.phoronix-test-suite/installed-tests/system/blender-1.0.2/benchmark/pabellon_barcelona/pavillon_barcelone_cpu.blend
~/.phoronix-test-suite/installed-tests/system/blender-1.0.2/benchmark/pabellon_barcelona/pavillon_barcelone_gpu.blend

As you can see in the image below, now all of the cores are being utilised! Blender with CPU Blend file

Fortunately this has already been fixed in pts/blender-1.1.1. Thanks to the report by Daniel it has also been fixed in system/blender-1.1.0.

Pinning the pts/bender-1.0.2, Pabellon Barcelona, CPU-Only test to a single 22-core POWER9 chip (sudo ppc64_cpu --cores-on=22) and two POWER9 chips (sudo ppc64_cpu --cores-on=44) show a huge speedup:

Benchmark Duration (deviation over 3 runs) Speedup
Baseline (GPU blend file) 1509.97s (0.30%) n/a
Single 22-core POWER9 chip (CPU blend file) 458.64s (0.19%) 3.29x
Two 22-core POWER9 chips (CPU blend file) 241.33s (0.25%) 6.25x

tl;dr

Some of the benchmarks where we don't perform as well as Intel are where the benchmark has inline assembly for x86 but uses generic C compiler generated assembly for POWER9. We could probably benefit with some more powerpc optimsed functions.

We also found a couple of things that should result in better performance for all three architectures, not just POWER.

A summary of the performance improvements we found:

Benchmark Approximate Improvement
Parboil 3x
x264 1.1x
Primesieve 1.1x
LAME 5x
FLAC 3x
OpenSSL 2x
SciKit-Learn 7-15x
Blender 3x

There is obviously room for more improvements, especially with the Primesieve and x264 benchmarks, but it would be interesting to see a re-run of the Phoronix benchmarks with these changes.

Thanks to Anton, Daniel, Joel and Nick for the analysis of the above benchmarks.