Store Half Byte-Reverse Indexed

A Power Technical Blog

NAMD on NVLink

NAMD is a molecular dynamics program that can use GPU acceleration to speed up its calculations. Recent OpenPOWER machines like the IBM Power Systems S822LC for High Performance Computing (Minsky) come with a new interconnect for GPUs called NVLink, which offers extremely high bandwidth to a number of very powerful Nvidia Pascal P100 GPUs. So they're ideal machines for this sort of workload.

Here's how to set up NAMD 2.12 on your Minsky, and how to debug some common issues. We've targeted this script for CentOS, but we've successfully compiled NAMD on Ubuntu as well.


GPU Drivers and CUDA

Firstly, you'll need CUDA and the NVidia drivers.

You can install CUDA by following the instructions on NVidia's CUDA Downloads page.

yum install epel-release
yum install dkms
# download the rpm from the NVidia website
rpm -i cuda-repo-rhel7-8-0-local-ga2-8.0.54-1.ppc64le.rpm
yum clean expire-cache
yum install cuda
# this will take a while...

Then, we set up a profile file to automatically load CUDA into our path:

cat >  /etc/profile.d/ <<EOF
# From -
export PATH=/usr/local/cuda-8.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Now, open a new terminal session and check to see if it works: ~
cd ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/bandwidthTest
make && ./bandwidthTest

If you see a figure of ~32GB/s, that means NVLink is working as expected. A figure of ~7-8GB indicates that only PCI is working, and more debugging is required.


You need a c++ compiler:

yum install gcc-c++

Building NAMD

Once CUDA and the compilers are installed, building NAMD is reasonably straightforward. The one hitch is that because we're using CUDA 8.0, and the NAMD build scripts assume CUDA 7.5, we need to supply an updated Linux-POWER.cuda file. (We also enable code generation for the Pascal in this file.)

We've documented the entire process as a script which you can download. We'd recommend executing the commands one by one, but if you're brave you can run the script directly.

The script will fetch NAMD 2.12 and build it for you, but won't install it. It will look for the CUDA override file in the directory you are running the script from, and will automatically move it into the correct place so it is picked up by the build system..

The script compiles for a single multicore machine setup, rather than for a cluster. However, it should be a good start for an Ethernet or Infiniband setup.

If you're doing things by hand, you may see some errors during the compilation of charm - as long as you get charm++ built successfully. at the end, you should be OK.

Testing NAMD

We have been testing NAMD using the STMV files available from the NAMD website:

cd NAMD_2.12_Source/Linux-POWER-g++
tar -xf stmv.tar.gz
sudo ./charmrun +p80 ./namd2 +pemap 0-159:2 +idlepoll +commthread stmv/stmv.namd

This binds a namd worker thread to every second hardware thread. This is because hardware threads share resources, so using every hardware thread costs overhead and doesn't give us access to any more physical resources.

You should see messages about finding and using GPUs:

Pe 0 physical rank 0 binding to CUDA device 0 on <hostname>: 'Graphics Device'  Mem: 4042MB  Rev: 6.0

This should be significantly faster than on non-NVLink machines - we saw a gain of about 2x in speed going from a machine with Nvidia K80s to a Minsky. If things aren't faster for you, let us know!


Other notes

Namd requires some libraries, some of which they supply as binary downloads on their website. Make sure you get the ppc64le versions, not the ppc64 versions, otherwise you'll get errors like:

/bin/ld: failed to merge target specific data of file .rootdir/tcl/lib/libtcl8.5.a(regfree.o)
/bin/ld: .rootdir/tcl/lib/libtcl8.5.a(regerror.o): compiled for a big endian system and target is little endian
/bin/ld: failed to merge target specific data of file .rootdir/tcl/lib/libtcl8.5.a(regerror.o)
/bin/ld: .rootdir/tcl/lib/libtcl8.5.a(tclAlloc.o): compiled for a big endian system and target is little endian

The script we supply should get these right automatically. 2017 review

I recently attended LCA 2017, where I gave a talk at the Linux Kernel miniconf (run by fellow sthbrx blogger Andrew Donnellan!) and a talk at the main conference.

I received some really interesting feedback so I've taken the opportunity to write some of it down to complement the talk videos and slides that are online. (And to remind me to follow up on it!)

Miniconf talk: Sparse Warnings

My kernel miniconf talk was on sparse warnings (pdf slides, 23m video).

The abstract read (in part):

sparse is a semantic parser for C, and is one of the static analysis tools available to kernel devs.

Sparse is a powerful tool with good integration into the kernel build system. However, we suffer from warning overload - there are too many sparse warnings to spot the serious issues amongst the trivial. This makes it difficult to use, both for developers and maintainers.

Happily, I received some feedback that suggests it's not all doom and gloom like I had thought!

  • Dave Chinner told me that the xfs team uses sparse regularly to make sure that the file system is endian-safe. This is good news - we really would like that to be endian-safe!

  • Paul McKenney let me know that the 0day bot does do some sparse checking - it would just seem that it's not done on PowerPC.

Main talk: 400,000 Ephemeral Containers

My main talk was entitled "400,000 Ephemeral Containers: testing entire ecosystems with Docker". You can read the abstract for full details, but it boils down to:

What if you want to test how all the packages in a given ecosystem work in a given situation?

My main example was testing how many of the Ruby packages successfully install on Power, but I also talk about other languages and other cool tests you could run.

The 44m video is online. I haven't put the slides up yet but they should be available on GitHub soonish.

Unlike with the kernel talk, I didn't catch the names of most of the people with feedback.

Docker memory issues

One of the questions I received during the talk was about running into memory issues in Docker. I attempted to answer that during the Q&A. The person who asked the question then had a chat with me afterwards, and it turns out I had completely misunderstood the question. I thought it was about memory usage of running containers in parallel. It was actually about memory usage in the docker daemon when running lots of containers in serial. Apparently the docker daemon doesn't free memory during the life of the process, and the question was whether or not I had observed that during my runs.

I didn't have a good answer for this at the time other than "it worked for me", so I have gone back and looked at the docker daemon memory usage.

After a full Ruby run, the daemon is using about 13.9G of virtual memory, and 1.975G of resident memory. If I restart it, the memory usage drops to 1.6G of virtual and 43M of resident memory. So it would appear that the person asking the question was right, and I'm just not seeing it have an effect.

Other interesting feedback

  • Someone was quite interested in testing on Sparc, once they got their Go runtime nailed down.

  • A Rackspacer was quite interested in Python testing for OpenStack - this has some intricacies around Py2/Py3, but we had an interesting discussion around just testing to see if packages that claim Py3 support provide Py3 support.

  • A large jobs site mentioned using this technique to help them migrate their dependencies between versions of Go.

  • I was 'gently encouraged' to try to do better with how long the process takes to run - if for no other reason than to avoid burning more coal. This is a fair point. I did not explain very well what I meant with diminishing returns in the talk: there's lots you could do to make the process faster, it's just comes at the cost of the simplicity that I really wanted when I first started the project. I am working (on and off) on better ways to deal with this by considering the dependency graph.

Extracting Early Boot Messages in QEMU

Be me, you're a kernel hacker, you make some changes to your kernel, you boot test it in QEMU, and it fails to boot. Even worse is the fact that it just hangs without any failure message, no stack trace, no nothing. "Now what?" you think to yourself.

You probably do the first thing you learnt in debugging101 and add abundant print statements all over the place to try and make some sense of what's happening and where it is that you're actually crashing. So you do this, you recompile your kernel, boot it in QEMU and lo and behold, nothing... What happened? You added all these shiny new print statements, where did the output go? The kernel still failed to boot (obviously), but where you were hoping to get some clue to go on you were again left with an empty screen. "Maybe I didn't print early enough" or "maybe I got the code paths wrong" you think, "maybe I just need more prints" even. So lets delve a bit deeper, why didn't you see those prints, where did they go, and how can you get at them?


So what happens when you call printk()? Well what normally happens is, depending on the log level you set, the output is sent to the console or logged so you can see it in dmesg. But what happens if we haven't registered a console yet? Well then we can't print the message can we, so its logged in a buffer, kernel log buffer to be exact helpfully named __log_buf.

Console Registration

So how come I eventually see print statements on my screen? Well at some point during the boot process a console is registered with the printk system, and any buffered output can now be displayed. On ppc it happens that this occurs in register_early_udbg_console() called in setup_arch() from start_kernel(), which is the generic kernel entry point. From this point forward when you print something it will be displayed on the console, but what if you crash before this? What are you supposed to do then?

Extracting Early Boot Messages in QEMU

And now the moment you've all been waiting for, how do I extract those early boot messages in QEMU if my kernel crashes before the console is registered? Well it's quite simple really, QEMU is nice enough to allow us to dump guest memory, and we know the log buffer is in there some where, so we just need to dump the correct part of memory which corresponds to the log buffer.

Locating __log_buf

Before we can dump the log buffer we need to know where it is. Luckily for us this is fairly simple, we just need to dump all the kernel symbols and look for the right one.

> nm vmlinux > tmp; grep __log_buf tmp;
c000000000f5e3dc b __log_buf

We use the nm tool to list all the kernel symbols and output this into some temporary file, we can then grep this for the log buffer (which we know to be named __log_buf), and presto we are told that it's at kernel address 0xf5e3dc.

Dumping Guest Memory

It's then simply a case of dumping guest memory from the QEMU console. So first we press ^a+c to get us to the QEMU console, then we can use the aptly named dump-guest-memory.

> help dump-guest-memory
dump-guest-memory [-p] [-d] [-z|-l|-s] filename [begin length] -- dump guest memory into file 'filename'.
            -p: do paging to get guest's memory mapping.
            -d: return immediately (do not wait for completion).
            -z: dump in kdump-compressed format, with zlib compression.
            -l: dump in kdump-compressed format, with lzo compression.
            -s: dump in kdump-compressed format, with snappy compression.
            begin: the starting physical address.
            length: the memory size, in bytes.

We just give it a filename for where we want our output to go, we know the starting address, we just don't know the length. We could choose some arbitrary length, but inspection of the kernel code shows us that:

static char __log_buf[__LOG_BUF_LEN] __aligned(LOG_ALIGN);

Looking at the pseries_defconfig file shows us that the LOG_BUF_SHIFT is set to 18, and thus we know that the buffer is 2^18 bytes or 256kb. So now we run:

> dump-guest-memory tmp 0xf5e3dc 262144

And we now get our log buffer in the file tmp. This can simply be viewed with:

> hexdump -C tmp

This gives a readable, if poorly formatted output. I'm sure you can find something better but I'll leave that as an exercise for the reader.


So if like me your kernel hangs somewhere early in the boot process and you're left without your console output you are now fully equipped to extract the log buffer in QEMU and hopefully therein lies the answer to why you failed to boot.

Installing Centos 7.2 on IBM Power System's S822LC for High Performance Computing (Minksy) with USB device


If you are installing Linux on your IBM Power System's S822LC server then the instructions in this article will help you to start and run your system. These instructions are specific to installing CentOS 7 on an IBM Power System S822LC for High Performance Computing (Minsky), but also work for RHEL 7 - just swap CentOS for RHEL.


Before you power on the system, ensure that you have the following items:

  • Ethernet cables;
  • USB storage device of 7G or greater;
  • An installed ethernet network with a DHCP server;
  • Access to the DHCP server's logs;
  • Power cords and outlet for your system;
  • PC or notebook that has IPMItool level 1.8.15 or greater; and
  • a VNC client.

Download CentOS ISO file from the Centos Mirror. Select the "Everything" ISO file.

Note: You must use the 1611 release (dated 2016-12-22) or later due to Linux Kernel support for the server hardware.

Step 1: Preparing to power on your system

Follow these steps to prepare your system:

  1. If your system belongs in a rack, install your system into that rack. For instructions, see IBM POWER8 Systems information.
  2. Connect an Ethernet cable to the left embedded Ethernet port next to the serial port on the back of your system and the other end to your network. This Ethernet port is used for the BMC/IPMI interface.
  3. Connect another Enternet cable to the right Ethernet port for network connection for the operating system.
  4. Connect the power cords to the system and plug them into the outlets.

At this point, your firmware is booting.

Step 2: Determining the BMC firmware IP address

To determine the IP address of the BMC, examine the latest DHCP server logs for the network connected to the server. The IP address will be requested approximately 2 minutes after being powered on.

It is possible to set the BMC to a static IP address by following the IBM documentation on IPMI.

Step 3: Connecting to the BMC firmware with IPMItool

After you have a network connection set up for your BMC firmware, you can connect using Intelligent Platform Management Interface (IPMI). IPMI is the default console to use when connecting to the Open Power Abstraction Layer (OPAL) firmware.

Use the default authentication for servers over IPMI is:

  • Default user: ADMIN
  • Default password: admin

To power on your server from a PC or notebook that is running Linux®, follow these steps:

Open a terminal program on your PC or notebook with Activate Serial-Over-Lan using IPMI. Use other steps here as needed.

For the following impitool commands, server_ip_address is the IP address of the BMC from Step 2, and ipmi_user and ipmi_password are the default user ID and password for IPMI.

Power On using IPMI

If your server is not powered on, run the following command to power the server on:

ipmitool -I lanplus -H server_ip_address -U ipmi_user -P ipmi_password chassis power on

Activate Serial-Over-Lan using IPMI

Activate your IPMI console by running this command:

ipmitool -I lanplus -H server_ip_address -U ipmi_user -P ipmi_password sol activate

After powering on your system, the Petitboot interface loads. If you do not interrupt the boot process by pressing any key within 10 seconds, Petitboot automatically boots the first option. At this point the IPMI console will be connected to the Operating Systems serial. If you get to this stage accidently you can deactivate and reboot as per the following two commands.

Deactivate Serial-Over-Lan using IPMI

If you need to power off or reboot your system, deactivate the console by running this command:

ipmitool -I lanplus -H server_ip_address -U user-name -P ipmi_password sol deactivate

Reboot using IPMI

If you need to reboot the system, run this command:

ipmitool -I lanplus -H server_ip_address -U user-name -P ipmi_password chassis power reset

Step 4: Creating a USB device and booting

At this point, your IPMI console should be contain a Petitboot bootloader menu as illustrated below and you are ready to install Centos 7 on your server.

Petitboot menu over IPMI

Use one of the following USB devices:

  • USB attached DVD player with a single USB cable to stay under 1.0 Amps, or
  • 7 GB (or more) 2.0 (or later) USB flash drive.

Follow the following instructions:

  1. To create the bootable USB device, follow the instructions in the CentOS wiki Host to Set Up a USB to Install CentOS.
  2. Insert your bootable USB device into the front USB port. CentOS AltArch installer will automatically appear as a boot option on the Petitboot main screen. If the USB device does not appear select Rescan devices. If your device is not detected, you might have to try a different type.
  3. Arrow up to select the CentOS boot option. Press e (Edit) to open the Petitboot Option Editor window
  4. Move the cursor to the Boot arguments section and to include the following information: ro inst.stage2=hd:LABEL=CentOS_7_ppc64le:/ console=hvc0 ip=dhcp (if using RHEL the LABEL will be similar to RHEL-7.3\x20Server.ppc64le:/)

Petitboot edited "Install CentOS AltArch 7 (64-bit kernel)

Notes about the boot arguments:

  • ip=dhcp to ensure network is started for VNC installation.
  • console hvc0 is needed as this is not the default.
  • inst.stage2 is needed as the boot process won't automatically find the stage2 install on the install disk.
  • append inst.proxy=URL where URL is the proxy URL if installing in a network that requires a proxy to connect externally.

You can find additional options at Anaconda Boot Options.

  1. Select OK to save your options and return to the Main menu
  2. On the Petitboot main screen, select the CentOS AltArch option and then press Enter.

Step 5: Complete your installation

After you select to boot the CentOS installer, the installer wizard walks you through the steps.

  1. If the CentOS installer was able to obtain a network address via DHCP, it will present an option to enable the VNC. If no option is presented check your network cables. VNC option
  2. Select the Start VNC option and it will provide an OS server IP adress. Note that this will be different to the BMC address previously optained. VNC option selected
  3. Run a VNC client program on your PC or notebook and connect to the OS server IP address.

VNC of Installer

During the install over VNC, there are a couple of consoles active. To switch between them in the ipmitool terminal, press ctrl-b and then between 1-4 as indicated.

Using the VNC client program:

  1. Select "Install Destination"
  2. Select a device from "Local Standard Disks"
  3. Select "Full disk summary and boot device"
  4. Select the device again from "Selected Disks" with the Boot enabled
  5. Select "Do not install boot loader" from device. Disabling install of boot loader which results in Result after disabling boot loader install.

Without disabling boot loader, the installer complains about an invalid stage1 device. I suspect it needs a manual Prep partition of 10M to make the installer happy.

If you have a local Centos repository you can set this by selecting "Install Source" - the directories at this url should look like CentOS's Install Source for ppc64le.

Step 6: Before reboot and using the IPMI Serial-Over-LAN

Before reboot, generate the grub.cfg file as Petitboot uses this to generate its boot menu:

  1. Using the ipmitool's shell (ctrl-b 2):
  2. Enter the following commands to generate a grub.cfg file
chroot /mnt/sysimage
rm /etc/grub.d/30_os-prober
grub2-mkconfig -o /boot/grub2/grub.cfg

/etc/grub.d/30_os-prober is removed as Petitboot probes the other devices anyway so including it would create lots of duplicate menu items.

The last step is to restart your system.

Note: While your system is restarting, remove the USB device.

After the system restarts, Petitboot displays the option to boot CentOS 7.2. Select this option and press Enter.


After you have booted CentOS, your server is ready to go! For more information, see the following resources:

Getting In Sync

Since at least v1.0.0 Petitboot has used device-mapper snapshots to avoid mounting block devices directly. Primarily this is so Petitboot can mount disks and potentially perform filesystem recovery without worrying about messing it up and corrupting a host's boot partition - all changes happen to the snapshot in memory without affecting the actual device.

This of course gets in the way if you actually do want to make changes to a block device. Petitboot will allow certain bootloader scripts to make changes to disks if configured (eg, grubenv updates), but if you manually make changes you would need to know the special sequence of dmsetup commands to merge the snapshots back to disk. This is particulary annoying if you're trying to copy logs to a USB device!

Depending on how recent a version of Petitboot you're running, there are two ways of making sure your changes persist:

Before v1.2.2

If you really need to save changes from within Petitboot, the most straightforward way is to disable snapshots. Drop to the shell and enter

nvram --update-config petitboot,snapshots?=false

Once you have rebooted you can remount the device as read-write and modify it as normal.

After v1.2.2

To make this easier while keeping the benefit of snapshots, v1.2.2 introduces a new user-event that will merge snapshots on demand. For example:

mount -o remount,rw /var/petitboot/mnt/dev/sda2
cp /var/log/messages /var/petitboot/mnt/dev/sda2/
pb-event sync@sda2

After calling pb-event sync@yourdevice, Petitboot will remount the device back to read-only and merge the current snapshot differences back to disk. You can also run pb-event sync@all to sync all existing snapshots if desired.