Recently Phoronix ran a range of
comparing the performance of our POWER9 processor against the Intel Xeon and AMD
We did well in the Stockfish, LLVM Compilation, Zstd compression, and the
Tinymembench benchmarks. A few of my colleagues did a bit of investigating into
some the benchmarks where we didn't perform quite so well.
LBM / Parboil
The Parboil benchmarks are a
collection of programs from various scientific and commercial fields that are
useful for examining the performance and development of different architectures
and tools. In this round of benchmarks Phoronix used the lbm
benchmark: a fluid dynamics
simulation using the Lattice-Boltzmann Method.
lbm is an iterative algorithm - the problem is broken down into discrete
time steps, and at each time step a bunch of calculations are done to
simulate the change in the system. Each time step relies on the results
of the previous one.
The benchmark uses OpenMP to parallelise the workload, spreading the
calculations done in each time step across many CPUs. The number of
calculations scales with the resolution of the simulation.
Unfortunately, the resolution (and therefore the work done in each time
step) is too small for modern CPUs with large numbers of SMT (simultaneous multi-threading) threads. OpenMP
doesn't have enough work to parallelise and the system stays relatively idle. This
means the benchmark scales relatively poorly, and is definitely
not making use of the large POWER9 system
Also this benchmark is compiled without any optimisation. Recompiling with -O3 improves the
results 3.2x on POWER9.
x264 Video Encoding
x264 is a library that encodes videos into the H.264/MPEG-4 format. x264 encoding
requires a lot of integer kernels doing operations on image elements. The math
and vectorisation optimisations are quite complex, so Nick only had a quick look at
the basics. The systems and environments (e.g. gcc version 8.1 for Skylake, 8.0
for POWER9) are not completely apples to apples so for now patterns are more
important than the absolute results. Interestingly the output video files between
architectures are not the same, particularly with different asm routines and
compiler options used, which makes it difficult to verify the correctness of any changes.
All tests were run single threaded to avoid any SMT effects.
With the default upstream build of x264, Skylake is significantly faster than POWER9 on this benchmark
(Skylake: 9.20 fps, POWER9: 3.39 fps). POWER9 contains some vectorised routines, so an
initial suspicion is that Skylake's larger vector size may be responsible for its higher throughput.
Let's test our vector size suspicion by restricting
Skylake to SSE4.2 code (with 128 bit vectors, the same width as POWER9). This hardly
slows down the x86 CPU at all (Skylake: 8.37 fps, POWER9: 3.39 fps), which indicates it's
not taking much advantage of the larger vectors.
So the next guess would be that x86 just has more and better optimized versions of costly
functions (in the version of x264 that Phoronix used there are only six powerpc specific
files compared with 21 x86 specific files). Without the time or expertise to dig into the
complex task of writing vector code, we'll see if the compiler can help, and turn
on autovectorisation (x264 compiles with -fno-tree-vectorize by default, which disables
auto vectorization). Looking at a perf profile of the benchmark we can see
that one costly function, quant_4x4x4, is not autovectorised. With a small change to the
code, gcc does vectorise it, giving a slight speedup with the output file checksum unchanged
(Skylake: 9.20 fps, POWER9: 3.83 fps).
We got a small improvement with the compiler, but it looks like we may have gains left on the
table with our vector code. If you're interested in looking into this, we do have some
active bounties for x264 (lu-zero/x264).
|Original - AVX256
|Original - SSE4.2
|Autovectorisation enabled, quant_4x4x4 vectorised
Nick also investigated running this benchmark with SMT enabled and across multiple cores, and it looks like the code is
not scalable enough to feed 176 threads on a 44 core system. Disabling SMT in parallel runs
actually helped, but there was still idle time. That may be another thing to look at,
although it may not be such a problem for smaller systems.
Primesieve is a program and C/C++
library that generates all the prime numbers below a given number. It uses an
optimised Sieve of Eratosthenes
The algorithm uses the L1 cache size as the sieve size for the core loop. This
is an issue when we are running in SMT mode (aka more than one thread per core)
as all threads on a core share the same L1 cache and so will constantly be
invalidating each others cache-lines. As you can see
in the table below, running the benchmark in single threaded mode is 30% faster
than in SMT4 mode!
This means in SMT-4 mode the workload is about 4x too large for the L1 cache. A
better sieve size to use would be the L1 cache size / number of
threads per core. Anton posted a pull request
to update the sieve size.
It is interesting that the best overall performance on POWER9 is with the patch applied and in
Despite its name, a recursive acronym for "LAME Ain't an MP3 Encoder",
LAME is indeed an MP3 encoder.
Due to configure options not being parsed correctly this
benchmark is built without any optimisation regardless of architecture. We see a
massive speedup by turning optimisations on, and a further 6-8% speedup by
is already enabled for Intel).
|With optimisation flags
|With optimisation flags and USE_FAST_LOG set
For more detail see Joel's
FLAC is an alternative encoding format to
MP3. But unlike MP3 encoding it is lossless! The benchmark here was encoding
audio files into the FLAC format.
The key part of this workload is missing
vector support for POWER8 and POWER9. Anton and Amitay submitted this
patch series that
adds in POWER specific vector instructions. It also fixes the configuration options
to correctly detect the POWER8 and POWER9 platforms. With this patch series we get see about a 3x
improvement in this benchmark.
OpenSSL is among other things a cryptographic library. The Phoronix benchmark
measures the number of RSA 4096 signs per second:
$ openssl speed -multi $(nproc) rsa4096
Phoronix used OpenSSL-1.1.0f, which is almost half as slow for this benchmark (on POWER9) than mainline OpenSSL.
Mainline OpenSSL has some powerpc multiplication and squaring assembly code which seems
to be responsible for most of this speedup.
To see this for yourself, add these four powerpc specific commits on top of OpenSSL-1.1.0f:
- perlasm/ppc-xlate.pl: recognize .type directive
- bn/asm/ppc-mont.pl: prepare for extension
- bn/asm/ppc-mont.pl: add optimized multiplication and squaring subroutines
- ppccap.c: engage new multipplication and squaring subroutines
The following results were from a dual 16-core POWER9:
|Version of OpenSSL
|1.1.0f with 4 patches
SciKit-Learn is a bunch of python tools for data mining and
analysis (aka machine learning).
Joel noticed that the benchmark spent 92% of the time in libblas. Libblas is a
very basic BLAS (basic linear algebra subprograms) library that python-numpy
uses to do vector and matrix operations. The default libblas on Ubuntu is only
compiled with -O2. Compiling with -Ofast and using alternative BLAS's that have
powerpc optimisations (such as libatlas or libopenblas) we see big improvements
in this benchmark:
You can read more details about this
Blender is a 3D graphics suite that supports image rendering,
animation, simulation and game creation. On the surface it appears that Blender
2.79b (the distro package version that Phoronix used by system/blender-1.0.2)
failed to use more than 15 threads, even when "-t 128" was added to the Blender
It turns out that even though this benchmark was supposed to be run on CPUs only
(you can choose to render on CPUs or GPUs), the GPU file was always being used.
The GPU file is configured with a very large tile size (256x256) -
which is fine for
but not great for CPUs. The image size (1280x720) to tile size ratio limits the
number of jobs created and therefore the number threads used.
To obtain a realistic CPU measurement with more that 15 threads you can force
the use of the CPU file by overwriting the GPU file with the CPU one:
As you can see in the image below, now all of the cores are being utilised!
Fortunately this has already been fixed in
Thanks to the report by Daniel it
has also been fixed in system/blender-1.1.0.
Pinning the pts/bender-1.0.2, Pabellon Barcelona, CPU-Only test to a single
22-core POWER9 chip (
sudo ppc64_cpu --cores-on=22) and two POWER9 chips
sudo ppc64_cpu --cores-on=44) show a huge speedup:
||Duration (deviation over 3 runs)
|Baseline (GPU blend file)
|Single 22-core POWER9 chip (CPU blend file)
|Two 22-core POWER9 chips (CPU blend file)
Some of the benchmarks where we don't perform as well as Intel are where the
benchmark has inline assembly for x86 but uses generic C compiler generated
assembly for POWER9. We could probably benefit with some more powerpc optimsed functions.
We also found a couple of things that should result in better performance for all three architectures,
not just POWER.
A summary of the performance improvements we found:
There is obviously room for more improvements, especially with the Primesieve and x264 benchmarks,
but it would be interesting to see a re-run of the Phoronix benchmarks with these changes.
Thanks to Anton, Daniel, Joel and Nick for the analysis of the above benchmarks.