Store Halfword Byte-Reverse Indexed

A Power Technical Blog

Panic, flushing and compromise

This is a tale of a simple problem, with a relatively simple solution, that ended up being pretty complicated.

The BMC of OpenPOWER machines expose a serial console. It's pretty useful for getting information as the system is booting, or when it's having issues and the network is down. OpenPOWER machines also have runtime firmware, namely skiboot, which the Linux kernel calls to make certain things happen. One of those is writing to the serial console. There's a function that skiboot exposes, opal_poll_events() (which then calls opal_run_pollers()), which the kernel calls frequently. Among other things, it performs a partial flush of the serial console. And that all works fine...until the kernel panics.

Well, the kernel is in panic. Who cares if it flushes the console? It's dead. It doesn't need to do anything else.

Oh, right. It prints the reason it panicked. Turns out that's pretty useful.

There's a pretty simple fix here that we can push into the firmware. Most kernels are configured to reboot after panic, typically with some delay. In OpenPOWER, the kernel reboots by calling into skiboot with the opal_cec_reboot() function. So all we need to do is flush out the console buffer:

static int64 opal_cec_reboot(void)
{
    printf("OPAL: Reboot request...\n");

    console_complete_flush(); // <-- what I added

    // rebooting stuff happens here...

    return OPAL_SUCCESS;
}

Writing a complete flushing function was pretty easy, then call it from the power down and reboot functions. Easy, all nicely contained in firmware.

Now, what if the kernel isn't configured to reboot after panic. Or, what if the reboot timer is really long? Do you want to wait 3 minutes to see your panic output? Probably not. We need to call the pollers after panic.

First, I had to figure out what the kernel actually does when it panics. Let's have a look at the panic function itself to figure out where we could work some code in.

In the panic() function, the easiest place I found to put in some code was panic_blink(). This is supposed to be a function to blink the LEDs on your keyboard when the kernel is panicking, but we could set it to opal_poll_events() and it'd work fine. There, problem solved!

Oh, wait. That will never get accepted upstream, ever. Let's try again.

Well, there are #ifdefs in the code that are architecture specific, for s390 and SPARC. I could add an #ifdef to check if we're an OpenPOWER machine, and if so, run the pollers a bunch of times. That would also involve including architecture specific code from arch/powerpc, and that's somewhat gross. Maybe I could upstream this, but it'd be difficult. There must be a better way.

As a kernel noob, I found myself digging into what every function called by panic() actually did, to see if there's a way I could use it. I looked over it at first, but eventually I started looking harder at this line:

    kmsg_dump(KMSG_DUMP_PANIC);

It turns out kmsg_dump() does what it says: dumps messages from the kernel. Different parts of the kernel can register their own dumpers, so the kernel can have a variety of dumpers for different purposes. One existing example in OpenPOWER is a kmsg dumper that stores messages in nvram (non-volatile RAM), so you can find it after you reboot.

Well, we don't really want to dump any output, it's already been sent to the output buffer. We just need to flush it. Pretty simple, just call opal_poll_events() a whole bunch of times, right? That would work, though it'd be nice to have a better way than just calling the pollers. Instead, we can add a new API call to skiboot specifically for console flushing, and call it from the kmsg dumper.

Initially, I wired up the skiboot complete console flushing function to a new OPAL API call, and called that from the kernel. After some feedback, this was refactored into a partial, incremental flush so it was more generic. I also had to consider what happened if the machine was running a newer kernel and an older skiboot, so if the skiboot version didn't have my new flushing call it would fall back to calling the pollers an arbitrary amount of times.

In the end, it looks like this:

/*
 * Console output is controlled by OPAL firmware.  The kernel regularly calls
 * OPAL_POLL_EVENTS, which flushes some console output.  In a panic state,
 * however, the kernel no longer calls OPAL_POLL_EVENTS and the panic message
 * may not be completely printed.  This function does not actually dump the
 * message, it just ensures that OPAL completely flushes the console buffer.
 */
static void force_opal_console_flush(struct kmsg_dumper *dumper,
                                     enum kmsg_dump_reason reason)
{
    int i;
    int64_t ret;

    /*
     * Outside of a panic context the pollers will continue to run,
     * so we don't need to do any special flushing.
     */
    if (reason != KMSG_DUMP_PANIC)
        return;

    if (opal_check_token(OPAL_CONSOLE_FLUSH)) {
        ret = opal_console_flush(0);

        if (ret == OPAL_UNSUPPORTED || ret == OPAL_PARAMETER)
            return;

        /* Incrementally flush until there's nothing left */
        while (opal_console_flush(0) != OPAL_SUCCESS);
    } else {
        /*
         * If OPAL_CONSOLE_FLUSH is not implemented in the firmware,
         * the console can still be flushed by calling the polling
         * function enough times to flush the buffer.  We don't know
         * how much output still needs to be flushed, but we can be
         * generous since the kernel is in panic and doesn't need
         * to do much else.
         */
        printk(KERN_NOTICE "opal: OPAL_CONSOLE_FLUSH missing.\n");
        for (i = 0; i < 1024; i++) {
            opal_poll_events(NULL);
        }
    }
}

You can find the full code in-tree here.

And thus, panic messages now roam free 'cross the countryside, causing developer frustration around the world. At least now they know why they're frustrated.

Evolving into a systems programmer

In a previous life I tutored first year computing. The university I attended had a policy of using C to introduce first years to programming. One of the most rewarding aspects of teaching is opening doors of possibility to people by sharing my knowledge.

Over the years I had a mixture of computer science or computer engineering students as well as other disciplines of engineering who were required to learn the basics (notably electrical and mechanical). Each class was different and the initial knowledge always varied greatly. The beauty of teaching C meant that there was never someone who truly knew it all, heck, I didn't and still don't. The other advantage of teaching C is that I could very quickly spot the hackers, the shy person at the back of the room who's eyes light up when you know you've correctly explained pointers (to them anyway) or when asked "What happens if you use a negative index into an array" and the smile they would make upon hearing "What do you think happens".

Right there I would see the makings of a hacker, and this post is dedicated to you or to anyone who wants to be a hacker. I've been asked "What did you do to get where you are?", "How do I get into Linux?" (vague much) at careers fairs. I never quite know what to say, here goes a braindump.

Start with the basics, one of the easiest way we tested the first years was to tell them they can't use parts of libc. That was a great exam, taking aside those who didn't read the question and used strlen() when they were explicitly told they couldn't #include <string.h> a true hacker doesn't need libc, understand it won't always be there. I thought of this example because only two weeks ago I was writing code in an environment where I didn't have libc. Ok sure, if you've got it, use it, just don't crumble when you don't. Oh how I wish I could have told those students who argued that it was a pointless question that they were objectively wrong.

Be a fan of assembly, don't be afraid of it, it doesn't bite and it can be a lot of fun. I wouldn't encourage you to dive right into the PowerISA, it's intense but perhaps understand the beauty of GCC, know what it's doing for you. There is a variety of little 8 bit processors you can play with these days.

At all levels of my teaching I saw almost everyone get something which 'worked', and that's fine, it probably does but I'm here to tell you that it doesn't work until you know why it works. I'm all for the 'try it and see' approach but once you've tried it you have to explain why the behaviour changed otherwise you didn't fix it. As an extension to that, know how your tools work, I don't think anyone would expect you to be able to write tools to the level of complexity of GCC or GDB or Valgrind but have a rough idea as to how they achieve their goals.

A hacker is paranoid, yes, malloc() fails. Linux might just decide now isn't a good time for you to open() and your fopen() calling function had better be cool with that. A hacker also doesn't rely on the kindness of the operating system theres an munmap() for a reason. Nor should you even completely trust it, what are you leaving around in memory?

Above all do a it for the fun of it, so many of my students asked how I knew everything I knew (I was only a year ahead of them in my first year of teaching) and put simply, write code on a Saturday night.

None of these things do or don't make you a hacker, being a hacker is a frame of mind and a way of thinking but all of the above helps.

Unfortunately there isn't a single path, I might even say it is a path that chooses you. Odds are you're here because you approached me at some point and asked me one of those questions I never quite know how to answer. Perhaps this is the path, at the very least you're asking questions and approaching people. I'm hope I did on the day, but once again, all the very best with your endeavours into the future

What the HILE is this?

One of the cool features of POWER8 processors is the ability to run in either big- or little-endian mode. Several distros are already available in little-endian, but up until recently Petitboot has remained big-endian. While it has no effect on the OS, building Petitboot little-endian has its advantages, such as making support for vendor tools easier. So it should just be a matter of compiling Petitboot LE right? Well...

Switching Endianess

Endianess, and several other things besides, are controlled by the Machine State Register (MSR). Each processor in a machine has an MSR, and each bit of the MSR controls some aspect of the processor such as 64-bit mode or enabling interrupts. To switch endianess we set the LE bit (63) to 1.

When a processor first starts up it defaults to big-endian (bit 63 = 0). However the processor doesn't actually know the endianess of the kernel code it is about to execute - either it is big-endian and everything is fine, or it isn't and the processor will very quickly try to execute an illegal instruction.

The solution to this is an amazing little snippet of code in arch/powerpc/boot/ppc_asm.h (follow the link to see some helpful commenting):

#define FIXUP_ENDIAN
    tdi   0, 0, 0x48;
    b     $+36;
    .long 0x05009f42;
    .long 0xa602487d;
    .long 0x1c004a39;
    .long 0xa600607d;
    .long 0x01006b69;
    .long 0xa6035a7d;
    .long 0xa6037b7d;
    .long 0x2400004c

By some amazing coincidence if you take the opcode for tdi 0, 0, 0x48 and flip the order of the bytes it forms the opcode for b . + 8. So if the kernel is big-endian, the processor will jump to the next instruction after this snippet. However if the kernel is little-endian we execute the next 8 instructions. These are written in reverse so that if the processor isn't in the right endian it interprets them backwards, executing the instructions shown in the linked comments above, resulting in MSRLE being set to 1.

When booting a little-endian kernel all of the above works fine - but there is a problem for Petitboot that will become apparent a little further down...

Petitboot's Secret Sauce

The main feature of Petitboot is that it is a full (but small!) Linux kernel and userspace which scans all available devices and presents possible boot options. To boot an available operating system Petitboot needs to start executing the OS's kernel, which it accomplishes via kexec. Simply speaking kexec loads the target kernel into memory, shuts the current system down most of the way, and at the last moment sets the instruction pointer to the start of the target kernel. From there it's like booting any other kernel, including the FIXUP_ENDIAN section above.

We've Booted! Wait...

So our LE Petitboot kernel boots fine thanks to FIXUP_ENDIAN, we kexec into some other kernel.. and everything falls to pieces.
The problem is we've unwittingly changed one of the assumptions of booting a kernel; namely that MSRLE defaults to zero. When kexec-ing from an LE kernel we start executing the next kernel in LE mode. This itself is ok, the FIXUP_ENDIAN macro will handle the switch if needed. The problem is that the FIXUP_ENDIAN macro is relatively recent, first entering the kernel in early 2014. So if we're booting, say, an old Fedora 19 install with a v3.9 kernel - things go very bad, very quickly.

Fix #1

The solution seems pretty straightforward: find where we jump into the next kernel, and just before that make sure we reset the LE bit in the MSR. That's exactly what this patch to kexec-lite does.
That worked up until I tested on a machine with more than one CPU. Remembering that the MSR is processor-specific, we also have to reset the endianess of each secondary CPU
Now things are looking good! All the CPUs are reset to big-endian, the target kernel boots fine, and then... 'recursive interrupts?!'

HILE

Skipping the debugging process that led to this (hint: mambo is actually a pretty cool tool), these were the sequence of steps leading up to the problem:

  • Little-endian Petitboot kexecs into a big-endian kernel
  • All CPUs are reset to big-endian
  • The big-endian kernel begins to boot successfully
  • Somewhere in the device-tree parsing code we take an exception
  • Execution jumps to the exception handler at 0x300
  • I notice that MSRLE is set to 1
  • WHAT WHY IS THE LE BIT IN THE MSR SET TO 1
  • We fail to read the first instruction at 0x300 because it's written in big-endian, so we jump to the exception handler at 0x300... oh no.

And then we very busily execute nothing until the machine is killed. I spend some time staring incredulously at my screen, then appeal to a higher authority who replies with "What is the HILE set to?"

..the WHAT?
Cracking open the PowerISA reveals this tidbit:

The Hypervisor Interrupt Little-Endian (HILE) bit is a bit in an implementation-dependent register or similar mechanism. The contents of the HILE bit are copied into MSRLE by interrupts that set MSRHV to 1 (see Section 6.5), to establish the Endian mode for the interrupt handler. The HILE bit is set, by an implementation-dependent method, during system initialization, and cannot be modified after system initialization.

To be fair, there are use cases for taking exceptions in a different endianess. The problem is that while HILE gets switched on when setting MSRLE to 1, it doesn't get turned off when MSRLE is set to zero. In particular the line "...cannot be modified after system initialization." led to a fair amount of hand wringing from myself and whoever would listen; if we can't reset the HILE bit, we simply can't use little-endian kernels for Petitboot.

Luckily while on some other systems the machinations of the firmware might be a complete black box, Petitboot runs on OPAL systems - which means the firmware source is right here. In particular we can see here the OPAL call to opal_reinit_cpus which among other things resets the HILE bit.
This is actually what turns on the HILE bit in the first place, and is meant to be called early on in boot since it also clobbers a large amount of state. Luckily for us we don't need to hold onto any state since we're about to jump into a new kernel. We just need to choose an appropriate place where we can be sure we won't take an exception before we get into the next kernel: thus the final patch to support PowerNV machines.

Docker: Just Stop Using AUFS

Docker's default storage driver on most Ubuntu installs is AUFS.

Don't use it. Use Overlay instead. Here's why.

First, some background. I'm testing the performance of the basic LAMP stack on POWER. (LAMP is Linux + Apache + MySQL/MariaDB + PHP, by the way.) To do more reliable and repeatable tests, I do my builds and tests in Docker containers. (See my previous post for more info.)

Each test downloads the source of Apache, MariaDB and PHP, and builds them. This should be quick: the POWER8 system I'm building on has 160 hardware threads and 128 GB of memory. But I was finding that it was only just keeping pace with a 2 core Intel VM on BlueMix.

Why? Well, my first point of call was to observe a compilation under top. The header is below.

top header, showing over 70 percent of CPU time spent in the kernel

Over 70% of CPU time is spent in the kernel?! That's weird. Let's dig deeper.

My next port of call for analysis of CPU-bound workloads is perf. perf top reports astounding quantities of time in spin-locks:

display from perf top, showing 80 percent of time in a spinlock

perf top -g gives us some more information: the time is in system calls. open() and stat() are the key culprits, and we can see a number of file system functions are in play in the call-chains of the spinlocks.

display from perf top -g, showing syscalls and file ops

Why are open and stat slow? Well, I know that the files are on an AUFS mount. (docker info will tell you what you're using if you're not sure.) So, being something of a kernel hacker, I set out to find out why. This did not go well. AUFS isn't upstream, it's a separate patch set. Distros have been trying to deprecate it for years. Indeed, RHEL doesn't ship it. (To it's credit, Docker seems to be trying to move away from it.)

Wanting to avoid the minor nightmare that is an out-of-tree patchset, I looked at other storage drivers for Docker. This presentation is particularly good. My choices are pretty simple: AUFS, btrfs, device-mapper or Overlay. Overlay was an obvious choice: it doesn't need me to set up device mapper on a cloud VM, or reformat things as btrfs.

It's also easy to set up on Ubuntu:

  • export/save any docker containers you care about.

  • add --storage-driver=overlay option to DOCKER_OPTS in /etc/default/docker, and restart docker (service docker restart)

  • import/load the containters you exported

  • verify that things work, then clear away your old storage directory (/var/lib/docker/aufs).

Having moved my base container across, I set off another build.

The first thing I noticed is that images are much slower to create with Overlay. But once that finishes, and a compile starts, things run much better:

top, showing close to zero system time, and around 90 percent user time

The compiles went from taking painfully long to astonishingly fast. Winning.

So in conclusion:

  • If you use Docker for something that involves open()ing or stat()ing files

  • If you want your machine to do real work, rather than spin in spinlocks

  • If you want to use code that's upstream and thus much better supported

  • If you want something less disruptive than the btrfs or dm storage drivers

...then drop AUFS and switch to Overlay today.

A tale of two Dockers

(This was published in an internal technical journal last week, and is now being published here. If you already know what Docker is, feel free to skim the first half.)

Docker seems to be the flavour of the month in IT. Most attention is focussed on using Docker for the deployment of production services. But that's not all Docker is good for. Let's explore Docker, and two ways I use it as a software developer.

Docker: what is it?

Docker is essentially a set of tools to deal with containers and images.

To make up an artificial example, say you are developing a web app. You first build an image: a file system which contains the app, and some associated metadata. The app has to run on something, so you also install things like Python or Ruby and all the necessary libraries, usually by installing a minimal Ubuntu and any necessary packages.1 You then run the image inside an isolated environment called a container.

You can have multiple containers running the same image, (for example, your web app running across a fleet of servers) and the containers don't affect each other. Why? Because Docker is designed around the concept of immutability. Containers can write to the image they are running, but the changes are specific to that container, and aren't preserved beyond the life of the container.2 Indeed, once built, images can't be changed at all, only rebuilt from scratch.

However, as well as enabling you to easily run multiple copies, another upshot of immutability is that if your web app allows you to upload photos, and you restart the container, your photos will be gone. Your web app needs to be designed to store all of the data outside of the container, sending it to a dedicated database or object store of some sort.

Making your application Docker friendly is significantly more work than just spinning up a virtual machine and installing stuff. So what does all this extra work get you? Three main things: isolation, control and, as mentioned, immutability.

Isolation makes containers easy to migrate and deploy, and easy to update. Once an image is built, it can be copied to another system and launched. Isolation also makes it easy to update software your app depends on: you rebuild the image with software updates, and then just deploy it. You don't have to worry about service A relying on version X of a library while service B depends on version Y; it's all self contained.

Immutability also helps with upgrades, especially when deploying them across multiple servers. Normally, you would upgrade your app on each server, and have to make sure that every server gets all the same sets of updates. With Docker, you don't upgrade a running container. Instead, you rebuild your Docker image and re-deploy it, and you then know that the same version of everything is running everywhere. This immutability also guards against the situation where you have a number of different servers that are all special snowflakes with their own little tweaks, and you end up with a fractal of complexity.

Finally, Docker offers a lot of control over containers, and for a low performance penalty. Docker containers can have their CPU, memory and network controlled easily, without the overhead of a full virtual machine. This makes it an attractive solution for running untrusted executables.3

As an aside: despite the hype, very little of this is actually particularly new. Isolation and control are not new problems. All Unixes, including Linux, support 'chroots'. The name comes from “change root”: the system call changes the processes idea of what the file system root is, making it impossible for it to access things outside of the new designated root directory. FreeBSD has jails, which are more powerful, Solaris has Zones, and AIX has WPARs. Chroots are fast and low overhead. However, they offer much lower ability to control the use of system resources. At the other end of the scale, virtual machines (which have been around since ancient IBM mainframes) offer isolation much better than Docker, but with a greater performance hit.

Similarly, immutability isn't really new: Heroku and AWS Spot Instances are both built around the model that you get resources in a known, consistent state when you start, but in both cases your changes won't persist. In the development world, modern CI systems like Travis CI also have this immutable or disposable model – and this was originally built on VMs. Indeed, with a little bit of extra work, both chroots and VMs can give the same immutability properties that Docker gives.

The control properties that Docker provides are largely as a result of leveraging some Linux kernel concepts, most notably something called namespaces.

What Docker does well is not something novel, but the engineering feat of bringing together fine-grained control, isolation and immutability, and – importantly – a tool-chain that is easier to use than any of the alternatives. Docker's tool-chain eases a lot of pain points with regards to building containers: it's vastly simpler than chroots, and easier to customise than most VM setups. Docker also has a number of engineering tricks to reduce the disk space overhead of isolation.

So, to summarise: Docker provides a toolkit for isolated, immutable, finely controlled containers to run executables and services.

Docker in development: why?

I don't run network services at work; I do performance work. So how do I use Docker?

There are two things I do with Docker: I build PHP 5, and do performance regression testing on PHP 7. They're good case studies of how isolation and immutability provide real benefits in development and testing, and how the Docker tool chain makes life a lot nicer that previous solutions.

PHP 5 builds

I use the isolation that Docker provides to make building PHP 5 easier. PHP 5 depends on an old version of Bison, version 2. Ubuntu and Debian long since moved to version 3. There are a few ways I could have solved this:

  • I could just install the old version directly on my system in /usr/local/, and hope everything still works and nothing else picks up Bison 2 when it needs Bison 3. Or I could install it somewhere else and remember to change my path correctly before I build PHP 5.
  • I could roll a chroot by hand. Even with tools like debootstrap and schroot, working in chroots is a painful process.
  • I could spin up a virtual machine on one of our development boxes and install the old version on that. That feels like overkill: why should I need to run an entire operating system? Why should I need to copy my source tree over the network to build it?

Docker makes it easy to have a self-contained environment that has Bison 2 built from source, and to build my latest source tree in that environment. Why is Docker so much easier?

Firstly, Docker allows me to base my container on an existing container, and there's an online library of containers to build from.4 This means I don't have to roll a base image with debootstrap or the RHEL/CentOS/Fedora equivalent.

Secondly, unlike a chroot build process, which ultimately is just copying files around, a docker build process includes the ability to both copy files from the host and run commands in the context of the image. This is defined in a file called a Dockerfile, and is kicked off by a single command: docker build.

So, my PHP 5 build container loads an Ubuntu Vivid base container, uses apt-get to install the compiler, tool-chain and headers required to build PHP 5, then installs old bison from source, copies in the PHP source tree, and builds it. The vast majority of this process – the installation of the compiler, headers and bison, can be cached, so they don't have to be downloaded each time. And once the container finishes building, I have a fully built PHP interpreter ready for me to interact with.

I do, at the moment, rebuild PHP 5 from scratch each time. This is a bit sub-optimal from a performance point of view. I could alleviate this with a Docker volume, which is a way of sharing data persistently between a host and a guest, but I haven't been sufficiently bothered by the speed yet. However, Docker volumes are also quite fiddly, leading to the development of tools like docker compose to deal with them. They also are prone to subtle and difficult to debug permission issues.

PHP 7 performance regression testing

The second thing I use docker for takes advantage of the throwaway nature of docker environments to prevent cross-contamination.

PHP 7 is the next big version of PHP, slated to be released quite soon. I care about how that runs on POWER, and I preferably want to know if it suddenly deteriorates (or improves!). I use Docker to build a container with a daily build of PHP 7, and then I run a benchmark in it. This doesn't give me a particularly meaningful absolute number, but it allows me to track progress over time. Building it inside of Docker means that I can be sure that nothing from old runs persists into new runs, thus giving me more reliable data. However, because I do want the timing data I collect to persist, I send it out of the container over the network.

I've now been collecting this data for almost 4 months, and it's plotted below, along with a 5-point moving average. The most notable feature of the graph is a the drop in benchmark time at about the middle. Sure enough, if you look at the PHP repository, you will see that a set of changes to improve PHP performance were merged on July 29: changes submitted by our very own Anton Blanchard.5

Graph of PHP 7 performance over time

Docker pain points

Docker provides a vastly improved experience over previous solutions, but there are still a few pain points. For example:

  1. Docker was apparently written by people who had no concept that platforms other than x86 exist. This leads to major issues for cross-architectural setups. For instance, Docker identifies images by a name and a revision. For example, ubuntu is the name of an image, and 15.04 is a revision. There's no ability to specify an architecture. So, how you do specify that you want, say, a 64-bit, little-endian PowerPC build of an image versus an x86 build? There have been a couple of approaches, both of which are pretty bad. You could name the image differently: say ubuntu_ppc64le. You can also just cheat and override the ubuntu name with an architecture specific version. Both of these break some assumptions in the Docker ecosystem and are a pain to work with.

  2. Image building is incredibly inflexible. If you have one system that requires a proxy, and one that does not, you need different Dockerfiles. As far as I can tell, there are no simple ways to hook in any changes between systems into a generic Dockerfile. This is largely by design, but it's still really annoying when you have one system behind a firewall and one system out on the public cloud (as I do in the PHP 7 setup).

  3. Visibility into a Docker server is poor. You end up with lots of different, anonymous images and dead containers, and you end up needing scripts to clean them up. It's not clear what Docker puts on your file system, or where, or how to interact with it.

  4. Docker is still using reasonably new technologies. This leads to occasional weird, obscure and difficult to debug issues.6

Final words

Docker provides me with a lot of useful tools in software development: both in terms of building and testing. Making use of it requires a certain amount of careful design thought, but when applied thoughtfully it can make life significantly easier.


  1. There's some debate about how much stuff from the OS installation you should be using. You need to have key dynamic libraries available, but I would argue that you shouldn't be running long running processes other than your application. You shouldn't, for example, be running a SSH daemon in your container. (The one exception is that you must handle orphaned child processes appropriately: see https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/) Considerations like debugging and monitoring the health of docker containers mean that this point of view is not universally shared. 

  2. Why not simply make them read only? You may be surprised at how many things break when running on a read-only file system. Things like logs and temporary files are common issues. 

  3. It is, however, easier to escape a Docker container than a VM. In Docker, an untrusted executable only needs a kernel exploit to get to root on the host, whereas in a VM you need a guest-to-host vulnerability, which are much rarer. 

  4. Anyone can upload an image, so this does require running untrusted code from the Internet. Sadly, this is a distinctly retrograde step when compared to the process of installing binary packages in distros, which are all signed by a distro's private key. 

  5. See https://github.com/php/php-src/pull/1326 

  6. I hit this last week: https://github.com/docker/docker/issues/16256, although maybe that's my fault for running systemd on my laptop.