Store Half Byte-Reverse Indexed

A Power Technical Blog

Kernel interfaces and vDSO test

Getting Suckered

Last week a colleague of mine came up to me and showed me some of the vDSO on PowerPC and asked why on earth does it fail vdsotest. I should come clean at this point and admit that I knew very little about the vDSO and hadn't heard of vdsotest. I had to admit to this colleague that I had no idea everything looked super sane.

Unfortunately (for me) I got hooked, vdsotest was saying it was getting '22' instead of '-1' and it was the case where the vDSO would call into the kernel. It plagued me all night, 22 is so suspicious. Right before I got to work the next morning I had an epiphany, "I bet 22 is EINVAL".

Virtual Dynamically linked Shared Objects

The vDSO is a mechanism to expose some kernel functionality into userspace to avoid the cost of a context switch into kernel mode. This is a great feat of engineering, avoiding the context switch can have a dramatic speedup for userspace code. Obviously not all kernel functionality can be placed into userspace and even for the functionality which can, there may be edge cases in which the vDSO needs to ask the kernel.

Who tests the vDSO? For the portion that lies exclusively in userspace it will escape all testing of the syscall interface which is really what kernel developers are so focused on not breaking. Enter Nathan Lynch with vdsotest who has done some great work!

The Kernel

When the vDSO can't get the correct value without the kernel, it simply calls into the kernel because the kernel is the definitive reference for every syscall. On PowerPC something like this happens (sorry, our vDSO is 100% asm): 1

/*
 * Exact prototype of clock_gettime()
 *
 * int __kernel_clock_gettime(clockid_t clock_id, struct timespec *tp);
 *
 */
V_FUNCTION_BEGIN(__kernel_clock_gettime)
  .cfi_startproc
    /* Check for supported clock IDs */
    cmpwi   cr0,r3,CLOCK_REALTIME
    cmpwi   cr1,r3,CLOCK_MONOTONIC
    cror    cr0*4+eq,cr0*4+eq,cr1*4+eq
    bne cr0,99f

    /* [snip] */

    /*
     * syscall fallback
     */
99:
    li  r0,__NR_clock_gettime
    sc
    blr

For those not familiar, this couldn't be more simple. The start checks to see if it is a clock id that the vDSO can handle and if not it jumps to the 99 label. From here simply load the syscall number, jump to the kernel and branch to link register aka 'return'. In this case the 'return' statement would return to the userspace code which called the vDSO function.

Wait, having the vDSO calling into the kernel call gets us the wrong result? Or course it should, vdsotest is assuming a C ABI with return values and errno but the kernel doesn't do that, the kernel ABI is different. How does this even work on x86? Ohhhhh vdsotest does 2

static inline void record_syscall_result(struct syscall_result *res,
                     int sr_ret, int sr_errno)
{
    /* Calling the vDSO directly instead of through libc can lead to:
     * - The vDSO code punts to the kernel (e.g. unrecognized clock id).
     * - The kernel returns an error (e.g. -22 (-EINVAL))
     * So we need to recognize this situation and fix things up.
     * Fortunately we're dealing only with syscalls that return -ve values
     * on error.
     */
    if (sr_ret < 0 && sr_errno == 0) {
        sr_errno = -sr_ret;
        sr_ret = -1;
    }

    *res = (struct syscall_result) {
        .sr_ret = sr_ret,
        .sr_errno = sr_errno,
    };
}

That little hack isn't working on PowerPC and here's why:

The kernel puts the return value in the ABI specified return register (r3) and uses a condition register bit (condition register field 0, SO bit), so unlike x86 on error the return value isn't negative. To make matters worse, the condition register is very difficult to access from C. Depending on your definition of 'access from C' you might consider it impossible, a fixup like that would be impossible.

Lessons learnt

  • vDSO supplied functions aren't quite the same as their libc counterparts. Unless you have very good reason, and to be fair, vdsotest does have a very good reason, always access the vDSO through libc
  • Kernel interfaces aren't C interfaces, yep, they're close but they aren't the same
  • 22 is in fact EINVAL
  • Different architectures are... Different!
  • Variety is the spice of life

P.S I have a hacky patch waiting review


  1. arch/powerpc/kernel/vdso64/gettimeofday.S 

  2. src/vdsotest.h 

Comments