memcmp() for POWER8 - part II — Store Halfword Byte-Reverse Indexed

This entry is a followup to part I which you should absolutely read here before continuing on.

Where we left off

We concluded that while a vectorised memcmp() is a win, there are some cases where it won't quite perform.

The overhead of enabling ALTIVEC

In the kernel we explicitly don't touch ALTIVEC unless we need to, this means that in the general case we can leave the userspace registers in place and not have do anything to service a syscall for a process.

This means that if we do want to use ALTIVEC in the kernel, there is some setup that must be done. Notably, we must enable the facility (a potentially time consuming move to MSR), save off the registers (if userspace we using them) and an inevitable restore later on.

If all this needs to be done for a memcmp() in the order of tens of bytes then it really wasn't worth it.

There are two reasons that memcmp() might go for a small number of bytes, firstly and trivially detectable is simply that parameter n is small. The other is harder to detect, if the memcmp() is going to fail (return non zero) early then it also wasn't worth enabling ALTIVEC.

Detecting early failures

Right at the start of memcmp(), before enabling ALTIVEC, the first 64 bytes are checked using general purpose registers. Why the first 64 bytes, well why not? In a strange twist of fate 64 bytes happens to be the amount of bytes in four ALTIVEC registers (128 bits per register, so 16 bytes multiplied by 4) and by utter coincidence that happens to be the stride of the ALTIVEC compare loop.

What does this all look like

Well unlike part I the results appear slightly less consistent across three runs of measurement but there are some very key differences with part I. The trends do appear to be the same across all three runs, just less pronounced - why this is is unclear.

The difference between run two and run three clipped at deltas of 1000ns is interesting: Sample 2: Deltas below 1000ns

Sample 3: Deltas below 1000ns

The results are similar except for a spike in the amount of deltas in the unpatched kernel at around 600ns. This is not present in the first sample (deltas1) of data. There are a number of reasons why this spike could have appeared here, it is possible that the kernel or hardware did something under the hood, prefetch could have brought deltas for a memcmp() that would otherwise have yielded a greater delta into the 600ns range.

What these two graphs do both demonstrate quite clearly is that optimisations down at the sub 100ns end have resulted in more sub 100ns deltas for the patched kernel, a significant win over the original data. Zooming out and looking at a graph which includes deltas up to 5000ns shows that the sub 100ns delta optimisations haven't noticeably slowed the performance of long duration memcmp(), Samply 2: Deltas below 5000ns .

Conclusion

The small amount of extra development effort has yielded tangible results in reducing the low end memcmp() times. This second round of data collection and performance analysis only confirms the that for any significant amount of comparison, a vectorised loop is significantly quicker.

The results obtained here show no downside to adopting this approach for all power8 and onwards chips as this new version of the patch solves the performance regression for small compares.