Toward a faster and thinner XSIMD

5 min readOct 29, 2021

xsimd, the SIMD library by QuantStack, got more and more adoption in recent years. While it helped improve the library, adoption also brought new requirements. Among them was the abitility to dynamically chose which instruction set to use, and the architecture of xsimd was not suitable for that. To support this new feature, we basically had to… reimplement the library from scratch ;)

The main change from this huge refactoring is the batch class. In xsimd 7.x, a batch is a parametric type, denoted batch<T, N>, that represents a vector of N elements of type T. In xsimd 8.x, a batch is a parametric type, denoted batch<T, A>, that represents a vector of elements of type T as represented on (sub) architecture A. Why this change?

Dynamic dispatch

When one compiles a binary on the machine that’s going to run the binary, she can use the -march=native flag from GCC or Clang to state Use the best instruction set available natively. That’s unfortunately not the case of many packaging system that provide pre-compiled binaries. In that case the machine used to compile the binary is not the same as the one that runs the code. Thus a minimal architecture must be chosen, say compile all binaries with -msse4.2, so that they can run correctly on any machine that supports this instruction set, but may crash on others. As a consequence, package maintainers tend to make very conservative choices at the expense of performance.

An alternate approach is to bundle code for different architectures within the same binary, and choose the right version at runtime, based on the running system capabilities. This scenario is used in many performance-critical codebase, including Numpy.

Unfortunately, modelling a vector as a batch<T, N> doesn’t capture the current architecture: it assumes there’s one architecture available, generally the widest / most recent one based on compiler flags. On the other hand batch<T, A> explicitly depends on an architecture A, so it’s possible to write code where A==avx2 and code where A==avx512 within the same compilation unit with xsimd 8.x.

This all sounds very abstract… Fortunately xsimd documentation provides a good example of generic dispatching using this new paradigm:

Let’s dive a bit into this code sample: sum is a function object parametrized by a type T and an architecture A. It loops over an array data of size elements, using xsimd::batch to process data in chunks. Note that the number of elements stored in this batch is totally abstracted here, and can be retrieved through the batch::size constexpr member.

It is possible to instanciate sum for any supported type and architecture, and then pick the right version at runtime. The boilerplate to do so is hidden within the xsimd::dispatch function adaptator, but the abstraction it provides is easy to grab: it inspects the running system to identify which architecture is supported, picks the best version and returns it. Then the specialized function can be called repeateadly with no overhead.

Changing the core

xsimd really consists in a bunch of operations on batch: memory operations, arithmetic operations, math operations etc. Changing the representation of its core batch data type really means rewritting the whole library. That’s what we did, and we took the opportunity to do a few engineering, say enhancement. Let the digits talk:

$ git diff --shortstat 7.6.0  8.0.0
226 files changed, 20233 insertions(+), 38842 deletions(-)

That’s close to twice as less code in 8.0.0 than in 7.6.0!

Compilation time, the bane of many C++ developers, is also impacted. It takes roughly 0.35s to generate object code for the following example with g++ test.cpp -msse4.2 -02 -c :

while the same code adapted for xsimd 7.6.0 takes 0.46s to compile. And the generated code is exactly the same.

Migration Guide

The basics of xsimd remain unchanged: one loads data from an array to a batch, performs operations on the batch, then stores the result back to memory. However, as hinted above, we took advantage of the major release to cleanup the API. Most changes are listed in the migration guide, let us comment on the most notable ones:

the batch::load method is now static. Having it as a regular method led to a pattern where user were creating an uninitialized batch to fill it afterward. With a static method, user don’t need this (transient) inconsistent state.
the batch::operator[] method is no longer supported. It is a costly operation (as it implies a store) which is not what users expect from operator[]. batch::get is used instead and is more explicit about the additional cost.
batches of complex are solely represented by the batch<std::complex<T>> type, xtl::xcomplex type is no longer involved. It is still possible to load xtl::xcomplex values into batch<std::complex<T>>.
some small quirks that are not even in the migration guide. We don’t want to spoil all the joy of a migration ;-)

Acknowledgements

The authors would like to thank Joël Falcou for sharing many of the ideas behind the EVE SIMD library. EVE has a very nice design, but it requires C++20, while xsimd humbly sticks to C++11.

The authors would like to thank Apache Arrow developers, who have been contributing to xsimd since version 7, and helped a lot to port features from branch 7.x to the new implementation.

About the authors

Serge is a scientific computing hobbyist and the main author of Pythran. Serge is also a maintainer of xsimd. He holds a PhD in compilation, and enjoys chopping wood and bugs, indifferently.

Johan Mabille is a scientific software developer at QuantStack.

Johan is very active in the Jupyter ecosystem, as the creator of xeus, a C++ implementation of the Jupyter protocol, and several language kernels, such as xeus-python, xeus-cling, and xeus-robot. Johan also made important contributions to the Jupyter widgets ecosystem and to JupyterLab.