How we wrote xtensor 5/N: Expression Templates

6 min readSep 5, 2019

xtensor is a comprehensive framework for N-D arrays processing, including an extensible expression system, lazy evaluation, and many other features that cannot be covered in a single article. In this post, we focus on expression templates.

In the previous articles, we started to implement an N-D container, xarray, which has value semantics and provides all the methods required to access its data. The next step is to give it computation capabilities so that it becomes a modelization of a mathematical tensor rather than a simple container.

Traditional operator overloading

The traditional way to give basic computation capabitilities to a class is to overload arithmetic operators and assign operators:

template <class T, layout_type L, class A>
class xarray
{
public:    using self_type = xarray<T, L, A>;    self_type& operator+=(const self_type& rhs)
    {
        std::transform(data().cbegin(),
                       data().cend(),
                       rhs.data().cbegin(),
                       data().begin(),
                       std::plus<T>());
        return *this;
    }
};template <class T, layout_type L, class A>
xarray<T, L, A> operator+(const xarray<T, L, A>& lhs,
                          const xarray<T, L, A>& rhs)
{
    xarray<T, L, A> tmp = lhs;
    tmp += rhs;
    return tmp;
}

The problem with this approach is when you chain operations:

xarray<double> a, b, c, d;
// ... initializes a, b, c and d
xarray<double> res = a + b + c + d;

With the previous definitions of operator+ and operator+=, the last line is equivalent to:

xarray<double> tmp1 = c + d;
xarray<double> tmp2 = b + tmp1;
xarray<double> res = a + tmp2;

Therefore we allocate several temporaries (which means several dynamic memory allocation) and we traverse the data of these temporaries twice, when we compute them, and when we use them for the next computation. Depending on how much data the arrays contain, this can results in a lot of cache misses.

A better solution would be to generate a single loop:

for(size_t i = 0; i < res.size(); ++i)
{
    res(i) = a(i) + b(i) + c(i) + d(i);
}

This is where expression templates come into play.

Abstract syntax tree

The idea is to build in-memory structures that represent the abstract syntax tree of the expression:

Arithmetic operators are overloaded to return a representation of the corresponding operation as a node of the syntax tree. Each node of the tree provides an API similar to that of xarray. When the access operator of a node is called, it recursively calls the access operator of its operands, performs the computation and returns the result as illustrated below:

template <class E1, class E2>
class addition
{
public:    using const_reference =
        std::common_type_t<typename E1::const_reference,
                           typename E2::const_reference>;    addition(const E1& e1, const E2& e2)
        : m_e1(e1), m_e2(e2)
    {
    }    template <class... Args>
    const_reference operator()(Args... i) const
    {
        return m_e1(i...) + m_e2(i...);
    }private:    const E1& m_e1;
    const E2& m_e2;
};template <class E1, class E2>
addition<E1, E2> operator+(const E1& e1, const E2& e2)
{
    return addition<E1, E2>(e1, e2);
}

When the expression is assigned to an xarray, the access operator of the root node is called in the assignment loop:

template <class T, layout_type L, class A>
xarray<T, L, A>& xarray<T, L, A>::operator=(const expression& e)
{
    resize(e.shape());
    for(size_t i = 0; i < e.size(); ++i)
    {
        (*this)(i) = e(i);
    }
    return *this;
}

Each element is computed by recursively calling the access operator of the nodes in the tree, avoiding the allocation of temporary tensor objects. Besides, the expression is only computed upon assignment. This is called lazy evaluation.

However, implementing one structure per operation or mathematical function is going to be cumbersome. We can add a functor parameter to describe the operation, but we still need different structures depending on the arity of the functions. Besides, we should support functions that take parameters by reference, not only const reference. Therefore, a class representing an operation should have three axes of genericity:

the functor that performs the operation
the number of arguments of the function
how we store the arguments (reference, const reference or value)

The class xfunction described in the next section addresses the first two points. The latter will be delegated to other structures that will be covered in a future article. For now, we assume that the xfunction class is given template arguments of the exact types that we want to store. These types are known as the closure types in the xtensor terminology.

The xfunction class

As mentioned above, a class describing an operation should accept the functor of the operation as a template parameter. The other parameters should be the operands of the operation. Prior to C++11, it was necessary to create one class per function’s arity (e.g xunary_function, xbinary_function, etc…). Now that C++ provides variadic templates, we can use a single class:

template <class F, class... CT>
class xfunction
{
public:    template <class... CTA>
    xfunction(F f, CTA&&... e)
        : m_e(std::forward<CTA>(e)), m_f(f)
    {
    }private:    std::tuple<CT...> m_e;
    F m_f;
};

The constructor has to be a template if we want to achieve perfect forwarding of the operands. Indeed, && is a universal reference only in a deduced context, otherwise it is a classical rvalue reference. That is, the following constructor could only be invoked with rvalues:

template <class F, class... CT>
xfunction<F, CT...>::xfunction(F f, CT&& e)
    : m_e(std::forward<CT>(e)), m_f(f)
{
}

Now that we can instantiate an xfunction object, it is time to implement the access operator. Since xfunction represents a node in the abstract xyntax tree of an expression, it is by nature immutable. Therefore, it should only provide a constant access operator:

template <class F, class... CT>
class xfunction
{
public:    template <class... Args>
    const_reference operator()(Args... args) const
    {
        /* return m_f(std::get<0>(m_e)(args...),
                      std::get<1>(m_e)(args...),
                      std::get<2>(m_e)(args...), ...);*/
    }private:    std::tuple<CT...> m_e;
    F m_f;
};

The const_reference typedef is computed from the const_reference member of the operands, we will get back to it in the article about the closure types. For now, just accept that is exists.

The implementation commented out above is invalid C++ code, however, it gives a good idea of what we want to achieve: retrieve each element of the operands at the given indices, pass them to the functor, and return the result.

The problem is we don’t know the number of operands stored in the tuple. We can get it at compile time with the tuple_size metafunction or with the sizeof operator, but we cannot explicitly write all the std::get that we need. Since the compiler knows how many items the tuple holds, it would be nice to make it generate the list of calls to std::get. And the best way to achieve this in modern C++ is to rely on pack expansion:

template <std::size_t... I, class... Args)
const_reference access_impl(std::index_sequence<I...>,
                            Args... args) const
{
    return m_f(std::get<I>(m_e)(args...)...);
}template <class... Args>
const_reference operator()(Args... args) const
{
    return access_impl(std::make_index_sequence<sizeof...(CT)>(),
                       args...);}

The access_impl method accepts the same arguments as operator() and an additional index_sequence wrapping the parameter pack that we expand to generate the calls of std::get and retrieve the values of all the operands at index args.... Notice that we have two nested pack expansions here, one for the multi-index args..., and one for the integer sequence I....

The access operator builds the index_sequence required for access_impl and simply forwards its argument.

Although the xfunction class now has all that is required to build an abstract syntax tree, it is far from being complete. In the next articles, we will:

define what closure types are and how they are passed to xfunction
add inner types similar to the ones defined in the xarray class
define the overloads of arithmetic operators and mathematical functions

More about the Series

This post is just one episode of a long series of articles:

How We Wrote Xtensor

How we wrote xtensor 5/N: Expression Templates

Traditional operator overloading

Abstract syntax tree

The xfunction class

More about the Series

Written by Johan Mabille