How we wrote xtensor 7/N: broadcasting

9 min readNov 24, 2019

xtensor is a comprehensive framework for N-D arrays processing, including an extensible expression system, lazy evaluation, and many other features that cannot be covered in a single article. In this post, we focus on broadcasting.

In the previous article, we implemented operator and mathematical function overloads so that we can build arbitrary complex expression trees and access their elements. Before we can assign an expression tree to a xarray object, we need to compute its shape. This is not trivial when the tree is made of expression with different shapes. This article focuses on the rules of such computation and how they are implemented in xtensor.

Broadcasting rules

Like Numpy, xtensor allows operating on arrays with different shapes. When doing so, it follows the same rules as Numpy to compute the shape of the result. These rules, known as the broadcasting rules, are described below.

When operating on two arrays, xtensor compares their shapes element-wise. If an array has fewer dimensions than the other one, its shape is virtually prepended with ones until its size matches that of the other shape. Two dimensions are compatible when they are equal or when one of them is one. If these conditions are not met, an exception is raised. Otherwise, the dimension of the result is the maximum dimension of the operands.

Here are some examples that illustrate these rules:

A      (3d array): 8 x 4 x 1
B      (3d array): 8 x 1 x 6
Result (3d array): 8 x 4 x 6A      (3d array): 8 x 4 x 3
B      (1d array):         3
Result (3d array): 8 x 4 x 3A      (3d array): 8 x 4 x 3
B      (2d array):     3 x 1
Result (3d array): 8 x 4 x 3

When arrays of different shapes are compatible, the smaller one is virtually stretched into a bigger array with the same shape as the result. Assume we want to compute the following addition:

a = {{0, 1, 2},
     {3, 4, 5}};
b = {2, 4, 6};res = a + b;

According to the broadcasting rules, the shape of res is (2, 3). The elements of b are repeated across the first dimension so that the computation becomes

{{0, 1, 2},  +  {{2, 4, 6},
 {3, 4, 5}}      {2, 4, 6}}

The broadcasting is said to be trivial when all the involved arrays have the same shape.

Implementing the broadcasting rules

Since xfunction models multi-dimensional array operations, it feels natural to have it implementing broadcasting rules. We want it to provide the same API as xarray, that is,

size_type size() const;
size_type dimension() const;
const shape_type& shape() const;
size_type shape(size_type i) const;

Let’s start with size and shape(size_type i), which are straightforward to implement:

template <class F, class... CT>
inline auto xfunction<F, CT...>::size() const -> size_type
{
    return std::accumulate(shape().begin(),
                           shape().end(),
                           size_type(1),
                           std::multiplies());
}template <class F, class... CT>
inline auto xfunction<F, CT...>::shape(size_type i) const
    -> size_type
{
    return shape()[i];
}

Both methods rely on the shape() overload, which may trigger some heavy computation. Therefore we should cache the result to avoid computing many times the same shape. We first add three members to xfunction, one for storing the shape, one for storing if the broadcasting is trivial and one for storing if the cache has been initialized. Indeed, we do not want to compute the shape until it is explicitly asked.

template <class F, class... CT>
class xfunction
{
public:
    //....
private:
    mutable shape_type m_cached_shape;
    mutable bool m_trivial_broadcast;
    mutable bool m_cache_initialized;
};

A few words about the usage of mutable here. Since the shape method is const, we can not modify members of xfunction inside its implementation. So if we want to store the result of the first computation in a data member, we have two options: either turn shape into a non const method or make the cache data members mutable. Since shape does not change the logical state of xfunction, we opt for mutable data members. This is the difference between logical constness (an object is seen as const from the outside) and physical constness (an object is seen as const from the inside).

Implementation of shape

We can now start to implement the shape method:

template <class F, class... CT>
inline auto xfunction<F, CT...>::shape() const -> const shape_type&
{
    if(!m_cache_initialized)
    {
        compute_cached_shape();
    }
    return m_cached_shape;
}

We will get back to the shape_type in a future article when we introduce other kinds of tensors. The shape method only triggers the computation if it is required, the computation itself is delegated to the new private method compute_cached_shape.

A naive approach would be to recursively ask for the shape of the operands and apply the broadcasting rules to compute the resulting shape. However, this would be highly inefficient: if we have a complex expression tree, each intermediate xfunction object will have to allocate a temporary shape object, and each of these objects will be traversed at least once when applying the broadcasting rules. Besides, we may be noticed of incompatible operands shapes lately in the computation while we would prefer the shape method to fail as soon as possible.

A better approach would be to initialize a shape for broadcasting and ask each operand to broadcast its shape to this former. This way we avoid temporaries and the computation fails as soon as an operand has an incompatible shape. We should fill the initial resulting shape with the highest integral value possible to avoid wrong results regarding the triviality of the broadcasting (we will get back to this in the last section). But before initializing the shape object, we need to compute its size. That the purpose of the dimension method.

Implementation of dimension

If the operands of an xfunction were stored in an std::vector object or any similar container, computing the dimension of the xfunction would be trivial:

size_t dim = std::accumulate(m_e.begin(), m_e.end(), size_t(0),
                             [](size_t d, auto&& e)
                             { return std::max(d, e.dimension()); }
);

However, the operands are stored in an std::tuple object, for which element access is resolved at compile time. Therefore we need an equivalent of std::accumulate that operates on a tuple and unroll the loop at compile time. Since we are going to indirectly operate on a parameter pack, the solution is to use recursion. Let’s start with the “public” function:

template <class F, class R, class... T>
inline R accumulate(F&& f, R init, const std::tuple<T...>& t)
{
    return detail::accumulate_impl<0, F, R, T...>(
        std::forward<F>(f), init, t
    );
}

It simply forwards the call to the recursive implementation, with an additional template parameter for handling the recursion:

namespace detail
{
    template <size_t I, class F, class R, class... T>
    inline std::enable_if_t<I < sizeof...(T), R>
    accumulate_impl(F&& f, R init, const std::tuple<T...>& t)
    {
        R res = f(init, std::get<I>(t));
        return accumulate_impl<I+1, F, R, T...>(
            std::forward<F>(f), res, res, t
        );
    }
}

Even is the notation is a bit cumbersome due to template syntax, the implementation is easy to understand: we apply the functor to the accumulated value (init) and the current element (std::get<I>(t)), and call accumulate_impl with the recursion parameter increased.

The last piece that is missing is the stop condition, which must be handled at compile time. That is the purpose of std::enable_if_t in the return type: if the condition I < sizeof...(T) is not satisfied, the return type is invalid and this overload of accumulate_impl cannot be instantiated. At this point, this is not an error, the compiler will try other overloads until one can be instantiated. If all fail, a compilation error occurs. This principle is known as Substitution Failure Is Not An Error (SFINAE).

In our case, everything is fine as long as I is smaller than the number of elements in the tuple, so our stop condition is:

namespace detail
{
    template <size_t I, class F, class R, class... T>
    inline std::enable_if_t<I == sizeof...(T), R>
    accumulate_impl(F&&, R init, const std::tuple<T...>&)
    {
        return init;
    }
}

This simply returns the accumulated value. We can now use this function to compute the dimension of an xfunction object. Remember from the previous section that we created a cache for the shape? Let’s make use of it too so that we can avoid unnecessary computation:

template <class F, class... CT>
inline auto xfunction<F, CT...>::dimension() const -> size_type
{
    size_type dimension = m_cache_initialized ?
                          m_cached_shape.size() :
                          compute_dimension();
    return dimension;
}// New method added in the private section of xfunction
template <class F, class... CT>
inline auto xfunction<F, CT...>::compute_dimension() const
    -> size_type
{
    auto func = [](size_type d, auto&& e)
                { return std::max(d, e.dimension()); };
    return accumulate(func, size_type 0, m_e);
}

Implementation of compute_cached_shape

As explained in a previous section, the idea is to initialize a shape with the highest integral value possible and recursively ask the operands to broadcast their shapes to it:

template <class F, class... CT>
inline void xfunction<F, CT...>::compute_cached_shape() const
{
    m_cached_shape = shape_type(dimension(),
                         std::numeric_limits<value_type>::max());
    m_trivial_broadcast = broadcast_shape(m_cached_shape);
    m_cache_initialize = true;
}

compute_cached_shape is responsible for initializing the cache members only, the recursion itself is done in the new broadcast_shape method:

template <clas F, class... CT>
template <class S>
inline bool xfunction<F, CT...>::broadcast_shape(S& shape) const
{
    auto func = [&shape](bool b, auto&& e)
                { return e.broadcast_shape(shape) && b; }
    return accumulate(func, true, m_e);
}

Notice how we reuse the previous accumulate function to compute the resulting shape: we go through all operands, ask them to broadcast their shape, and accumulate the “trivial broadcast” property as the result. Since we want an incremental computation, each operand must broadcast to the previously computed shape. This is the reason for capturing the shape parameter in the lambda.

Also, notice the implementation of the lambda: the accumulated result is read after the broadcasting of the operand’s shape. This is to avoid the boolean short-cut that would prevent the computation when the trivial broadcast property is false.

The last remark is about broadcast_shape being a template method. This is because in the future we will have different expression types with different shape types. Therefore, this method must be able to operate on arbitrary types that provide an API similar to that of std::vector.

The last missing piece in the broadcast implementation is the broadcast_shape method in xarray.

Implementation of broadcast_shape

There is a chance that other parts of the library need to broadcast a shape to another one, so we should provide a free function for that. The broadcast_shape method of xarray simply forwards the call to this free function:

template <class T, layout_type L, class A>
template <class S>
inline bool xarray<T, L, A>::broadcast_shape(S& shape) const
{
    return broadcast_shape(this->shape(), shape);
}

Again the additional template parameter S allows to accept any type of shape.

The free broadcast shape accepts two parameters: the shape to broadcast, and the result. For each dimension of the input shape, it compares it with that of the output shape:

if the output dimension is the highest integral value, then it has not been used as a broadcasting result yet, and we can simply copy the input dimension and the broadcasting is trivial.
if the output dimension is 1, then we can replace it with the input dimension but the broadcasting is not trivial. This is why we decided to initialize the resulting shape with the highest integral value instead of 1: this avoids a wrong non-trivial broadcasting result.
for other values of the output dimension: either the input dimension is one and the output is not changed and the broadcasting is not trivial. Or the input dimension is equal to the output dimension and the broadcasting is trivial. Or the input dimension is different and the broadcasting fails.

If the number of input dimensions does not match that of output dimensions, the broadcast is not trivial.

Let’s turn these ideas into code, with the help of some coffee:

template <class S1, class S2>
inline bool broadcast_shape(const S1& input, S2& output)
{
    bool trivial_broadcast = (input.size() == output.size());
    // Indices are faster than reverse iterators
    using value_type = typename S2::value_type;
    auto output_index = output.size();
    auto input_index = input.size(); 
    if (output_index < input_index)
    {
        throw_broadcast_error(output, input);
    }
    for (; input_index != 0; --input_index, --output_index)
    {
        // First case: output = (MAX, MAX, ...., MAX)
        // output is a new shape that has not been through
        // the broadcast process yet; broadcast is trivial
        if (output[output_index - 1] =
                   std::numeric_limits<value_type>::max())
        {
            output[output_index - 1] =
                   static_cast<value_type>(input[input_index - 1]);
        }
        // Second case: output has been initialized to 1.
        // Broadcast is trivial only if input is 1 too.
        else if (output[output_index - 1] == 1)
        {
            output[output_index - 1] =
                  static_cast<value_type>(input[input_index - 1]);
            trivial_broadcast = trivial_broadcast &&
                                    (input[input_index - 1] == 1);
        }
        // Third case: output has been initialized to something
        // different from 1. If input is 1, then the broadcast is
        // not trivial
        else if (input[input_index - 1] == 1)
        {
            trivial_broadcast = false;
        }
        // Last case: input and output must have the same value,
        // else shapes are not compatible and an exception is thrown
        else if (static_cast<value_type>(input[input_index - 1]) !=
                     output[output_index - 1])
        {
            throw_broadcast_error(output, input);
        }
    } 
    return trivial_broadcast;
}

And we are done with broadcasting! We are now able to compute the shape of any arbitrary expression, which is required if we want to assign an expression to another.

In the next article, we will focus on how to iterate on the data when broadcasting is involved.

More about the Series

This post is just one episode of a long series of articles:

How We Wrote Xtensor