Sparrow

A Modern C++ Implementation of the Apache Arrow Columnar Format

Johan Mabille
7 min read6 days ago
Photography of four sparrows on a branch. Caption is “Sparrow: Modern C++ implementation of the Apache Arrow Format”.

We are thrilled to introduce Sparrow, a new library designed to simplify the integration of Apache Arrow’s columnar format into C++ applications.

Why Sparrow?

Apache Arrow is a universal columnar format and a multi-language toolbox for fast data interchange and in-memory analytics. It offers accessible and well-tested building blocks for others to build upon. As a de facto standard, the Arrow tabular data format is widely used by many open-source libraries, frameworks, and commercial services in the data ecosystem.

ArcticDB is a DataFrame database engine tailored for the Python Data Science ecosystem. Written in modern idiomatic C++, ArcticDB needed a way to operate on Arrow tabular data that aligns with their programming paradigm. However, the reference implementation of the Arrow format, Arrow-cpp, was not an ideal fit for ArcticDB: the batteries included approach of the project made it an intricate dependency for ArcticDB, which only required read and write functionality in the Arrow format. This need is not unique to ArcticDB; other projects have also expressed a desire for a minimal implementation of the Arrow tabular data format.

To address this gap in the ecosystem, we initiated the development of Sparrow, with the following idea: to make it effortless for C++ libraries to work with data in the Apache Arrow columnar format. Sparrow has a more focused scope than the reference Arrow-cpp library, concentrating on the reading and writing of the Arrow data specification with an idiomatic C++ API. By leveraging key constructs of the C++ standard library such as iterators, ranges, and concepts, Sparrow’s focused approach makes it easy to adopt and integrate into C++ projects and the C++ standard library.

Key Features

  • Lightweight and Modern: Sparrow is designed to be a lightweight, modern implementation, ensuring that it is both efficient and easy to use.
  • Idiomatic APIs: The library provides array structures with idiomatic APIs, making it intuitive for C++ developers.
  • Convenient Conversions: Sparrow offers convenient conversions from and to the C interface, simplifying the process of integrating data in the Apache Arrow format into your applications.
  • Finally, Sparrow is licensed under the Apache License, Version 2.0, ensuring that it is open and accessible for a wide range of uses.

Getting Started

Data can be directly initialized from the Sparrow API, or read from data provided in the Arrow C data interface.

Initialize data with sparrow and retrieve the C data structures:

#include "sparrow/sparrow.hpp"
namespace sp = sparrow;
sp::primitive_array<int> ar = { 1, 3, 5, 7, 9 };
// Caution: get_arrow_structures returns pointers, not values
auto [arrow_array, arrow_schema] = sp::get_arrow_structures(ar);
// Use arrow_array and arrow_schema as you need (serialization,
// passing it to a third party library)
// ...
// do NOT release the C structures in the end, the "ar" variable will do it
// for you

Populate Arrow data (in the form of the C data interface) with a third-party library and initializing sparrow data structures from it:

#include "sparrow/sparrow.hpp"
#include "thrid-party-lib.hpp"
namespace sp = sparrow;
namespace tpl = third_party_library;

ArrowArray array;
ArrowSchema schema;
tpl::read_arrow_structures(&array, &schema);
sp::array ar(std::move(array), std::move(schema));
// Use ar as you need
// ...
// do NOT release the C structures in the end, the "ar" variable will do it
// for you

Typed arrays

Sparrow provides an array class for each Array layout. These arrays are commonly referred to as typed arrays because the exact data type is known at build time. There are two major kinds of arrays: non-nested and nested arrays. Despite differences in memory layout, all these arrays share a consistent API for reading data. This API is designed to resemble that of the standard container std::vector:

#include "sparrow/sparrow.hpp"
#include <algorithm>
#include <iostream>

namespace sp = sparrow;

sp::primitive_array<double> arr = { 1, 2, 3, 4, 5 };
std::cout << arr.size() << std::endl;
std::for_each(arr.cbegin(), arr.cend(),
[](auto n) { std::cout << n.value() << ' '; });
std::cout << arr[2].value() << std::endl;
std::cout << arr.front().value() << std::endl;
std::cout << arr.back().value << std::endl;

In addition to the capacity, element access, and iterators APIs, typed arrays offer convenient constructors tailored to their specific types. They also implement full value semantics and can therefore be copied and moved. Contrary to the standard containers, most of the typed arrays provide an immutable API only.

Support for null values

The element access methods of typed arrays do not return values of the array’s data type, but thin wrappers allowing for null values: nullable objects.

The nullable’s API is very similar to that of std::optional, but the semantics of these types have two major differences:

  • nullable can hold references and acts as a reference proxy: assigning a value to a nullable on a reference will assign the value to the underlying reference.
  • Assigning nullval (similar to std::nullopt) to a nullable object does not trigger the destruction of the underlying object.

Untyped array

Some arrays do not hold their data directly; instead, they reference a child array that contains the actual values. These are known as nested arrays. Since the nesting depth can be arbitrary, storing typed arrays into typed arrays would result in a combinatorial explosion of types. To avoid that, sparrow relies on a well-known pattern, the type erasure. Typed arrays are wrapped into lightweight template containers which all inherit from the same base class, and a holder class stores a pointer to the base class and implements value semantics:

class array_wrapper
{
public:
virtual void api() = 0;
};

template <class T>
class array_wrapper_impl : public array_wrapper
{
public:
void api() override { p_typed_array-> api(); }
private:
T* p_typed_array;
};

class array
{
public:
void api() { p_wrapper->api(); }
private:
array_wrapper* p_wrapper;
};

In Sparrow, the holder is the untyped array class. It makes it easy for typed arrays to reference other arrays. With type erasure applied, both the array type and the data type are lost. Therefore, the element access methods of the untyped array cannot return nullable objects. Instead, they return a variant of all possible nullable types returned by the various typed arrays.

The array class provides convenient constructors from typed arrays, so that the wrapper layer can be hidden from the user:

sp::primitive_array<double> parr = { 1, 2, sp::nullval, 4, 5 };
array arr(std::move(parr));

The array class can also be constructed from the Arrow C data interface, when getting data from a third party library and dynamically discovering the data type, as demonstrated in the second example of this article.

Dynamic dispatch

Iterating over an untyped array to operate on its values may induce a performance cost: for each value, one needs to visit the returned variant to get the actual type of the data. It is more efficient to retrieve the type of the wrapped array and then iterate over it. The untyped array provides the visit method for this purpose, which accepts an arbitrary functor. With sparrow traits and the new template lambda introduced in C++20, it is easy to efficiently iterate over an array:

array arr = init_my_array();
arr.visit([]<class T>(const T& typed_array)
{
if constexpr (is_primitive_array_v<T>)
{
std::for_each(typed_array.begin(), typed_array.end(), [](const U& u)
{
// ...
});
}
else if constexpr (is_stirng_array_v<T>)
{
// ...
}
// ...
});

Builders

Arrow data structures are “structs of arrays” (SoA) which are all composed of flat arrays. While this is a very efficient way to store data, it might be cumbersome to create such data structures.

For instance, consider the list of lists of integers [[1, 2], [3, 4, 5], [6, 7]]; a common way to store and think about this structure in C++ is std::vector<std::vector<int>>. In Arrow format, such a structure is stored as two flat arrays:

  • A flat array of integers (the values): [1, 2, 3, 4, 5, 6, 7]
  • An array of offsets indicating the starts of each list: [0, 2, 5, 7]

Building these data structures from nested standard containers is not trivial. Sparrow provides a convenient mechanism to solve this problem: the build function. It accepts arbitrary nested standard containers and returns the appropriate typed array, hiding all the complexity of the build:

// [["hello", "world","!"], ["Another", "sentence"]]
std::vector<std::vector<std::string>> v
{
{ "Hello", "world", "!" },
{ "Another", "sentence" }
};
auto arr = sp::build(v);

An exhaustive mapping between standard containers and Arrow layouts can be found in the sparrow documentation.

What is coming into Sparrow

Looking ahead, our roadmap includes several exciting developments, including language bindings, supporting new platforms, and integration into third-party technologies.

Acknowledgements

This development has been funded as part of a collaboration between Man Group, Bloomberg, and QuantStack, as part of the ArcticDB project.

About the authors

Photo of Johan Mabille

Johan Mabille

Johan Mabille is a Technical Director specialized in high-performance computing in C++. He holds a master’s degree in computer science from Centrale-Supelec. As an open source developer, Johan coauthored xtensor , xeus , and xsimd.

He leads the C++ team at QuantStack, where he oversees the development and maintenance of mamba and the Jupyter Xeus project.

Johan has also made significant contributions to JupyterLab.

Prior to joining QuantStack, Johan worked as a quant developer at HSBC.

Photo of Alexis Placet

Alexis Placet

Alexis Placet is a C++ scientific software developer at QuantStack. In the past year, Alexis has mostly been active in the development of Sparrow.

Photo of Thorsten Beier

Thorsten Beier

Thorsten Beier is a senior scientific software developer at QuantStack. Prior to working on the Sparrow project, Thorsten has worked on numerous open-source projects, including Xtensor, and Xeus. Thorsten is the creator of the Emscripten-Forge project, a distribution of software packages for WebAssembly for the conda packaging system.

Photo of Joel Lamotte

Joel Lamotte

Joel Lamotte is a senior software developer at QuantStack and a C++ expert. In his time at QuantStack, Joel has worked on several C++ projects, including Mamba, ArcticDB, and Sparrow.

--

--

Johan Mabille
Johan Mabille

Written by Johan Mabille

Scientific computing software engineer at QuantStack, C++ teacher, template metamage

No responses yet