xframe, towards a C++ dataframe

Johan Mabille
4 min readJan 4, 2019

--

For a very long time, the C++ programming language lacked a high-level toolset for scientific computing. Data structures such as N-dimensional arrays, data frames, are the bread and butter of the R and Python scientific stacks.

The QuantStack team, already behind the xtensor N-dimensional tensor algebra library, is now working on a data frame system for the C++ programming language, xframe.

The user-facing API of xframe is inspired from the xarray project in that the base variable type is a labeled N-dimensional array. While we share the same concepts, such as broadcasting and reindexing, xframe is also a lazy expression system. We can symbolically manipulate large and complex expressions, differing evaluation to access or assignment.

In xframe and xarray, the base variable type is a labeled N-dimensional array (Illustration from the xarray documentation)

Such variable types are more amenable to data that is intrinsically N-dimensional, which is often the case in physical sciences. It is also the base data model for the NetCDF file format used in climate science.

Just like xarray is built upon numpy, xframe is built upon the xtensor C++ library.

Labeled arrays

xframe variables are tensors with dimension names and labeled coordinates. Creating a two-dimensional floating point variable with string labels:

using coordinate_type = xf::xcoordinate<xf::fstring>;
using variable_type = xf::xvariable<double, coordinate_type>;
auto v = variable_type
(
xt::eval(xt::random::rand({6, 3}, 15., 25.)),
{
{"group", xf::axis({"a", "b", "c", "d", "e", "f"})},
{"city", xf::axis({"NYC", "London", "Paris"})}
}
);

Accessing and selecting data can be done in multiple ways:

// Dimension and index lookup by position
v(0, 2);
// Dimension lookup by position, index lookup by label
v.locate("a", "Paris");
// Dimension lookup by name, index lookup by position
v.iselect({{"city", 2},
{"group", 0}});
// Dimension lookup by name, index lookup by label
v.select({{"city", "Paris"},
{"group", "a"}});

Variables can be used for computation. Variables are broadcasted according to dimension names. In the following example, we perform an operation on a 2-D and a 1-D variables sharing a common axis:

auto v2 = variable_type
(
{0.5, 0.7, 0.6, 0.3, 0.2, 0.6},
{{"group", xf::axis({"a", "b", "c", "d", "e", "f"})}}
);
variable_type res = v1 + v2;

Zero-copy operations

Popular in-memory data frame systems tend to perform unnecessary copies when selecting subset of data or performing simple operations. One of the main goals of the design of xframe is to prevent copies of data as much as possible.

Selections

In xframe, a view on a variable can be created using the free functions select, locate, iselectand ilocate. We can then operate upon the returned view just like on a variable. Any change to the view is reflected in the underlying variable:

auto view = xf::select(v, {{"city", xf::keep("Paris")},
{"group", xf::drop("a", "f")}});
view.locate("b") = 0.;
std::cout << v.locate("b", "Paris") << std::endl;
// Prints 0.

Expressions

Like xtensor, xframe is more than a labeled N-dimensional arrays library, it is an expression engine that allows numerical computation on any object that implements the variable interface. Thus, if aand bare variables, a + bis not evaluated and does not allocate any memory. This also applies to complex nested expressions.

This avoids the evaluation of intermediate results and their storage in temporary variables for complex expressions.

Roadmap

We have big plans for xframe. Just like xtensor has bindings for Python, Julia, and R, allowing to operate on Numpy, Julia, and R arrays, we plan on releasing xframe bindings for the main languages of data sciences.

Support for standard file formats such as NetCDF and HDF5 is also in the roadmap.

Various notions of data frames built upon xvariablewill be implemented in the coming months.

Resources

You can find the documentation here, and if you are familiar with xarray, the cheat sheet should make you feel at home.

xframe has been packaged for the conda package manager. You can also try out xframe interactively in your web browser thanks to mybinder and Project Jupyter. Just click on the following link to launch the Jupyter notebook with the xeus-cling kernel:

xframe and xtensor are open source software released under the BSD-3-Clause license.

Acknowledgments

The development of xframe, xtensor and related packages is led by QuantStack.

This development is sponsored by Bloomberg.

--

--

Johan Mabille

Scientific computing software engineer at QuantStack, C++ teacher, template metamage