xframe, towards a C++ dataframe
For a very long time, the C++ programming language lacked a high-level toolset for scientific computing. Data structures such as N-dimensional arrays, data frames, are the bread and butter of the R and Python scientific stacks.
The QuantStack team, already behind the xtensor N-dimensional tensor algebra library, is now working on a data frame system for the C++ programming language, xframe.
The user-facing API of xframe is inspired from the xarray project in that the base variable type is a labeled N-dimensional array. While we share the same concepts, such as broadcasting and reindexing, xframe is also a lazy expression system. We can symbolically manipulate large and complex expressions, differing evaluation to access or assignment.
Such variable types are more amenable to data that is intrinsically N-dimensional, which is often the case in physical sciences. It is also the base data model for the NetCDF file format used in climate science.
Just like xarray is built upon numpy, xframe is built upon the xtensor C++ library.
Labeled arrays
xframe variables are tensors with dimension names and labeled coordinates. Creating a two-dimensional floating point variable with string labels:
using coordinate_type = xf::xcoordinate<xf::fstring>;
using variable_type = xf::xvariable<double, coordinate_type>;auto v = variable_type
(
xt::eval(xt::random::rand({6, 3}, 15., 25.)),
{
{"group", xf::axis({"a", "b", "c", "d", "e", "f"})},
{"city", xf::axis({"NYC", "London", "Paris"})}
}
);
Accessing and selecting data can be done in multiple ways:
// Dimension and index lookup by position
v(0, 2);// Dimension lookup by position, index lookup by label
v.locate("a", "Paris");// Dimension lookup by name, index lookup by position
v.iselect({{"city", 2},
{"group", 0}});// Dimension lookup by name, index lookup by label
v.select({{"city", "Paris"},
{"group", "a"}});
Variables can be used for computation. Variables are broadcasted according to dimension names. In the following example, we perform an operation on a 2-D and a 1-D variables sharing a common axis:
auto v2 = variable_type
(
{0.5, 0.7, 0.6, 0.3, 0.2, 0.6},
{{"group", xf::axis({"a", "b", "c", "d", "e", "f"})}}
);variable_type res = v1 + v2;
Zero-copy operations
Popular in-memory data frame systems tend to perform unnecessary copies when selecting subset of data or performing simple operations. One of the main goals of the design of xframe is to prevent copies of data as much as possible.
Selections
In xframe, a view on a variable can be created using the free functions select
, locate
, iselect
and ilocate
. We can then operate upon the returned view just like on a variable. Any change to the view is reflected in the underlying variable:
auto view = xf::select(v, {{"city", xf::keep("Paris")},
{"group", xf::drop("a", "f")}});view.locate("b") = 0.;
std::cout << v.locate("b", "Paris") << std::endl;
// Prints 0.
Expressions
Like xtensor, xframe is more than a labeled N-dimensional arrays library, it is an expression engine that allows numerical computation on any object that implements the variable interface. Thus, if a
and b
are variables, a + b
is not evaluated and does not allocate any memory. This also applies to complex nested expressions.
This avoids the evaluation of intermediate results and their storage in temporary variables for complex expressions.
Roadmap
We have big plans for xframe. Just like xtensor has bindings for Python, Julia, and R, allowing to operate on Numpy, Julia, and R arrays, we plan on releasing xframe bindings for the main languages of data sciences.
Support for standard file formats such as NetCDF and HDF5 is also in the roadmap.
Various notions of data frames built upon xvariable
will be implemented in the coming months.
Resources
You can find the documentation here, and if you are familiar with xarray, the cheat sheet should make you feel at home.
xframe has been packaged for the conda package manager. You can also try out xframe interactively in your web browser thanks to mybinder and Project Jupyter. Just click on the following link to launch the Jupyter notebook with the xeus-cling kernel:
xframe and xtensor are open source software released under the BSD-3-Clause license.
Acknowledgments
The development of xframe, xtensor and related packages is led by QuantStack.
This development is sponsored by Bloomberg.