Write multidimensional vectors (tensors) of scalars to file in C++

Write multidimensional vectors (tensors) of scalars to file in C++ - c++

The description of the object I have
I have several N-dimensional containers in my code, representing tensors, whose types are defined as
std::vector<std::vector<std::vector<...<double>...>>>
These type of data structures occur in several different sizes and dimensions and they only contain scalar numbers. The number of dimensions is known for every vector and can be accessed as eg. tensor::dimension. Since they're representing tensors, they're never "irregular": at the bottom level, vectors always contain the same number of elements, like this:
// THIS IS HOW THEY ALWAYS LOOK LIKE
T tensor = {{1,2,3,4}, {1,2,3,4}, {1,2,3,4}}
// THIS IS WHAT NEVER HAPPENS
T tensor = {{1,2,3}, {1,2,3,4}, {1,2}}
What I want to do with this object
I want to save each of these multidimensional vectors (tensors basically) into different files, which I can then easily load/read eg. in Python into a numpy.array - for further analysis and visualization. How can I achieve this to save any of these N-dimensional std::vectors in modern C++ without explicitly defining a basic write-to-txt function with N nested loops for each vector with different dimensions?
(Note: Solutions/advice that require/mention only standard libraries are preferred, but I'm happy to hear any other answers too!)

The only way to iterate over something in C++ is a loop, in some sort of shape, matter, or form. So no matter what you're going to have loops. There are no workarounds or alternatives, but it doesn't mean you actually have to write all these loops yourself, one at a time. This is why we have templates in C++. What you are looking for is a recursive template, that recursively peels away each dimension: until the last one which gets implemented for real-sies, basically letting your compiler write every loop for you. Mission accomplished. Starting with a simplistic example of writing out a plain vector
void write_vec(const std::vector<double> &v)
{
for (const auto &value:vector)
std::cout << value << std::endl;
}
The actual details of how you want to save each value, and which files, is irrelevant here, you can adjust the above code to make it work in whichever way you see fit. The point that you want to make it work for some artbirary dimensions. Simply add a template with the same name, then let overload resolution do all the work for you:
template<typename T>
void write_vec(const std::vector<std::vector<T>> &v)
{
for (const auto &value:vector)
write_vec(value);
}
Now, a write_vec(anything), where anything is any N-"deep" vector that ends up in a std::vector<double> will walk its way downhill, on its own, and write out every double.

Related

C++ Table of Vectors of Different Types

I have a collection of vectors of different types like this:
std::vector<int> a_values;
std::vector<float> b_values;
std::vector<std::string> c_values;
Everytime i get a new value for a, b and c I want to push those to their respective vectors a_values, b_values and c_values.
I want to do this in the most generic way possible, ideally in a way I can iterate over the vectors. So I want a function addValue(...) which automatically calls the respective push_back() on each vector. If I add a new vector d_values I only want to have to specify it in one place.
The first answer to this post https://softwareengineering.stackexchange.com/questions/311415/designing-an-in-memory-table-in-c seems relevant, but I want to easily get the vector out for a given name, without having to manually cast to a particular type. ie. I want to call getValues("d") which will give me the underlying std::vector.
Does anyone have a basic example of a collection class that does this?

This idea can be achieved with the heterogenous container tuple, which will allow for storage of vectors containing elements of different types.
In particular, we can define a simple data structure as follows
template <typename ...Ts>
using vector_tuple = std::tuple<std::vector<Ts>...>;
In the initial case, of the provided example, the three vectors a_values, b_values, c_values, simply corresponds to the type vector_tuple<int, float, std::string>. Adding an additional vector simply requires adding an additional type to our collection.
Indexing into our new collection is simple too, given the collection
vector_tuple<int, float, std::string> my_vec_tup;
we have the following methods for extracting a_values, b_values and c_values
auto const &a_values = std::get<0>(my_vec_tup);
auto const &b_values = std::get<1>(my_vec_tup);
auto const &c_values = std::get<2>(my_vec_tup);
Note that the tuple container is indexed at compile-time, which is excellent if you know the intended size at compile-time, but will be unsuitable otherwise.
In the description of the problem that has been provided, the number of vectors does appear to be decided at compile-time and the naming convention appears to be arbitrary and constant (i.e., the index of the vectors won't need to be changed at runtime). Hence, a tuple of vectors seems to be a suitable solution, if you associate each name with an integer index.
As for the second part of the question (i.e., iterating over each vector), if you're using C++17, the great news is that you can simply use the std::apply function to do so: https://en.cppreference.com/w/cpp/utility/apply
All that is required is to pass a function which takes a vector (you may wish to define appropriate overloading to handle each container separately), and your tuple to std::apply.
However, for earlier versions of C++, you'll need to implement your own for_each function. This problem has fortunately already been solved: How can you iterate over the elements of an std::tuple?

Should I use simple classes or a highly dimensional matrix?

C++ newbie here! I would like to simulate a population containing patches, containing individuals, containing chromosomes, containing genes.
What are the pros and cons of using a series of simple classes versus a highly dimensional matrix in C++? Typically, does the time to access a memory slot varies in between the two technics?
Highly dimensional Matrix
One could make "a vector of vectors of vectors of vectors" (or a C-style highly dimensional arrays of integers) and access any gene in memory with
for (int patch=0;patch<POP.size();patch++)
{
for (int ind=0;ind<POP[patch].size();patch++)
{
for (int chrom=0;chrom<POP[patch][ind].size();chrom++)
{
for (int gene=0;gene<POP[patch][ind][chrom].size();gene++)
{
POP[patch][ind][chrom][gene];
}
}
}
}
Series of Simple Classes
One could use a series of simple classes and access any gene in memory with
for (int patch=0;patch<POP->PATCHES.size();patch++)
{
for (int ind=0;ind<POP->PATCHES[patch]->INDIVIDUALS.size();patch++)
{
for (int chrom=0;chrom<POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES.size();chrom++)
{
for (int gene=0;gene<POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES[chrom]->GENES.size();gene++)
{
POP->PATCHES[patch]->INDIVIDUALS[ind]->CHROMOSOMES[chrom]->GENES[gene];
}
}
}
}

While a high-dimensional matrix would work, consider that you might want to add more information to an individual. It might not just have chromosomes, but also an age, siblings, parents, phenotypes, et cetera. It is then natural to have a class Individual, which can contain all that information along with its list of chromosomes. Using classes will group relevant information together.

While I in general agree with #g-sliepen's answer, there is an additional point you should know about:
C++ gives you the ability to make a distinction between interface and type. You can leave the type abstract for the users of your code (even if that is only you) and provide a finite set of operations on it.
Using this pattern allows you to change the implementation completely (e.g. back to vectors for parallel computation etc.) later without having to change the code using it (e.g. a concrete simulation).

I won't cover what's already been suggested as it is generally a good idea to store your individual entities as a class with all relevant fields associared with it, but I'll just address your first suggestion:
The issue with using something like a std::vector<std::vector<std::vector<std::vector<type>>>> (apart from the fact it's a pain to handle generically) is that whilst the overall std::vector enclosing the structure has contiguous storage (so long as you aren't storing bools in your std::vector that is) the inner vectors are not contiguous with each other or the other elements.
Due to this, if you are storing a large amount of data in your structure and need access and iteration to be as fast as possible, this method of storage is not ideal - it also complicates matters of iterating through the entire structure.
A good solution for storing a large multi-dimensional "matrix" (technically a rank 4 tensor in this case I suppose) when you require fast iteration and random access is to write a wrapper around a single std::vector in some row-major / column-major configuration such that all your data is stored as a contiguous block and you can iterate over it all via a single loop or call to std::for_each (for example). Then each index by which you access the structure would correspond to patch, ind, chrom and gene in order.
An example of a pre-built data structure which could handle this is boost::multi_array if you'd rather not code the wrapper yourself.

There are two major ways to do multidimensional arrays. Vector of vectors (aka jagged array) and really multidimensional array - n dimensional cube. Using the latter one means for example, that all chromozomes have the same amount of genes and every individual has the same amount of chromozomes. If You can accept those restrictions, You get some advantages like continuous memory storage.

How to use arrays in machine learning classes?

I'm new to C++ and I think a good way for me to jump in is to build some basic models that I've built in other languages. I want to start with just Linear Regression solved using first order methods. So here's how I want things to be organized (in pseudocode).
class LinearRegression
LinearRegression:
tol = <a supplied tolerance or defaulted to 1e-5>
max_ite = <a supplied max iter or default to 1k>
fit(X, y):
// model learns weights specific to this data set
_gradient(X, y):
// compute the gradient
score(X,y):
// model uses weights learned from fit to compute accuracy of
// y_predicted to actual y
My question is when I use fit, score and gradient methods I don't actually need to pass around the arrays (X and y) or even store them anywhere so I want to use a reference or a pointer to those structures. My problem is that if the method accepts a pointer to a 2D array I need to supply the second dimension size ahead of time or use templating. If I use templating I now have something like this for every method that accepts a 2D array
template<std::size_t rows, std::size_t cols>
void fit(double (&X)[rows][cols], double &y){...}
It seems there likely a better way. I want my regression class to work with any size input. How is this done in industry? I know in some situations the array is just flattened into row or column major format where just a pointer to the first element is passed but I don't have enough experience to know what people use in C++.

You wrote a quite a few points in your question, so here are some points addressing them:
Contemporary C++ discourages working directly with heap-allocated data that you need to manually allocate or deallocate. You can use, e.g., std::vector<double> to represent vectors, and std::vector<std::vector<double>> to represent matrices. Even better would be to use a matrix class, preferably one that is already in mainstream use.
Once you use such a class, you can easily get the dimension at runtime. With std::vector, for example, you can use the size() method. Other classes have other methods. Check the documentation for the one you choose.
You probably really don't want to use templates for the dimensions.
a. If you do so, you will need to recompile each time you get a different input. Your code will be duplicated (by the compiler) to the number of different dimensions you simultaneously use. Lots of bad stuff, with little gain (in this case). There's no real drawback to getting the dimension at runtime from the class.
b. Templates (in your setting) are fitting for the type of the matrix (e.g., is it a matrix of doubles or floats), or possibly the number of dimesions (e.g., for specifying tensors).
Your regressor doesn't need to store the matrix and/or vector. Pass them by const reference. Your interface looks like that of sklearn. If you like, check the source code there. The result of calling fit just causes the class object to store the parameter corresponding to the prediction vector β. It doesn't copy or store the input matrix and/or vector.

standard and efficient map between objects

I am working on clustering problem where I have something called distance matrix. This distance matrix is something like:
the number of nodes(g) are N (dynamic)
This matrix is Symmetric (dist[i,j]==dist[j,i])
g1,g2,.... are object (they contain strings , integers and may even more..)
I want to be able to reach any value by simple way like dist[4][3] or even more clear way like dist(g1,g5) (here g1 and g5 may be some kind of pointer or reference)
many std algorithm will be applied on this distance matrix like min, max, accumulate ..etc
preferably but not mandatory, I would like not to use boost or other 3rd party libraries
What is the best standard way to declare this matrix.

You can create two dimensional vector like so
std::vector<std::vector<float> > table(N, std::vector<float>(N));
don`t forget to initialize it like this, it reserves memory for N members, so it does not need to reallocate all the members then you are adding more. And does not fragment the memory.
you can access its members like so
table[1][2] = 2.01;
it does not uses copy constructors all the time because vector index operator returns a reference to a member;
so it is pretty efficient if N does not need to change.

Choice of the most performant container (array)

This is my little big question about containers, in particular, arrays.
I am writing a physics code that mainly manipulates a big (> 1 000 000) set of "particles" (with 6 double coordinates each). I am looking for the best way (in term of performance) to implement a class that will contain a container for these data and that will provide manipulation primitives for these data (e.g. instantiation, operator[], etc.).
There are a few restrictions on how this set is used:
its size is read from a configuration file and won't change during execution
it can be viewed as a big two dimensional array of N (e.g. 1 000 000) lines and 6 columns (each one storing the coordinate in one dimension)
the array is manipulated in a big loop, each "particle / line" is accessed and computation takes place with its coordinates, and the results are stored back for this particle, and so on for each particle, and so on for each iteration of the big loop.
no new elements are added or deleted during the execution
First conclusion, as the access on the elements is essentially done by accessing each element one by one with [], I think that I should use a normal dynamic array.
I have explored a few things, and I would like to have your opinion on the one that can give me the best performances.
As I understand there is no advantage to use a dynamically allocated array instead of a std::vector, so things like double** array2d = new ..., loop of new, etc are ruled out.
So is it a good idea to use std::vector<double> ?
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > my_array that can be indexed like my_array[i][j], or is it a bad idea and it would be better to use std::vector<double> other_array and acces it with other_array[6*i+j].
Maybe this can gives better performance, especially as the number of columns is fixed and known from the beginning.
If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
Another option, the one that I am using so far is to use Blitz, in particular blitz::Array:
typedef blitz::Array<double,TWO_DIMENSIONS> store_t;
store_t my_store;
Where my elements are accessed like that: my_store(line, column);.
I think there are not much advantage to use Blitz in my case because I am accessing each element one by one and that Blitz would be interesting if I was using operations directly on array (like matrix multiplication) which I am not.
Do you think that Blitz is OK, or is it useless in my case ?
These are the possibilities I have considered so far, but maybe the best one I still another one, so don't hesitate to suggest me other things.
Thanks a lot for your help on this problem !
Edit:
From the very interesting answers and comments bellow a good solution seems to be the following:
Use a structure particle (containing 6 doubles) or a static array of 6 doubles (this avoid the use of two dimensional dynamic arrays)
Use a vector or a deque of this particle structure or array. It is then good to traverse them with iterators, and that will allow to change from one to another later.
In addition I can also use a Blitz::TinyVector<double,6> instead of a structure.

So is it a good idea to use std::vector<double> ?
Usually, a std::vector should be the first choice of container. You could use either std::vector<>::reserve() or std::vector<>::resize() to avoid reallocations while populating the vector. Whether any other container is better can be found by measuring. And only by measuring. But first measure whether anything the container is involved in (populating, accessing elements) is worth optimizing at all.
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > [...]?
No. IIUC, you are accessing your data per particle, not per row. If that's the case, why not use a std::vector<particle>, where particle is a struct holding six values? And even if I understood incorrectly, you should rather write a two-dimensional wrapper around a one-dimensional container. Then align your data either in rows or columns - what ever is faster with your access patterns.
Do you think that Blitz is OK, or is it useless in my case?
I have no practical knowledge about blitz++ and the areas it is used in. But isn't blitz++ all about expression templates to unroll loop operations and optimizing away temporaries when doing matrix manipulations? ICBWT.

First of all, you don't want to scatter the coordinates of one given particle all over the place, so I would begin by writing a simple struct:
struct Particle { /* coords */ };
Then we can make a simple one dimensional array of these Particles.
I would probably use a deque, because that's the default container, but you may wish to try a vector, it's just that 1.000.000 of particles means about a single chunk of a few MBs. It should hold but it might strain your system if this ever grows, while the deque will allocate several chunks.
WARNING:
As Alexandre C remarked, if you go the deque road, refrain from using operator[] and prefer to use iteration style. If you really need random access and it's performance sensitive, the vector should prove faster.

The first rule when choosing from containers is to use std::vector. Then, only after your code is complete and you can actually measure performance, you can try other containers. But stick to vector first. (And use reserve() from the start)
Then, you shouldn't use an std::vector<std::vector<double> >. You know the size of your data: it's 6 doubles. No need for it to be dynamic. It is constant and fixed. You can define a struct to hold you particle members (the six doubles), or you can simply typedef it: typedef double particle[6]. Then, use a vector of particles: std::vector<particle>.
Furthermore, as your program uses the particle data contained in the vector sequentially, you will take advantage of the modern CPU cache read-ahead feature at its best performance.

You could go several ways. But in your case, don't declare astd::vector<std::vector<double> >. You're allocating a vector (and you copy it around) for every 6 doubles. Thats way too costly.

If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
(other_array[i,j] won't work too well, as i,j employs the comma operator to evaluate the value of "i", then discards that and evaluates and returns "j", so it's equivalent to other_array[i]).
You will need to use one of:
other_array[i][j]
other_array(i, j) // if other_array implements operator()(int, int),
// but std::vector<> et al don't.
other_array[i].identifier // identifier is a member variable
other_array[i].identifier() // member function getting value
other_array[i].identifier(double) // member function setting value
You may or may not prefer to put get_ and set_ or similar on the last two functions should you find them useful, but from your question I think you won't: functions are prefered in APIs between parts of large systems involving many developers, or when the data items may vary and you want the algorithms working on the data to be independent thereof.
So, a good test: if you find yourself writing code like other_array[i][3] where you've decided "3" is the double with the speed in it, and other_array[i][5] because "5" is the the acceleration, then stop doing that and give them proper identifiers so you can say other_array[i].speed and .acceleration. Then other developers can read and understand it, and you're much less likely to make accidental mistakes. On the other hand, if you are iterating over those 6 elements doing exactly the same things to each, then you probably do want Particle to hold a double[6], or to provide an operator[](int). There's no problem doing both:
struct Particle
{
double x[6];
double& speed() { return x[3]; }
double speed() const { return x[3]; }
double& acceleration() { return x[5]; }
...
};
BTW / the reason that vector<vector<double> > may be too costly is that each set of 6 doubles will be allocated on the heap, and for fast allocation and deallocation many heap implementations use fixed-size buckets, so your small request will be rounded up t the next size: that may be a significant overhead. The outside vector will also need to record a extra pointer to that memory. Further, heap allocation and deallocation is relatively slow - in you're case, you'd only be doing it at startup and shutdown, but there's no particular point in making your program slower for no reason. Even more importantly, the areas on the heap may just around in memory, so your operator[] may have cache-faults pulling in more distinct memory pages than necessary, slowing the entire program. Put another way, vectors store elements contiguously, but the pointed-to-vectors may not be contiguous.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Write multidimensional vectors (tensors) of scalars to file in C++ - c++

Related

C++ Table of Vectors of Different Types

Should I use simple classes or a highly dimensional matrix?

How to use arrays in machine learning classes?

standard and efficient map between objects

Choice of the most performant container (array)

Categories

Resources