How to use arrays in machine learning classes? - c++

I'm new to C++ and I think a good way for me to jump in is to build some basic models that I've built in other languages. I want to start with just Linear Regression solved using first order methods. So here's how I want things to be organized (in pseudocode).
class LinearRegression
LinearRegression:
tol = <a supplied tolerance or defaulted to 1e-5>
max_ite = <a supplied max iter or default to 1k>
fit(X, y):
// model learns weights specific to this data set
_gradient(X, y):
// compute the gradient
score(X,y):
// model uses weights learned from fit to compute accuracy of
// y_predicted to actual y
My question is when I use fit, score and gradient methods I don't actually need to pass around the arrays (X and y) or even store them anywhere so I want to use a reference or a pointer to those structures. My problem is that if the method accepts a pointer to a 2D array I need to supply the second dimension size ahead of time or use templating. If I use templating I now have something like this for every method that accepts a 2D array
template<std::size_t rows, std::size_t cols>
void fit(double (&X)[rows][cols], double &y){...}
It seems there likely a better way. I want my regression class to work with any size input. How is this done in industry? I know in some situations the array is just flattened into row or column major format where just a pointer to the first element is passed but I don't have enough experience to know what people use in C++.

You wrote a quite a few points in your question, so here are some points addressing them:
Contemporary C++ discourages working directly with heap-allocated data that you need to manually allocate or deallocate. You can use, e.g., std::vector<double> to represent vectors, and std::vector<std::vector<double>> to represent matrices. Even better would be to use a matrix class, preferably one that is already in mainstream use.
Once you use such a class, you can easily get the dimension at runtime. With std::vector, for example, you can use the size() method. Other classes have other methods. Check the documentation for the one you choose.
You probably really don't want to use templates for the dimensions.
a. If you do so, you will need to recompile each time you get a different input. Your code will be duplicated (by the compiler) to the number of different dimensions you simultaneously use. Lots of bad stuff, with little gain (in this case). There's no real drawback to getting the dimension at runtime from the class.
b. Templates (in your setting) are fitting for the type of the matrix (e.g., is it a matrix of doubles or floats), or possibly the number of dimesions (e.g., for specifying tensors).
Your regressor doesn't need to store the matrix and/or vector. Pass them by const reference. Your interface looks like that of sklearn. If you like, check the source code there. The result of calling fit just causes the class object to store the parameter corresponding to the prediction vector β. It doesn't copy or store the input matrix and/or vector.

Related

How to build a constructor for a matrix type in C++

I'm a beginner in C++, and as a learning exercise I'm trying to write a library for doing matrix math (matrix multiplication, invertion and the like).
The first thing I want to do is to define a class "Matrix", which members are "rows"- the number of rows in the matrix, "cols"-the number of columns in the matrix, and "_matrix"-an array containing the elements of the matrix.
The problem is I don't have any idea how to build the constructor.
Can I write something like "Matrix(m,n,array)"? how do I make sure the array actually contains m*n elements?
I would love some guidence on how to procceed (well.. how to begin if I'm being honest :) )
thanks!
Another answer provides a typical solution one would expect a Matrix class constructor to have (i.e. Matrix(unsigned, unsigned)).
If you are doing it as an exercise and you are serious about learnig C++, I would also suggest implementing the following contructor:
Matrix(std::initializer_list<std::initializer_list<T>> init_list);
Therefore you could build your object like that:
Matrix m({{1,2,3},{4,5,6},{7,8,9}});
Note that you could take size of the constructed matrix straight from the std::initializer_lists provided and you could easily build templated matrices that way.
If I were using a matrix, I would expect a constructor like:
Matrix(unsigned int maximum_rows, unsigned int maximum_columns);
I don't care if the matrix is implemented by array, list or other data structure. I told it the size, so construct one.
Edit 1:
You want to hide the implementation of the Matrix from the user. The implementation of the constructor depends on your implementation.
The implementation may be different for a lower triangular matrix than a generic one. You may decide on a vector of vectors, a 2 dimensional array, a one dimensional array or a linked list.
I as the user don't really care how it's implemented. All I care is that the expected Matrix functionalities are implemented correctly, and in some cases, efficiently. So I may expect an overloaded operator + or an add method or both.
Again, search the internet to see examples of how other people have implemented a matrix.
Edit 2:
There may be cases where you want to have one class for the functionality and another class for the implementation. In that case, you may want to pass the implementation to the Matrix's constructor. (I would suggest using a reference to a base class the describes the implementation interface). But that may be overkill for what you need.

Data structure for handling a list of 3 integers

I'm currently coding a physical simulation on a lattice, I'm interested in describing loops in this lattice, they are closed curved composed by the edges of the lattice cells. I'm storing the information on this lattice cells (by information I mean a Boolean variable saying if the edge is valuable or no for composing a loop) in a 3 dimensional Boolean array.
I'm now thinking about a good structure to handle this loops. they are basically a list of edges, so I would need something like an array of 3d integer vectors, each edge being defined by 3 coordinates in my current parameterization. I'm already thinking about building a class around this "list" object as I'll need methods computing the loop diameter and probably more in the future.
But, I'm definitely not so aware of the choice of structure I have to do that, my physics background hasn't taught me enough in C++. And for so, I'd like to hear your suggestion for shaping this piece of code. I would really enjoy discovering some new ways of coding this kid of things.
You want two separate things. One is keeping track of all edges and allowing fast lookup of edge objects by an (int,int,int) index (you probably don't want int there but something like size_t or so). This is entirely independent from your second goal crating ordered subsets of these.
General Collection (1)
Since your edge database is going to be sparse (i.e. only a few of the possible indices will actually identify as a particular edge), my prior suggestion of using a 3d matrix is unsuitable. Instead, you probably want to lookup edges with a hash map.
How easy this is, depends on the expected size of the individual integers. That is, can you manage to have no more than 21 bit per integer (for instance if your integers are short int values, which have only 16 bit), then you can concatenate them to one 64 bit value, which already has an std::hash implementation. Otherwise, you will have to implement your own hash specialisation for, e.g., std::hash<std::array<uint32_t,3>> (which is also quite easy, and highly stackable).
Once you can hash your key, you can throw it into an std::unordered_map and be done with it. That thing is fast.
Loop detection (2)
Then you want to have short-lived data structures for identifying loops, so you want a data structure that extends on one end but never on the other. That means you're probably fine with an std::vector or possibly with an std::deque if you have very large instances (but try the vector first!).
I'd suggest simply keeping the index to an edge in the local vector. You can always lookup the edge object in your unordered_map. Then the question is how to represent the index. If Int represents your integer type (e.g. int, size_t, short, ...) it's probably the most consistent to use an std::array<Int,3> --- if the types of the integers differ, you'll want an std::tuple<...>.

Armadillo C++:- Efficient access of columns in a cube structure

Using Armadillo matrix library I am aware that the efficient way of accessing a column in a 2d matrix is via a simply call to .col(i).
I am wondering is there an efficient way of extracting a column stored in a "cube", without first having to call the slice command?
I need the most efficient possible way of accessing the data stored in for instance (using matlab notation) A(:,i,j) . I will be doing this millions of times on a very large dataset, so speed and efficiency is of a high priority.
I think you want
B = A.subcube( span:all, span(i), span(j) );
or equivalently
B = A.subcube( span(), span(i), span(j) );
where B will be a row or column vector of the same type as A (e.g. containing double by default, or a number of other available types).
.slice() should be pretty quick. It simply provides a reference to the underlying Mat class. You could try something along these lines:
cube C(4,3,2);
double* mem = C.slice(1).colptr(2);
Also, bear in mind that Armadillo has range checks enabled by default. If you want to avoid the range checks, use the .at() element accessors:
cube C(4,3,2);
C.at(3,2,1) = 456;
Alternatively, you can store your matrices in a field class:
field<mat> F(100);
F(0).ones(12,34);
Corresponding element access:
F(0)(1,2); // with range checks
F.at(0).at(1,2); // without range checks
You can also compile your code with ARMA_NO_DEBUG defined, which will remove all run-time debugging (such as range checks). This will give you a speedup, but it is only recommended once you have debugged all your code (ie. verified that your algorithm is working correctly). The debugging checks are very useful in picking up mistakes.

Making an Eigen::Vector look like a vector of points

I want to represent a 2D shape in such a way that it can be interacted with as if it were a vector of points, in particular I want to be able to call operator[] and at() on it and return references to things that act like 2D points. Currently I just use a class whose only member variable is a vector of points and that has various arithmetic and geometric operations defined pointwise on its elements.
However, in other parts of my code I need to treat a vector of n points as an element of 2n dimensional space and perform basic linear algebra on it (e.g. projecting the vector onto a given subspace of R^2n). Currently I'm creating an Eigen::VectorXd object every time I want to do this, and then converting back after performing these operations. I don't want to do this, as I make the conversion often enough that all the copying is a noticeable source of inefficiency.
If I was storing the data as a flat array of doubles/floats/ints, I could cast a pointer to its nth element to a pointer to a Point (whose members would just be a pair of doubles/floats/ints). However, as I don't know the internal representation that Eigen uses for vectors (and it may well change), this isn't possible.
Is there a sensible way of solving this? I could just use Eigen::Vectors everywhere, but I really want most of the code to be able to pretend that it is dealing with a set of points.
However, as I don't know the internal representation that Eigen uses for vectors (and it may well change), this isn't possible.
Eigen offers the Map classes that allow mapping plain arrays to Eigen structures. For example:
double numbers[2];
Eigen::Vector2f::Map( numbers ).dot( Eigen::Vector2f::Constant(1) );

Choice of the most performant container (array)

This is my little big question about containers, in particular, arrays.
I am writing a physics code that mainly manipulates a big (> 1 000 000) set of "particles" (with 6 double coordinates each). I am looking for the best way (in term of performance) to implement a class that will contain a container for these data and that will provide manipulation primitives for these data (e.g. instantiation, operator[], etc.).
There are a few restrictions on how this set is used:
its size is read from a configuration file and won't change during execution
it can be viewed as a big two dimensional array of N (e.g. 1 000 000) lines and 6 columns (each one storing the coordinate in one dimension)
the array is manipulated in a big loop, each "particle / line" is accessed and computation takes place with its coordinates, and the results are stored back for this particle, and so on for each particle, and so on for each iteration of the big loop.
no new elements are added or deleted during the execution
First conclusion, as the access on the elements is essentially done by accessing each element one by one with [], I think that I should use a normal dynamic array.
I have explored a few things, and I would like to have your opinion on the one that can give me the best performances.
As I understand there is no advantage to use a dynamically allocated array instead of a std::vector, so things like double** array2d = new ..., loop of new, etc are ruled out.
So is it a good idea to use std::vector<double> ?
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > my_array that can be indexed like my_array[i][j], or is it a bad idea and it would be better to use std::vector<double> other_array and acces it with other_array[6*i+j].
Maybe this can gives better performance, especially as the number of columns is fixed and known from the beginning.
If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
Another option, the one that I am using so far is to use Blitz, in particular blitz::Array:
typedef blitz::Array<double,TWO_DIMENSIONS> store_t;
store_t my_store;
Where my elements are accessed like that: my_store(line, column);.
I think there are not much advantage to use Blitz in my case because I am accessing each element one by one and that Blitz would be interesting if I was using operations directly on array (like matrix multiplication) which I am not.
Do you think that Blitz is OK, or is it useless in my case ?
These are the possibilities I have considered so far, but maybe the best one I still another one, so don't hesitate to suggest me other things.
Thanks a lot for your help on this problem !
Edit:
From the very interesting answers and comments bellow a good solution seems to be the following:
Use a structure particle (containing 6 doubles) or a static array of 6 doubles (this avoid the use of two dimensional dynamic arrays)
Use a vector or a deque of this particle structure or array. It is then good to traverse them with iterators, and that will allow to change from one to another later.
In addition I can also use a Blitz::TinyVector<double,6> instead of a structure.
So is it a good idea to use std::vector<double> ?
Usually, a std::vector should be the first choice of container. You could use either std::vector<>::reserve() or std::vector<>::resize() to avoid reallocations while populating the vector. Whether any other container is better can be found by measuring. And only by measuring. But first measure whether anything the container is involved in (populating, accessing elements) is worth optimizing at all.
If I use a std::vector, should I create a two dimensional array like std::vector<std::vector<double> > [...]?
No. IIUC, you are accessing your data per particle, not per row. If that's the case, why not use a std::vector<particle>, where particle is a struct holding six values? And even if I understood incorrectly, you should rather write a two-dimensional wrapper around a one-dimensional container. Then align your data either in rows or columns - what ever is faster with your access patterns.
Do you think that Blitz is OK, or is it useless in my case?
I have no practical knowledge about blitz++ and the areas it is used in. But isn't blitz++ all about expression templates to unroll loop operations and optimizing away temporaries when doing matrix manipulations? ICBWT.
First of all, you don't want to scatter the coordinates of one given particle all over the place, so I would begin by writing a simple struct:
struct Particle { /* coords */ };
Then we can make a simple one dimensional array of these Particles.
I would probably use a deque, because that's the default container, but you may wish to try a vector, it's just that 1.000.000 of particles means about a single chunk of a few MBs. It should hold but it might strain your system if this ever grows, while the deque will allocate several chunks.
WARNING:
As Alexandre C remarked, if you go the deque road, refrain from using operator[] and prefer to use iteration style. If you really need random access and it's performance sensitive, the vector should prove faster.
The first rule when choosing from containers is to use std::vector. Then, only after your code is complete and you can actually measure performance, you can try other containers. But stick to vector first. (And use reserve() from the start)
Then, you shouldn't use an std::vector<std::vector<double> >. You know the size of your data: it's 6 doubles. No need for it to be dynamic. It is constant and fixed. You can define a struct to hold you particle members (the six doubles), or you can simply typedef it: typedef double particle[6]. Then, use a vector of particles: std::vector<particle>.
Furthermore, as your program uses the particle data contained in the vector sequentially, you will take advantage of the modern CPU cache read-ahead feature at its best performance.
You could go several ways. But in your case, don't declare astd::vector<std::vector<double> >. You're allocating a vector (and you copy it around) for every 6 doubles. Thats way too costly.
If you think that this is the best option, would it be possible to wrap this vector in a way that it can be accessed with a index operator defined as other_array[i,j] // same as other_array[6*i+j] without overhead (like function call at each access) ?
(other_array[i,j] won't work too well, as i,j employs the comma operator to evaluate the value of "i", then discards that and evaluates and returns "j", so it's equivalent to other_array[i]).
You will need to use one of:
other_array[i][j]
other_array(i, j) // if other_array implements operator()(int, int),
// but std::vector<> et al don't.
other_array[i].identifier // identifier is a member variable
other_array[i].identifier() // member function getting value
other_array[i].identifier(double) // member function setting value
You may or may not prefer to put get_ and set_ or similar on the last two functions should you find them useful, but from your question I think you won't: functions are prefered in APIs between parts of large systems involving many developers, or when the data items may vary and you want the algorithms working on the data to be independent thereof.
So, a good test: if you find yourself writing code like other_array[i][3] where you've decided "3" is the double with the speed in it, and other_array[i][5] because "5" is the the acceleration, then stop doing that and give them proper identifiers so you can say other_array[i].speed and .acceleration. Then other developers can read and understand it, and you're much less likely to make accidental mistakes. On the other hand, if you are iterating over those 6 elements doing exactly the same things to each, then you probably do want Particle to hold a double[6], or to provide an operator[](int). There's no problem doing both:
struct Particle
{
double x[6];
double& speed() { return x[3]; }
double speed() const { return x[3]; }
double& acceleration() { return x[5]; }
...
};
BTW / the reason that vector<vector<double> > may be too costly is that each set of 6 doubles will be allocated on the heap, and for fast allocation and deallocation many heap implementations use fixed-size buckets, so your small request will be rounded up t the next size: that may be a significant overhead. The outside vector will also need to record a extra pointer to that memory. Further, heap allocation and deallocation is relatively slow - in you're case, you'd only be doing it at startup and shutdown, but there's no particular point in making your program slower for no reason. Even more importantly, the areas on the heap may just around in memory, so your operator[] may have cache-faults pulling in more distinct memory pages than necessary, slowing the entire program. Put another way, vectors store elements contiguously, but the pointed-to-vectors may not be contiguous.