performance of thrust vs. cublas - c++

I have an std::vector of matrices of different sizes and I am going to calculate the square of every matrix. I have two solutions :
1/ Flatten all my matrices, and store them in the device as a huge flat array (float *), with indices of beginning and end of each matrix in that array, and use cublas for example to do the squaring.
2/ store the matrices in a thrust::device_vector<float *> and use thrust::for_each to square them.
Clearly the second solution gives more readable code, but does it impact performance?

I think this is (now) just a repeat of a question you already asked.
Assuming the elementwise operation you want to do is something simple like squaring of each element, there should be little difference in performance or efficiency between the two cases.
This is because such an operation will be memory-bound, meaning its performance will be limited by (GPU) memory bandwidth. Therefore both realizations will have approximately the same limiter, and approximately the same performance.
Note that in both of your proposals, the data will ultimately need to be effectively "flattened" in the same way (thrust operations cannot be constructed in a typical or simple fashion to operate on a thrust::device_vector<float *>)
If you already have a mix of thrust and CUBLAS, for example, then you could probably use whichever approach suited you. If, on the other hand, your module used only CUBLAS, and you could realize your operation using either CUBLAS or thrust, I'm not sure I would inject thrust just for this one operation. But that's just a matter of opinion.

Related

Traverse of multidimensional Array in any axis

I have a (kind of) performance problem in my code, that roots in the chosen architecture.
I will use multidimensional tensors (basically matrices with more dimensions) in the form of cubes to store my data.
Since the dimension is not known at compile-time, I can't use Boost's MultidimensionalArray (IIRC), but have to come up, with my own solution.
Right now, I save each dimension, on it's own. I have a Tensor of dimension (let's say 3), that holds a lot of tensors of dimension 2 (in an std::vector), that each have a std::vector with tensors of dimension 1, that each holds a std::vector of (numerical) data. I use an abstract base-class for my tensor, so everything in there is a pointer to the abstract class, while beeing (secretly) multi- or one-dimensional.
I extract a single numerical data-point by giving a std::list of indices to a tensor, that get's the first element, searches for the according tensor and passes the rest of the list to that tensor in a (kind of) recursive call.
I now have to do a multi-dimensional Fast-Fourier Transformation on that data. I use a Threadpool and Job-Objects, that works on copying data from an Tensor along one dimension, doing an FFT and writes that data back.
I already have logic to implement ThreadPool and organize the dimensions to FFT along, but there is one problem:
My data-structure is the cache-unfriendliest beast, one can think of... While the Data-Copying along the first dimension (that, with it's data in a single 1D-Tensor) is reasonable fast, but in other directions, I need to copy my data from all over the place.
Since there are no race-conditions (I make sure every concurrent FFT is on distinct data-points), I thought, I would not use a Mutex-Guard to let everybody copy at the same time. However this heavily slows down the process ("I copy my data now!" - "No, I copy my data now!"- "But it's my turn now!"...)
Guarding the copy-Process with a mutex, does not increase speed. The FFT of a vector with 1024 elements is way faster, then the copy-process to get these elements, resulting in nearly all of my threads waiting, while one is copying.
Long story short:
Is there any kind of multi-dimensional data-structure, that does not need to set the dimension at compile-time, that allows me to traverse fast along all axis? I searched for a while now, by nothing came up besides Boost MultiArray. Vectorization also does not work since the indices would grow too fast to hold in usual int-types.
I can't think of how to present code-examples here, since most of that code is rather simple, but If needed, I can get that in.
Eigen has multi-dimensional tensor support (nominally unsupported, but written by the DeepMind people, so "somewhat" supported?), and FFTW has 1d to 3d FFTs. Using external libraries with a set of 1D to 3D FFTs would outsource most of the hard work.
Edit: Actually, FFTW has support for threaded n-dimensional FFTs

Eigen: Efficient equivalent to MATLAB's changem()?

I am needing to perform an operation on an Eigen VectorXi, which is equivalent to MATLAB's changem():
http://www.mathworks.com/help/map/ref/changem.html
At the moment, the way I am doing this is looping over the values in the array and performing the remapping with a switch/case block. I am guessing this is not particularly efficient.
Is there a fast way to do this with Eigen? Speed is critical for my application.
Switch / case will be particularly slow and inflexible.
changem takes a matrix and two vectors of values, new and old. If an entry is found in the old list, it is replaced by the corresponding entry in the new list. So it's inherently going to be rather slow, you need to pass over the entire matrix, search the old list, and if, and entry is found, replace with the new list.
How can you speed it up? First, don't hardcode as a switch / case. A modern compiler will possibly optimise to a loop rather than lots of jumps, but I wouldn't guarantee it. And the approach is inflexible.
Secondly, you can sort the "old" vector and use a binary search rather than a linear one. That will only help significantly if the old vector is long.
Thirdly, you can take advantage of what you know about the matrix. Are the old values constrained to lie in certain regions? Is there one value which is overwhelmingly likely and can be tested for first? Can you quickly exclude some values as not allowed in the old list (Too big, too small, not integral).
Are the old values integers and can you use indexing? Or generalise that to hashing. That would be even faster than a binary search, though with more overhead for hashing.
Can you solve the problem another way and keep an index of matrix xy co-ordinates by value?
There are lots of approaches. But simply implement the Matlab function naively in C as the first step. It might well be fast enough.

Handling large matrices in C++

I am using large matrices of doubles in C++. I need to get rows or columns from these matrices and pass them to a function. What is the fastest way I can do this?
One way is to write a function that returns a copy of the desired row or column as an std::vector.
Another way is to pass the whole thing as a reference and modify the function to be able to read the desired values.
Are there any other options? Which one do you recommend?
BTW, how do you recommend I store the data in the matrix class? I am using std::vector< std::vector< double > > right now.
EDIT
I mus have mentioned that the matrices might have more that two dimensions. So using boost or arma::mat here is out of the question. Although, I am using armadillo in other parts of the library.
If a variable number of dimensions above 2 is a key requirement, take a look at boost's multidimensional array library. It has efficient (copying free) "views" you can use to reference lower-dimensional "slices" of the full matrix.
The details of what's "fastest" for this sort of thing depends an awful lot on what exactly you're doing, and how the access patterns/working set "footprint" fit to your HW's various levels of cache and memory latency; in practice it can be worth copying to more compact representations to get more cache coherent access, in preference to making sparse strided accesses which just waste a lot of a cache line. Alternatives are Morton-order accessing schemes which can at least amortize "bad axis" effects over all axes. Only your own benchmarking on your own code and use-cases on your HW can really answer that though.
(Note that I wouldn't use Boost.MultiArray for 2 dimensional arrays - there are faster, better options for linear algebra / image processing applications - but for 3+ it's worth considering.)
I would use a library like http://arma.sourceforge.net/ Because not only do you get a way to store the matrix. You also have functions that can do operations on it.
Efficient (multi)linear algebra is a surprisingly deep subject; there are no easy one-size-fits-all answers. The principal challenge is data locality: the memory hardware of your computer is optimized for accessing contiguous regions of memory, and probably cannot operate on anything other than a cache line at a time (and even if it could, the efficiency would go down).
The size of a cache line varies, but think 64 or 128 bytes.
Because of this, it is a non-trivial challenge to lay out the data in a matrix so that it can be accessed efficiently in multiple directions; even more so for higher rank tensors.
And furthermore, the best choices will probably depend heavily on exactly what you're doing with the matrix.
Your question really isn't one that can be satisfactorily answered in a Q&A format like this.
But to at least get you started on researching, here are two keyphrases that may be worth looking into:
block matrix
fast transpose algorithm
You may well do better to use a library rather than trying to roll your own; e.g. blitz++. (disclaimer: I have not used blitz++)
vector<vector<...>> will be slow to allocate, slow to free, and slow to access because it will have more than one dereference (not cache-friendly).
I would recommend it only if your (rows or columns) don't have the same size (jagged arrays).
For a "normal" matrix, you could go for something like:
template <class T, size_t nDim> struct tensor {
size_t dims[nDim];
vector<T> vect;
};
and overload operator(size_t i, size_t j, etc.) to access elements.
operator() will have to do index calculations (you have to choose between row-major or column-major order). For nDim > 2, it becomes somewhat complicated, and it could benefit from caching some indexing computations.
To return a row or a column, you could then define sub types.
template <class T, size_t nDim> struct row /*or column*/ {
tensor<T, nDim> & tensor;
size_t iStart;
size_t stride;
}
Then define an operator(size_t i) that will return tensor.vect[iStart + i*stride]
stride value will depend on whether it is a row or a column, and your (row-major or column-major) ordering choice.
stride will be 1 for one of your sub type. Note that for this sub type, iterating will probably be much faster, because it will be cache-friendly. For other sub types, unfortunately, it will probably be rather slow, and there is not much you can do about it.
See other SO questions about why iterating on the rows then the columns will probably have a huge performance difference than iterating on the columns then the rows.
I recommend you pass it by reference as copying might be a slow process depending on the size. std::vector is fine if you want the ability to expand and contract the container.

Storing Matrix information in STL vector. Which is better vector or vector of vectors?

I've created my own Matrix class were inside the class the information regarding the Matrix is stored in a STL vector. I've notice that while searching the web some people work with a vector of vectors to represent the Matrix information. My best guess tells me that so long as the matrix is small or skinny (row_num >> column_num) the different should be small, but what about if the matrix is square or fat (row_num << column_num)? If I were to create a very large matrix would I see a difference a run time? Are there other factors that need to be considered?
Thanks
Have you considered using an off-the-shelf matrix representation such as boost's instead of reinventing the wheel?
If you have a lot of empty rows for example, using the nested representation could save a lot of space. Unless you have specific information in actual use cases showing one way isn't meeting your requirements, code the way that's easiest to maintain and implement properly.
There are too many variables to answer your question.
Create an abstraction so that your code does not care how the matrix is represented. Then write your code using any implementation. Then profile it.
If your matrix is dense, the "vector of vectors" is very unlikely to be faster than a single big memory block and could be slower. (Chasing two pointers for random access + worse locality.)
If your matrices are large and sparse, the right answer to your question is probably "neither".
So create an abstract interface, code something up, and profile it. (And as #Mark says, there are lots of third-party libraries you should probably consider.)
If you store everything in a single vector, an iterator will traverse the entire matrix. If you use a vector of vectors, an iterator will only traverse a single dimension.

Performance question: Inverting an array of pointers in-place vs array of values

The background for asking this question is that I am solving a linearized equation system (Ax=b), where A is a matrix (typically of dimension less than 100x100) and x and b are vectors. I am using a direct method, meaning that I first invert A, then find the solution by x=A^(-1)b. This step is repated in an iterative process until convergence.
The way I'm doing it now, using a matrix library (MTL4):
For every iteration I copy all coeffiecients of A (values) in to the matrix object, then invert. This the easiest and safest option.
Using an array of pointers instead:
For my particular case, the coefficients of A happen to be updated between each iteration. These coefficients are stored in different variables (some are arrays, some are not). Would there be a potential for performance gain if I set up A as an array containing pointers to these coefficient variables, then inverting A in-place?
The nice thing about the last option is that once I have set up the pointers in A before the first iteration, I would not need to copy any values between successive iterations. The values which are pointed to in A would automatically be updated between iterations.
So the performance question boils down to this, as I see it:
- The matrix inversion process takes roughly the same amount of time, assuming de-referencing of pointers is non-expensive.
- The array of pointers does not need the extra memory for matrix A containing values.
- The array of pointers option does not have to copy all NxN values of A between each iteration.
- The values that are pointed to the array of pointers option are generally NOT ordered in memory. Hopefully, all values lie relatively close in memory, but *A[0][1] is generally not next to *A[0][0] etc.
Any comments to this? Will the last remark affect performance negatively, thus weighing up for the positive performance effects?
Test, test, test.
Especially in the field of Numerical Linear Algebra. There are many effects in play, which is why there is a number of optimized libraries that have solved that burden for you.
Some effects to consider:
Memory locality and cache effects
Multithreading effects (some algorithms that are optimal while running single-core, cause memory collision/cache eviction when more than one core is utilized).
There is no substitute for testing.
Here are some comments:
Is the function you use for the inversion capable of handling a matrix of pointers instead of values? If it does not realise it has to do an indirection, all kinds of strange effects could happen.
When doing an in-place matrix inversion (meaning the inverted matrix overwrites the input matrix), all input coefficients will get overwritten with new values, because matrix inversion can not be done by re-ordering the elements of the matrix.
During the inversion process, none of the input coefficients may be changed by an outside process. All such updates have to be performed between iterations.
So, you get the following set of trade-offs when you chose the pointer solution:
The coefficients making up matrix A can no longer be calculated asynchronously with the matrix inversion.
Either all coefficients must be recalculated for each iteration (when you use in-place inversion, meaning the inverted matrix uses the same memory as the input matrix), or you still have to use a matrix of N x N values to hold the result of the inversion.
You're getting good answers here. The only thing I would add is some general experience with performance.
You are thinking about performance a-priori. That's reasonable, but the real payoff is a-posteriori. In other words, you don't know for certain where the real optimization opportunities are, until the running code tells you.
You don't know if the bulk of the time will be spent in matrix inversion, multiplication, copying the matrix, dereferencing, or what. People can guess. If I had to guess, it would be matrix inversion, because it's 100x100.
However, something else I can't guess might be even bigger.
Guessing has a very poor track record, especially when you can just find out.
Here's an example of what I mean.