filling only half of Matrix using OpenMp in C++ - c++

I have a quite big matrix. I would like to fill half of the matrix in parallel.
m_matrix is 2D std vector. Any suggestion for the type of container is appreciated as well. What _fill(i,j) function is doing is not considered heavy compared to size of the matrix.
//i: row
//j: column
for (size_t i=1; i<num_row; ++i)
{
for (size_t j=0; j<i; ++j)
{
m_matrix[i][j] = _fill(i, j);
}
}
What would be a nice openMP structure for that? I tried dynamic strategy bet I got even time increase compared to the sequential mode.

Related

How to efficiently initialize a SparseVector in Eigen

In the Eigen docs for filling a sparse matrix it is recommended to use the triplet filling method as it can be much more efficient than making calls to coeffRef, which involves a binary search.
For filling SparseVectors however, there is no clear recommendation on how to do it efficiently.
The suggested method in this SO answer uses coeffRef which means that a binary search is performed for every insertion.
Is there a recommended, efficient way to build sparse vectors? Should I try to create a single row SparseMatrix and then store that as a SparseVector?
My use case is reading in LibSVM files, in which there can be millions of very sparse features and billions of data points. I'm currently representing these as an std::vector<Eigen::SparseVector>. Perhaps I should just use SparseMatrix instead?
Edit: One thing I've tried is this:
// for every data point in a batch do the following:
Eigen::SparseMatrix<float> features(1, num_features);
// copy the data over
typedef Eigen::Triplet<float> T;
std::vector<T> tripletList;
for (int j = 0; j < num_batch_instances; ++j) {
for (size_t i = batch.offset[j]; i < batch.offset[j + 1]; ++i) {
uint32_t index = batch.index[i];
float fvalue = batch.value;
if (index < num_features) {
tripletList.emplace_back(T(0, index, fvalue));
}
}
features.setFromTriplets(tripletList.begin(), tripletList.end());
samples->emplace_back(Eigen::SparseVector<float>(features));
}
This creates a SparseMatrix using the triplet list approach, then creates a SparseVector from that object. In my experiments with ~1.4M features and very high sparsity this is 2 orders of magnitude slower than using SparseVector and coeffRef, which I definitely did not expect.

C++ AMP nested loop

I'm working on a project that requires massive parallel computing. However, the tricky problem is that, the project contains a nested loop, like this:
for(int i=0; i<19; ++i){
for(int j=0; j<57; ++j){
//the computing section
}
}
To achieve the highest gain, I need to parallelise those two levels of loops. Like this:
parallel_for_each{
parallel_for_each{
//computing section
}
}
I tested and found that AMP doesn't support nested for loops. Anyone have any idea on this problem? Thanks
You could, as #High Performance Mark suggest collapse the two loops into one. However, you don't need to do this with C++ AMP because it supports 2 and 3 dimensional extents on arrays and array_views. You can the use an index as a multi-dimensional index.
array<float, 2> x(19,57);
parallel_for_each(x.extent, [=](index<2> idx) restrict(amp)
{
x[idx] = func(x[idx]);
});
float func(const float v) restrict(amp) { return v * v; }
You can access the individual sub-indeces in idx using:
int row = idx[0];
int col = idx[1];
You should also consider the amount of work being done by computing section. If it is relatively small you may want to have each thread process more than one element of the array, x.
The following article is also worth reading as just like the CPU if your loops do not access memory efficiently it can have a big impact on performance. Arrays are Row Major in C++ AMP
So collapse the loops:
for(int ij=0; ij<19*57; ++ij){
//if required extract i and j from ij
//the computing section
}
}

Fastest way to calculate distance between all rows in a dense eigen::matrix

I am trying to calculate the euclidean distance between every pair of rows in a 1000x1000 matrix using Eigen. What I have so far is something along these lines:
for (int i = 0; i < matrix.rows(); ++i){
VectorXd refRow = matrix.row(i);
for (int j = i+1; j < matrix.rows(); ++j){
VectorXd eleRow = matrix.row(j);
euclid_distance = (refRow - eleRow).lpNorm<2>();
...
}
}
My code includes other code here replaced with "..." but for testing the performance I have removed it.
Now I don't expect this to run at the speed of light but it is taking a lot more than I expected. Am I doing something wrong in using C++ \ the Eigen library that might be slowing this down?
Is there any other preferred method?

How to parallelize a loop?

I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.
for(i=2;i<N;i++)
for(j=2;j<N;j++)
A[i,j] =A[i-2,j] +A[i,j-2];
Code:
int const N = 10;
int arr[N][N];
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
arr[i][j] = 1;
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
printf_s("%d ",arr[i][j]);
printf("\n");
}
Do you have any suggestions how I can do it? Thank you!
serial and parallel run will give different. result because in
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
.....
you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!
This
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i and leave them to it. There will be no determinism in the relative order in which threads update arr. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100 (or whatever) and thread 2 is updating elements with i = 102,103,104,...,200 the program does not determine whether thread 1 updates arr[i,:] = 100 before or after thread 2 wants to use the updated values in arr. You have written a code with a classic data race.
You have a number of options to fix this:
You could tie yourself in knots trying to ensure that the threads update arr in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.
You could make 2 copies of arr and always update from one to the other, then from the other to the one. Something like (very pseudo-code)
for ...
{
old = 0
new = 1
arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];
old = 1
new = 0
}
Of course, this second approach trades space for time but that's often a reasonable trade-off.
You may find that adding an extra plane to arr doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old] the first index element rather than the last.
Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.
If any other options occur to me I'll edit them in.
To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.
Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).
A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:
if( too small to be worth parallelizing ) {
do serially
} else {
// Recursively:
Do upper left quadrant
Do lower left and upper right quadrants in parallel
Do lower right quadrant
}
As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.
If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without
changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))
Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.
To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).

Matrix Multiplication optimization via matrix transpose

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?
Full Source on GitHub
void transpose(int size, matrix m) {
int i, j;
for (i = 0; i < size; i++)
for (j = 0; j < size; j++)
std::swap(m.element[i][j], m.element[j][i]);
}
void mm(matrix a, matrix b, matrix result) {
int i, j, k;
int size = a.size;
long long before, after;
before = wall_clock_time();
// Do the multiplication
transpose(size, b); // transpose the matrix to reduce cache miss
for (i = 0; i < size; i++)
for (j = 0; j < size; j++) {
int tmp = 0; // save memory writes
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
}
after = wall_clock_time();
fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}
Am I doing things right so far?
FYI: The next optimization I need to do is use SIMD/Intel SSE3
Am I doing things right so far?
No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.
Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.
Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:
for(k = 0; k < size; k++)
tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;
An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.
Addendum
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.
Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a2.
Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.
Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.
If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read
j=0; j<size; j++
but
j=0; j<i; j++
Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:
for (int i=0; i<size; i++)
{
for (int k=0; k<size; k++)
{
double tmp = a[i][k];
for (int j=0; j<size; j++)
{
result[i][j] += tmp * b[k][j];
}
}
}