C++ AMP nested loop - c++

I'm working on a project that requires massive parallel computing. However, the tricky problem is that, the project contains a nested loop, like this:
for(int i=0; i<19; ++i){
for(int j=0; j<57; ++j){
//the computing section
}
}
To achieve the highest gain, I need to parallelise those two levels of loops. Like this:
parallel_for_each{
parallel_for_each{
//computing section
}
}
I tested and found that AMP doesn't support nested for loops. Anyone have any idea on this problem? Thanks

You could, as #High Performance Mark suggest collapse the two loops into one. However, you don't need to do this with C++ AMP because it supports 2 and 3 dimensional extents on arrays and array_views. You can the use an index as a multi-dimensional index.
array<float, 2> x(19,57);
parallel_for_each(x.extent, [=](index<2> idx) restrict(amp)
{
x[idx] = func(x[idx]);
});
float func(const float v) restrict(amp) { return v * v; }
You can access the individual sub-indeces in idx using:
int row = idx[0];
int col = idx[1];
You should also consider the amount of work being done by computing section. If it is relatively small you may want to have each thread process more than one element of the array, x.
The following article is also worth reading as just like the CPU if your loops do not access memory efficiently it can have a big impact on performance. Arrays are Row Major in C++ AMP

So collapse the loops:
for(int ij=0; ij<19*57; ++ij){
//if required extract i and j from ij
//the computing section
}
}

Related

How to efficiently initialize a SparseVector in Eigen

In the Eigen docs for filling a sparse matrix it is recommended to use the triplet filling method as it can be much more efficient than making calls to coeffRef, which involves a binary search.
For filling SparseVectors however, there is no clear recommendation on how to do it efficiently.
The suggested method in this SO answer uses coeffRef which means that a binary search is performed for every insertion.
Is there a recommended, efficient way to build sparse vectors? Should I try to create a single row SparseMatrix and then store that as a SparseVector?
My use case is reading in LibSVM files, in which there can be millions of very sparse features and billions of data points. I'm currently representing these as an std::vector<Eigen::SparseVector>. Perhaps I should just use SparseMatrix instead?
Edit: One thing I've tried is this:
// for every data point in a batch do the following:
Eigen::SparseMatrix<float> features(1, num_features);
// copy the data over
typedef Eigen::Triplet<float> T;
std::vector<T> tripletList;
for (int j = 0; j < num_batch_instances; ++j) {
for (size_t i = batch.offset[j]; i < batch.offset[j + 1]; ++i) {
uint32_t index = batch.index[i];
float fvalue = batch.value;
if (index < num_features) {
tripletList.emplace_back(T(0, index, fvalue));
}
}
features.setFromTriplets(tripletList.begin(), tripletList.end());
samples->emplace_back(Eigen::SparseVector<float>(features));
}
This creates a SparseMatrix using the triplet list approach, then creates a SparseVector from that object. In my experiments with ~1.4M features and very high sparsity this is 2 orders of magnitude slower than using SparseVector and coeffRef, which I definitely did not expect.

filling only half of Matrix using OpenMp in C++

I have a quite big matrix. I would like to fill half of the matrix in parallel.
m_matrix is 2D std vector. Any suggestion for the type of container is appreciated as well. What _fill(i,j) function is doing is not considered heavy compared to size of the matrix.
//i: row
//j: column
for (size_t i=1; i<num_row; ++i)
{
for (size_t j=0; j<i; ++j)
{
m_matrix[i][j] = _fill(i, j);
}
}
What would be a nice openMP structure for that? I tried dynamic strategy bet I got even time increase compared to the sequential mode.

Fastest way to calculate the abs()-values of a complex array

I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.
Given that all loop iterations are independent, you can use the following code for parallelization:
#pragma omp parallel for
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.
Or use Concurrency::parallele_for like that :
Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});
Use vector operations.
If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.
Libmvec is vector math library added in Glibc 2.22.
Vector math library was added to support SIMD constructs of OpenMP4.0
(#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding
vector implementations of vector math functions.
Vector math functions are vector variants of corresponding scalar math
operations implemented using SIMD ISA extensions (e.g. SSE or AVX for
x86_64). They take packed vector arguments, perform the operation on
each element of the packed vector argument, and return a packed vector
result. Using vector math functions is faster than repeatedly calling
the scalar math routines.
Also, see Parallel for vs omp simd: when to use each?
If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:
Description
These functions evaluate the function hypot(x, y) for an entire vector
of values at once. ...
The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.
Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use
//struct of arrays of four complex numbers
struct c4 {
float x[4]; // real values of four complex numbers
float y[4]; // imaginary values of four complex numbers
};
Here is code showing how you could do this with SSE for the x86 instruction set.
#include <stdio.h>
#include <x86intrin.h>
#define N 10
struct c4{
float x[4];
float y[4];
};
static inline void cabs_soa4(struct c4 *a, float *b) {
__m128 x4 = _mm_loadu_ps(a->x);
__m128 y4 = _mm_loadu_ps(a->y);
__m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
_mm_storeu_ps(b, b4);
}
int main(void)
{
int n4 = ((N+3)&-4)/4; //choose next multiple of 4 and divide by 4
printf("%d\n", n4);
struct c4 a[n4]; //array of struct of arrays
for(int i=0; i<n4; i++) {
for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
}
float b[4*n4];
for(int i=0; i<n4; i++) {
cabs_soa4(&a[i], &b[4*i]);
}
for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}
It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this
for(int i = 0; i < nchunks; i++) {
for(int j = 0; j < chunk_size; j++) {
b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
}
foo(&b[i*chunck_size]); // foo is computationally intensive.
}
I did not implement an array of struct of array here but it should be easy to adjust the code for that.
If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.
So, if you want to run them in parallel you would do:
#include <cilk/cilk.h>
cilk_for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
or if you want to test SIMD:
#pragma simd
for(int i = 0; i < N; i++)
{
b[i] = cabs(a[i]);
}
But, the nicest part of Cilk is that you can just do:
b[:] = cabs(a[:])
In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel).
Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.
Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:
#include <future>
...
int main()
{
...
// Create async calculations
std::future<void> *futures = new std::future<void>[N];
for (int i = 0; i < N; ++i)
{
futures[i] = std::async([&a, &b, i]
{
b[i] = std::sqrt(a[i]);
});
}
// Wait for calculation of all async procedures
for (int i = 0; i < N; ++i)
{
futures[i].get();
}
...
return 0;
}
IdeOne live code
We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com

How to parallelize a loop?

I'm using OpenMP on C++ and I want to parallelize very simple loop. But I can't do it correctly. All time I get wrong result.
for(i=2;i<N;i++)
for(j=2;j<N;j++)
A[i,j] =A[i-2,j] +A[i,j-2];
Code:
int const N = 10;
int arr[N][N];
#pragma omp parallel for
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
arr[i][j] = 1;
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
printf_s("%d ",arr[i][j]);
printf("\n");
}
Do you have any suggestions how I can do it? Thank you!
serial and parallel run will give different. result because in
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
.....
you update arr[i]. so you change data used by the other thread. it will lead to a read over write data race!
This
#pragma omp parallel for
for (int i = 2; i < N; i++)
for (int j = 2; j < N; j++)
{
arr[i][j] = arr[i-2][j] +arr[i][j-2];
}
is always going to be a source of grief and unpredictable output. The OpenMP run time is going to hand each thread a range of values for i and leave them to it. There will be no determinism in the relative order in which threads update arr. For example, while thread 1 is updating elements with i = 2,3,4,5,...,100 (or whatever) and thread 2 is updating elements with i = 102,103,104,...,200 the program does not determine whether thread 1 updates arr[i,:] = 100 before or after thread 2 wants to use the updated values in arr. You have written a code with a classic data race.
You have a number of options to fix this:
You could tie yourself in knots trying to ensure that the threads update arr in the right (ie sequential) order. The end result would be an OpenMP program that runs more slowly than the sequential program. DO NOT TAKE THIS OPTION.
You could make 2 copies of arr and always update from one to the other, then from the other to the one. Something like (very pseudo-code)
for ...
{
old = 0
new = 1
arr[i][j][new] = arr[i-2][j][old] +arr[i][j-2][old];
old = 1
new = 0
}
Of course, this second approach trades space for time but that's often a reasonable trade-off.
You may find that adding an extra plane to arr doesn't immediately speed things up because it wrecks the spatial locality of values pulled into cache. Experiment a bit with this, possibly make [old] the first index element rather than the last.
Since updating each element in the array depends on the values found in elements 2 rows/columns away you're effectively splitting the array up like a chess-board, into white and black elements. You could use 2 threads, one on each 'colour', without the threads racing for access to the same data. Again, though, the disruption of spatial locality in the cache might have a bad impact on speed.
If any other options occur to me I'll edit them in.
To parallelize the loop nest in the question is tricky, but doable. Lamport's paper "The Parallel Execution of DO Loops" covers the technique. Basically you have to rotate your (i,j) coordinates by 45 degrees into a new coordinate system (k,l), where k=i+j and l=i-j.
Though to actually get speedup, the iterations likely have to be grouped into tiles, which makes the code even uglier (four nested loops).
A completely different approach is to solve the problem recursively, using OpenMP tasking. The recursion is:
if( too small to be worth parallelizing ) {
do serially
} else {
// Recursively:
Do upper left quadrant
Do lower left and upper right quadrants in parallel
Do lower right quadrant
}
As a practical matter, the ratio of arithmetic operations to memory accesses is so low that it is going to be difficult to get speedup out of the example.
If you ask about parallelism in general, then one more possible answer is vectorization. You could achieve some relatively poor vector parallelizm (something like 2x speedup or so) without
changing the data structure and codebase. This is possible using OpenMP4.0 or CilkPlus pragma simd or similar (with safelen/vectorlength(2))
Well, you really have data dependence (both inner and outer loops), but it belongs to «WAR»[ (write after read) dependencies sub-category, which is blocker for using «omp parallel for» «as is» but not necessarily a problem for «pragma omp simd» loops.
To make this working you will need x86 compilers supporting pragma simd either via OpenMP4 or via CilkPlus (very recent gcc or Intel compiler).

Controlling the index variables in C++ AMP

I have just started trying C++ AMP and I decided to give it a shot with the current project I am working on. At some point, I have to build a distance matrix for the vectors I have and I have written the code below for this
unsigned int samplesize=samplelist.size();
unsigned int vs = samplelist.front().size();
vector<double> samplevec(samplesize*vs);
vector<double> distancevec(samplesize*samplesize,0);
it1=samplelist.begin();
for(int i=0 ; i<samplesize; ++i){
for(int j = 0 ; j<vs ; ++j){
samplevec[j + i*vs] = (*it1)[j];
}
++it1;
}
array_view<const double,2> samplearray(samplesize,vs,samplevec);
array_view<writeonly<double>,2> distances(samplesize,samplesize,distancevec);
parallel_for_each(distances.grid, [=](index<2> idx) restrict(direct3d){
double sqrsum=0;
double tempd=0;
for ( unsigned int i=0 ; i<vs ; ++i)
{
tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
sqrsum += tempd*tempd;
}
distances[idx]=sqrsum;
}
However, as you can see, this does not take into account the symmetry property of distance matrices. When I calculate sqrsum of matrices i and j, I don't want to do the same calculation again when the order of the i and j are reversed. Is there any way to accomplish this? I came up with the following trick, but I don't know if this would bump up the performance significantly
for ( unsigned int i=0 ; i<vs ; ++i)
{
if(idx.x<=idx.y){
break;
}
tempd = samplearray(idx.x,i) - samplearray(idx.y,i);
sqrsum += tempd*tempd;
}
Can the if-condition do the job? Or do you think the if statement would hurt the performance unnecessarily? I couldn't came up with any alternative to it
BTW, I just noticed that the above written code does not work on my machine, whose gpu only supports single precision. Is there anything to do to get around that problem? Error message is as follows:
"runtime_exception: Concurrency;;parallel_for_each uses features unsupported by the selected accelerator.
ID3D11Device::CreateComputeShader: Shader uses double precision float ops which are not supported on the current device."
I think you can eliminate if-condition, if you would schedule only as many threads as you need, instead of scheduling entire rectangle that covers your output matrix. What you need is upper or lower triangle without diagonal, which you can calculate using arithmetic sequence.
The alternative would be to organize input data such that it is in two 1D vectors, each thread would read value from vector 1, then vector 2 and calculate distance and store it in one of the input vectors.
Finally, the error on double precision shows up, because the card you are using does not support double precision operations. Please check your card specification to confirm that. You can workaround it by switching to single precision type i.e. "float" in array_view template.