Related
I have a 2D column major array on the host with padding, for example:
|1 4 7|
|2 5 8|
A_h = |3 6 9|
|x x x|
|x x x|
and I want to copy the data to device memory as 1D array:
{1, 2, 3, 4, 5, 6, 7, 8, 9} //preferred
or
{1, 2, 3, 4, 5, 6, 7, 8, 9, x, x, x, x, x, x}
What is the fastest and effective way to achieve that using either CUDA and/or thrust?
Edit: I followed the comment of Robert to remove the loop when using thrust but the code only able to copy the first column. How can I make it work for the whole array without using a loop?
thrust::counting_iterator<int> first(0);
thrust::counting_iterator<int> last = first + rows;
thrust::device_vector<real_type> A_d(rows * cols);
thrust::copy(thrust::make_permutation_iterator(A_h, first),
thrust::make_permutation_iterator(A_h, last), A_d.begin());
If the use case is only copying a subset of a larger source into a smaller destination which isn't strided (so contiguous), then a conditional copy with a predicate is probably the simplest approach (I guess gather would also work). Something like this:
#include <vector>
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
struct indexer
{
int lda0;
int lda1;
indexer() = default;
__device__ __host__
indexer(int l0, int l1) : lda0(l0), lda1(l1) {};
__device__ __host__
bool operator()(int x) {
int r = x % lda0;
return (r < lda1);
};
};
int main()
{
const int M0 = 5, N=3;
const int M1 = 3;
const int len1 = M1*N;
{
std::vector<int> data{ 1, 2, 3, -1, -1, 4, 5, 6, -1, -1, 7, 8, 9, -1, -1 };
thrust::device_vector<int> ddata = data;
thrust::device_vector<int> doutput(len1);
indexer pred(M0, M1);
thrust::counting_iterator<int> idx(0);
thrust::copy_if(ddata.begin(), ddata.end(), idx, doutput.begin(), pred);
for(int i=0; i<len1; i++) {
int val = doutput[i];
std::cout << i << " " << val << std::endl;
}
}
return 0;
}
Here the predicate will only select a subset of each column and copy them into a continuous output range:
$ nvcc -arch=sm_52 -std=c++11 -o subset subset.cu
$ ./subset
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
If you want something more general (so strided input and output) then you could probably use the same idea with scatter_if. As noted in comments, this is trivially done with cudaMemcpy2D or a copy kernel.
I'm trying to use the MKL routine mkl_dcsradd to add an upper-triangular matrix to its transpose. In this case, the upper triangular matrix stores part of the adjacency matrix of a graph, and I need the full version for implementing another algorithm.
In this simplified example, I start with a list of (11) edges, and build an upper-triangular CSR matrix from it. I have checked that this much works. However, when I try to add it to its transpose, dcsradd stops on the final row, saying it's run out of space. However, this shouldn't be the case. An upper triangular matrix (no zeros along the diagonal) with n non-zero entries, when added to its transpose, should result in a matrix with 2n (22) non-zeros.
When I supply dcsradd with a maximum non-zeros of 22, it fails, but when I supply it with 23 (an excessive value), it works correctly. Why is this?
I've simplified my code down to a minimal example demonstrating the error:
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <mkl.h>
int main()
{
int nnz = 11;
int numVertices = 10;
int32_t u[] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 1 };
int32_t v[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 5, 8 };
double w[] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 };
int fullNnz = nnz * 2;
int dim = numVertices;
double triData[nnz];
int triCols[nnz];
int triRows[dim];
// COO to upper-triangular CSR
int info = -1;
int job [] = { 2, 1, 0, 0, nnz, 0 };
mkl_dcsrcoo(job, &dim,
triData, triCols, triRows,
&nnz, w, u, v,
&info);
printf("info = %d\n", info);
// Allocate final memory
double data[fullNnz];
int cols[fullNnz];
int rows[dim];
// B = A + A^T (to make a full adjacency matrix)
int request = 0, sort = 0;
double beta = 1.0;
int WRONG_NNZ = fullNnz + 1; // What is happening here?
mkl_dcsradd("t", &request, &sort, &dim, &dim,
triData, triCols, triRows,
&beta, triData, triCols, triRows,
data, cols, rows,
&WRONG_NNZ, &info);
printf("info = %d\n", info);
// Convert back to 0-based indexing (via Cilk)
cols[:]--;
rows[:]--;
printf("data:");
for (double d : data) printf("%.0f ", d);
printf("\ncols:");
for (int c : cols) printf("%d ", c);
printf("\nrows:");
for (int r : rows) printf("%d ", r);
printf("\n");
return 0;
}
I compile with:
icc -O3 -std=c++11 -xHost main.cpp -o main -openmp -L/opt/intel/composerxe/mkl/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm
When I give 22, the output is:
info = 0
info = 10
data:1 10 1 2 11 2 3 3 4 4 5 10 5 6 6 7 7 8 11 8 9 0
cols:1 5 0 2 8 1 3 2 4 3 5 0 4 6 5 7 6 8 1 7 9 -1
rows:0 2 5 7 9 11 14 16 18 21
But, when I give 23, the output is:
info = 0
info = 0
data:1 10 1 2 11 2 3 3 4 4 5 10 5 6 6 7 7 8 11 8 9 9
cols:1 5 0 2 8 1 3 2 4 3 5 0 4 6 5 7 6 8 1 7 9 8
rows:0 2 5 7 9 11 14 16 18 21
I want to loop an array then during each loop I want to loop backwards over the previous 5 elements.
So given this array
int arr[24]={3, 1, 4, 1, 7, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6, 4}
and this nested loop
for(int i=0;i<arr.size;i++)
{
for(int h=i-5; h<i; h++)
{
//things happen
}
}
So, if i=0, second loop would loop last few elements 4,6,2,6,5.
How could you handle this?
I'm assuming that:
You only want to go over previous values (i.e. no wrap around) You
You don't actually want arr to be a multi-dimensional array as suggested
by your choice of tags
You want to include the current i in your five values
This is just a small modification to your code that will do (what I think) you are asking:
#include <math>
int main()
{
int arr[24]={3, 1, 4, 1, 7, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6, 4}
for(int i=0;i<arr.size;i++)
{
for(int h = max(i-4, 0); h < i+1; h++)
{
//things happen
}
}
}
note the h = max(i-4, 0) and h < i+1This will reduce the number of iterations of the inner loop so that it starts from index 0 and loops up through the five values up to and including i. (four values and i). h will always be within bounds.
The case where i==arr.size won't be a problem in the inner loop as the outer loop will terminate before that happens (i is always within bounds).
Edit: I saw this comment:
I want the first element to consider the last final 5 elements of the array though.
in which case, your loops should look like:
for(int i=0;i<arr.size;i++)
{
for(int h=0; h<5; h++)
{
int index = (i + arr.size - h) % arr.size;
//things happen
//access array with arr[index];
}
}
This should do what you want:
When i=0, h=0 index=(0+24-0)%24 which is 0. For h=1 we go one less, index=(0+24-1)%24 = 23 and so on for the next values of h.
The code gets the last 5 values, wrapping round, inclusive of the current value. (so will get 20,21,22,23,0 when i=0, 21,22,23,0,1 when i=1)
If you want the five before, non-inclusive, then inner loop should be:
for(int h=1; h<=5; h++)
here is the current output of the loop as it stands:
i 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 ... 22 22 22 22 22 23 23 23 23 23
h 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 ... 0 1 2 3 4 0 1 2 3 4
index 0 23 22 21 20 1 0 23 22 21 2 1 0 23 22 3 2 1 0 23 ... 22 21 20 19 18 23 22 21 20 19
I assume you want it to loop around (don't know why). if so, use modulo:
int index = (h + arr.size) % arr.size;
Using the modulo operator.
for (int i = 0; i < arr.size; i++)
{
for (int h = 5; h > 0; h--)
{
const int array_length = sizeof(arr) / sizeof(arr[0]);
int index = (i - h + array_length) % array_length; // Use 'sizeof(arr) / sizeof(arr[0])' to get the size of the array
//things happen
}
}
Is using if statement not an option?
const int array_size = 24;
int arr[array_size] = { 1,3,4,5,...,2 }
for(int i=0;i<array_size;i++)
{
for(int h=i-5; h<i; h++)
{
int arr_index = (h >= 0) ? h : (array_size + h);
//do your things with arr[arr_index]
}
}
you may also start the nested loop with something like:
for(int h=i-min(i,5);h<i;++h)
{
}
which let you process first 5 cells as well. also, if you are dealing with some kind of signal or image processing consider extending arr to have 29 elements with preceding 5 zeros or whatever value would be suitable, and start the first for-loop with 5th element.
Just make an if statement in nested loop. Something like this
for( int h = i-5; h < i; h++ )
{
// do stuff
if( i == 0 )
break;
}
Here is a code snippet below.
Input to program is
dimension d[] = {{4, 6, 7}, {1, 2, 3}, {4, 5, 6}, {10, 12, 32}};
PVecDim vecdim(new VecDim());
for (int i=0;i<sizeof(d)/sizeof(d[0]); ++i) {
vecdim->push_back(&d[i]);
}
getModList(vecdim);
Program:
class dimension;
typedef shared_ptr<vector<dimension*> > PVecDim;
typedef vector<dimension*> VecDim;
typedef vector<dimension*>::iterator VecDimIter;
struct dimension {
int height, width, length;
dimension(int h, int w, int l) : height(h), width(w), length(l) {
}
};
PVecDim getModList(PVecDim inList) {
PVecDim modList(new VecDim());
VecDimIter it;
for(it = inList->begin(); it!=inList->end(); ++it) {
dimension rot1((*it)->length, (*it)->width, (*it)->height);
dimension rot2((*it)->width, (*it)->height, (*it)->length);
cout<<"rot1 "<<rot1.height<<" "<<rot1.length<<" "<<rot1.width<<endl;
cout<<"rot2 "<<rot2.height<<" "<<rot2.length<<" "<<rot2.width<<endl;
modList->push_back(*it);
modList->push_back(&rot1);
modList->push_back(&rot2);
for(int i=0;i < 3;++i) {
cout<<(*modList)[i]->height<<" "<<(*modList)[i]->length<<" "<<(*modList)[i]->width<<" "<<endl;
}
}
return modList;
}
What I see is that the values rot1 and rot2 actually overwrite previous values.
For example that cout statement prints as below for input values defined at top. Can someone tell me why are these values being overwritten?
rot1 7 4 6
rot2 6 7 4
4 7 6
7 4 6
6 7 4
rot1 3 1 2
rot2 2 3 1
4 7 6
3 1 2
2 3 1
You are storing pointers to local variables when you do this kind of thing:
modList->push_back(&rot1);
These get invalidated every loop cycle. You could save yourself a lot of trouble by not storing pointers in the first place.
I stuck at one point and need some help.
I have a STL vector with the following values:
[1, 17, 2, 18, 3, 19, 1, 17, 2, 18, 3, 19, 1, 17, 2, 18, 3, 19].
note that first six values in a vector (i.e. 1, 17, 2, 18, 3, 19 ) can be considered as one block. So this vector has 3 blocks each with the values as described above.
Now, I want to organize this vector in a following way:
[1, 17, 1, 17, 1, 17, 2, 18, 2, 18, 2, 18, 3, 19, 3, 19, 3, 19]
.
So essentially I am picking first two values (i.e. 1, 17) from each block first and store them sequentially 3 times (basically # of blocks which in this case is 3). I then go on to pick next two values (i.e. 2, 18) and continue.
How do I achieve this..?
Any help will be greatly appreciated.
Thanks
Sound quite easy once you figure out the exact mapping. So external loop is the number of chunks in every block since that's the number of final groups, middle loop goes over each original block while inner loop just goes through every element of a chunk. Final result should be something like (untested):
std::vector organized;
organized.reserve(data.size());
const int blockSize = 6;
const int subBlockSize = 2;
assert(data.size()%blockCount == 0 && blockSize%subBlockSize == 0);
const int blockCount = data.size()/blockSize;
const int subBlockCount = blockSize/subBlockSize;
for (int i = 0; i < subBlockCount; ++i)
for (int j = 0; j < blockCount; ++j)
for (int k = 0; k < subBlockSize; ++k)
organized.push_back(subBlockSize*i + blockSize*j + k);
Just create a function shuffle(i) that takes an index into the new array, and returns an index from the original array:
#include <iostream>
#include <cstdlib>
using namespace std;
int list[] = { 1, 17, 2, 18, 3, 19, 1, 17, 2, 18, 3, 19, 1, 17, 2, 18, 3, 19 };
int shuffle( int i )
{
div_t a = div( i, 6 );
div_t b = div( a.rem, 2 );
return 2*a.quot + 6*b.quot + b.rem;
}
int main()
{
for( int i=0 ; i<18 ; ++i ) cout << list[shuffle(i)] << ' ';
cout << endl;
return 0;
}
This outputs:
1 17 1 17 1 17 2 18 2 18 2 18 3 19 3 19 3 19
Just allocate the new vector, and fill it from the old one:
new_vector[i] = list[shuffle(i)];