Assigning a vector to a matrix column in Eigen - c++
This question was asked in haste. The error in my original program, was not the typo in the code that is displayed here. The error was that in my program v was not getting populated due to some conditions.
The more useful takeaway from this thread is the demonstration of copying a std::vector to all rows or columns of an Eigen Matrix, in the accepted answer.
I want to copy vectors into the columns of a matrix, like the following:
#include <Eigen/Dense>
#include <vector>
#include <iostream>
int main() {
int m = 10;
std::vector<Eigen::VectorXd> v(m);
Eigen::MatrixXd S(m,m);
for (int i = 0; i != m; ++i) {
v[i].resize(m);
for (int j = 0; j != m; ++j) {
v[i](j) = rand() % m;
}
//S.cols(i) = v[i]; //needed something like this
}
return 0;
}
S is of type Eigen::MatrixXd and dimension mxm. v is a std::vector of Eigen::VectorXd, where each Eigen::VectorXd is of size m and there are m of them in v.
Regarding the original question, you need to wrap the std::vector with an Eigen::Map. You could/should also make the operation a one-liner.
The reworded question is reduced to a typo. S.cols(i) should be S.col(i).
int main()
{
size_t sz = 6;
Eigen::MatrixXd S(sz, sz);
std::vector<double> v(sz);
std::vector<Eigen::VectorXd> vv(sz);
for(int i = 0; i < sz; i++)
{
v[i] = i*2;
vv[i] = Eigen::VectorXd::LinSpaced(sz, (i+sz), (i+sz)*2);
}
for (int i = 0; i != sz; ++i)
S.col(i) = vv[i];
std::cout << S << "\n\n";
S.rowwise() = Eigen::Map<Eigen::RowVectorXd>(v.data(), sz);
std::cout << S << "\n\n";
S.colwise() = Eigen::Map<Eigen::VectorXd>(v.data(), sz);
std::cout << S << "\n\n";
return 0;
}
which would output
6 7 8 9 10 11
7.2 8.4 9.6 10.8 12 13.2
8.4 9.8 11.2 12.6 14 15.4
9.6 11.2 12.8 14.4 16 17.6
10.8 12.6 14.4 16.2 18 19.8
12 14 16 18 20 22
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 0 0 0 0 0
2 2 2 2 2 2
4 4 4 4 4 4
6 6 6 6 6 6
8 8 8 8 8 8
10 10 10 10 10 10
Related
take one column from a 2D array and store in 1D
I am trying to take this 9 x 3 and use only the 3rd column to store in its own 1D array: 3 5 8 6 3 9 7 5 12 0 5 5 1 2 3 8 2 10 8 3 11 9 3 12 4 1 5 This is what I have for a conversion: int index = 0; // 2D to 1D conversion for (int r = 0; r < N; r++) { for (int c = 0; c < 3; c++) { end[index++] = start[r][c]; } } But it is giving me the first 9 numbers in the whole matrix: 3 5 8 6 3 9 7 5 12 (but vertically) I need the 3rd column only and I don't know what I am doing wrong.
you can try this: int index = 0; // 2D to 1D conversion for (int r = 0; r < N; r++) { end[index++] = start[r][2]; }
How to relocate an element in one array in C++
I took this interview question and I failed, so I'm here to not fail again! I have an array of int with size 16 and a 5 < givenIndex < 10. I have to take the element in this index a print every possible array (there are 16) by moving the element at givenIndex through every position in array and pushing rest of elements. For example: int array[16] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}; int givenIndex = 6; Since array[givenIndex] = 7, I need to move 7 to every possible position and print that array. [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [7,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16] [1,7,2,3,4,5,6,8,9,10,11,12,13,14,15,16] [1,2,7,3,4,5,6,8,9,10,11,12,13,14,15,16] And that's for 16 cases. What I was trying was: for(int i = 0;i<16;i++){ array[i] = array[indexInsercion] if (i<indexInsert){ //right shift array[i] = array[i+1] }else if(i == indexInsert){ //no shift }else{ //left shift array[i] = array[i-1] } } Can I get some help?
We can only guess what the interviewer expected to see. If I was the interviewer I would like to see that you keep things simple. This is code I think one can expect to be written from scratch in an interview situation: #include <iostream> #include <array> template <size_t size> void print_replaced(const std::array<int,size>& x,size_t index){ for (int i=0;i<size;++i){ for (int j=0;j<i;++j) { if (j == index) continue; std::cout << x[j] << " "; } std::cout << x[index] << " "; for (int j=i;j<size;++j) { if (j == index) continue; std::cout << x[j] << " "; } std::cout << "\n"; } } int main() { std::array<int,16> x{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}; print_replaced(x,6); } It is a first approach at the problem, with a loop that prints 16 different combinations of the array elements. Printing each line follows simple logic: We print all elements before the one that should be replaced, then the one that should be shuffled, then the remaining elements. It is simple, but wrong. Its output is: 7 1 2 3 4 5 6 8 9 10 11 12 13 14 15 16 1 7 2 3 4 5 6 8 9 10 11 12 13 14 15 16 1 2 7 3 4 5 6 8 9 10 11 12 13 14 15 16 1 2 3 7 4 5 6 8 9 10 11 12 13 14 15 16 1 2 3 4 7 5 6 8 9 10 11 12 13 14 15 16 1 2 3 4 5 7 6 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 8 7 9 10 11 12 13 14 15 16 1 2 3 4 5 6 8 9 7 10 11 12 13 14 15 16 1 2 3 4 5 6 8 9 10 7 11 12 13 14 15 16 1 2 3 4 5 6 8 9 10 11 7 12 13 14 15 16 1 2 3 4 5 6 8 9 10 11 12 7 13 14 15 16 1 2 3 4 5 6 8 9 10 11 12 13 7 14 15 16 1 2 3 4 5 6 8 9 10 11 12 13 14 7 15 16 1 2 3 4 5 6 8 9 10 11 12 13 14 15 7 16 There is one line that appears twice and the last line is missing. As an interviewer I would not be surprised that the first attempt does not produce correct output. I don't care about that. Thats not a minus. What I would care about is how you react on that. Do you know the next steps? Do you have a strategy to fix the wrong output? Or do you just panic because you didn't manage to write the correct code on the first attempt? This is what I would like to check in an interview and then thats the end of the exercise. I want to ask more different questions rather than giving you the time to fix all mistakes and write correct well tested code, because I know that this takes more time than we have in the interview. I'll leave it to you to fix the above code ;)
Here's a quick stab at it. Basically just keep track of where the given index should go and print it there as well as skip the original position it would be in. #include <iostream> int main() { int array[16] = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 }; int givenIndex = 6; for (int p = 0; p <= 16; ++p) { if (p != givenIndex) { std::cout << "["; for (int i = 0; i < 16; ++i) { if (i == p) { if (i > 0) { std::cout << ","; } std::cout << array[givenIndex]; } if (array[i] != array[givenIndex]) { if (i > 0 || p == 0) { std::cout << ","; } std::cout << array[i]; } } if (p == 16) { std::cout << "," << array[givenIndex]; } std::cout << "]\n"; } } } Output: [7,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16] [1,7,2,3,4,5,6,8,9,10,11,12,13,14,15,16] [1,2,7,3,4,5,6,8,9,10,11,12,13,14,15,16] [1,2,3,7,4,5,6,8,9,10,11,12,13,14,15,16] [1,2,3,4,7,5,6,8,9,10,11,12,13,14,15,16] [1,2,3,4,5,7,6,8,9,10,11,12,13,14,15,16] [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16] [1,2,3,4,5,6,8,7,9,10,11,12,13,14,15,16] [1,2,3,4,5,6,8,9,7,10,11,12,13,14,15,16] [1,2,3,4,5,6,8,9,10,7,11,12,13,14,15,16] [1,2,3,4,5,6,8,9,10,11,7,12,13,14,15,16] [1,2,3,4,5,6,8,9,10,11,12,7,13,14,15,16] [1,2,3,4,5,6,8,9,10,11,12,13,7,14,15,16] [1,2,3,4,5,6,8,9,10,11,12,13,14,7,15,16] [1,2,3,4,5,6,8,9,10,11,12,13,14,15,7,16] [1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,7]
If the expectation is to just print the elements of array in the given order: Keep the track of current index of array element to be print, say indx - If the position of current element processing is equal to row number then print the element at givenIndex. If indx is equal to givenIndex skip it and print indx + 1 element, otherwise print element at indx and increase indx by 1. Implementation: #include <iostream> #include <array> int main() { std::array<int, 16> array = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}; std::size_t givenIndex = 6; for (std::size_t i = 0, indx = 0; i < array.size(); indx = 0, ++i) { std::cout << '['; for (std::size_t j = 0; j < array.size(); ++j) { if (j == i) { std::cout << array[givenIndex] << ','; continue; } if (indx == givenIndex) { ++indx; } std::cout << array[indx++] << ','; } std::cout << ']'; std::cout << '\n'; } return 0; } Output: # ./a.out [7,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,] [1,7,2,3,4,5,6,8,9,10,11,12,13,14,15,16,] [1,2,7,3,4,5,6,8,9,10,11,12,13,14,15,16,] [1,2,3,7,4,5,6,8,9,10,11,12,13,14,15,16,] [1,2,3,4,7,5,6,8,9,10,11,12,13,14,15,16,] [1,2,3,4,5,7,6,8,9,10,11,12,13,14,15,16,] [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,] [1,2,3,4,5,6,8,7,9,10,11,12,13,14,15,16,] [1,2,3,4,5,6,8,9,7,10,11,12,13,14,15,16,] [1,2,3,4,5,6,8,9,10,7,11,12,13,14,15,16,] [1,2,3,4,5,6,8,9,10,11,7,12,13,14,15,16,] [1,2,3,4,5,6,8,9,10,11,12,7,13,14,15,16,] [1,2,3,4,5,6,8,9,10,11,12,13,7,14,15,16,] [1,2,3,4,5,6,8,9,10,11,12,13,14,7,15,16,] [1,2,3,4,5,6,8,9,10,11,12,13,14,15,7,16,] [1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,7,] If the expectation is to alter the order of elements in the array and then print the array: First move the element at givenIndex to the 0th index of array and then - Print array In every iteration swap the current element with its next element in the array and print it. Implementation: #include <iostream> #include <array> void print_array (std::array<int, 16>& array) { std::cout << '['; for (std::size_t indx = 0; indx < array.size(); ++indx) { std::cout << array[indx] << ','; } std::cout << ']'; std::cout << '\n'; } void rearrange_array_elem (std::array<int, 16>& array, std::size_t givenIndx) { // move the element at givneIndx to first position in array for (std::size_t j = givenIndx; j > 0; --j) { std::swap (array[j], array[j - 1]); } // print array print_array (array); for (std::size_t indx = 0; indx < array.size() - 1; ++indx) { // swap current element with its next element std::swap (array[indx], array[indx + 1]); print_array (array); } } int main() { std::array<int, 16> array = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}; std::size_t givenIndex = 6; rearrange_array_elem (array, givenIndex); return 0; } Output: # ./a.out [7,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,] [1,7,2,3,4,5,6,8,9,10,11,12,13,14,15,16,] [1,2,7,3,4,5,6,8,9,10,11,12,13,14,15,16,] [1,2,3,7,4,5,6,8,9,10,11,12,13,14,15,16,] [1,2,3,4,7,5,6,8,9,10,11,12,13,14,15,16,] [1,2,3,4,5,7,6,8,9,10,11,12,13,14,15,16,] [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,] [1,2,3,4,5,6,8,7,9,10,11,12,13,14,15,16,] [1,2,3,4,5,6,8,9,7,10,11,12,13,14,15,16,] [1,2,3,4,5,6,8,9,10,7,11,12,13,14,15,16,] [1,2,3,4,5,6,8,9,10,11,7,12,13,14,15,16,] [1,2,3,4,5,6,8,9,10,11,12,7,13,14,15,16,] [1,2,3,4,5,6,8,9,10,11,12,13,7,14,15,16,] [1,2,3,4,5,6,8,9,10,11,12,13,14,7,15,16,] [1,2,3,4,5,6,8,9,10,11,12,13,14,15,7,16,] [1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,7,]
g++ optimization breakes for loop
So, I have this codesnippet here for (int t = WILSON_TEMPORAL_START, t_i = 0; t <= WILSON_TEMPORAL_END; t++, t_i++) { for (int r = WILSON_SPATIAL_START, r_i = 0; r <= WILSON_SPATIAL_END; r++, r_i++) { std::cout << r << " " << WILSON_SPATIAL_END << std::endl; wilson_loop[conf_id][t_i][r_i] = compute_wilson_loop(t, r, iu1, iu2); } } I run my g++ compiler in two different optimization versions -O1 and -O2, but the terminal output is differently. With -O1 2 12 3 12 4 12 5 12 6 12 7 12 8 12 9 12 10 12 11 12 12 12 2 12 3 12 4 12 ... With -O2 2 12 3 12 4 12 5 12 6 12 7 12 8 12 9 12 10 12 11 12 12 12 13 12 14 12 15 12 ... The code works fine if I change the inner loop to: for (int t = WILSON_TEMPORAL_START, t_i = 0; t <= WILSON_TEMPORAL_END; t++, t_i++) { for (int r = WILSON_SPATIAL_START, r_i = 0; r <= WILSON_SPATIAL_END; r++, r_i++) { std::cout << r << " " << WILSON_SPATIAL_END << std::endl; // wilson_loop[conf_id][t_i][r_i] = compute_wilson_loop(t, r, iu1, iu2); } } Some useful definitions are: static double compute_wilson_loop(int time_extend, int space_extend, SUN_mat iu1[VOL][DIM], SUN_mat iu2[VOL][DIM]) void wilson::measure_wilson_loop(SUN_mat pu[VOL][DIM], double wilson_loop[CLEN][TLEN][SLEN], int conf_id) { ... } I know, that I can find a workaround here, but I really want to understand, why this happens.
I just solved the problem, I made an error in defining the preprocessor constants. I'm still not sure, why it changes output when using a different optimization mode, but it works now :)
Why is this code outputting so many numbers? [duplicate]
This question already has answers here: ARRAYSIZE C++ macro: how does it work? (7 answers) c++ sizeof(array) return twice the array's declared length (6 answers) Closed 3 years ago. Starting with two arrays a and b, I am trying to output a matrix c with dimensions sizeof(a) and sizeof(b), whose entries are the product of every pair of the Cartesian product of a and b. Theses products are also stored in a two dimensional array c. My code is below. #include <iostream> #include <string> int main() { int a[]= { 1,2,3,4,5,5 }; int b[]= { 1,23,2,32,42,4 }; int c[sizeof(a)][sizeof(b)]; for (int i = 0; i < sizeof(a); i++) { for (int j = 0; j < sizeof(b); j++) { c[i][j] = a[i]* b[j] ; std::cout << c[i][j] << " "; } std::cout << "\n"; } return 0; } My output is: 1 23 2 32 42 4 -858993460 -858993460 1 2 3 4 5 5 -858993460 16710224 15543422 1 2161328 2122464 16710312 15543008 196436084 15536213 2 46 4 64 84 8 -1717986920 -1717986920 2 4 6 8 10 10 -1717986920 33420448 31086844 2 4322656 4244928 33420624 31086016 392872168 31072426 3 69 6 96 126 12 1717986916 1717986916 3 6 9 12 15 15 1717986916 50130672 46630266 3 6483984 6367392 50130936 46629024 589308252 46608639 ... This is just a small part of the output.
sizeof(a) is not the length of the array, it is the number of bytes required to store it. Since the element type of the array is larger than one byte each, the numbers are different.
Row-wise/column-wise operations on matrices with CUDA
I'm relatively new to CUDA programming. I have understood the programming model and have already written few basic kernels. I know how to apply a kernel to each element of a matrix (stored as 1D array), but now I'm trying to figure out how to apply the same operation to the same row/column of the input matrix. Let's say I have a MxN matrix and a vector of length N. I would like to sum (but it can be any other math operation) the vector to each row of the matrix. The serial code of such operation is: for (int c = 0; c < columns; c++) { for (int r = 0; r < rows; r++) { M[r * rows + c] += V[c]; } } Now the CUDA code for doing this operation should be quite straightforward: I should spawn as many cuda threads as the elements and apply this kernel: __global__ void kernel(const unsigned int size, float* matrix, const float* vector) { // get the current element index for the thread unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < size) { // sum the current element with the matrix[idx] += vector[threadIdx.x]; } } It runs but the result is not correct. Actually, it's correct if I transpose the matrix after the kernel completes its work. Unfortunately, I have no clue why it works in this way. Could you help me to figure out this problem? Thanks in advance. EDIT #1 I launch the kernel using: int block_size = 64; int grid_size = (M * N + block_size - 1) / block_size; kernel<<<grid_size, block_size>>>(M * N, matrix, vector); EDIT #2 I solved the problem by fixing the CPU code as suggested by #RobertCrovella: M[r * columns + c] += V[c]; It should match the outer for, that is, over the columns.
The kernel shown in the question could be used without modification to sum a vector to each of the rows of a matrix (assuming c-style row-major storage), subject to certain limitations. A demonstration is here. The main limitation of that approach is that the maximum vector length and therefore matrix width that can be handled is equal to the maximum number of threads per block, which on current CUDA 7-supported GPUs is 1024. We can eliminate that limitation with a slight modification to the vector indexing, and passing the row width (number of columns) as a parameter to the matrix. With this modification, we should be able to handle arbitrary matrix (and vector) sizes. EDIT: based on discussion/comments, OP wants to know how to handle row-major or column major underlying storage. The following example uses a templated kernel to select either row-major or column major underlying storage, and also shows one possible CUBLAS method for doing a add-vector-to-each-matrix-row operation using rank-1 update function: $ cat t712.cu #include <iostream> #include <cublas_v2.h> #define ROWS 20 #define COLS 10 #define nTPB 64 #define ROW_MAJOR 0 #define COL_MAJOR 1 template <int select, typename T> __global__ void vec_mat_row_add(const unsigned int height, const unsigned int width, T* matrix, const T* vector) { // get the current element index for the thread unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < height*width) { // sum the current element with the if (select == ROW_MAJOR) matrix[idx] += vector[idx%width]; else // COL_MAJOR matrix[idx] += vector[idx/height]; } } int main(){ float *h_mat, *d_mat, *h_vec, *d_vec; const unsigned int msz = ROWS*COLS*sizeof(float); const unsigned int vsz = COLS*sizeof(float); h_mat = (float *)malloc(msz); h_vec = (float *)malloc(vsz); cudaMalloc(&d_mat, msz); cudaMalloc(&d_vec, vsz); for (int i=0; i<COLS; i++) h_vec[i] = i; // set vector to 0,1,2, ... cudaMemcpy(d_vec, h_vec, vsz, cudaMemcpyHostToDevice); // test row-major case cudaMemset(d_mat, 0, msz); // set matrix to zero vec_mat_row_add<ROW_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec); cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost); std::cout << "Row-major result: " << std::endl; for (int i = 0; i < ROWS; i++){ for (int j = 0; j < COLS; j++) std::cout << h_mat[i*COLS+j] << " "; std::cout << std::endl;} // test column-major case cudaMemset(d_mat, 0, msz); // set matrix to zero vec_mat_row_add<COL_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec); cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost); std::cout << "Column-major result: " << std::endl; for (int i = 0; i < ROWS; i++){ for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " "; std::cout << std::endl;} // test CUBLAS, doing matrix-vector add using <T>ger cudaMemset(d_mat, 0, msz); // set matrix to zero float *d_ones, *h_ones; h_ones = (float *)malloc(ROWS*sizeof(float)); for (int i =0; i<ROWS; i++) h_ones[i] = 1.0f; cudaMalloc(&d_ones, ROWS*sizeof(float)); cudaMemcpy(d_ones, h_ones, ROWS*sizeof(float), cudaMemcpyHostToDevice); cublasHandle_t ch; cublasCreate(&ch); float alpha = 1.0f; cublasStatus_t stat = cublasSger(ch, ROWS, COLS, &alpha, d_ones, 1, d_vec, 1, d_mat, ROWS); if (stat != CUBLAS_STATUS_SUCCESS) {std::cout << "CUBLAS error: " << (int)stat << std::endl; return 1;} cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost); std::cout << "CUBLAS Column-major result: " << std::endl; for (int i = 0; i < ROWS; i++){ for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " "; std::cout << std::endl;} return 0; } $ nvcc -o t712 t712.cu -lcublas $ ./t712 Row-major result: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Column-major result: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 CUBLAS Column-major result: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 $ For brevity of presentation, I've not included proper cuda error checking, but that is always a good idea any time you are having trouble with a CUDA code. As a proxy/shortcut, you can run your code with cuda-memcheck as a quick check to see if there are any CUDA errors. Note that we expect all 3 printouts to be identical because that is actually the correct way to display the matrix, regardless of whether the underlying storage is row-major or column-major. The difference in underlying storage is accounted for in the for-loops handling the display output.
Robert Crovella has already answered this question providing examples using explicit CUDA kernels and cuBLAS. I find it useful, for future references, to show also an example on how performing row-wise or column-wise operations using CUDA Thrust. In particular, I'm focusing on two problems: Summing a column vector to all matrix columns; Summing a row vector to all matrix rows. The generality of thrust::transform enables to generalize the example below to elementwise operations other than the sum (e.g., multiplications, divisions, subtractions etc.). #include <thrust/device_vector.h> #include <thrust/reduce.h> #include <thrust/random.h> #include <thrust/sort.h> #include <thrust/unique.h> #include <thrust/equal.h> using namespace thrust::placeholders; /*************************************/ /* CONVERT LINEAR INDEX TO ROW INDEX */ /*************************************/ template <typename T> struct linear_index_to_row_index : public thrust::unary_function<T,T> { T Ncols; // --- Number of columns __host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {} __host__ __device__ T operator()(T i) { return i / Ncols; } }; /********/ /* MAIN */ /********/ int main() { /**************************/ /* SETTING UP THE PROBLEM */ /**************************/ const int Nrows = 10; // --- Number of rows const int Ncols = 3; // --- Number of columns // --- Random uniform integer distribution between 0 and 100 thrust::default_random_engine rng; thrust::uniform_int_distribution<int> dist1(0, 100); // --- Random uniform integer distribution between 1 and 4 thrust::uniform_int_distribution<int> dist2(1, 4); // --- Matrix allocation and initialization thrust::device_vector<float> d_matrix(Nrows * Ncols); for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist1(rng); // --- Column vector allocation and initialization thrust::device_vector<float> d_column(Nrows); for (size_t i = 0; i < d_column.size(); i++) d_column[i] = (float)dist2(rng); // --- Row vector allocation and initialization thrust::device_vector<float> d_row(Ncols); for (size_t i = 0; i < d_row.size(); i++) d_row[i] = (float)dist2(rng); printf("\n\nOriginal matrix\n"); for(int i = 0; i < Nrows; i++) { std::cout << "[ "; for(int j = 0; j < Ncols; j++) std::cout << d_matrix[i * Ncols + j] << " "; std::cout << "]\n"; } printf("\n\nColumn vector\n"); for(int i = 0; i < Nrows; i++) std::cout << d_column[i] << "\n"; printf("\n\nRow vector\n"); for(int i = 0; i < Ncols; i++) std::cout << d_row[i] << " "; /*******************************************************/ /* ADDING THE SAME COLUMN VECTOR TO ALL MATRIX COLUMNS */ /*******************************************************/ thrust::device_vector<float> d_matrix2(d_matrix); thrust::transform(d_matrix.begin(), d_matrix.end(), thrust::make_permutation_iterator( d_column.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols))), d_matrix2.begin(), thrust::plus<float>()); printf("\n\nColumn + Matrix -> Result matrix\n"); for(int i = 0; i < Nrows; i++) { std::cout << "[ "; for(int j = 0; j < Ncols; j++) std::cout << d_matrix2[i * Ncols + j] << " "; std::cout << "]\n"; } /*************************************************/ /* ADDING THE SAME ROW VECTOR TO ALL MATRIX ROWS */ /*************************************************/ thrust::device_vector<float> d_matrix3(d_matrix); thrust::transform(thrust::make_permutation_iterator( d_matrix.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)), thrust::make_permutation_iterator( d_matrix.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)) + Nrows * Ncols, thrust::make_permutation_iterator( d_row.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nrows))), thrust::make_permutation_iterator( d_matrix3.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)), thrust::plus<float>()); printf("\n\nRow + Matrix -> Result matrix\n"); for(int i = 0; i < Nrows; i++) { std::cout << "[ "; for(int j = 0; j < Ncols; j++) std::cout << d_matrix3[i * Ncols + j] << " "; std::cout << "]\n"; } return 0; }