take one column from a 2D array and store in 1D - c++

I am trying to take this 9 x 3 and use only the 3rd column to store in its own 1D array:
3 5 8
6 3 9
7 5 12
0 5 5
1 2 3
8 2 10
8 3 11
9 3 12
4 1 5
This is what I have for a conversion:
int index = 0;
// 2D to 1D conversion
for (int r = 0; r < N; r++)
{
for (int c = 0; c < 3; c++)
{
end[index++] = start[r][c];
}
}
But it is giving me the first 9 numbers in the whole matrix:
3 5 8
6 3 9
7 5 12 (but vertically)
I need the 3rd column only and I don't know what I am doing wrong.

you can try this:
int index = 0;
// 2D to 1D conversion
for (int r = 0; r < N; r++)
{
end[index++] = start[r][2];
}

Related

Outputting Values from a 2D Array to a Text File

I am writing some C++ code and i currently have a function that reads in some numbers from a text file and stores them into a 2D array. I now need to output the same numbers i have stored into the 2D array back out into a new text file. I currently have some code in a function that can output the numbers however they are not in the same format as the input file. As you can see below.
Input File Format (space between each number)
1 2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9 1
3 4 5 6 7 8 9 1 2
4 5 6 7 8 9 1 2 3
5 6 7 8 9 1 2 3 4
6 7 8 9 1 2 3 4 5
7 8 9 1 2 3 4 5 6
8 9 1 2 3 4 5 6 7
9 1 2 3 4 5 6 7 8
Output File Format Currently
123456789234567891345678912456789123567891234678912345789123456891234567912345678 (this is all on one line)
My function to read in from the text file.
void Grid::LoadGrid(const char filename[])
{
ifstream file(filename);
for (int y = 0; y < 9; y++) {
for (int x = 0; x < 9; x++)
{
file >> m_grid[x][y];
}
}
file.close();
}
My function to read out to the text file. (m_grid is the 2D array)
void Grid::SaveGrid(const char filename[])
{
ofstream file(filename);
for (int y = 0; y < 9; y++) {
for (int x = 0; x < 9; x++)
{
file << m_grid[x][y];
}
}
file.close();
}
If anyone can help me output it to the text file so it will appear the same as the input i'd be very grateful.
Edit: Question has been answered.
After your inner loop completes
file << endl;
Also in your inner loop might want to..
file << m_grid[x][y] << " ";
Update the way you write to the file like this:
void Grid::SaveGrid(const char filename[])
{
ofstream file(filename);
for (int y = 0; y < 9; y++) {
cout << endl;
for (int x = 0; x < 9; x++)
{
cout << m_grid[x][y];
cout << " ";
}
}
file.close();
}
In your output function you should write the spaces after each character. And the newline character when you reach the end of a row.
I believe this would work
void Grid::SaveGrid(const char filename[])
{
ofstream file(filename);
for (int y = 0; y < 9; y++)
{
for (int x = 0; x < 9; x++)
{
file << m_grid[x][y] << " ";//write a space after the character
if(x ==8) //at x ==8 is where this for loop would exit
{
file << "\n"
}
}
file.close();
}
The output you are seeing is exactly what you told the compiler to do. You are displaying each of your NxN matrix elements one right after another. You never added any white space to your output stream. To fix this, it is quite simple. Adjust your function as follows:
void Grid::SaveGrid(const char filename[])
{
ofstream file(filename);
for (int y = 0; y < 9; y++) {
for (int x = 0; x < 9; x++)
{
file << m_grid[x][y] << " "; // and a space after each element
}
file << '\n'; // add a new line character after each row has been printed.
}
file.close();
}
Here's a simple working example that uses a std::vector<int> of your numbers and displays it to the console. The only difference here is that I'm using a 2D to 1D mapping for the indexing into the vector. Once you understand how this affects the formatting it should be trivial to convert it from an std::vector to a multidimensional array and to convert the iostream to a fstream output.
#include <iostream>
#include <vector>
int main() {
const std::vector<int> values{ 1,2,3,4,5,6,7,8,9,
2,3,4,5,6,7,8,9,1,
3,4,5,6,7,8,9,1,2,
4,5,6,7,8,9,1,2,3,
5,6,7,8,9,1,2,3,4,
6,7,8,9,1,2,3,4,5,
7,8,9,1,2,3,4,5,6,
8,9,1,2,3,4,5,6,7,
9,1,2,3,4,5,6,7,8
};
const int size = 9;
for (int row = 0; row < 9; row++) {
for (int col = 0; col < 9; col++) {
std::cout << values[col + size*row] << " ";
}
std::cout << '\n';
}
return 0;
}
-Ouput-
1 2 3 4 5 6 7 8 9
2 3 4 5 6 7 8 9 1
3 4 5 6 7 8 9 1 2
4 5 6 7 8 9 1 2 3
5 6 7 8 9 1 2 3 4
6 7 8 9 1 2 3 4 5
7 8 9 1 2 3 4 5 6
8 9 1 2 3 4 5 6 7
9 1 2 3 4 5 6 7 8

How do I save all the numbers from a string into a multi-dimensional array in c++?

I have to write a program that takes a completed sudoku board, saves only the numbers (meaning all the symbols used between the numbers to separate them such as '-', '|' etc cant be saved) into a two-dimensional array.
#include <iostream>
#include <string>
using namespace std;
int main()
{
int input[11] = { 0 };
int sudoku[9][9] = { 0 };
for (int line = 0; line <= 10; line++)
{
cin >> input[line];
}
system("PAUSE");
return 0;
}
This is the only working code I've got so far. I've tried different kinds of for loops to get this done but I can't figure why it doesn't work.
So I wanted to ask, is it even possible save all the numbers of a string into a multi-dimensional array? And if it's not, where is my approach wrong or how could I solve this task?
One example of the input would be:
.5.1.4.|.8.6.9.|.7.2.3
.8.7.2.|.3.4.5.|.6.1.9
.9.6.3.|.2.1.7.|.5.4.8
-------|-------|-------
.6.2.8.|.1.3.4.|.9.5.7
.1.9.7.|.6.5.2.|.8.3.4
.4.3.5.|.7.9.8.|.1.6.2
-------|-------|-------
.2.4.6.|.9.7.1.|.3.8.5
.7.5.1.|.4.8.3.|.2.9.6
.3.8.9.|.5.2.6.|.4.7.1
One approach is to use regular expressions. This way the formatting of the sudoku board can change but your will still be able to parse out the numbers.
The reason I broke it into two for loops was to easily ignore the row that has no numbers in it.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
int main()
{
std::string line;
// this regular expression matches a single digit
std::regex exp("(\\d)");
std::smatch res;
int sudoku[9][9] = {{0}};
int row = 0;
for (int i = 0; i < 3; ++i)
{
for (int j = 0; j < 3; ++j)
{
// get a line of the board
std::getline(std::cin, line);
// search for the next digit in the line
for (int k = 0; std::regex_search(line, res, exp); ++k)
{
// convert the digit into an integer and store it in the board
sudoku[row][k] = std::stoi(res[0]);
// the rest of the line after the first match becomes the new
// line so that we can search for the next digit
line = res.suffix();
}
row += 1;
}
// ignore every third row that is used to separate the board sections
std::getline(std::cin, line);
}
for (int i = 0; i < 9; ++i)
{
for (int j = 0; j < 9; ++j)
{
std::cout << sudoku[i][j] << " ";
}
std::cout << std::endl;
}
return 0;
}
For your example board, it produces this output:
5 1 4 8 6 9 7 2 3
8 7 2 3 4 5 6 1 9
9 6 3 2 1 7 5 4 8
6 2 8 1 3 4 9 5 7
1 9 7 6 5 2 8 3 4
4 3 5 7 9 8 1 6 2
2 4 6 9 7 1 3 8 5
7 5 1 4 8 3 2 9 6
3 8 9 5 2 6 4 7 1

Why is this code outputting so many numbers? [duplicate]

This question already has answers here:
ARRAYSIZE C++ macro: how does it work?
(7 answers)
c++ sizeof(array) return twice the array's declared length
(6 answers)
Closed 3 years ago.
Starting with two arrays a and b, I am trying to output a matrix c with dimensions sizeof(a) and sizeof(b), whose entries are the product of every pair of the Cartesian product of a and b.
Theses products are also stored in a two dimensional array c.
My code is below.
#include <iostream>
#include <string>
int main()
{
int a[]= { 1,2,3,4,5,5 };
int b[]= { 1,23,2,32,42,4 };
int c[sizeof(a)][sizeof(b)];
for (int i = 0; i < sizeof(a); i++) {
for (int j = 0; j < sizeof(b); j++) {
c[i][j] = a[i]* b[j] ;
std::cout << c[i][j] << " ";
}
std::cout << "\n";
}
return 0;
}
My output is:
1 23 2 32 42 4 -858993460 -858993460 1 2 3 4 5 5 -858993460 16710224 15543422 1 2161328 2122464 16710312 15543008 196436084 15536213
2 46 4 64 84 8 -1717986920 -1717986920 2 4 6 8 10 10 -1717986920 33420448 31086844 2 4322656 4244928 33420624 31086016 392872168 31072426
3 69 6 96 126 12 1717986916 1717986916 3 6 9 12 15 15 1717986916 50130672 46630266 3 6483984 6367392 50130936 46629024 589308252 46608639
...
This is just a small part of the output.
sizeof(a) is not the length of the array, it is the number of bytes required to store it.
Since the element type of the array is larger than one byte each, the numbers are different.

Assigning a vector to a matrix column in Eigen

This question was asked in haste. The error in my original program, was not the typo in the code that is displayed here. The error was that in my program v was not getting populated due to some conditions.
The more useful takeaway from this thread is the demonstration of copying a std::vector to all rows or columns of an Eigen Matrix, in the accepted answer.
I want to copy vectors into the columns of a matrix, like the following:
#include <Eigen/Dense>
#include <vector>
#include <iostream>
int main() {
int m = 10;
std::vector<Eigen::VectorXd> v(m);
Eigen::MatrixXd S(m,m);
for (int i = 0; i != m; ++i) {
v[i].resize(m);
for (int j = 0; j != m; ++j) {
v[i](j) = rand() % m;
}
//S.cols(i) = v[i]; //needed something like this
}
return 0;
}
S is of type Eigen::MatrixXd and dimension mxm. v is a std::vector of Eigen::VectorXd, where each Eigen::VectorXd is of size m and there are m of them in v.
Regarding the original question, you need to wrap the std::vector with an Eigen::Map. You could/should also make the operation a one-liner.
The reworded question is reduced to a typo. S.cols(i) should be S.col(i).
int main()
{
size_t sz = 6;
Eigen::MatrixXd S(sz, sz);
std::vector<double> v(sz);
std::vector<Eigen::VectorXd> vv(sz);
for(int i = 0; i < sz; i++)
{
v[i] = i*2;
vv[i] = Eigen::VectorXd::LinSpaced(sz, (i+sz), (i+sz)*2);
}
for (int i = 0; i != sz; ++i)
S.col(i) = vv[i];
std::cout << S << "\n\n";
S.rowwise() = Eigen::Map<Eigen::RowVectorXd>(v.data(), sz);
std::cout << S << "\n\n";
S.colwise() = Eigen::Map<Eigen::VectorXd>(v.data(), sz);
std::cout << S << "\n\n";
return 0;
}
which would output
6 7 8 9 10 11
7.2 8.4 9.6 10.8 12 13.2
8.4 9.8 11.2 12.6 14 15.4
9.6 11.2 12.8 14.4 16 17.6
10.8 12.6 14.4 16.2 18 19.8
12 14 16 18 20 22
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 2 4 6 8 10
0 0 0 0 0 0
2 2 2 2 2 2
4 4 4 4 4 4
6 6 6 6 6 6
8 8 8 8 8 8
10 10 10 10 10 10

Row-wise/column-wise operations on matrices with CUDA

I'm relatively new to CUDA programming. I have understood the programming model and have already written few basic kernels. I know how to apply a kernel to each element of a matrix (stored as 1D array), but now I'm trying to figure out how to apply the same operation to the same row/column of the input matrix.
Let's say I have a MxN matrix and a vector of length N. I would like to sum (but it can be any other math operation) the vector to each row of the matrix.
The serial code of such operation is:
for (int c = 0; c < columns; c++)
{
for (int r = 0; r < rows; r++)
{
M[r * rows + c] += V[c];
}
}
Now the CUDA code for doing this operation should be quite straightforward: I should spawn as many cuda threads as the elements and apply this kernel:
__global__ void kernel(const unsigned int size, float* matrix, const float* vector)
{
// get the current element index for the thread
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size)
{
// sum the current element with the
matrix[idx] += vector[threadIdx.x];
}
}
It runs but the result is not correct. Actually, it's correct if I transpose the matrix after the kernel completes its work. Unfortunately, I have no clue why it works in this way. Could you help me to figure out this problem? Thanks in advance.
EDIT #1
I launch the kernel using:
int block_size = 64;
int grid_size = (M * N + block_size - 1) / block_size;
kernel<<<grid_size, block_size>>>(M * N, matrix, vector);
EDIT #2
I solved the problem by fixing the CPU code as suggested by #RobertCrovella:
M[r * columns + c] += V[c];
It should match the outer for, that is, over the columns.
The kernel shown in the question could be used without modification to sum a vector to each of the rows of a matrix (assuming c-style row-major storage), subject to certain limitations. A demonstration is here.
The main limitation of that approach is that the maximum vector length and therefore matrix width that can be handled is equal to the maximum number of threads per block, which on current CUDA 7-supported GPUs is 1024.
We can eliminate that limitation with a slight modification to the vector indexing, and passing the row width (number of columns) as a parameter to the matrix. With this modification, we should be able to handle arbitrary matrix (and vector) sizes.
EDIT: based on discussion/comments, OP wants to know how to handle row-major or column major underlying storage. The following example uses a templated kernel to select either row-major or column major underlying storage, and also shows one possible CUBLAS method for doing a add-vector-to-each-matrix-row operation using rank-1 update function:
$ cat t712.cu
#include <iostream>
#include <cublas_v2.h>
#define ROWS 20
#define COLS 10
#define nTPB 64
#define ROW_MAJOR 0
#define COL_MAJOR 1
template <int select, typename T>
__global__ void vec_mat_row_add(const unsigned int height, const unsigned int width, T* matrix, const T* vector)
{
// get the current element index for the thread
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < height*width)
{
// sum the current element with the
if (select == ROW_MAJOR)
matrix[idx] += vector[idx%width];
else // COL_MAJOR
matrix[idx] += vector[idx/height];
}
}
int main(){
float *h_mat, *d_mat, *h_vec, *d_vec;
const unsigned int msz = ROWS*COLS*sizeof(float);
const unsigned int vsz = COLS*sizeof(float);
h_mat = (float *)malloc(msz);
h_vec = (float *)malloc(vsz);
cudaMalloc(&d_mat, msz);
cudaMalloc(&d_vec, vsz);
for (int i=0; i<COLS; i++) h_vec[i] = i; // set vector to 0,1,2, ...
cudaMemcpy(d_vec, h_vec, vsz, cudaMemcpyHostToDevice);
// test row-major case
cudaMemset(d_mat, 0, msz); // set matrix to zero
vec_mat_row_add<ROW_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec);
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "Row-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[i*COLS+j] << " ";
std::cout << std::endl;}
// test column-major case
cudaMemset(d_mat, 0, msz); // set matrix to zero
vec_mat_row_add<COL_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec);
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "Column-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " ";
std::cout << std::endl;}
// test CUBLAS, doing matrix-vector add using <T>ger
cudaMemset(d_mat, 0, msz); // set matrix to zero
float *d_ones, *h_ones;
h_ones = (float *)malloc(ROWS*sizeof(float));
for (int i =0; i<ROWS; i++) h_ones[i] = 1.0f;
cudaMalloc(&d_ones, ROWS*sizeof(float));
cudaMemcpy(d_ones, h_ones, ROWS*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t ch;
cublasCreate(&ch);
float alpha = 1.0f;
cublasStatus_t stat = cublasSger(ch, ROWS, COLS, &alpha, d_ones, 1, d_vec, 1, d_mat, ROWS);
if (stat != CUBLAS_STATUS_SUCCESS) {std::cout << "CUBLAS error: " << (int)stat << std::endl; return 1;}
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "CUBLAS Column-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " ";
std::cout << std::endl;}
return 0;
}
$ nvcc -o t712 t712.cu -lcublas
$ ./t712
Row-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Column-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
CUBLAS Column-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
$
For brevity of presentation, I've not included proper cuda error checking, but that is always a good idea any time you are having trouble with a CUDA code. As a proxy/shortcut, you can run your code with cuda-memcheck as a quick check to see if there are any CUDA errors.
Note that we expect all 3 printouts to be identical because that is actually the correct way to display the matrix, regardless of whether the underlying storage is row-major or column-major. The difference in underlying storage is accounted for in the for-loops handling the display output.
Robert Crovella has already answered this question providing examples using explicit CUDA kernels and cuBLAS.
I find it useful, for future references, to show also an example on how performing row-wise or column-wise operations using CUDA Thrust. In particular, I'm focusing on two problems:
Summing a column vector to all matrix columns;
Summing a row vector to all matrix rows.
The generality of thrust::transform enables to generalize the example below to elementwise operations other than the sum (e.g., multiplications, divisions, subtractions etc.).
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/random.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/equal.h>
using namespace thrust::placeholders;
/*************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX */
/*************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/********/
/* MAIN */
/********/
int main()
{
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
const int Nrows = 10; // --- Number of rows
const int Ncols = 3; // --- Number of columns
// --- Random uniform integer distribution between 0 and 100
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist1(0, 100);
// --- Random uniform integer distribution between 1 and 4
thrust::uniform_int_distribution<int> dist2(1, 4);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist1(rng);
// --- Column vector allocation and initialization
thrust::device_vector<float> d_column(Nrows);
for (size_t i = 0; i < d_column.size(); i++) d_column[i] = (float)dist2(rng);
// --- Row vector allocation and initialization
thrust::device_vector<float> d_row(Ncols);
for (size_t i = 0; i < d_row.size(); i++) d_row[i] = (float)dist2(rng);
printf("\n\nOriginal matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "]\n";
}
printf("\n\nColumn vector\n");
for(int i = 0; i < Nrows; i++) std::cout << d_column[i] << "\n";
printf("\n\nRow vector\n");
for(int i = 0; i < Ncols; i++) std::cout << d_row[i] << " ";
/*******************************************************/
/* ADDING THE SAME COLUMN VECTOR TO ALL MATRIX COLUMNS */
/*******************************************************/
thrust::device_vector<float> d_matrix2(d_matrix);
thrust::transform(d_matrix.begin(), d_matrix.end(),
thrust::make_permutation_iterator(
d_column.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols))),
d_matrix2.begin(),
thrust::plus<float>());
printf("\n\nColumn + Matrix -> Result matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix2[i * Ncols + j] << " ";
std::cout << "]\n";
}
/*************************************************/
/* ADDING THE SAME ROW VECTOR TO ALL MATRIX ROWS */
/*************************************************/
thrust::device_vector<float> d_matrix3(d_matrix);
thrust::transform(thrust::make_permutation_iterator(
d_matrix.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)),
thrust::make_permutation_iterator(
d_matrix.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)) + Nrows * Ncols,
thrust::make_permutation_iterator(
d_row.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nrows))),
thrust::make_permutation_iterator(
d_matrix3.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)),
thrust::plus<float>());
printf("\n\nRow + Matrix -> Result matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix3[i * Ncols + j] << " ";
std::cout << "]\n";
}
return 0;
}