void vectorDeduplicator(std::vector<std::string>& inputVector){
for(int i = 0; i < inputVector.size() - 1; i++){
for(int x = 1; x <= inputVector.size() - 1; x++)
if(inputVector.at(i) == inputVector.at(x) && i != x){
inputVector.erase(inputVector.begin() + x);
}
}
}
Input: 1 1 2 2 4 4 3 3 1 1 3 3 3 2 2
Output: [1,2,4,1,3,2]
You can see the function I'm trying to use to remove duplicates inside of a vector. It works when duplicates are adjacent. I wouldn't like to use a faster and an efficient method without knowing anything about it that already exists within the standard library or anything else. I'd like to learn the algorithm behind it as this is for learning purposes.
The problem is you ignore one value as you erase. You need to decrement x:
#include <vector>
#include <iostream>
void vectorDeduplicator(std::vector<int>& inputVector)
{
for(int i = 0; i < inputVector.size() - 1; i++)
{
for(int x = 1; x < inputVector.size(); x++)
{
if(inputVector.at(i) == inputVector.at(x) && i != x)
{
inputVector.erase(inputVector.begin() + x);
x--; // go one back because you erased one value
}
}
// to debug
for(const auto& x : inputVector)
std::cout << x << " ";
std::cout << std::endl;
}
}
int main(){
std::vector<int> vector{1, 1, 2, 2, 4, 4, 3, 3, 1, 1, 3, 3, 3, 2, 2};
vectorDeduplicator(vector);
// output
for(const auto& x : vector)
std::cout << x << " ";
return 0;
}
The output then is:
1 2 2 4 4 3 3 3 3 3 2 2
1 2 4 4 3 3 3 3 3
1 2 4 3 3 3 3 3
1 2 4 3
1 2 4 3
I am trying to generate a list of subsets from a set. For example, if I had n = 6, and r = 4, I would have 15 possible combinations which would be the following:
0 1 2 3
0 1 2 4
0 1 2 5
0 1 3 4
0 1 3 5
0 1 4 5
0 2 3 4
0 2 3 5
0 2 4 5
0 3 4 5
1 2 3 4
1 2 3 5
1 2 4 5
1 3 4 5
2 3 4 5
My current code does work with the above subsets if n = 6 & r = 4. It also works if any other combination of n-r=2. It does not work for anything else and I'm having a bit of trouble debugging since my code makes perfect sense to me. The code I have is the following:
int array[r];
int difference = n-r;
for(int i = 0; i < r; i++){
array[i] = i;
}
while (array[0] < difference){
print (array, r);
for(int i = r-1; i >= 0; i--){
if ((array[i] - i) == 0){
array[i] = array[i] + 1;
for (int j = i+1; j < r; j++){
array[j] = j + 1;
}
i = r;
}
else{
array[i] = array[i] + 1;
}
print (array, r);
}
}
}
To give some context, when I plug in n=6 and r=3, I am supposed to have 20 combinations as the output. Only 14 are printed, however:
0 1 2
0 1 3
0 1 4
0 2 3
0 2 4
0 3 4
1 2 3
1 2 4
1 3 4
2 3 4
2 3 4
2 3 5
2 4 5
3 4 5
It does print the first and last output correctly, however I need to have all the outputs printed out and correct. I can see after the 3rd iteration, the code starts failing as it goes from 0 1 4 to 0 2 3 when it should go to 0 1 5 instead. Any suggestions as to what I'm doing wrong?
Here's what I think you are trying to do. As far as I can tell, your main problem is that the main for loop should start over after incrementing an array element to a valid value, rather than continuing.
So this version only calls print in one place and uses break to get out of the main for loop. It also counts the combinations found.
#include <iostream>
void print(int array[], int r) {
for(int i=0; i<r; ++i) {
std::cout << array[i] << ' ';
}
std::cout << '\n';
}
int main() {
static const int n = 6;
static const int r = 3;
static const int difference = n-r;
int array[r];
for(int i = 0; i < r; i++) {
array[i] = i;
}
int count = 0;
while(array[0] <= difference) {
++count;
print(array, r);
for(int i=r-1; i>=0; --i) {
++array[i];
if(array[i] <= difference + i) {
for(int j=i+1; j<r; ++j) {
array[j] = array[j-1] + 1;
}
break;
} } }
std::cout << "count: " << count << '\n';
}
Outputs
0 1 2
0 1 3
0 1 4
0 1 5
0 2 3
0 2 4
0 2 5
0 3 4
0 3 5
0 4 5
1 2 3
1 2 4
1 2 5
1 3 4
1 3 5
1 4 5
2 3 4
2 3 5
2 4 5
3 4 5
count: 20
I'm relatively new to CUDA programming. I have understood the programming model and have already written few basic kernels. I know how to apply a kernel to each element of a matrix (stored as 1D array), but now I'm trying to figure out how to apply the same operation to the same row/column of the input matrix.
Let's say I have a MxN matrix and a vector of length N. I would like to sum (but it can be any other math operation) the vector to each row of the matrix.
The serial code of such operation is:
for (int c = 0; c < columns; c++)
{
for (int r = 0; r < rows; r++)
{
M[r * rows + c] += V[c];
}
}
Now the CUDA code for doing this operation should be quite straightforward: I should spawn as many cuda threads as the elements and apply this kernel:
__global__ void kernel(const unsigned int size, float* matrix, const float* vector)
{
// get the current element index for the thread
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size)
{
// sum the current element with the
matrix[idx] += vector[threadIdx.x];
}
}
It runs but the result is not correct. Actually, it's correct if I transpose the matrix after the kernel completes its work. Unfortunately, I have no clue why it works in this way. Could you help me to figure out this problem? Thanks in advance.
EDIT #1
I launch the kernel using:
int block_size = 64;
int grid_size = (M * N + block_size - 1) / block_size;
kernel<<<grid_size, block_size>>>(M * N, matrix, vector);
EDIT #2
I solved the problem by fixing the CPU code as suggested by #RobertCrovella:
M[r * columns + c] += V[c];
It should match the outer for, that is, over the columns.
The kernel shown in the question could be used without modification to sum a vector to each of the rows of a matrix (assuming c-style row-major storage), subject to certain limitations. A demonstration is here.
The main limitation of that approach is that the maximum vector length and therefore matrix width that can be handled is equal to the maximum number of threads per block, which on current CUDA 7-supported GPUs is 1024.
We can eliminate that limitation with a slight modification to the vector indexing, and passing the row width (number of columns) as a parameter to the matrix. With this modification, we should be able to handle arbitrary matrix (and vector) sizes.
EDIT: based on discussion/comments, OP wants to know how to handle row-major or column major underlying storage. The following example uses a templated kernel to select either row-major or column major underlying storage, and also shows one possible CUBLAS method for doing a add-vector-to-each-matrix-row operation using rank-1 update function:
$ cat t712.cu
#include <iostream>
#include <cublas_v2.h>
#define ROWS 20
#define COLS 10
#define nTPB 64
#define ROW_MAJOR 0
#define COL_MAJOR 1
template <int select, typename T>
__global__ void vec_mat_row_add(const unsigned int height, const unsigned int width, T* matrix, const T* vector)
{
// get the current element index for the thread
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < height*width)
{
// sum the current element with the
if (select == ROW_MAJOR)
matrix[idx] += vector[idx%width];
else // COL_MAJOR
matrix[idx] += vector[idx/height];
}
}
int main(){
float *h_mat, *d_mat, *h_vec, *d_vec;
const unsigned int msz = ROWS*COLS*sizeof(float);
const unsigned int vsz = COLS*sizeof(float);
h_mat = (float *)malloc(msz);
h_vec = (float *)malloc(vsz);
cudaMalloc(&d_mat, msz);
cudaMalloc(&d_vec, vsz);
for (int i=0; i<COLS; i++) h_vec[i] = i; // set vector to 0,1,2, ...
cudaMemcpy(d_vec, h_vec, vsz, cudaMemcpyHostToDevice);
// test row-major case
cudaMemset(d_mat, 0, msz); // set matrix to zero
vec_mat_row_add<ROW_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec);
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "Row-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[i*COLS+j] << " ";
std::cout << std::endl;}
// test column-major case
cudaMemset(d_mat, 0, msz); // set matrix to zero
vec_mat_row_add<COL_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec);
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "Column-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " ";
std::cout << std::endl;}
// test CUBLAS, doing matrix-vector add using <T>ger
cudaMemset(d_mat, 0, msz); // set matrix to zero
float *d_ones, *h_ones;
h_ones = (float *)malloc(ROWS*sizeof(float));
for (int i =0; i<ROWS; i++) h_ones[i] = 1.0f;
cudaMalloc(&d_ones, ROWS*sizeof(float));
cudaMemcpy(d_ones, h_ones, ROWS*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t ch;
cublasCreate(&ch);
float alpha = 1.0f;
cublasStatus_t stat = cublasSger(ch, ROWS, COLS, &alpha, d_ones, 1, d_vec, 1, d_mat, ROWS);
if (stat != CUBLAS_STATUS_SUCCESS) {std::cout << "CUBLAS error: " << (int)stat << std::endl; return 1;}
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "CUBLAS Column-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " ";
std::cout << std::endl;}
return 0;
}
$ nvcc -o t712 t712.cu -lcublas
$ ./t712
Row-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Column-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
CUBLAS Column-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
$
For brevity of presentation, I've not included proper cuda error checking, but that is always a good idea any time you are having trouble with a CUDA code. As a proxy/shortcut, you can run your code with cuda-memcheck as a quick check to see if there are any CUDA errors.
Note that we expect all 3 printouts to be identical because that is actually the correct way to display the matrix, regardless of whether the underlying storage is row-major or column-major. The difference in underlying storage is accounted for in the for-loops handling the display output.
Robert Crovella has already answered this question providing examples using explicit CUDA kernels and cuBLAS.
I find it useful, for future references, to show also an example on how performing row-wise or column-wise operations using CUDA Thrust. In particular, I'm focusing on two problems:
Summing a column vector to all matrix columns;
Summing a row vector to all matrix rows.
The generality of thrust::transform enables to generalize the example below to elementwise operations other than the sum (e.g., multiplications, divisions, subtractions etc.).
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
#include <thrust/random.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/equal.h>
using namespace thrust::placeholders;
/*************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX */
/*************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/********/
/* MAIN */
/********/
int main()
{
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
const int Nrows = 10; // --- Number of rows
const int Ncols = 3; // --- Number of columns
// --- Random uniform integer distribution between 0 and 100
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist1(0, 100);
// --- Random uniform integer distribution between 1 and 4
thrust::uniform_int_distribution<int> dist2(1, 4);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_matrix(Nrows * Ncols);
for (size_t i = 0; i < d_matrix.size(); i++) d_matrix[i] = (float)dist1(rng);
// --- Column vector allocation and initialization
thrust::device_vector<float> d_column(Nrows);
for (size_t i = 0; i < d_column.size(); i++) d_column[i] = (float)dist2(rng);
// --- Row vector allocation and initialization
thrust::device_vector<float> d_row(Ncols);
for (size_t i = 0; i < d_row.size(); i++) d_row[i] = (float)dist2(rng);
printf("\n\nOriginal matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix[i * Ncols + j] << " ";
std::cout << "]\n";
}
printf("\n\nColumn vector\n");
for(int i = 0; i < Nrows; i++) std::cout << d_column[i] << "\n";
printf("\n\nRow vector\n");
for(int i = 0; i < Ncols; i++) std::cout << d_row[i] << " ";
/*******************************************************/
/* ADDING THE SAME COLUMN VECTOR TO ALL MATRIX COLUMNS */
/*******************************************************/
thrust::device_vector<float> d_matrix2(d_matrix);
thrust::transform(d_matrix.begin(), d_matrix.end(),
thrust::make_permutation_iterator(
d_column.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Ncols))),
d_matrix2.begin(),
thrust::plus<float>());
printf("\n\nColumn + Matrix -> Result matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix2[i * Ncols + j] << " ";
std::cout << "]\n";
}
/*************************************************/
/* ADDING THE SAME ROW VECTOR TO ALL MATRIX ROWS */
/*************************************************/
thrust::device_vector<float> d_matrix3(d_matrix);
thrust::transform(thrust::make_permutation_iterator(
d_matrix.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)),
thrust::make_permutation_iterator(
d_matrix.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)) + Nrows * Ncols,
thrust::make_permutation_iterator(
d_row.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nrows))),
thrust::make_permutation_iterator(
d_matrix3.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0),(_1 % Nrows) * Ncols + _1 / Nrows)),
thrust::plus<float>());
printf("\n\nRow + Matrix -> Result matrix\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << d_matrix3[i * Ncols + j] << " ";
std::cout << "]\n";
}
return 0;
}
#include <iostream>
using namespace std;
int main() {
const int SIZE = 5;
double x[SIZE];
for(int i = 2; i <= SIZE; i++) {
x[i] = 0.0;
cout << i << endl;
}
}
Output:
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
0
1
2
3
4
...
If SIZE is initialized to a different value, the iterator will iterate until it is one short of that value and then reset back to zero. If the array of x is changed to data type int, the loop does not get stuck on itself. If the assignment value to x[i] is changed to any non-zero number, the value of is changed to garbage during the last run of the loop.
#include <iostream>
using namespace std;
int main() {
const int SIZE = 5;
double x[SIZE];
for(int i = 2; i <= SIZE; i++) {
x[i] = 1;
cout << i << endl;
}
}
Output:
2
3
4
1072693248
#include <iostream>
using namespace std;
int main() {
const int SIZE = 5;
int x[SIZE];
for(int i = 2; i <= SIZE; i++) {
x[i] = 1;
cout << i << endl;
}
}
Output:
2
3
4
5
You are writing past the end of the x array. x[] ranges from 0 to SIZE - 1 (or 4), and you let your index i == SIZE.
So, the behavior is undefined and coincidentally, you are overwriting i when you write x[5].
Use a debugger. It's your friend.
for(int i = 2; i < SIZE; i++) // i <= SIZE will write beyond the array
Your current array is of size 5. Arrays are 0 indexed:
1st element last element
0 1 2 3 4
You're iterating past the end of your array (i <= 5), which is undefined behavior.
Your end condition is wrong. Use i < SIZE
#include <iostream>
using namespace std;
int main() {
const int SIZE = 5;
double x[SIZE];
for(int i = 2; i < SIZE; i++) {
x[i] = 0.0;
cout << i << endl;
}
}
I looked up in many places and tried to understand how to get arbitrary number of nested for loops via recursion. But what I have understood is clearly wrong.
I need to generate coordinates in an n-dimensional space, in a grid-pattern. The actual problem has different coordinates with different ranges, but to get simpler things right first, I have used the same, integer-stepped coordinate ranges in the code below.
#include <iostream>
using namespace std;
void recursion(int n);
int main(){
recursion(3);
return 0;
}
void recursion(int n)
{
if(n!=0){
for(int x=1; x<4; x++){
cout<<x<<" ";
recursion(n-1);
}
}
else cout<<endl;
}
I want, and was expecting the output to be:
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 3 2
1 3 3
2 1 1
2 1 2
2 1 3
2 2 1
2 2 2
2 2 3
2 3 1
2 3 2
2 3 3
3 1 1
3 1 2
3 1 3
3 2 1
3 2 2
3 2 3
3 3 1
3 3 2
3 3 3
Instead, the output I'm getting is
1 1 1
2
3
2 1
2
3
3 1
2
3
2 1 1
2
3
2 1
2
3
3 1
2
3
3 1 1
2
3
2 1
2
3
3 1
2
3
I just can't figure out whats wrong. Any help to figure out the mistake or even another way to generate coordinates will be greatly appreciated. Thanks!
Non-recursive solution based on add-with-carry:
#include <iostream>
using namespace std;
bool addOne(int* indices, int n, int ceiling) {
for (int i = 0; i < n; ++i) {
if (++indices[i] <= ceiling) {
return true;
}
indices[i] = 1;
}
return false;
}
void printIndices(int* indices, int n) {
for (int i = n-1; i >= 0; --i) {
cout << indices[i] << ' ';
}
cout << '\n';
}
int main() {
int indices[3];
for (int i=0; i < 3; ++i) {
indices[i] = 1;
}
do {
printIndices(indices, 3);
} while (addOne(indices, 3, 3));
return 0;
}
Recursive solution, salvaged from your original code:
#include <iostream>
#include <sstream>
#include <string>
using namespace std;
void recursion(int n, const string& prefix);
int main(){
recursion(3, "");
return 0;
}
void recursion(int n, const string& prefix)
{
if (n!=0) {
for(int x=1; x<4; x++){
ostringstream os;
os << prefix << x << ' ';
recursion(n-1, os.str());
}
}
else cout << prefix << endl;
}
Per Igor's comment, you need an increment function.
Let's use an std::vector to represent each dimension. That is vector[0] is the first dimension, vector[1] is the second dimension and so on.
Using a vector allows us to determine the number of dimensions without any hard coded numbers. The vector.size() will be the number of dimensions.
Here is a function to get you started:
void Increment_Coordinate(std::vector<int>& coordinates,
int max_digit_value,
int min_digit_value)
{
unsigned int digit_index = 0;
bool apply_carry = false;
do
{
apply_carry = false;
coordinates[digit_index]++; // Increment the value in a dimension.
if (coordinates[digit_index] > max_digit_value)
{
// Reset the present dimension value
coordinates[digit_index] = min_digit_value;
// Apply carry to next column by moving to the next dimension.
++digit_index;
apply_carry = true;
}
} while (apply_carry);
return;
}
Edit 1
This is only a foundation. The function needs to be boundary checked.
This function does not support dimensions of varying sizes. That is left as an exercise for reader or OP.