I wrote some pretty simple GPU code here in CUDA C to copy an array, nums, into an array, vals. Nums is [4,7,1,9,2]. This is how I wanted to copy each element over:
__global__ void makeArray(int*);
int main()
{
int* d_nums;
int nums[5];
nums[0] = 4;
nums[1] = 7;
nums[2] = 1;
nums[3] = 9;
nums[4] = 2;
cudaMalloc(&d_nums, sizeof(int)*5);
makeArray<<<2,16>>>(d_nums);
cudaMemcpy(nums, d_nums, sizeof(int)*5, cudaMemcpyDeviceToHost);
for (int i = 0; i < 5; i++)
cout << i << " " << nums[i] << endl;
return 0;
}
__global__ void makeArray(int* nums)
{
int vals[5];
int threadIndex = blockIdx.x * blockDim.x + threadIdx.x;
vals[threadIndex%5] = nums[threadIndex%5];
__syncthreads();
if (threadIndex < 5)
nums[threadIndex] = vals[threadIndex];
}
In the long run, I want to transfer an array from the CPU to the GPU shared memory using this method, but I can't even get this simple practice file to work. I'm expecting the output to look something like this:
0 4
1 7
2 1
3 9
4 2
But I'm getting this:
0 219545856
1 219546112
2 219546368
3 219546624
4 219546880
My thought process is that by using the modulus of the thread index, which is greater than the number of elements in this array, I can cover all 5 data points, and not worry about over reading the array. I can also assign each array spot at the same time, one per thread, and then __syncthreads() at the end to make sure every thread is done copying over. Clearly, that isn't working. Help!
After your edit, we can see d_nums points to uninitialised memory. You just allocated it and didn't fill it with anything. If you want data accessible to the GPU, you have to copy it:
cudaMemcpy(d_nums, nums, sizeof(nums), cudaMemcpyHostToDevice);
before you run the kernel.
Related
I am new with cuda.
I have two arrays:
int* AA = new int[5]{1,2,3,4,5};
int* BB = new int[5]{ 2,2,2,4,4 };
and I want to find the index of every element in AA that is equal to each element in BB that in this case is
{1,1,1,3,3}
here is My code:
__global__ void findIndex(int* A, int* B, int* C)
{
int i = threadIdx.x;
for (int j = 0; j < 5; j++)
{
if (B[i] == A[j])
{
C[i] = j;
}
}
}
int main() {
int* AA = new int[5]{1,2,3,4,5};
int* BB = new int[5]{ 2,2,2,4,4 };
int* CC = new int[5]{ 0,0,0,0,0 };
int(*ppA), (*ppB), (*ppC);
cudaMalloc((void**)&ppA, (5) * sizeof(int));
cudaMalloc((void**)&ppB, (5) * sizeof(int));
cudaMalloc((void**)&ppC, (5) * sizeof(int));
cudaMemcpy(ppA, AA, 5 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(ppB, BB, 5 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(ppC, CC, 5 * sizeof(int), cudaMemcpyHostToDevice);
int numBlocks = 1;
dim3 threadsPerBlock(5);
findIndex << <numBlocks, threadsPerBlock >> > (ppA, ppB, ppC);
cudaMemcpy(CC, ppC, 5 * sizeof(int), cudaMemcpyDeviceToHost);
for (int m = 0; m < 5; m++) {
printf("%d ", CC[m]);
}
}
My output is:
{1,2,3,0,0}
Can anyone help?
Simplest non-stable single-gpu solution would be using atomics, something like this:
__global__ void find(int * arr,int * counter, int * result)
{
int id = blockIdx.x*blockDim.x+threadIdx.x;
if(arr[id] == 4)
{
int ctr = atomicAdd(counter,1);
result[ctr] = id;
}
}
this way you can have an array of results in "result" array and if the wanted number is sparse it wouldn't slowdown much (like only few in whole source array). This is not an optimal way for multi-gpu systems, though. Requires host-side coordination between gpus, unless a special CUDA feature from newest toolkit is used (for system-level atomics).
If number of 4s leads to a "dense" arr array or if you have multiple gpus, then you should look for other solutions like stream compaction. First select the cells containing 4 as a mask. Then do the compaction. Some Nvidia blogs or tutorials have this algorithm.
For the "atomic" solution (especially on "shared" memory atomics), Maxwell (and onwards) architecture is much better than Kepler, just in case you still use a Kepler. Also using atomics is not exactly reproducible as the order of atomic operations can not be known. You will get a differently-ordered result array most of the time. But the stream compaction preserves the result order. This may save you from writing a sorting algorithm (like bitonic-sort, shear-sort, etc) on top of it.
I build a simple cuda kernel that performs a sum on elements. Each thread adds an input value to an output buffer. Each thread calculates one value. 2432 threads are being used (19 blocks * 128 threads).
The output buffer remains the same, the input buffer pointer is shifted by threadcount after each kernel execution. So in total, we have a loop invoking the add kernel until we computed all input data.
Example:
All my input values are set to 1. The output buffer size is 2432. The input buffer size is 2432 *2000.
2000 times the add kernel is called to add 1 to each field of output. The endresult in output is 2000 at every field. I call the function aggregate which contains a for loop, calling the kernel as often as needed to pass over the complete input data.
This works so far unless I call the kernel too often.
However if I call the Kernel 2500 times, I get an illegalmemoryaccess cuda error.
As you can see, the runtime of the last successfull kernel increases by 3 orders of magnitude. Afterwards my pointers are invalidated and the following invocations result in CudaErrorIllegalAdress.
I cleaned up the code to get a minimal working example:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <vector>
#include <stdio.h>
#include <iostream>
using namespace std;
template <class T> __global__ void addKernel_2432(int *in, int * out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
out[i] = out[i] + in[i];
}
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
cout << "ITERATIONS: " << vectorCount << endl;
for (size_t i = 0; i < vectorCount-1; i++)
{
addKernel_2432<int><<<19,128>>>(array, out);
array += vectorCount;
}
addKernel_2432<int> << <19, 128 >> > (array, out);
return 1;
}
int main()
{
int* dev_in1 = 0;
size_t vectorCount = 2432;
int * dev_out = 0;
size_t datacount = 2432*2500;
std::vector<int> hostvec(datacount);
//create input buffer, filled with 1
std::fill(hostvec.begin(), hostvec.end(), 1);
//allocate input buffer and output buffer
cudaMalloc(&dev_in1, datacount*sizeof(int));
cudaMalloc(&dev_out, vectorCount * sizeof(int));
//set output buffer to 0
cudaMemset(dev_out, 0, vectorCount * sizeof(int));
//copy input buffer to GPU
cudaMemcpy(dev_in1, hostvec.data(), datacount * sizeof(int), cudaMemcpyHostToDevice);
//call kernel datacount / vectorcount times
aggregate(dev_in1, datacount, dev_out);
//return data to check for corectness
cudaMemcpy(hostvec.data(), dev_out, vectorCount*sizeof(int), cudaMemcpyDeviceToHost);
if (cudaSuccess != cudaMemcpy(hostvec.data(), dev_out, vectorCount * sizeof(int), cudaMemcpyDeviceToHost))
{
cudaError err = cudaGetLastError();
cout << " CUDA ERROR: " << cudaGetErrorString(err) << endl;
}
else
{
cout << "NO CUDA ERROR" << endl;
cout << "RETURNED SUM DATA" << endl;
for (int i = 0; i < 2432; i++)
{
cout << hostvec[i] << " ";
}
}
cudaDeviceReset();
return 0;
}
If you compile and run it, you get an error.
Change:
size_t datacount = 2432 * 2500;
to
size_t datacount = 2432 * 2400;
and it gives the correct results.
I am looking for any ideas, why it breaks after 2432 kernel invocations.
What i have found so far googeling around:
Wrong target architecture set. I use a 1070ti. My target is set to: compute_61,sm_61 In visual studio project properties. That does not change anything.
Did I miss something? Is there a limit how many times a kernel can be called until cuda invalidates pointer? Thank you for your help. I used windows, Visual Studio 2019 and CUDA runtime 11.
This is the output in both cases. Succes and failure:
[
Error:
[
static int aggregate(int* array, size_t size, int* out) {
size_t const vectorCount = size / 2432;
for (size_t i = 0; i < vectorCount-1; i++)
{
array += vectorCount;
}
}
That's not vectorCount but the number of iterations you have been accidentally incrementing by. Works fine while vectorCount <= 2432 (but yields wrong results), and results in buffer overflow above.
array += 2432 is what you intended to write.
In my Algorithm, I need to keep all the combinations of (3 bytes of) extended ASCII characters. Following is my code But when i run this code, the program gets killed on terminal when the last step occurs(BigVector.pushback).Why is this so and what can be the alternative in my case?
vector<set<vector<int> > > BigVector;
set<vector<int> > SmallSet;
for(int k=0; k <256; k++)
{
for(int j=0; j <256; j++)
{
for(int m=0; m <256; m++)
{
vector<int> temp;
temp.push_back(k);
temp.push_back(j);
temp.push_back(m);
SmallSet.insert(temp);
}
}
}
BigVector.push_back(SmallSet);
P.S: I have to keep the ascii characters like this:
{ {(a,b,c) ,(a,b,d),...... (z,z,z)} }
Please note that 256^3 = 16,777,216. This is huge, especially when you use vector and set!
Because you only need to record 256 = 2^8 information, you can store this in a char ( one byte). You can store each combination in one tuple of three chars. The memory is now 16,777,216 / 1024 / 1024 = 16 MB. On my computer, it finishes in 1 second.
If you accept C++11, I would suggest using std::array, instead of writing a helper struct like Info in my old code.
C++11 code using std::array.
vector<array<char,3>> bs;
.... for loop
array<char,3> temp;
temp[0]=k; temp[1]=j; temp[2]=m;
bs.push_back(temp);
C++98 code using home-made struct.
struct Info{
char chrs[3];
Info ( char c1, char c2, char c3):chrs({c1,c2,c3}){}
};
int main() {
vector<Info> bs;
for (int k = 0; k < 256; k++) {
for (int j = 0; j < 256; j++) {
for (int m = 0; m < 256; m++) {
bs.push_back(Info(k,j,m));
}
}
}
return 0;
}
Ways to use the combinations. (You can write wrapper method for Info).
// Suppose s[256] contains the 256 extended chars.
for( auto b : bs){
cout<< s[b.chrs[0]] << " " << s[b.chrs[1]] << " "<< s[b.chrs[2]] << endl;
}
First: your example doesn't correspond with the actual code.
You are creating ( { (a,a,a), ..., (z,z,z) } )
As already mentioned you will have 16'777'216 different vectors. Every vector will hold the 3 characters and typically ~20 bytes[1] overhead because of the vector object.
In addition a typical vector implementation will reserve memory for future push_backs.
You can avoid this by specifying the correct size during initialization or using reserve():
vector<int> temp(3);
(capacity() tells you the "real" size of the vector)
push_back makes a copy of the object you are pushing [2], which might be too much memory and therefore crashing your program.
16'777'216 * (3 characters + 20 overhead) * 2 copy = ~736MiB.
(This assumes that the vectors are already initialized with the correct size!)
See [2] for a possible solution to the copying problem.
I do agree with Potatoswatter: your data structure is very inefficient.
[1] What is the overhead cost of an empty vector?
[2] Is std::vector copying the objects with a push_back?
I have a pointer to a list of pointers (each pointer of the list point to a row)
I need to "scatter" the list of pointers so that each processor has a certain number of rows.
I make an example to say how I want to assign the pointers.
If the list is composed by 5 pointers and there are 2 processors, I want that processor0 has pointers 4 0 1 2 3 and processor1 has 2 3 4 0 (this mean that each processor has the last pointer of the previous processor and the first pointer of the following processor)
This is part of the code:
int **vptr = NULL;
if(rank==0){
vptr = m.ptr();
}
//this definition comes from one of my class methods
then I have this part of code that decide how to assign the rows to each process (assuming at the outset that each processor has only the rows that others do not have)
int *elem;
elem = new int[p]; //number of rows for process
int *disp;
disp = new int[p]; //index first row of the process
int split = N / p;
int extra = N % p;
for(unsigned i = 0; i < extra; i++){
elem[i] = split + 1;
}
for(unsigned i = extra; i < p; i++){
elem[i] = split;
}
disp[0] = 0;
for(unsigned i = 1; i < p; i++){
disp[i] = disp[i-1] + elem[i-1];
}
int local_n = elem[rank]; //number of rows for this process
int local_f = disp[rank]; //index first row for this process
int *local_v;
local_v = new int[local_n + 2]; //+2 because now I consider that I also need the row above and the row below
here I need to use MPI_Send and MPI_Recv, I suppose I am making an erroror with the pointers
if(rank==0){
for(unsigned j = 0; j < local_n + 2; j++){
local_v[j] = *vptr[j];
}
for(unsigned i = 1; i < p; i++){
MPI_Send(&vptr[disp[i]-1], elem[i] + 2, MPI_INT, i, 1, MPI_COMM_WORLD);
}
}else{
MPI_Recv(&local_v[0], local_n + 2, MPI_INT, 0, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
Comment converted to answer in the pursuit of vainglorious reputation ... (and the slightly more noble pursuit of providing an acceptable answer for future generations)
I'm not sure I entirely understand your code but there is no point sending pointers from one process to another. Pointers point to locations in the local address space of a process and can not be expected to point to a specific location in the local address space of another process. Indeed, they cannot be expected to continue to point to any location in the local address space of another process
I got an assignment to reverse an dynamic array in C++. So far, from my logic, I thinking of loop thru the array to reverse it. And here comes my code :
int main ()
{
const int size = 10;
int num_array[size];
srand (time(NULL));
for (int count = 0; count< sizeof(num_array)/sizeof(num_array[0]) ; count++){
/* generate secret number between 1 and 100: */
num_array[count] = rand() % 100 + 1;
cout << num_array[count] << " " ;
}
reverse(num_array[size],size);
cout << endl;
system("PAUSE");
return 0;
}
void reverse(int num_array[], int size)
{
for (int count =0; count< sizeof(num_array)/sizeof(num_array[0]); count++){
cout << num_array[sizeof(num_array)/sizeof(num_array[0])-1-count] << " " ;
}
return;
}
Somehow I think my logic was there but this code doesn't works, there's some error. However, my teacher told me that this isn't the way what the question wants. And here is the question :
Write a function reverse that reverses the sequence of elements in an array. For example, if reverse is called with an array containing 1 4 9 16 9 7 4 9 11,
then the array is changed to 11 9 4 7 9 16 9 4 1.
So far, she told us in the reverse method, you need to swap for the array element. So here's my question how to swap array element so that the array entered would be reversed?
Thanks in advance.
Updated portion
int main ()
{
const int size = 10;
int num_array[size];
srand (time(NULL));
for (int count = 0; count< size ; count++){
/* generate secret number between 1 and 100: */
num_array[count] = rand() % 100 + 1;
cout << num_array[count] << " " ;
}
reverse(num_array,size);
cout << endl;
system("PAUSE");
return 0;
}
void reverse(int num_array[], const int& size)
{
for (int count =0; count< size/2; count++){
int first = num_array[0];
int last = num_array[count-1];
int temp = first;
first = last;
last = temp;
}
}
You reverse function should look like this:
void reverse(int* array, const size_t size)
{
for (size_t i = 0; i < size / 2; i++)
{
// Do stuff...
}
}
And call it like:
reverse(num_array, size);
I am no C++ programmer, however I do see an easy solution to this problem. By simply using a for loop and an extra array (of the same size) you should be able to reverse the array with ease.
By using a for loop, starting at the last element of the array, and adding them in sequence to the new array, it should be fairly simple to end up with a reversed array. It would be something like this:
Declare two arrays of the same size (10 it seems)
Array1 contains your random numbers
Array2 is empty, but can consist of 10 elements
Also declare an integer, which will keep track of the progression of the for loop, but in the opposite direction. i.e not from the end but from the start.
Counter = 0
Next you will need to create a for loop to start from the end of the first array, and add the values to the start of the second array. Thus we will create a for loop to do so. The for loop will be something like this:
for(int i = lengthOfArray1; i > 0; i--){
Array2[Counter] = Array1[i]
Counter++
}
If you only wish to print it out, you would not need the counter, or the second array, you will simply use the Array1 elements and print them out with that style of for loop.
That's it. You could set Array1 = Array2 afterward if you wished to keep Array1 the original for some reason. Hope this helps a bit, changing it to C++ is your job on this one unfortunately.
You're not actually swapping the elements in the array, you're just printing them out. I assume she wants you to actually change what is stored in the array.
As a hint, go through the array swapping the first and last element, then the 2nd and 2nd last element, etc. You only need to loop for size/2 too. As you have the size variable, just use that instead of all the sizeof stuff you're doing.
I would implement the function like following
void reverse(int A[], int N)
{
for (int i=0, j=N-1; i<j; i++, j--){
int t = A[i];
A[i] = A[j];
A[j] = t;
}
}