I am new to CUDA programming.
I am curious that what happens if the number of elements is larger than the number of threads?
In this simple vector_add example
__global__
void add(int n, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = x[i] + y[i];
}
Say the number of array elements is 10,000,000. And we call this function using 64 blocks and 256 threads per block:
int n = 1e8;
int grid_size = 64;
int block_sie = 256;
Then, only 64*256 = 16384 threads are assigned, what would happen to the rest of the array elements?
what would happen to the rest of the array elements?
Nothing at all. They wouldn't be touched and would remain unchanged. Of course, your x array elements don't change anyway. So we are referring to y here. The values of y[0..16383] would reflect the result of the vector add. The values of y[16384..9999999] would be unchanged.
For this reason (to conveniently handle arbitrary data set sizes independent of the chosen grid size), people sometimes suggest a grid-stride-loop kernel design.
Related
I managed to reduce the problem to the following code, which uses almost 500MB of memory when it runs on my laptop - which in turn causes a std::bad_alloc in the full program. What is the problem here? As far as I can see, the unordered map only uses something like (32+32)*4096*4096 bits = 134.2MB, which is not even close to what the program uses.
#include<iostream>
#include<unordered_map>
using namespace std;
int main()
{
unordered_map<int,int> a;
long long z = 0;
for (int x = 0; x < 4096; x++)
{
for (int y = 0; y < 4096; y++)
{
z = 0;
for (int j = 0; j < 4; j++)
{
z ^= ((x>>(3*j))%8)<<(3*j);
z ^= ((y>>(3*j))%8)<<(3*j + 12);
}
a[z]++;
}
}
return 0;
}
EDIT: I'm aware that some of the bit shifting here can cause undefined behaviour, but I'm 99% sure that's not what's the problem.
EDIT2: What I need is essentially to count the number of x in a given set that some function maps to each y in a second set (of size 4096*4096). Would it be better to perhaps store these numbers in an array? I.e I have a function f: A to B, and I need to know the size of the set {x in A : f(x) = y} for each y in B. In this case A and B are both the set of non-negative integers less than 2^12=4096. (Ideally I would like to extend this to 2^32).
... which uses almost 500MB of memory ... What is the problem here?
There isn't really a problem, per se, regarding the memory usage you are observing. std::unordered_map is built to run fast for large number of elements. As such, memory isn't a top priority. For example, in order to optimize for resizing, it often allocates upon creation for some pre-calculated hash chains. Also, your measure of the the count of elements multiplied by the element's size is not taking into account the actual memory-footprint, data structure-wise, of each node in this map -- which should at least involve a few pointers to adjacent elements in the list of its bucket.
Having said that, it isn't clear you even need to use std::unorderd_map in this scenario. Instead, given the mapping your trying store is defined as
{x in A : f(x) = y} for each y in B
you could have one fixed-sized array (use std::array for that) that would simply hold for each index i, representing the element in set B, the number of elements from set A that fills the criteria.
Say I have this toy code:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel1(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] *= buf[i];
}
__global__ void cudakernel2(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
for(int j = 0; j < M; j++)
buf[i] += buf[i];
}
int main()
{
float data[N];
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel1<<<N/256, 256>>>(d_data);
cudakernel2<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
}
Can I merge the two kernels like so:
#define N (1024*1024)
#define M (1000000)
__global__ void cudakernel1_plus_2(float *buf)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
buf[i] = 1.0f * i / N;
for(int j = 0; j < M; j++)
buf[i] *= buf[i];
__syncthreads();
for(int j = 0; j < M; j++)
buf[i] += buf[i];
}
int main()
{
float data[N];
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));
cudakernel1_plus_2<<<N/256, 256>>>(d_data);
cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_data);
}
Is the general case that two consecutive kernels which take the same block and thread parameters can be merged with an intermediate __syncthreads() true?
(My real case is 6 consecutive non-trivial kernels that have a lot of set-up and tear-down overhead).
The simplest, most general answer is no. I only need to find one example for which the paradigm breaks to support that. Let's remind ourselves that:
__syncthreads() is a block level execution barrier, but not a device-wide execution barrier. The only defined device-wide execution barrier is the kernel launch (assuming we're talking about issuing kernels into the same stream, for sequential execution).
threadblocks of a particular kernel launch can execute in any order.
Let's say we have 2 functions:
reverse the elements of a vector
Sum the vector elements
Let's assume the vector reversal is not an in-place operation (the output is distinct from the input), and that each threadblock handles a block-sized chunk of the vector, reading the elements and storing to the appropriate location in the output vector.
To keep it really simple, we'll imagine we only have (need) two threadblocks. For the first step, block 0 copies the left hand side of the vector to the right hand side (reversing the order) and block 1 copies right-to-left:
1 2 3 4 5 6 7 8
|blk 0 |blk 1 |
\ | /
X
/| \
v | v
8 7 6 5 4 3 2 1
For the second step, in classical parallel reduction fashion, block zero sums the left hand elements of the output vector, and block 1 sums the right hand elements:
8 7 6 5 4 3 2 1
\ / \ /
blk0 blk1
26 10
As long as the first function is issued in kernel1 and the second function is issued in kernel2, into the same stream after kernel1, this all just works. For each kernel, it does not matter if block 0 executes before block 1, or vice-versa.
If we combine the operations so that we have a single kernel, and block 0 copies/reverses the first half of the vector to the second half of the output vector, then executes a __syncthreads(), then sums the first half of the output vector, things are likely to break. If block 0 executes before block 1, then the first step will be fine (copy/reversal of vector) but the second step will be operating on an output array half that has not been populated yet, because the block 1 has not begun executing yet. The computed sum will be wrong.
Without trying to give formal proofs, we can see that in the above case where there is data movement from one block's "domain" to another block's "domain", we run the risk of breaking things, because the previous device-wide sync (kernel launch) was necessary for correctness. However, if we can limit the "domain" of a block so that any data consumed by subsequent operations is produced only by previous operations in that block, then a __syncthreads() may be sufficient to allow this strategy with correctness. (The previous silly example could easily be reworked to allow this, simply by having block 0 be responsible for the first half of the output vector, thus copying from the second half of the input vector, and vice versa for the other block.)
Finally, if we limit data scope to a single thread, then we can make such combinations without even using __syncthreads(). These last two cases might have characteristics of "embarassingly parallel" problems, which exhibit a high degree of independence.
How do you fill with 0 a dynamic matrix, in C++? I mean, without:
for(int i=0;i<n;i++)for(int j=0;j<n;j++)a[i][j]=0;
I need it in O(n), not O(n*m) or O(n^2).
Thanks.
For the specific case where your array is going to to be large and sparse and you want to zero it at allocation time then you can get some benefit from using calloc - on most platforms this will result in lazy allocation with zero pages, e.g.
int **a = malloc(n * sizeof(a[0]); // allocate row pointers
int *b = calloc(n * n, sizeof(b[0]); // allocate n x n array (zeroed)
a[0] = b; // initialise row pointers
for (int i = 1; i < n; ++i)
{
a[i] = a[i - 1] + n;
}
Note that this is, of course, premature optimisation. It is also C-style coding rather than C++. You should only use this optimisation if you have established that performance is a bottleneck in your application and there is no better solution.
From your code:
for(int i=0;i<n;i++)for(int j=0;j<n;j++)a[i][j]=0;
I assume, that your matrix is two dimensional array declared as either
int matrix[a][b];
or
int** matrix;
In first case, change this for loop to a single call to memset():
memset(matrix, 0, sizeof(int) * a * b);
In second case, you will to do it this way:
for(int n = 0; n < a; ++n)
memset(matrix[n], 0, sizeof(int) * b);
On most platforms, a call to memset() will be replaced with proper compiler intrinsic.
every nested loop is not considered as O(n2)
the following code is a O(n),
No 1
for(int i=0;i<n;i++)for(int j=0;j<n;j++)a[i][j]=0;
imagine that you had all of the cells in matrix a copied into a one dimentional flat array and set zero for all of its elements by just one loop, what would be the order then? ofcouse you will say thats a O(n)
No 2 for(int i=0;i<n*m;i++) b[i]=0;
Now lets compare them, No 2 with No 1, ask the following questions from yourselves :
Does this code traverse matrix a cells more than once?
If I can measure the time will there be a difference?
Both answers are NO.
Both codes are O(n), A multi-tier nested loop on a multi-dimentional array produces a O(n) order.
In my program, i need to run the kernel once on every item of the large 2d-array. The program works correctly for small ranges - up to around 50x50, sometimes up to 100x100.
For bigger datasets however, calling the kernel causes the video card driver to crash.
I have tested this program on two computers with different AMD cards, and they exhibit the exact same behaviour. Other, one-dimensional kernels work properly, even for huge datasets of ~10 000 x 10 000 items.
Also, removing the i variable from the matrix[i + (N + 1) * j] expression causes the kernel to work without errors.
Am i setting the range incorrectly, making a mistake in the kernel, or maybe the problem lies elsewhere?
enqueued range:
cl::EnqueueArgs args(queue,cl::NDRange(offset, offset+1),cl::NDRange(N+1, N),cl::NullRange);
kernel:
void kernel sub(global float* matrix, global const float* vec, int N, int offset) {
int i = get_global_id(0);
int j = get_global_id(1);
matrix[i + (N + 1) * j] -= matrix[i + (N + 1) * offset] * vec[j];
}
One of possible reasons - if your kernel is running for too long, driver may drop it. Dice up problem area into smaller blocks.
Consider this, for a 100x100 input array you will use N=100, hence the maximum value of i in your kernel will be 100 because of the N+1 used in the enqueue args, while the maximum for j will be 99. I have assumed that offset = 0. Therefore i + (N + 1) * j = 100 + 101*99 = 10099 which is outside of your 2D array.
When offset = 1, the minimums for i and j will be 1 and 2 respectively, while the maximums will be 101 and 100. Therefore i + (N + 1) * j = 101 + 101*100 = 10201.
In my experience, GPUs are not very good at catching segmentation faults when accessing global memory. Your attempt at purposefully creating one may work on some cards sometimes but no guarantees.
The problem could be caused by local-work-size and global-work-size. It is important while using two dimensional arrays to properly calculate them. It could be that for big values your global_id(0) is bigger than you specified in clEnqueueNDRangeKernel().
What parallel algorithms could I use to generate random permutations from a given set?
Especially proposals or links to papers suitable for CUDA would be helpful.
A sequential version of this would be the Fisher-Yates shuffle.
Example:
Let S={1, 2, ..., 7} be the set of source indices.
The goal is to generate n random permutations in parallel.
Each of the n permutations contains each of the source indices exactly once,
e.g. {7, 6, ..., 1}.
Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.
This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):
const int id = threadIdx.x;
__shared__ int perm_shared[2 * BLOCK_SIZE];
perm_shared[2 * id] = 2 * id;
perm_shared[2 * id + 1] = 2 * id + 1;
__syncthreads();
unsigned int shift = 1;
unsigned int pos = id * 2;
while(shift <= BLOCK_SIZE)
{
if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
shift = shift << 1;
pos = (pos & ~shift) | ((pos & shift) >> 1);
__syncthreads();
}
Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].
Note that the code above has the following assumptions:
The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)
To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.
If the length of s = s_L, a very crude way of doing this could be implemented in thrust:
http://thrust.github.com.
First, create a vector val of length s_L x n that repeats s n times.
Create a vector val_keys associate n unique keys repeated s_L times with each element of val, e.g.,
val = {1,2,...,7,1,2,...,7,....,1,2,...7}
val_keys = {0,0,0,0,0,0,0,1,1,1,1,1,1,2,2,2,...., n,n,n}
Now the fun part. create a vector of length s_L x n uniformly distributed random variables
U = {0.24, 0.1, .... , 0.83}
then you can do zip iterator over val,val_keys and sort them according to U:
http://codeyarns.com/2011/04/04/thrust-zip_iterator/
both val, val_keys will be all over the place, so you have to put them back together again using thrust::stable_sort_by_key() to make sure that if val[i] and val[j] both belong to key[k] and val[i] precedes val[j] following the random sort, then in the final version val[i] should still precede val[j]. If all goes according to plan, val_keys should look just as before, but val should reflect the shuffling.
For large sets, using a sort primitive on a vector of randomized keys might be efficient enough for your needs. First, setup some vectors:
const int N = 65535;
thrust:device_vector<uint16_t> d_cards(N);
thrust:device_vector<uint16_t> d_keys(N);
thrust::sequence(d_cards.begin(), d_cards.end());
Then, each time you want to shuffle the d_cards call the pair of:
thrust::tabulate(d_keys.begin(), d_keys.end(), PRNFunc(rand()*rand());
thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_cards.begin());
// d_cards now freshly shuffled
The random keys are generated from a functor that uses a seed (evaluated in host-code and copied to the kernel at launch-time) and a key number (which tabulate passes in at thread-creation time):
struct PRNFunc
{
uint32_t seed;
PRNFunc(uint32_t s) { seed = s; }
__device__ __host__ uint32_t operator()(uint32_t kn) const
{
thrust::minstd_rand randEng(seed);
randEng.discard(kn);
return randEnd();
}
};
I have found that performance could be improved (by probably 30%) if I could figure out how to cache the allocations that thrust::sort_by_key does internally.
Any corrections or suggestions welcome.