OpenCL crash on big 2d range - c++

In my program, i need to run the kernel once on every item of the large 2d-array. The program works correctly for small ranges - up to around 50x50, sometimes up to 100x100.
For bigger datasets however, calling the kernel causes the video card driver to crash.
I have tested this program on two computers with different AMD cards, and they exhibit the exact same behaviour. Other, one-dimensional kernels work properly, even for huge datasets of ~10 000 x 10 000 items.
Also, removing the i variable from the matrix[i + (N + 1) * j] expression causes the kernel to work without errors.
Am i setting the range incorrectly, making a mistake in the kernel, or maybe the problem lies elsewhere?
enqueued range:
cl::EnqueueArgs args(queue,cl::NDRange(offset, offset+1),cl::NDRange(N+1, N),cl::NullRange);
kernel:
void kernel sub(global float* matrix, global const float* vec, int N, int offset) {
int i = get_global_id(0);
int j = get_global_id(1);
matrix[i + (N + 1) * j] -= matrix[i + (N + 1) * offset] * vec[j];
}

One of possible reasons - if your kernel is running for too long, driver may drop it. Dice up problem area into smaller blocks.

Consider this, for a 100x100 input array you will use N=100, hence the maximum value of i in your kernel will be 100 because of the N+1 used in the enqueue args, while the maximum for j will be 99. I have assumed that offset = 0. Therefore i + (N + 1) * j = 100 + 101*99 = 10099 which is outside of your 2D array.
When offset = 1, the minimums for i and j will be 1 and 2 respectively, while the maximums will be 101 and 100. Therefore i + (N + 1) * j = 101 + 101*100 = 10201.
In my experience, GPUs are not very good at catching segmentation faults when accessing global memory. Your attempt at purposefully creating one may work on some cards sometimes but no guarantees.

The problem could be caused by local-work-size and global-work-size. It is important while using two dimensional arrays to properly calculate them. It could be that for big values your global_id(0) is bigger than you specified in clEnqueueNDRangeKernel().

Related

bit shift operation in parallel prefix sum

The code is to compute prefix sum parallelly from OpengGL-Superbible 10.
The shader shown has a local workgroup size of 1024, which means it will process arrays of 2048 elements, as each invocation computes two elements of the output array. The shared variable shared_data is used to store the data that is in flight. When execution starts, the shader loads two adjacent elements from the input arrays into the array. Next, it executes the barrier() function. This step ensures that all of the shader invocations have loaded their data into the shared array before the inner loop begins.
#version 450 core
layout (local_size_x = 1024) in;
layout (binding = 0) coherent buffer block1
{
float input_data[gl_WorkGroupSize.x];
};
layout (binding = 1) coherent buffer block2
{
float output_data[gl_WorkGroupSize.x];
};
shared float shared_data[gl_WorkGroupSize.x * 2];
void main(void)
{
uint id = gl_LocalInvocationID.x;
uint rd_id;
uint wr_id;
uint mask;// The number of steps is the log base 2 of the
// work group size, which should be a power of 2
const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
uint step = 0;
// Each invocation is responsible for the content of
// two elements of the output array
shared_data[id * 2] = input_data[id * 2];
shared_data[id * 2 + 1] = input_data[id * 2 + 1];
// Synchronize to make sure that everyone has initialized
// their elements of shared_data[] with data loaded from
// the input arrays
barrier();
memoryBarrierShared();
// For each step...
for (step = 0; step < steps; step++)
{
// Calculate the read and write index in the
// shared array
mask = (1 << step) - 1;
rd_id = ((id >> step) << (step + 1)) + mask;
wr_id = rd_id + 1 + (id & mask);
// Accumulate the read data into our element
shared_data[wr_id] += shared_data[rd_id];
// Synchronize again to make sure that everyone
// has caught up with us
barrier();
memoryBarrierShared();
} // Finally write our data back to the output image
output_data[id * 2] = shared_data[id * 2];
output_data[id * 2 + 1] = shared_data[id * 2 + 1];
}
How to comprehend the bit shift operation of rd_id and wr_id intuitively? Why it works?
When we say something is "intuitive" we usually mean that our understanding is deep enough that we are not aware of our own thought processes, and "know the answer" without consciously thinking about it. Here the author is using the binary representation of integers within a CPU/GPU to make the code shorter and (probably) slightly faster. The code will only be "intuitive" for someone who is very familiar with such encodings and binary operations on integers. I'm not, so had to think about what is going on.
I would recommend working through this code since these kind of operations do occur in high performance graphics and other programming. If you find it interesting, it will eventually become intuitive. If not, that's OK as long as you can figure things out when necessary.
One approach is to just copy this code into a C/C++ program and print out the mask, rd_id, wr_id, etc. You wouldn't actually need the data arrays, or the calls to barrier() and memoryBarrierShared(). Make up values for invocation ID and workgroup size based on what the SuperBible example does. That might be enough for "Aha! I see."
If you aren't familiar with the << and >> shifts, I suggest writing some tiny programs and printing out the numbers that result. Python might actually be slightly easier, since
print("{:016b}".format(mask))
will show you the actual bits, whereas in C you can only print in hex.
To get you started, log2 returns the number of bits needed to represent an integer. log2(256) will be 8, log2(4096) 12, etc. (Don't take my word for it, write some code.)
x << n is multiplying x by 2 to the power n, so x << 1 is x * 2, x << 2 is x * 4, and so on. x >> n is dividing by 1, 2, 4, .. instead.
(Very important: only for non-negative integers! Again, write some code to find out what happens.)
The mask calculation is interesting. Try
mask = (1 << step);
first and see what values come out. This is a common pattern for selecting an individual bit. The extra -1 instead generates all the bits to the right.
Anding, the & operator, with a mask that has zeroes on the left and ones on the right is a faster way for an integer % a power of 2.
Finally rd_id and wr_id array indexes need to start from base positions in the array, from the invocation ID and workgroup size, and increment according to the pattern explained in the Super Bible text.

Cuda number of elements is larger than assigned threads

I am new to CUDA programming.
I am curious that what happens if the number of elements is larger than the number of threads?
In this simple vector_add example
__global__
void add(int n, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = x[i] + y[i];
}
Say the number of array elements is 10,000,000. And we call this function using 64 blocks and 256 threads per block:
int n = 1e8;
int grid_size = 64;
int block_sie = 256;
Then, only 64*256 = 16384 threads are assigned, what would happen to the rest of the array elements?
what would happen to the rest of the array elements?
Nothing at all. They wouldn't be touched and would remain unchanged. Of course, your x array elements don't change anyway. So we are referring to y here. The values of y[0..16383] would reflect the result of the vector add. The values of y[16384..9999999] would be unchanged.
For this reason (to conveniently handle arbitrary data set sizes independent of the chosen grid size), people sometimes suggest a grid-stride-loop kernel design.

Sparse matrix-dense vector multiplication with matrix known at compile time

I have a sparse matrix with only zeros and ones as entries (and, for example, with shape 32k x 64k and 0.01% non-zero entries and no patterns to exploit in terms of where the non-zero entries are). The matrix is known at compile time. I want to perform matrix-vector multiplication (modulo 2) with non-sparse vectors (not known at compile time) containing 50% ones and zeros. I want this to be efficient, in particular, I'm trying to make use of the fact that the matrix is known at compile time.
Storing the matrix in an efficient format (saving only the indices of the "ones") will always take a few Mbytes of memory and directly embedding the matrix into the executable seems like a good idea to me. My first idea was to just automatically generate the C++ code that just assigns all the result vector entries to the sum of the correct input entries. This looks like this:
constexpr std::size_t N = 64'000;
constexpr std::size_t M = 32'000;
template<typename Bit>
void multiply(const std::array<Bit, N> &in, std::array<Bit, M> &out) {
out[0] = (in[11200] + in[21960] + in[29430] + in[36850] + in[44352] + in[49019] + in[52014] + in[54585] + in[57077] + in[59238] + in[60360] + in[61120] + in[61867] + in[62608] + in[63352] ) % 2;
out[1] = (in[1] + in[11201] + in[21961] + in[29431] + in[36851] + in[44353] + in[49020] + in[52015] + in[54586] + in[57078] + in[59239] + in[60361] + in[61121] + in[61868] + in[62609] + in[63353] ) % 2;
out[2] = (in[11202] + in[21962] + in[29432] + in[36852] + in[44354] + in[49021] + in[52016] + in[54587] + in[57079] + in[59240] + in[60362] + in[61122] + in[61869] + in[62610] + in[63354] ) % 2;
out[3] = (in[56836] + in[11203] + in[21963] + in[29433] + in[36853] + in[44355] + in[49022] + in[52017] + in[54588] + in[57080] + in[59241] + in[60110] + in[61123] + in[61870] + in[62588] + in[63355] ) % 2;
// LOTS more of this...
out[31999] = (in[10208] + in[21245] + in[29208] + in[36797] + in[40359] + in[48193] + in[52009] + in[54545] + in[56941] + in[59093] + in[60255] + in[61025] + in[61779] + in[62309] + in[62616] + in[63858] ) % 2;
}
This does in fact work (takes ages to compile). However, it actually seems to be very slow (more than 10x slower than the same Sparse vector-matrix multiplication in Julia) and also to blow up the executable size significantly more than I would have thought necessary. I tried this with both std::array and std::vector, and with the individual entries (represented as Bit) being bool, std::uint8_t and int, to no progress worth mentioning. I also tried replacing the modulo and addition by XOR. In conclusion, this is a terrible idea. I'm not sure why though - is the sheer codesize slowing it down that much? Does this kind of code rule out compiler optimization?
I haven't tried any alternatives yet. The next idea I have is storing the indices as compile-time constant arrays (still giving me huge .cpp files) and looping over them. Initially, I expected doing this would lead the compiler optimization to generate the same binary as from my automatically generated C++ code. Do you think this is worth trying (I guess I will try anyway on monday)?
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that. I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
Do you have any other ideas on how this might be done?
I'm not sure why though - is the sheer codesize slowing it down that much?
The problem is that the executable is big, the the OS will fetch a lot of pages from your storage device. This process is very slow. The processor will often stall waiting for data to be loaded. And even the code would be already loaded in the RAM (OS caching), it would be inefficient because the speed of the RAM (latency + throughput) is quite bad. The main issue here is that all the instructions are executed only once. If you reuse the function, then the code need to be reloaded from the cache and if it is to big to fit in the cache, it will be loaded from the slow RAM. Thus, the overhead of loading the code is very high compared to its actual execution. To overcome this problem, you need to use a quite small code with loops iterating on a fairly small amount of data.
Does this kind of code rule out compiler optimization?
This is dependent of the compiler, but most mainstream compilers (eg. GCC or Clang) will optimize the code the same way (hence the slow compilation time).
Do you think this is worth trying (I guess I will try anyway on monday)?
Yes, this solution is clearly better, especially if the indices are stored in a compact way. In your case, you can store them using an uint16_t type. All the indices can be put in a big buffer. The starting/ending position of the indices for each line can be specified in another buffer referencing the first one (or using pointers). This buffer can be loaded once in memory in the beginning of your application from a dedicated file to reduce the size of the resulting program (and avoid fetches from the storage device in a critical loop). With a probability of 0.01% of having non-zero values, the resulting data structure will take less than 500 KiB of RAM. On an average mainstream desktop processor, it can fit in the L3 cache (that is rather quite fast) and I think that your computation should not take more than 1ms assuming the code of multiply is carefully optimized.
Another idea would be to try storing the input (and maybe also output?) vector as packed bits and perform the calculation like that.
Bit-packing is good only if your matrix is not too sparse. With a matrix filled with 50% of non-zero values, the bit-packing method is great. With 0.01% of non-zero values, the bit-packing method is clearly bad as it takes too much space.
I would expect one can't get around a lot of bit-shifting or and-operations and this would end up being slower and worse overall.
As previously said, loading data from the storage device or the RAM is very slow. Doing some bit-shifts is very fast on any modern mainstream processor (and much much faster than loading data).
Here is the approximate timings for various operations that a computer can do:
I implemented the second method (constexpr arrays storing the matrix in compressed column storage format) and it is a lot better. It takes (for a 64'000 x 22'000 binary matrix containing 35'000 ones) <1min to compile with -O3 and performs one multiplication in <300 microseconds on my laptop (Julia takes around 350 microseconds for the same calculation). The total executable size is ~1 Mbyte.
Probably one can still do a lot better. If anyone has an idea, let me know!
Below is a code example (showing a 5x10 matrix) illustrating what I did.
#include <iostream>
#include <array>
// Compressed sparse column storage for binary matrix
constexpr std::size_t M = 5;
constexpr std::size_t N = 10;
constexpr std::size_t num_nz = 5;
constexpr std::array<std::uint16_t, N + 1> colptr = {
0x0,0x1,0x2,0x3,0x4,0x5,0x5,0x5,0x5,0x5,0x5
};
constexpr std::array<std::uint16_t, num_nz> row_idx = {
0x0,0x1,0x2,0x3,0x4
};
template<typename Bit>
constexpr void encode(const std::array<Bit, N>& in, std::array<Bit, M>& out) {
for (std::size_t col = 0; col < N; col++) {
for (std::size_t j = colptr[col]; j < colptr[col + 1]; j++) {
out[row_idx[j]] = (static_cast<bool>(out[row_idx[j]]) != static_cast<bool>(in[col]));
}
}
}
int main() {
using Bit = bool;
std::array<Bit, N> input{1, 0, 1, 0, 1, 1, 0, 1, 0, 1};
std::array<Bit, M> output{};
for (auto i : input) std::cout << i;
std::cout << std::endl;
encode(input, output);
for (auto i : output) std::cout << i;
}

Problem with initialising 2D vector in C++

I was implementing a solution for this problem to get a feel for the language. My reasoning is as follows:
Notice that the pattern on the diagonal is 2*n+1.
The elements to the left and upwards are alternating arithmetic progressions or additions/subtractions of the elements from the diagonal to the boundary.
Create a 2D vector and instantiate all the diagonal elements. Then create a dummy variable to fill in the remaining parts by add/subtract the diagonal elements.
My code is as follows:
#include <vector>
using namespace std;
const long value = 1e9;
vector<vector<long>> spiral(value, vector<long> (value));
long temp;
void build(){
spiral[0][0] = 1;
for(int i = 1; i < 5e8; i++){
spiral[i][i]= 2*i+1;
temp = i;
long counter = temp;
while(counter){
if(temp % 2 ==0){
spiral[i][counter]++;
spiral[counter][i]--;
counter--;
temp--;
}else{
spiral[i][counter]--;
spiral[counter][i]++;
counter--;
temp--;
}
}
}
}
int main(){
spiral[0][0] = 1;
build();
int y, x;
cin >> y >> x;
cout << spiral[y][x] << endl;
}
The problem is that the programme doesn't output any thing. I can't figure out why my vector won't print any elements. I've tested it with spiral[1][1] and all I get is some obscure assembler message after waiting 5 or 10 minutes. What's wrong with my reasoning?
EDIT: Full output is:
and
A long is probably 4 or 8 bytes for you (e.g. commonly 4 bytes on Windows, 4 bytes on x86 Linux, and 8 bytes on x64 Linux), so lets assume 4. 1e9 * 4 is 4 gigabytes of continuous memory for each vector<long> (value).
Then the outer vector creates another 1e9 copies of that, which is 4 exabytes (or 4 million terabytes) given a 32bit long or double for 64bit and ignoring the overhead size of each std::vector. It is highly unlikely that you have that much memory and swapfile, and being a global this is attempted before main() is called.
So you are not going to be able to store all this data directly, you will need to think about what data actually needs to be stored to get the result you desire.
If you run under a debugger set to stop on exceptions, you might see a std::bad_alloc getting thrown, with the call stack indicating the cause (e.g. Visual Studio will display something like "dynamic initializer for 'spiral'" in the call stack), but it is possible on Linux the OS will just kill it first, as Linux can over-commit memory (so new etc. succeeds), then when some program goes to use memory (an actual read or write) it fails (over committed, nothing free) and it SIGKILL's something to free memory (this doesn't seem entirely predictable, I copy-pasted your code onto Ubuntu 18 and on command line got "terminate called after throwing an instance of 'std::bad_alloc'").
The problem actually asks you to find an analytical formula for the solution, not to simulate the pattern. All you need to do is to carefully analyze the pattern:
unsigned int get_n(unsigned int row, unsigned int col) {
assert(row >= 1 && col >= 1);
const auto n = std::max(row, col);
if (n % 2 == 0)
std::swap(row, col);
if (col == n)
return n * n + 1 - row;
else
return (n - 1) * (n - 1) + col;
}
Math is your friend, here, not std::vector. One of the constraints of this puzzle is a memory limit of 512MB, but a vector big enough for all the tests would require several GB of memory.
Consider how the square is filled. If you choose the maximum between the given x and y (call it w), you have "delimited" a square of size w2. Now you have to consider the outer edge of this square to find the actual index.
E.g. Take x = 6 and y = 3. The maximum is 6 (even, remember the zig zag pattern), so the number is (6 - 1)2 + 3 = 28
* * * * * 26
* * * * * 27
* * * * * [28]
* * * * * 29
* * * * * 30
36 35 34 33 32 31
Here, a proof of concept.

CUDA: reindexing arrays

In this example, I have 3 float arrays, query_points[], initial_array[], and final_array[]. Values in query_points[] are rounded down and become index values, and I want to copy the data at those indexes in initial_array[] to result_array[].
The problem I'm having is every few hundred values, I am getting different values compared to properly working c++ code. I am new to CUDA and not sure what is happening. Please let me know if you can point me towards a solution. Thanks!
CUDA Code:
int w = blockIdx.x * blockDim.x + threadIdx.x; // Col // width
int h = blockIdx.y * blockDim.y + threadIdx.y; // Row // height
int index = h*width+w;
if ((w < width) && (h < height)){
int piece = floor(query_points[index]) - 1;
int piece_index = h*width+piece;
result_array[index] = initial_array[piece_index];
}
You gave the answer in your own comment: "I also think it may have had to do with the fact that I was passing the same input and output array into the function, trying to do an in place operation."
Your description of the symptom (it only happens occasionally and it only repros on large arrays) also fits the explanation.
Note that it isn't always possible to guard against race conditions if you want full concurrency - you may have to use separate input and output arrays. Merge Sort and Radix Sort both ping-pong between intermediate arrays while processing. I don't think anyone has figured out how to implement those algorithms without O(N) auxiliary space.
I didn't write the code to test it but there are two problems that I can see:
If you are floor-ing a float than use floorf() function. I do not think this is the cause, but it is obviously the better way to do it.
The main issue that I can see is subtler, or maybe I am just speculating: floor() and floorf() return float and double respectively. So, when you do :
floor(query_points[index]) - 1;
what you have is still a float and might be smaller than the actual integral value you are supposed to get due to precision loss. When you implicitly cast it to integer by
int piece = floor(query_points[index]) - 1;
you basically truncate the decimal part and get n-1 where you think you are getting n.
Even without this analysis
int piece = floor(query_points[index]) - 1;
In this line, you are flooring and than truncating, which are essentially the same thing, so you do not even need to use floor() or floorf().