I am trying to implement quickHull algorithm (for convex hull) parallely in CUDA. It works correctly for input_size <= 1 million. When I try 10 million points, the program crashes. My graphic card size is 1982 MB and all my data structures in the algorithm collectively require not more than 600 MB for this input size, which is less than 50 % of the available space.
By commenting out lines of my kernels, I found out that the crash occurs when I try to access array element and the index of the element I am trying to access is not out of bounds (double checked). The following is the kernel code where it crashes.
for(unsigned int i = old_setIndex; i < old_setIndex + old_setS[tid]; i++)
{
int pI = old_set[i];
if(pI <= -1 || pI > pts.size())
{
printf("Thread %d: i = %d, pI = %d\n", tid, i, pI);
continue;
}
p = pts[pI];
double d = distance(A,B,p);
if(d > dist) {
dist = d;
furthestPoint = i;
fpi = pI;
}
}
//fpi = old_set[furthestPoint];
//printf("Thread %d: Furthestpoint = %d\n", tid, furthestPoint);
My code crashes when I uncomment the statements (array access and printf) after the for loop. I am unable to explain the error as furthestPoint is always within bounds of old_set array size. Old_setS stores the size of smaller arrays that each thread can operate on. It crashes even if just try to print the value of furthestPoint (last line) without the array access statement above it.
There's no problem with the above code for input size <= 1 million. Am I overflowing some buffer in the device in case of 10 million?
Please help me in finding the source of the crash.
There is no out of bounds memory access in your code (or at least not one which is causing the symptoms you are seeing).
What is happening is that your kernel is being killed by the display driver because it is taking too much time to execute on your display GPU. All CUDA platform display drivers include a time limit for any operation on the GPU. This exists to prevent the display from freezing for a sufficiently long time that either the OS kernel panics or the user panics and thinks the machine has crashed. On the windows platform you are using, the time limit is about 2 seconds.
What has partly mislead you into thinking the source of the problem is array adressing is the commenting out of code makes the problem disappear. But what really happens there is an artifact of compiler optimization. When you comment out a global memory write, the compiler recognizes that the calculations which lead to the value being stored are unused, and it removes all that code from the assembler code it emits (google "nvcc dead code removal" for more information). That has the effect of making the code run much faster and puts it under the display driver time limit.
For workarounds see this recent stackoverflow question and answer
Related
The program
I have a C++ program that looks something like the following:
<load data from disk, etc.>
// Get some buffers aligned to 4 KiB
double* const x_a = static_cast<double*>(std::aligned_alloc(......));
double* const p = static_cast<double*>(std::aligned_alloc(......));
double* const m = static_cast<double*>(std::aligned_alloc(......));
double sum = 0.0;
const auto timerstart = std::chrono::steady_clock::now();
for(uint32_t i = 0; i<reps; i++){
uint32_t pos = 0;
double factor;
if((i%2) == 0) factor = 1.0; else factor = -1.0;
for(uint32_t j = 0; j<xyzvec.size(); j++){
pos = j*basis::ndist; //ndist is a compile-time constant == 36
for(uint32_t k =0; k<basis::ndist; k++) x_a[k] = distvec[k+pos];
sum += factor*basis::energy(x_a, &coeff[0], p, m);
}
}
const auto timerstop = std::chrono::steady_clock::now();
<free memory, print stats, etc.>
reger
where reps is a single digit number, xyzvec has ~15k elements, and a single call to basis::energy(...) takes about 100 µs to return. The energy function is huge in terms of code size (~5 MiB of source code that looks something like this, it's from a code generator).
Edit: The m array is somewhat large, ~270 KiB for this test case.
Edit 2: Source code of the two functions responsible for ~90% of execution time
All of the pointers entering energy are __restrict__-qualified and declared to be aligned via __assume_aligned(...), the object files are generated with -Ofast -march=haswell to allow the compiler to optimize and vectorize at will. Profiling suggests the function is currently frontend-bound (L1i cache miss, and fetch/decode).
energy does no dynamic memory allocation or IO, and mostly reads/writes x_a, m and p, x_a is const, which are all aligned to 4k page boundaries. Its execution time ought to be pretty consistent.
The strange timing behaviour
Running the program many times, and looking at the time elapsed between the timer start/stop calls above, I have found it to have a strange bimodal distribution.
Calls to energy are either "fast" or "slow", fast ones take ~91 µs, slow ones take ~106 µs on an Intel Skylake-X 7820X.
All calls to energy in a given process are either fast or slow, the metaphorical coin is flipped once, when the process starts.
The process is not quite random, and can be heavily biased towards the "fast" cases, by purging all kernel caches via echo 3 | sudo tee /proc/sys/vm/drop_caches immediately before execution.
The random effect may be CPU dependent. Running the same executable on a Ryzen 1700X yields both faster and much more consistent execution. The "slow" runs either don't happen or their prominence is much reduced. Both machines are running the same OS. (Ubuntu 20.04 LTS, 5.11.0-41-generic kernel, mitigations=off)
What could be the cause?
Data alignment (dubious, the arrays intensively used are aligned)
Code alignment (maybe, but I have tried printing the function pointer of energy, no correlation with speed)
Cache aliasing?
JCC erratum?
Interrupts, scheduler activity?
Some cores turbo boosting higher? (probably not, tried launching it bound to a core with taskset and tried all cores one by one, could not find one that was always "fast")
???
Edit
Zero-filling x_a, p and m before first use appears to make no difference to the timing pattern.
Replacing (i % 2) with factor *= -1.0 appears to make no difference to the timing pattern.
I have implemented a pixel mask class used for checking for perfect collision. I am using SFML so the implementation is fairly straight forward:
Loop through each pixel of the image and decide whether its true or false based on its transparency value. Here is the code I have used:
// Create an Image from the given texture
sf::Image image(texture.copyToImage());
// measure the time this function takes
sf::Clock clock;
sf::Time time = sf::Time::Zero;
clock.restart();
// Reserve memory for the pixelMask vector to avoid repeating allocation
pixelMask.reserve(image.getSize().x);
// Loop through every pixel of the texture
for (unsigned int i = 0; i < image.getSize().x; i++)
{
// Create the mask for one line
std::vector<bool> tempMask;
// Reserve memory for the pixelMask vector to avoid repeating allocation
tempMask.reserve(image.getSize().y);
for (unsigned int j = 0; j < image.getSize().y; j++)
{
// If the pixel is not transparrent
if (image.getPixel(i, j).a > 0)
// Some part of the texture is there --> push back true
tempMask.push_back(true);
else
// The user can't see this part of the texture --> push back false
tempMask.push_back(false);
}
pixelMask.push_back(tempMask);
}
time = clock.restart();
std::cout << std::endl << "The creation of the pixel mask took: " << time.asMicroseconds() << " microseconds (" << time.asSeconds() << ")";
I have used the an instance of the sf::Clock to meassure time.
My problem is that this function takes ages (e.g. 15 seconds) for larger images(e.g. 1280x720). Interestingly, only in debug mode. When compiling the release version the same texture/image only takes 0.1 seconds or less.
I have tried to reduce memory allocations by using the resize() method but it didn't change much. I know that looping through almost 1 million pixels is slow but it should not be 15 seconds slow should it?
Since I want to test my code in debug mode (for obvious reasons) and I don't want to wait 5 min till all the pixel masks have been created, what I am looking for is basically a way to:
Either optimise the code / have I missed somthing obvious?
Or get something similar to the release performance in debug mode
Thanks for your help!
Optimizing For Debug
Optimizing for debug builds is generally a very counter-productive idea. It could even have you optimize for debug in a way that not only makes maintaining code more difficult, but may even slow down release builds. Debug builds in general are going to be much slower to run. Even with the flattest kind of C code I write which doesn't pose much for an optimizer to do beyond reasonable register allocation and instruction selection, it's normal for the debug build to take 20 times longer to finish an operation. That's just something to accept rather than change too much.
That said, I can understand the temptation to do so at times. Sometimes you want to debug a certain part of code only for the other operations in the software to takes ages, requiring you to wait a long time before you can even get to the code you are interested in tracing through. I find in those cases that it's helpful, if you can, to separate debug mode input sizes from release mode (ex: having the debug mode only work with an input that is 1/10th of the original size). That does cause discrepancies between release and debug as a negative, but the positives sometimes outweigh the negatives from a productivity standpoint. Another strategy is to build parts of your code in release and just debug the parts you're interested in, like building a plugin in debug against a host application in release.
Approach at Your Own Peril
With that aside, if you really want to make your debug builds run faster and accept all the risks associated, then the main way is to just pose less work for your compiler to optimize away. That's going to be flatter code typically with more plain old data types, less function calls, and so forth.
First and foremost, you might be spending a lot of time on debug mode assertions for safety. See things like checked iterators and how to disable them:
https://msdn.microsoft.com/en-us/library/aa985965.aspx
For your case, you can easily flatten your nested loop into a single loop. There's no need to create these pixel masks with separate containers per scanline, since you can always get at your scanline data with some basic arithmetic (y*image_width or y*image_stride). So initially I'd flatten the loop. That might even help modestly for release mode. I don't know the SFML API so I'll illustrate with pseudocode.
const int num_pixels = image.w * image.h;
vector<bool> pixelMask(num_pixels);
for (int j=0; j < num_pixels; ++j)
pixelMask[j] = image.pixelAlpha(j) > 0;
Just that already might help a lot. Hopefully SFML lets you access pixels with a single index without having to specify column and row (x and y). If you want to go even further, it might help to grab the pointer to the array of pixels from SFML (also hopefully possible) and use that:
vector<bool> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
for (int j=0; j < num_pixels; ++j)
{
// Assuming 32-bit pixels (should probably use uint32_t).
// Note that no right shift is necessary when you just want
// to check for non-zero values.
const unsigned int alpha = pixels[j] & 0xff000000;
pixelMask[j] = alpha > 0;
}
Also vector<bool> stores each boolean as a single bit. That saves memory but translates to some more instructions for random-access. Sometimes you can get a speed up even in release by just using more memory. I'd test both release and debug and time carefully, but you can try this:
vector<char> pixelMask(image.w * image.h);
const unsigned int* pixels = image.getPixels();
char* pixelUsed = &pixelMask[0];
for (int j=0; j < num_pixels; ++j)
{
const unsigned int alpha = pixels[j] & 0xff000000;
pixelUsed[j] = alpha > 0;
}
Loops are faster if working with costants:
1. for (unsigned int i = 0; i < image.getSize().x; i++) get this image.getSize() before the loop.
2. get the mask for one line out of the loop and reuse it. Lines are of the same length I assume. std::vector tempMask;
This shall speed you up a bit.
Note that the compilation for debugging gives way more different machine code.
I am writing code and recently, I found some error. The simplified version is shown below.
#include <stdio.h>
#include <cuda.h>
#define DEBUG 1
inline void check_cuda_errors(const char *filename, const int line_number)
{
#ifdef DEBUG
cudaThreadSynchronize();
cudaError_t error = cudaGetLastError();
if(error != cudaSuccess)
{
printf("CUDA error at %s:%i: %s\n", filename, line_number, cudaGetErrorString(error));
exit(-1);
}
#endif
}
__global__ void make_input_matrix_zp()
{
unsigned int row = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int col = blockIdx.x*blockDim.x + threadIdx.x;
printf("col: %d (%d*%d+%d) row: %d (%d*%d+%d) \n", col, blockIdx.x, blockDim.x, threadIdx.x, row, blockIdx.y, blockDim.y, threadIdx.y);
}
int main()
{
dim3 blockDim(16, 16, 1);
dim3 gridDim(6, 6, 1);
make_input_matrix_zp<<<gridDim, blockDim>>>();
//check_cuda_errors(__FILE__, __LINE__);
return 0;
}
The first inline function is for checking the error in the cuda.
The second kernel function simply calculates current thread's index written in 'row' and 'col' and print these values. I guess there are no problem in inline function since it is from other reliable source.
The problem is, when I run the program, it does not execute kernel function even though it is called in the main function. However, if I delete the comment notation '//' in front of the
check_cuda_error
the program seems to enter the kernel function and it shows some printed value by printf function. But, it does not shows full combination of 'col' and 'row' indexes. In detail, the 'blockDim.y' does not change much. It only shows values of 4 and 5, but not 0, 1, 2, 3.
What I do not understand first.
As far as I know, the 'gridDim' means the dimension of the blocks. That means the block indexes have combination of (0,0)(0,1)(0,2)(0,3)(0,4)(0,5)(1,0)(1,1)(1,2)(1,3)... and so on. Also the size of the each block is 16 by 16. However, if you run this program, it does not show full combination. I just shows several combinations and it ends.
What I do not understand second.
Why the kernel function is dependent to the function named 'check_cuda_errors'? When this function exists, the program at least runs although imperfectly. However, when this error checking function is commented, the kernel function does not show any printed values.
This is very simple code but I couldn't find the problem for several days. Is there anything that I missed? Or do I know something wrong?
My working environment is like this.
"GeForce GT 630"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 2.1
Ubuntu 14.04
The CUDA GPU printf subsystem relies on a FIFO buffer to store printed output. If your output exceeds the size of the buffer, some or all of the previous content of the FIFO buffer will be overwritten by subsequent output. This is what will be happening in this case.
You can query and change the size of the buffer using the runtime API with cudaDeviceGetLimit and cudaDeviceSetLimit. If your device has the resources available to expand the limit, you should be able to see all the output your code emits.
As an aside, relying on the kernel printf feature for anything other than simple diagnostics or lightweight debugging, is a terrible idea, and you have probably just proven to yourself that you should be looking at other methods of verifying the correctness of your code.
Regarding your second question, the printf buffer is flushed to output only when the host synchronizes with the device. For example, with a call to cudaDeviceSynchronize, cudaThreadSynchronize, cudaMemcpy, and others (see B.17.2 Limitations of the Formatted Output appendix).
When check_cuda_errors is uncommented, calling cudaThreadSynchronize is what triggers the buffer to be printed. When it is commented, the main thread simply terminates before the kernel gets to run to completion, and nothing else happens.
In a c++ dynamic library I solve a least square problem using the Eigen Library. This Dll is called inside a python software where the problem configuration is settled. In a small sized problem the code works properly and returns the correct solution. If the number of points increases then the library throws std::bad_alloc.
More precisely, The code which creates the error simplified to its most is
try {
matrixA = new Eigen::MatrixXd(sizeX,NvalidBtuple); // initialize A
for (int i=0;i<sizeX;++i) {
int secondIndex = 0;
for (int k=0;k<btermSize;++k) {
if (bterm[k] == 1) { // select btuple that are validated by density exclusion
// product of terms
(*matrixA)(i,secondIndex) = 1.0;
secondIndex += 1;
}
}
}
} catch (std::bad_alloc& e) {
errorString = "Error 3: bad allocation in computation of coefficients!";
std::cout<<errorString<<" "<<e.what()<<std::endl;
return;
} catch (...) {
errorString = "Error 4: construction of matrix A failed! Unknown error.";
std::cout<<errorString<<std::endl;
return;
}
where matrixA is defined in the header file with Eigen::MatrixXd *matrixA;.
if sizeX and NvalidBtuple are smaller than about 20'000x3'000, the matrix definition works. If the size is bigger, it crashes.
The computer on which I did the tests has enough memory available, about 15G of free memory.
Is this a heap/stack problem?
How can I make the library accept bigger matrices?
Any comment is welcom. Thanks.
Edit:
As remarked in an answer below, I was not clear on the NvalidBtuple defintion:
NvalidBtuple = 0;
for (int i=0;i<btermSize;++i) {NvalidBtuple += bterm[i];}
where bterm is a boolean vector. Thus, since in the loop we do the check if (bterm[k] == 1), the secondIndex is always smaller than NvalidBtuple.
From the details of your question, the matrix takes 480Mb of RAM. A 32bit application can only access 2Gb of RAM (see e.g. How much memory can a 32 bit process access on a 64 bit operating system?); the allocation fails because there is no free continuous 480Mb block in the address space of the application.
The best way to solve the problem is to recompile the application as 64 bit. You won't be able to run it in a 32 bit system, but this shouldn't be a problem since you aren't able to run your algorithm on such a system anyways due to the limited memory.
The basic problem was as follows:
When I run the below Kernel with N threads and don't include the 4
lines to instantiate and populate the ScaledLLA variable every thing
works fine.
When I run the below Kernel with N threads and do include the 4
lines to instantiate and populate the ScaledLLA variable the GPU locks
up, and Windows throws a "display driver not responding" error.
If I reduce the number of threads running by reducing the grid size
everything worked fine.
I'm new to CUDA and have been incrementally building out some GIS functionality.
my host code looks like this at the kernel call.
MapperKernel << <g_CUDAControl->aGetGridSize(), g_CUDAControl->aGetBlockSize() >> >(g_Deltas.lat, g_Deltas.lon, 32.2,
g_DataReader->aGetMapper().aGetRPCBoundingBox()[0], g_DataReader->aGetMapper().aGetRPCBoundingBox()[1],
g_CUDAControl->aGetBlockSize().x,
g_CUDAControl->aGetThreadPitch(),
LLA_Offset,
LLA_ScaleFactor,
RPC_XN,RPC_XD,RPC_YN,RPC_YD,
Pixel_Offset, Pixel_ScaleFactor,
device_array);
cudaDeviceSynchronize(); //code crashes here
host_array = (point3D*)malloc(num_bytes);
cudaMemcpy(host_array, device_array, num_bytes, cudaMemcpyDeviceToHost);
the Kernel that is being called looks like this:
__global__ void MapperKernel(double deltaLat, double deltaLon, double passedAlt,
double minLat, double minLon,
int threadsperblock,
int threadPitch,
point3D LLA_Offset,
point3D LLA_ScaleFactor,
double * RPC_XN, double * RPC_XD, double * RPC_YN, double * RPC_YD,
point2D pixelOffset, point2D pixelScaleFactor,
point3D * rValue)
{
//calculate thread's LLA
int latindex = threadIdx.x + blockIdx.x*threadsperblock;
int lonindex = threadIdx.y + blockIdx.y*threadsperblock;
point3D LLA;
LLA.lat = ((double)(latindex))*deltaLat + minLat;
LLA.lon = ((double)(lonindex))*deltaLon + minLon;
LLA.alt = passedAlt;
//scale threads LLA - adding these four lines is what causes the problem
point3D ScaledLLA;
ScaledLLA.lat = (LLA.lat - LLA_Offset.lat) * LLA_ScaleFactor.lat;
ScaledLLA.lon = (LLA.lon - LLA_Offset.lon) * LLA_ScaleFactor.lon;
ScaledLLA.alt = (LLA.alt - LLA_Offset.alt) * LLA_ScaleFactor.alt;
rValue[lonindex*threadPitch + latindex] = ScaledLLA; //if I assign LLA without calculating ScaledLLA everything works fine
}
if I assign LLA to rValue then everything executes quickly and I get the expected behavior; however, when I add those fourlines for ScaledLLA and try to assign it to rValue, CUDA takes too long for windows's liking at the cudaDeviceSynchronize() call and I get a
"display driver not responding" error that then proceeds to reset the GPU. From looking around the error appears to be a windows thing that occurs when Windows believes that the GPU isn't being responsive. I am certain that the kernel is running and performing the right calculations, because I have stepped through it with the NSIGHT debugger.
Does anybody have a good explanation for why adding those three lines to the kernel would cause the execution time to spike?
I'm running Win7 VS 2013 and have nsight 4.5 installed.
For those who get here later via a search engine. It turns out the problem was with the card running out of memory.
That should probably have been one of the top couple of things to think of since the problem occurred only after the instantiation was added.
The card only had so much memory (~2GB) and my rvalue buffer was taking up most (~1.5GB) of it. With every thread trying to instantiate its own point3D variable the card simply ran out of memory.
For those interested NSight's profiler said that it was a cudaUknownError.
The fix was to lower the number of threads running the kernel