Why does this cuda program freeze my display? - c++

I am new to CUDA and I was just trying to make a program which utilized a lot of my GPU. The only issue was that I am also using the card for my display and this froze my screen and required me to reboot.
__global__ void cuda_burn(int* sum)
{
int x = 0;
for(int i = 0; i < 1000000000; i++)
{
x += i;
}
atomicAdd(sum, x);
}
I originally launched it like cuda_burn<<<1024, 1024>>>(sum_d); which killed my display. This makes sense to me because I have enough blocks and threads to fully utilize my gpu which leaves no time for the graphics.
Next I tried to launch it like this cuda_burn<<<1, 1024>>>(sum_d); I thought that since I was only using one block that it would not be able to fully utilize the GPU resources and not freeze my display. Unfortunately it still did. Why?
What is also strange is the the mouse does not freeze?
Also is there a better way of unfreezing the display than rebooting?

Currently, CUDA and display tasks cannot run at the same time. While a CUDA kernel is running, regardless of how much or how little it uses the GPU, display tasks will be "frozen".

Related

lowest latency/high resolution, highest timing guarantee, timing/timer on windows? [duplicate]

I am making a program using the Sleep command via Windows.h, and am experiencing a frustrating difference between running my program on Windows 10 instead of Windows 7. I simplified my program to the program below which exhibits the same behavior as my more complicated program.
On Windows 7 this 5000 count loop runs with the Sleep function at 1ms. This takes 5 seconds to complete.
On Windows 10 when I run the exact same program (exact same binary executable file), this program takes almost a minute to complete.
For my application this is completely unacceptable as I need to have the 1ms timing delay in order to interact with hardware I am using.
I also tried a suggestion from another post to use the select() command (via winsock2), but that command did not work to delay 1ms either. I have tried this program on multiple Windows 7 and Windows 10 PC's and the root cause of the issue always points to using Windows 10 instead of Windows 7. The program always runs within ~5 seconds on numerous Windows 7 PC's, and on the multiple Windows 10 PC's that I have tested the duration has been much longer ~60 seconds.
I have been using Microsoft Visual Studio Express 2010 (C/C++) as well as Microsoft Visual Studio Express 2017 (C/C++) to compile the programs. The version of visual studio does not influence the results.
I have also changed the compile options from 'Debug' to 'Release' and tried to optimize the compiler but this will not help either.
Any suggestions would be greatly appreciated.
#include <stdio.h>
#include <Windows.h>
#define LOOP_COUNT 5000
int main()
{
int i = 0;
for (i; i < LOOP_COUNT; i++){
Sleep(1);
}
return 0;
}
I need to have the 1ms timing delay in order to interact with hardware I am using
Windows is the wrong tool for this job.
If you insist on using this wrong tool, you are going to have to make compromises (such as using a busy-wait and accepting the corresponding poor battery life).
You can make Sleep() more accurate using timeBeginPeriod(1) but depending on your hardware peripheral's limits on the "one millisecond" delay -- is that a minimum, maximum, or the middle of some range? -- it still will fail to meet your timing requirement with some non-zero probability.
The timeBeginPeriod function requests a minimum resolution for periodic timers.
The right solution for talking to hardware with tight timing tolerances is an embedded microcontroller which talks to the Windows PC through some very flexible interface such as UART or Ethernet, buffers data, and uses hardware timers to generate signals with very well-defined timing.
In some cases, you might be able to use embedded circuitry already existing within your Windows PC, such as "sound card" functionality.
#BenVoigt & #mzimmers thank you for your responses and suggestions. I did find a unique solution to this question and the solution was inspired by the post I have linked directly below.
Units of QueryPerformanceFrequency
In this post BrianP007 writes a function to see how fast the Sleep(1000) command takes. However, while I was playing around I realized that Sleep() accepts 0. Therefore I used a similar structure to the linked post to find the time that it takes to loop until reaching a delta t of 1ms.
For my purposes I increased i by 100, however it can be increased by 10 or by 1 in order to get a more accurate estimate as to what i should be.
Once you get a value for i, you can use that value to get an approximate delay for 1ms on your machine. If you run this function in a loop (I ran it 100 times) I was able to get anywhere from i = 3000 to i = 6000. However, my machine averages out around 5500. This spread is probably due to jitter/clock frequency changes through time in the processor.
The processor_check() function below only finds out what value should be returned for the for loop argument; the actual 'timer' needs to just have the for loop with Sleep(0) inside of it to run a timer with ~1ms resolution on the machine.
While this method is not perfect, it is much closer and works a ton better than using Sleep(1). I have to test this more thoroughly, but please let me know if this works for you as well. Please feel free to use the code below if you need it for your own applications. This code should be able to be copy and pasted into an empty command prompt C program in Visual Studio directly without modification.
/*ZKR Sleep_ZR()*/
#include "stdio.h"
#include <windows.h>
/*Gets for loop value*/
int processor_check()
{
double delta_time = 0;
int i = 0;
int n = 0;
while(delta_time < 0.001){
LARGE_INTEGER sklick, eklick, cpu_khz;
QueryPerformanceFrequency(&cpu_khz);
QueryPerformanceCounter(&sklick);
for(n = 0; n < i; n++){
Sleep(0);
}
QueryPerformanceCounter(&eklick);
delta_time = (eklick.QuadPart-sklick.QuadPart) / (double)cpu_khz.QuadPart;
i = i + 100;
}
return i;
}
/*Timer*/
void Sleep_ZR(int cnt)
{
int i = 0;
for(i; i < cnt; i++){
Sleep(0);
}
}
/*Main*/
int main(int argc, char** argv)
{
double average = 0;
int i = 0;
/*Single use*/
int loop_count = processor_check();
Sleep_ZR(loop_count);
/*Average based on processor to get more accurate Sleep_ZR*/
for(i = 0; i < 100; i++){
loop_count = processor_check();
average = average + loop_count;
}
average = average / 100;
printf("Average: %f\n", average);
/*10 second test*/
for (i = 0; i < 10000; i++){
Sleep_ZR((int)average);
}
return 0;
}

Windows Sleep Function Extremely Slow

I am making a program using the Sleep command via Windows.h, and am experiencing a frustrating difference between running my program on Windows 10 instead of Windows 7. I simplified my program to the program below which exhibits the same behavior as my more complicated program.
On Windows 7 this 5000 count loop runs with the Sleep function at 1ms. This takes 5 seconds to complete.
On Windows 10 when I run the exact same program (exact same binary executable file), this program takes almost a minute to complete.
For my application this is completely unacceptable as I need to have the 1ms timing delay in order to interact with hardware I am using.
I also tried a suggestion from another post to use the select() command (via winsock2), but that command did not work to delay 1ms either. I have tried this program on multiple Windows 7 and Windows 10 PC's and the root cause of the issue always points to using Windows 10 instead of Windows 7. The program always runs within ~5 seconds on numerous Windows 7 PC's, and on the multiple Windows 10 PC's that I have tested the duration has been much longer ~60 seconds.
I have been using Microsoft Visual Studio Express 2010 (C/C++) as well as Microsoft Visual Studio Express 2017 (C/C++) to compile the programs. The version of visual studio does not influence the results.
I have also changed the compile options from 'Debug' to 'Release' and tried to optimize the compiler but this will not help either.
Any suggestions would be greatly appreciated.
#include <stdio.h>
#include <Windows.h>
#define LOOP_COUNT 5000
int main()
{
int i = 0;
for (i; i < LOOP_COUNT; i++){
Sleep(1);
}
return 0;
}
I need to have the 1ms timing delay in order to interact with hardware I am using
Windows is the wrong tool for this job.
If you insist on using this wrong tool, you are going to have to make compromises (such as using a busy-wait and accepting the corresponding poor battery life).
You can make Sleep() more accurate using timeBeginPeriod(1) but depending on your hardware peripheral's limits on the "one millisecond" delay -- is that a minimum, maximum, or the middle of some range? -- it still will fail to meet your timing requirement with some non-zero probability.
The timeBeginPeriod function requests a minimum resolution for periodic timers.
The right solution for talking to hardware with tight timing tolerances is an embedded microcontroller which talks to the Windows PC through some very flexible interface such as UART or Ethernet, buffers data, and uses hardware timers to generate signals with very well-defined timing.
In some cases, you might be able to use embedded circuitry already existing within your Windows PC, such as "sound card" functionality.
#BenVoigt & #mzimmers thank you for your responses and suggestions. I did find a unique solution to this question and the solution was inspired by the post I have linked directly below.
Units of QueryPerformanceFrequency
In this post BrianP007 writes a function to see how fast the Sleep(1000) command takes. However, while I was playing around I realized that Sleep() accepts 0. Therefore I used a similar structure to the linked post to find the time that it takes to loop until reaching a delta t of 1ms.
For my purposes I increased i by 100, however it can be increased by 10 or by 1 in order to get a more accurate estimate as to what i should be.
Once you get a value for i, you can use that value to get an approximate delay for 1ms on your machine. If you run this function in a loop (I ran it 100 times) I was able to get anywhere from i = 3000 to i = 6000. However, my machine averages out around 5500. This spread is probably due to jitter/clock frequency changes through time in the processor.
The processor_check() function below only finds out what value should be returned for the for loop argument; the actual 'timer' needs to just have the for loop with Sleep(0) inside of it to run a timer with ~1ms resolution on the machine.
While this method is not perfect, it is much closer and works a ton better than using Sleep(1). I have to test this more thoroughly, but please let me know if this works for you as well. Please feel free to use the code below if you need it for your own applications. This code should be able to be copy and pasted into an empty command prompt C program in Visual Studio directly without modification.
/*ZKR Sleep_ZR()*/
#include "stdio.h"
#include <windows.h>
/*Gets for loop value*/
int processor_check()
{
double delta_time = 0;
int i = 0;
int n = 0;
while(delta_time < 0.001){
LARGE_INTEGER sklick, eklick, cpu_khz;
QueryPerformanceFrequency(&cpu_khz);
QueryPerformanceCounter(&sklick);
for(n = 0; n < i; n++){
Sleep(0);
}
QueryPerformanceCounter(&eklick);
delta_time = (eklick.QuadPart-sklick.QuadPart) / (double)cpu_khz.QuadPart;
i = i + 100;
}
return i;
}
/*Timer*/
void Sleep_ZR(int cnt)
{
int i = 0;
for(i; i < cnt; i++){
Sleep(0);
}
}
/*Main*/
int main(int argc, char** argv)
{
double average = 0;
int i = 0;
/*Single use*/
int loop_count = processor_check();
Sleep_ZR(loop_count);
/*Average based on processor to get more accurate Sleep_ZR*/
for(i = 0; i < 100; i++){
loop_count = processor_check();
average = average + loop_count;
}
average = average / 100;
printf("Average: %f\n", average);
/*10 second test*/
for (i = 0; i < 10000; i++){
Sleep_ZR((int)average);
}
return 0;
}

How could just loading a dll lead to 100 CPU load in my main application?

I have a perfectly working program which connects to a video camera (an IDS uEye camera) and continuously grabs frames from it and displays them.
However, when loading a specific dll before connecting to the camera, the program runs with 100% CPU load. If I load the dll after connecting to the camera, the program runs fine.
int main()
{
INT nRet = IS_NO_SUCCESS;
// init camera (open next available camera)
m_hCam = (HIDS)0;
// (A) Uncomment this for 100% CPU load:
// HMODULE handle = LoadLibrary(L"myInnocentDll.dll");
// This is the call to the 3rdparty camera vendor's library:
nRet = is_InitCamera(&m_hCam, 0);
// (B) Uncomment this instead of (A) and the CPU load won't change
// HMODULE handle = LoadLibrary(L"myInnocentDll.dll");
if (nRet == IS_SUCCESS)
{
/*
* Please note: I have removed all lines which are not necessary for the exploit.
* Therefore this is NOT a full example of how to properly initialize an IDS camera!
*/
is_GetSensorInfo(m_hCam, &m_sInfo);
GetMaxImageSize(m_hCam, &m_s32ImageWidth, &m_s32ImageHeight);
m_nColorMode = IS_CM_BGR8_PACKED;// IS_CM_BGRA8_PACKED;
m_nBitsPerPixel = 24; // 32;
nRet |= is_SetColorMode(m_hCam, m_nColorMode);
// allocate image memory.
if (is_AllocImageMem(m_hCam, m_s32ImageWidth, m_s32ImageHeight, m_nBitsPerPixel, &m_pcImageMemory, &m_lMemoryId) != IS_SUCCESS)
{
return 1;
}
else
{
is_SetImageMem(m_hCam, m_pcImageMemory, m_lMemoryId);
}
}
else
{
return 1;
}
std::thread([&]() {
while (true) {
is_FreezeVideo(m_hCam, IS_WAIT);
/*
* Usually, the image memory would now be grabbed via is_GetImageMem().
* but as it is not needed for the exploit, I removed it as well
*/
}
}).detach();
cv::waitKey(0);
}
Independently of the actually used camera driver, in what way could loading a dll change the performance of it, occupying 100% of all available CPU cores? When using the Visual Studio Diagnostic Tools, the excess CPU time is attributed to "[External Call] SwitchToThread" and not to the myInnocentDll.
Loading just the dll without the camera initialization does not result in 100% CPU load.
I was first thinking of some static initializers in the myInnocentDll.dll configuring some threading behavior, but I did not find anything pointing in this direction. For which aspects should I look for in the code of myInnocentDll.dll?
After a lot of digging I found the answer and it is both frustratingly simple and frustrating by itself:
It is Microsoft's poor support of OpenMP. When I disabled OpenMP in my project, the camera driver runs just fine.
The reason seems to be that the Microsoft compiler uses OpenMP with busy waiting and there is also the possibility to manually configure OMP_WAIT_POLICY, but as I was not depending on OpenMP anyways, disabling was the easiest solution for me.
https://developercommunity.visualstudio.com/content/problem/589564/how-to-control-omp-wait-policy-for-openmp.html
https://support.microsoft.com/en-us/help/2689322/redistributable-package-fix-high-cpu-usage-when-you-run-a-visual-c-201
I still don't understand why the CPU only went up high when using the camera and not when running the rest of my solution, even though the camera library is pre-built and my disabling/enabling of OpenMP compilation cannot have any effect on it. And I also don't understand why they bothered to make a hotfix for VS2010 but have no real fix as of VS2019, which I am using. But the problem is averted.
You can disable CPU idle state in the IDS camera manager and then the minimum CPU load in the windows energy plans is set to 100%
I think this is worth mentioning here, even you solved your problem already.

How does one update the GTK+ GUI in C++ with time consuming operations?

I am using OpenMP to perform a time consuming operation. I am unable to update a ProgressBar from GTK+ from within the time consuming loop at the same time the operations are carried out. The code I have updtates the ProgressBar, but it does so after everything is done. Not as the code progresses.
This is my dummy code that doesn't update the ProgressBar until everything is done:
void largeTimeConsumingFunction (GtkProgressBar** progressBar) {
int extensiveOperationSize = 1000000;
#pragma omp parallel for ordered schedule(dynamic)
for (int i = 0; i < extensiveOperationSize; i++) {
// Do something that will take a lot of of time with data
#pragma omp ordered
{
// Update the progress bar
gtk_progress_bar_set_fraction(*progressBar, i/(double)extensiveOperationSize);
}
}
}
When I do the same, but without using OpenMP, the same happens. It doesn't get updated until the end.
How could I get that GTK+ Widget to update at the same time the loop is working?
Edit: This is just a dummy code to keep it short and readable. It has the same structure as my actual code, but in my actual code I don't know before hand the size of the items I will be processing. It could be 10 or more than 1 million items and I will have to perform some action for each of them.
There are two potential issues here:
First, if you are performing long running computations that might block main thread, you have to call
while (gtk_events_pending ())
gtk_main_iteration ();
every now and then to keep UI responsive (which includes redrawing itself).
Second, you should call GTK+ functions only from main thread.

What is the difference between Pause(), Sleep() and Wait() in C++?

I have been working through the CS106B course from Stanford, and while completing the Boggle assignment, I have noticed that the Sleep() function on Windows behaves differently from the Pause() function. For testing purposes, I have simply set up the board and used the provided gboggle.h file to highlight the Boggle cubes, then remove the highlighting. The following is the relevant code:
for(int row = 0; row < board.numRows(); row++)
{
for(int col = 0; col < board.numCols(); col++)
{
HighlightCube(row, col, true);
}
}
Pause(0.5);
for(int row = 0; row < board.numRows(); row++)
{
for(int col = 0; col < board.numCols(); col++)
{
HighlightCube(row, col, false);
}
}
If I use Pause(), the cubes highlight, then return to normal. If I use Sleep() or Wait(), the cubes never highlight, and the delay in the program occurs before the board is even drawn rather than between the for loops. The relevant Wait() function:
void wait ( int seconds )
{
clock_t endwait;
endwait = clock () + seconds * CLOCKS_PER_SEC ;
while (clock() < endwait) {}
}
taken from here. I am using Visual Studio 2005 on Windows XP.
What difference between these functions causes them to act this way?
Edit: I am aware that Sleep and wait require integers. I have tested them using integers and see a delay, but it occurs before the squares are written. Sorry I was not clear about that previously.
Edit2: After looking through some of the other libraries I used, I found that Pause is, in fact, part of the graphics library that simply pauses the graphics buffer.
Sleep wants an integer as milliseconds and you give it 0.5, so your wait for 0 milliseconds. Your wait function also takes ints, so it has the same problem.
Also your wait function is blocking. As long as you are waiting, your application is busy and uses the CPU for, well waiting. Whereas the windows Sleep function suspends the current thread, meaning your application really does nothing (does not use any CPU resources), until the time is over.
EDIT: I don't know what Pause does, as it is not a WinAPI function.
EDIT: It could be, that the results of HighlightCube are first seen, when the application get's back to it's event loop and then these cubes are drawn. This way you just highlight them, then wait, then un-highlight them. Then your function returns and the application gets to finally draw them. That is quite obvious, as Sleep (and also your wait) just block the application from processing any events (including window paint events). I suppose the Pause prevents that by checking back to the event loop. Actually that's what Greg Domjan already wrote.
I've never seen the Pause command before; perhaps you could provide some code for it?
Windows apps work on the idea of a message pump, and that painting is a low priority.
If you sleep or wait in the message pump thread then you block it from doing any further handling of messages such as drawing the screen.
You need to yield to the message pump so it can do it's work.
You might look at usage of Wait for multiple and running a second message pump. (guessing this is the body of Pause).
Since wait takes an int parameter, calling it with 0.5 (as you're example uses for Pause) will result in the 0.5 being truncated to 0, so you'll get no delay.