I'm attempting to run two applications simultaneously on windows 7, however, I'm finding that when I do this, whichever has focus runs at a normal speed but the other is clearly running at a far slower speed. (For reference, one is a unity application and the other is a C++ direct X application). Has anyone ever encountered something like this? Is there a way to allow both applications to run at full speed? The system ought to have the resources to run both, neither are very complex. When I monitor the system resources, etc, everything looks good.
Windows automatically offers less system resources to unfocused programs no matter their complexity or requirements. I don't believe you can disable that.
That makes sense. I looked into a bit deeper and found that the Desktop Window Manager was the one causing the headache. I stopped the service, set the processor affinity for each application, and everything was golden after that.
I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.
I am developing a cross-platform fractal explorer using Qt. I am experiencing a performance problem specifically when running on a single core CPU under Windows XP (program compiled with MSVC Express 2010), I haven't tried other versions of Windows. With two cores the program runs fine. It also runs fine under Linux with either one core or two cores (compiled with GCC).
The performance problem is something to do with calling a slot in the widget via the signal in the calculation thread. The widget contains a QImage and I pass a pointer to its pixels to the calculation thread. The thread calculates the fractal and plots the pixels to the image. At the end of each row, the thread emits a signal to the widget to tell it to update the display in the main thread. As I understand it, this is a queued connection.
With Windows and a single CPU the update is very slow, much slower than the calculation. It makes the program unusable.
The relevant code is similar to the Mandelbrot example in the Qt docs, except my signal has no arguments because the Qimage is located in the widget not the thread and I do not convert the QImage to a QPixmap.
Does anybody have any ideas of what the problem could be and how to go about solving it? Is it something to do with scheduling, time slicing allocation? Is there a compiler flag in MSVC that I need to set? Or do I need to modify my program some how?
Thanks very much!
You say the update is slower than the calculation - how much slower? Have you done any comprehensive profiling to see where exactly the bottleneck occurs? A cursory google finds this profiler which may help you.
Remember that for older CPU's, thread context switching is very expensive. This may be part of your problem, though again I don't know specifics.
I have imported an app from Visual Studio compiler to MinGW and I faced a problem – performance degradation. Usage of CPU increased from 30% to 100%.
There is one interesting thing. If before running my app or during, I’ve run Windows Media Player – performance of my app is going to fine. CPU usage is going down till 30% and works faster (about 10 times faster).
I’ve googled it and found. It relates to a service, which names as a Multimedia Class Scheduler Service (MMCSS). The main problem is: this service forks under Windows Vista and later, but I’ve tested and imported my app under Win XP.
So, does anyone know how to use this feature under XP? And how Windows Media Player increases performance of my app?
Windows Media Player changes the resolution of the system multimedia timer. Basically, this occurs when your application really should be using something like the High Performance Timer but is using the multimedia timer instead, which simply doesn't have and isn't intended to have the necessary accuracy or resolution to be a high-performance timer. As a result, any timings in your program essentially don't work as they should, which is especially bad if you're trying to sleep or block for a fixed time.
I'm considering writing a new Windows GUI app, where one of the requirements is that the app must be very responsive, quick to load, and have a light memory footprint.
I've used WTL for previous apps I've built with this type of requirement, but as I use .NET all the time in my day job WTL is getting more and more painful to go back to. I'm not interested in using .NET for this app, as I still find the performance of larger .NET UIs lacking, but I am interested in using a better C++ framework for the UI - like Qt.
What I want to be sure of before starting is that I'm not going to regret this on the performance front.
So: Is Qt fast?
I'll try and qualify the question by examples of what I'd like to come close to matching: My current WTL app is Programmer's Notepad. The current version I'm working on weighs in at about 4mb of code for a 32-bit, release compiled version with a single language translation. On a modern fast PC it takes 1-3 seconds to load, which is important as people fire it up often to avoid IDEs etc. The memory footprint is usually 12-20 mb on 64-bit Win7 once you've been editing for a while. You can run the app non-stop, leave it minimized, whatever and it always jumps to attention instantly when you switch to it.
For the sake of argument let's say I want to port my WTL app to Qt for potential future cross-platform support and/or the much easier UI framework. I want to come close to if not match this level of performance with Qt.
Just chiming in with my experience in case you still haven't solved it or anyone else is looking for more experience. I've recently developed a pretty heavy (regular QGraphicsView, OpenGL QGraphicsView, QtSQL database access, ...) application with Qt 4.7 AND I'm also a stickler for performance. That includes startup performance of course, I like my applications to show up nearly instantly, so I spend quite a bit of time on that.
Speed: Fantastic, I have no complaints. My heavy app that needs to instantiate at least 100 widgets on startup alone (granted, a lot of those are QLabels) starts up in a split second (I don't notice any delay between doubleclicking and the window appearing).
Memory: This is the bad part, Qt with many subsystems in my experience does use a noticeable amount of memory. Then again this does count for the many subsystems usage, QtXML, QtOpenGL, QtSQL, QtSVG, you name it, I use it. My current application at startup manages to use about 50 MB but it starts up lightning fast and responds swiftly as well
Ease of programming / API: Qt is an absolute joy to use, from its containers to its widget classes to its modules. All the while making memory management easy (QObject) system and mantaining super performance. I've always written pure win32 before this and I wil never go back. For example, with the QtConcurrent classes I was able to change a method invocation from myMethod(arguments) to QtConcurrent::run(this, MyClass::myMethod, arguments)and with one single line a non-GUI heavy processing method was threaded. With a QFuture and QFutureWatcher I could monitor when the thread had ended (either with signals or just method checking). What ease of use! Very elegant design all around.
So in retrospect: very good performance (including app startup), quite high memory usage if many submodules are used, fantastic API and possibilities, cross-platform
Going native API is the most performant choice by definition - anything other than that is a wrapper around native API.
What exactly do you expect to be the performance bottleneck? Any strict numbers? Honestly, vague ,,very responsive, quick to load, and have a light memory footprint'' sounds like a requirement gathering bug to me. Performance is often overspecified.
To the point:
Qt's signal-slot mechanism is really fast. It's statically typed and translates with MOC to quite simple slot method calls.
Qt offers nice multithreading support, so that you can have responsive GUI in one thread and whatever else in other threads without much hassle. That might work.
Programmer's Notepad is an text editor which uses Scintilla as the text editing core component and WTL as UI library.
JuffEd is a text editor which uses QScintilla as the text editing core component and Qt as UI library.
I have installed the latest versions of Programmer's Notepad and JuffEd and studied the memory footprint of both editors by using Process Explorer.
Empty file:
- juffed.exe Private Bytes: 4,532K Virtual Size: 56,288K
- pn.exe Private Bytes: 6,316K Virtual Size: 57,268K
"wtl\Include\atlctrls.h" (264K, ~10.000 lines, scrolled from beginning to end a few times):
- juffed.exe Private Bytes: 7,964K Virtual Size: 62,640K
- pn.exe Private Bytes: 7,480K Virtual Size: 63,180K
after a select all (Ctrl-A), cut (Ctrl-X) and paste (Ctrl-V)
- juffed.exe Private Bytes: 8,488K Virtual Size: 66,700K
- pn.exe Private Bytes: 8,580K Virtual Size: 63,712K
Note that while scrolling (Pg Down / Pg Up pressed) JuffEd seemed to eat more CPU than Programmer's Notepad.
Combined exe and dll sizes:
- juffed.exe QtXml4.dll QtGui4.dll QtCore4.dll qscintilla2.dll mingwm10.dll libjuff.dll 14Mb
- pn.exe SciLexer.dll msvcr80.dll msvcp80.dll msvcm80.dll libexpat.dll ctagsnavigator.dll pnse.dll 4.77 Mb
The above comparison is not fair because JuffEd was not compiled with Visual Studio 2005, which should generate smaller binaries.
We have been using Qt for multiple years now, developing a good size UI application with various elements in the UI, including a 3D window. Whenever we hit a major slowdown in app performance it is usually our fault (we do a lot of database access) and not the UIs.
They have done a lot of work over the last years to speed up drawing (this is where most of the time is spent). In general unless you really do implement a kind of editor usually there is not a lot of time spent executing code inside the UI. It mostly waits on input from the user.
Qt is a very nice framework, but there is a performance penalty. This has mostly to do with painting. Qt uses its own renderer for painting everything - text, rectangles, you name it... To the underlying window system every Qt application looks like a single window with a big bitmap inside. No nested windows, no nothing. This is good for flicker-free rendering and maximum control over the painting, but this comes at the price of completely forgoing any possibility for hardware acceleration. Hardware acceleration is still noticeable nowadays, e.g. when filling large rectangles in a single color, as is often the case in windowing systems.
That said, Qt is "fast enough" in almost all cases.
I mostly notice slowness when running on a Macbook whose CPU fan is very sensitive and will come to life after only a few seconds of moderate CPU activity. Using the mouse to scroll around in a Qt application loads the CPU a lot more than scrolling around in a native application. The same goes for resizing windows.
As I said, Qt is fast enough but if increased battery draining matters to you, or if you care about very smooth window resizing, then you don't have much choice besides going native.
Since you seem to consider a 3 second application startup "fast", it doesn't sound like you would care at all about Qt's performance, though. I would consider 3 second startup dog-slow, but opinions on that vary naturally.
The overall program performance will of course be up to you, but I don't think that you have to worry about the UI. Thanks to the graphics scene and OpenGL support you can do fast 2D/3D graphics too.
Last but not least, an example from my own experience:
Using Qt on Linux/Embedded XP machine with 128 MB of Ram. Windows uses MFC, Linux uses Qt. Custom user GUI with lots of painting, and a regular admin GUI with controls/widgets. From a user's point of view, Qt is as fast as MFC. Note: it was a full screen program that could not be minimized.
Edited after you have added more info:
you can expect a larger executable size (especially with Qt MinGW) and more memory usage. In your case, try playing with one of the IDEs (e.g. Qt Creator) or text editors written in Qt and see what you think.
I personally would choose Qt as I've never seen any performance hit for using it. That said, you can get a little closer to native with wxWidgets and still have a cross-platform app. You'll never be quite as fast as straight Win32 or MFC (and family) but you gain a multi-platform audience. So the question for you is, is this worth a small trade-off?
My experience is mostly with MFC, and more recently with C#. MFC is pretty close to the bare metal so unless you define a ton of data structure, it should be pretty quick.
For graphics painting, I always find it useful to render to a memory bitmap, and then blt that to the screen. It looks faster, and it may even be faster, because it's not worrying about clipping.
There usually is some kind of performance problem that creeps in, in spite of my trying to avoid it. I use a very simple way to find these problems: just wait until it's being subjectively slow, pause it, and examine the call stack. I do this a number of times - 10 is usually more than enough. It's a poor man's profiler but works well, no fuss, no bother. The problem is always something no one could have guessed, and usually easy to fix. This is why it works.
If there are dialogs of any complexity, I use my own technique, Dynamic Dialogs, because I'm spoiled. They are not for the faint-of-heart, but are very flexible and perform nicely.
I once made an app to determine the "primeness" of a number (whether it was prime or composite).
I first attempted a Qt GUI, and it took 5 hours to return the answer for 1,299,827 on a computer with 8GB of RAM and an AMD 1090T # 4GHz running no other foreground processes under Linux.
My second attempt used a QProcess of a console application that used the exact same code. On a laptop with 1.3GB of RAM and a 1.4GHz CPU, the response came with no perceivable delay.
I will not deny, though, that it is far easier than GTK+ or Win32, and it handles things quite nicely, but separate intensive processing ENTIRELY from the GUI if you use it.