Limiting processor count for multi-threaded applications - c++

I am developing a multi threaded application which ran fine on my development system which has 8 cores. When I ran it on a PC with 2 cores I encountered some synchronization issues.
Apart from turning off hyper-threading is there any way of limiting the number of cores an application can use so that I can emulate single and dual core environments for testing & debugging.
My application is written in C++ using Visual Studio 2010.

We always test in virtual machines nowadays since it's so easy to set up specific environments with given limitations.
For example, VMWare easily allows you to limit the number of processors in use, how much memory there is, hard disk sizes, the presence of USB or floppies or printers and all sorts of other wondrous things.
In fact, we have scripts which do all the work at the push of a button, from restoring the VM to a known initial state, then booting it up, installing the code over the network, running a test cycle then moving the results to an analysis machine on the network as well.
It greatly speeds up and simplifies the testing regime.

You want the SetProcessAffinityMask function or the SetThreadAffinityMask function.
The former works on the whole process and the latter on a specific thread.
You can also limit the active cores via the Windows Task Manager. Right click on process name and select "Set Affinity".

Related

Windows 7 applications run slower when not focused

I'm attempting to run two applications simultaneously on windows 7, however, I'm finding that when I do this, whichever has focus runs at a normal speed but the other is clearly running at a far slower speed. (For reference, one is a unity application and the other is a C++ direct X application). Has anyone ever encountered something like this? Is there a way to allow both applications to run at full speed? The system ought to have the resources to run both, neither are very complex. When I monitor the system resources, etc, everything looks good.
Windows automatically offers less system resources to unfocused programs no matter their complexity or requirements. I don't believe you can disable that.
That makes sense. I looked into a bit deeper and found that the Desktop Window Manager was the one causing the headache. I stopped the service, set the processor affinity for each application, and everything was golden after that.

Gauss Blur 3d image in cuda, sometimes it works sometimes it does not [duplicate]

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.

Isis2 in ns-3 and bridge tap

So I need to simulate Isis2 in ns-3. (I am also to modify Isis2 slightly, wrapping it with some C/C++ code since I need at least a quasi real-time mission-critical behavior)
Since I am far from having any of that implemented it would interesting to know if this is a suitable way of conduct. I need to specifically monitor the performance of the consensus during sporadic wifi (ad hoc) behavior.
Would it make sense to virtualize a machine for each instance of Isis2 and then use the tap bridge( model and analyze the traffic in the ns-3 channel?
(I also am to log the events on each instance; composing the various data into a unified presentation)
You need to start by building an Isis2 application program, and this would have to be done using C/CLI or C++/CLI. C++/CLI will be easier because the match with the Isis2 type system is closer. But as I type these words, I'm trying to remember whether Mono actually supports C++/CLI. If there isn't a Mono compiler for C++/CLI, you might be forced to use C# or IronPython. Basically, you have to work with what the compiler will support.
You'll build this and the library on your mono platform and should test it out, which you can do on any Linux system. Once you have it working, that's the thing you'll experiment with on NS/3. Notice that if you work on Windows, you would be able to use C++/CLI (for sure) and then can just make a Windows VM for NS3. So this would mean working on Windows, but not needing to learn C#.
This is because Isis2 is a library for group communication, multicast, file replication and sharing, DHTs and so forth and to access any particular functionality you need an application program to "drive" it. I wouldn't expect performance issues if you follow the recommendations in the video tutorials and the user manual; even for real-time uses the system is probably both fast enough and steady enough in its behavior.
Then yes, I would take a virtual machine with the needed binaries for Mono (Mono is loaded from DLLs so they need to be available at the right virtual file system locations) and your Isis2 test program and run that within NS3. I haven't tried this but don't see any reason it wouldn't work.
Keep in mind that the default timer settings for timeout and retransmission are very slow and tuned for running on Amazon AWS, inside a data center. So once you have this working, but before simulating your wifi setup, you may want to experiment with tuning the system to be more responsive in that setting. I'm thinking that ISIS_DEFAULTTIMEOUT will probably be way too long for you, and the RTDELAY setting may also be too long for you. Amazon AWS is a peculiar environment and what makes Isis2 stable in AWS might not be ideal in a Wifi setting with very different goals... but all of those parameters can be tuned by just setting the desired values in the Environment, which can be done in bash on the line that launches your test program, or using the bash "Export" command.

Is there a way for C++ application function to turn on the computer?

i need to find away to turn on the pc from c++ application ,
is there any way to do this?
Thanks
If the computer is off, it can't be executing code, and therefore can't turn itself on programmatically.
ACPI changes that somewhat, but for us to be able to help, you have to be more specific about your exact requirements.
If you need to turn on a different computer, take a look at Wake-on-LAN.
You will not be able to write a program to turn a computer on that the program itself is installed on.
If you need to write an application that will turn on a different computer, Wake-on-LAN is the tool for you. Modern desktops have NICs that is always receiving power - even if the computer is in an S5 state. Assuming the BIOS supports it and it is enabled.
Wake-On-LAN works by sending a Magic Packet to the NIC. The details of what the payload consists of is outlined in the article.
This is possibly a duplicate of C#: How to wake up system which has been shutdown? (although that is C#).
One way to do it under windows is to create a timer with CreateWaitableTimer(), set the time with SetWaitableTimer() and then do a WaitForSingleObject(). Your code will pause, and you can put the computer into standby (maybe also hibernation, but not shutdown). When the timer is reached, the PC will resume and so will your program.
See here for a complete example in C. The example shows how to calculate the time difference for the timer, and how to do the waiting in a thread (if you are writing a graphical application).
I have to add, you can also schedule the computer to wake up using the Windows Task Scheduler ('Wake the computer to run this task'). This possibly also works when the computer is shut down. There is also an option in some computers BIOS to set a wake time.
Under Linux, you can set the computer to wake up by writing to a special file:
echo 2006-02-09 23:05:00 > /proc/acpi/alarm
Note that I haven't tested all of this, and it is highly dependent on the hardware (mainboard), but some kind of wake-up should be available on all modern PCs.
See also: http://en.wikipedia.org/wiki/Real-time_clock_alarm ,
and here is a program that claims to do it on windows: http://www.dennisbabkin.com/wosb/
Use strip. If you require a Windows computer to be turned on, the cross-tools i686-w64-mingw32-strip or x86_64-w64-mingw32-strip should be used. These command-line programs modify an executable, and the result is able to turn on a computer.
How could you turn on a computer from an application, when no processes are running on it when it's shut down ? You can turn on another computer (Wake on Lan), but not the one you are running.
It is possible.
First thing to do is configure Wake On Lan. Check out this post on Lifehacker on how to do it: http://lifehacker.com/348197/access-your-computer-anytime-and-save-energy-with-wake+on+lan.
(Or this link: http://hblg.info/2011/08/21/Wake-on-LAN-and-remote-login.html)
Then you need to send a magic packet from your C++ application. There are several web services that already do this from javascript (wakeonlan.me) , but it can be done from within a C++ application as well.
Chances are, that if you want to do this, you are working with servers.
In such case, your mainboard may should an IMPI baseboard management controller.
IPMI may be used to cycle the chassis power remotely.
Generally, the BMC will have its own IP address, to which you may connect to send control messages.

What is a good hardware setup for programming concurrent and distributed applications?

I don't have the money to build my own uber Blade system but I would like to get into concurrent and distributed programming (think CCR/DSS, Hadoop, Project Voldemort etc.).
I currently have a Q6600 with 4GB with some separate hdds but that's about it. While I can write multi-threaded programs I can not properly test distributed filesystems / key-value stores and look for associated bottlenecks (disk access, network, etc.).
Does anyone have some recommendations? Buying some small cheap boxes and setting up a mini network? Or maybe a single box with two i7's and ESX and a simulated network?
edit:
I'm currently using VirtualBox and VmWare and this does not look good enough for me, correct me if I'm wrong: The hard drives could lock for instance, either because two virtualized machines run on them, or because all hard drive access is channeled through the same hdd controller. The network is entirely virtual, so no real case test here either.
If I go the virtualization route, what would you recommend so I can get as near to 'real-life' as possible?
Virtualise for your distributed system tests. It's much easier to 'pull the plug' on a machine, disconnect the network cable etc.
Sun VirtualBox is an excellent free virtual machine which I've found extremely convenient for development purposes. Most of it is also Open Source, if you're into that.
As for the multi-threaded part, it's actually easier - always test with more software threads than your number of hardware threads. And then, just for fun, do something like writing a 10 GB file to your hard disk, plug/unplug hardware to throw the scheduler off. You'll get surprising results.