cuda app on part of the cards - c++

I've got a Nvidia Tesla s2050; a host with a nvidia quadro card.CentOS 5.5 with CUDA 3.1
When i run cuda app, i wanna use 4 Tesla c-2050, but not including quadro on host in order not to lagging the whole performance while split the job by 5 equally.any way to implement this?

I'm assuming you have four processes and four devices, although your question suggests you have five processes and four devices, which means that manual scheduling may be preferable (with the Tesla devices in "shared" mode).
The easiest is to use nvidia-smi to specify that the Quadro device is "compute prohibited". You would also specify that the Teslas are "compute exclusive" meaning only one context can attach to each of these at any given time.
Run man nvidia-smi for more information.

Yes. Check CUDA Support/Choosing a GPU
Problem
Running your code on a machine with
multiple GPUs may result in your code
executing on an older and slower GPU.
Solution
If you know the device number of the
GPU you want to use, call
cudaSetDevice(N). For a more robust
solutions, include the code shown
below at the beginning of your program
to automatically select the best GPU
on any machine.
Check their website for further explanation.
You may also find this post very interesting.

Related

Why isn't my colab notebook using the GPU?

When I run code on my colab notebook after having selected the GPU, I get a message saying "You are connected to a GPU runtime, but not utilizing the GPU". Now I understand similar questions have been asked before, but I still don't understand why. I am running PCA on a dataset over hundreds of iterations, for multiple trials. Without a GPU it takes about as long as it does on my laptop, which can be >12 hours, resulting in a time out on colab. Is colab's GPU restricted to machine learning libraries like tensorflow only? Is there a way around this so I can take advantage of the GPU to speed up my analysis?
Colab is not restricted to Tensorflow only.
Colab offers three kinds of runtimes: a standard runtime (with a CPU), a GPU runtime (which includes a GPU) and a TPU runtime (which includes a TPU).
"You are connected to a GPU runtime, but not utilizing the GPU" indicates that the user is conneted to a GPU runtime, but not utilizing the GPU, and so a less costly CPU runtime would be more suitable.
Therefore, you have to use a package that utilizes the GPU, such as Tensorflow or Jax. GPU runtimes also have a CPU, and unless you are specifically using packages that exercise the GPU, it will sit idle.

Gauss Blur 3d image in cuda, sometimes it works sometimes it does not [duplicate]

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.
You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious.
To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1.
You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.
Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.
Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.
The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?
You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.
Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.
[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx
or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx
[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.
[EDIT August 2018 to update:]
Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys
On Windows, the graphics driver has a watchdog timer that kills any shader programs that run for more than 5 seconds. Note that the Xorg/XFree86 drivers don't do this, so one possible workaround is to run the CUDA apps on Linux.
AFAIK it is not possible to disable the watchdog timer on Windows. The only way to get around this on Windows is to use a second card that has no displayed screens on it. It doesn't have to be a Tesla but it must have no active screens.
Resolve Timeout Detection and Recovery - WINDOWS 7 (32/64 bit)
Create a registry key in Windows to change the TDR settings to a
higher amount, so that Windows will allow for a longer delay before
TDR process starts.
Open Regedit from Run or DOS.
In Windows 7 navigate to the correct registry key area, to create the
new key:
HKEY_LOCAL_MACHINE>SYSTEM>CurrentControlSet>Control>GraphicsDrivers.
There will probably one key in there called DxgKrnlVersion there as a
DWord.
Right click and select to create a new key REG_DWORD, and name it
TdrDelay. The value assigned to it is the number of seconds before
TDR kicks in - it > is currently 2 automatically in Windows (even
though the reg. key value doesn't exist >until you create it). Assign
it with a new value (I tried 4 seconds), which doubles the time before
TDR. Then restart PC. You need to restart the PC before the value will
work.
Source from Win7 TDR (Driver Timeout Detection & Recovery)
I have also verified this and works fine.
The most basic solution is to pick a point in the calculation some percentage of the way through that I am sure the GPU I am working with is able to complete in time, save all the state information and stop, then to start again.
Update:
For Linux: Exiting X will allow you to run CUDA applications as long as you want. No Tesla required (A 9600 was used in testing this)
One thing to note, however, is that if X is never entered, the drivers probably won't be loaded, and it won't work.
It also seems that for Linux, simply not having any X displays up at the time will also work, so X does not need to be exited as long as you screen to a non-X full-screen terminal.
This isn't possible. The time-out is there to prevent bugs in calculations from taking up the GPU for long periods of time.
If you use a dedicated card for CUDA work, the time limit is lifted. I'm not sure if this requires a Tesla card, or if a GeForce with no monitor connected can be used.
The solution I use is:
1. Pass all information to device.
2. Run iterative versions of algorithms, where each iteration invokes the kernel on the memory already stored within the device.
3. Finally transfer memory to host only after all iterations have ended.
This enables control over iterations from CPU (including option to abort), without the costly device<-->host memory transfers between iterations.
The watchdog timer only applies on GPUs with a display attached.
On Windows the timer is part of the WDDM, it is possible to modify the settings (timeout, behaviour on reaching timeout etc.) with some registry keys, see this Microsoft article for more information.
It is possible to disable this behavior in Linux. Although the "watchdog" has an obvious purpose, it may cause some very unexpected results when doing extensive computations using shaders / CUDA.
The option can be toggled in your X-configuration (likely /etc/X11/xorg.conf)
Adding: Option "Interactive" "0" to the device section of your GPU does the job.
see CUDA Visual Profiler 'Interactive' X config option?
For details on the config
and
see ftp://download.nvidia.com/XFree86/Linux-x86/270.41.06/README/xconfigoptions.html#Interactive
For a description of the parameter.

How to setup a dedicated GPU in order to benchmark a CUDA kernel?

I want to use second GPU device as a dedicate device under linux, in order to benchmark a kernel.
The kernel that I am testing is a SIMD computing kernel without reductions and not X-Server is attached to the GPU, the device is a GeForge GTX-480 so I suppose that the compute capability is 2. Therefore, advanced features as dynamic parallelism and others, are disabled.
using the nvidia-smi utility there are various modes to setup the GPU
"Default" means multiple contexts are allowed per device.
"Exclusive Process" means only one context is allowed per device, usable from multiple threads at a time.
"Prohibited" means no contexts are allowed per device (no compute apps).
Which is the best mode to setup the GPU in order to obtain a benchmark as faithful as possible?
What is the command that I should use in order to make permanent such setup?
I am compiling the kernel using the following flags:
nvcc --ptxas-options=-v -O3 -w -arch=sm_20 -use_fast_math -c -o
Exist a better combination of flags in order to obtain more help from the compiler to get faster execution times?
Any suggestion will be very appreciated.
my question is related to what is more appropriated? setup the GPU to a compute-exclusive mode or not.
It should not matter whether you set the GPU to exclusive-process or Default, as long as there is only one process attempting to use that GPU.
You generally would not want to use exclusive-thread except in specific situations, because exclusive-thread could prevent multi-threaded GPU apps from running correctly, and may also interfere with other functions such as profiler functions.
What is the command that I should use in order to make permanent such setup?
If you refer to the nvidia-smi command line help (nvidia-smi --help) or the nvidia-smi man page (man nvidia-smi), you can determine the command to make the change. Any changes you make will be permanent until they are explicitly changed again.

I have two GPUs, how can I just let one to do certain CUDA task?

New to CUDA, but have some time spending on computing, and I have geforces at home and tesla (same generation) in the office.
At home I have two gpus installed in the same computer, one is GK110 (compute capability 3.5), the other is GF110 (compute capability 2.0), I perfer to use GK110 for computation task ONLY and GF110 for display UNLESS I tell it to do computation, is there a way to do this through driver setting or I still need to rewrite some of my codes?
Also, if I understand correctly, if the display port of GK110 is not being connected, then the annoying windows timeout detection will not try to reset it even if the computation time is very long?
Btw my CUDA codes are compiled with both compute_35 and compute20, so the codes can be run on both GPUs, however I plan to use features that being exclusive to GK110 so the codes in the future may not being able to run on GF110 at all, and the OS is windows 7.
With a GeForce GTX Titan (or any GeForce product) on Windows, I don't believe there is a way to prevent the GPU from appearing in the system in WDDM mode, which means that windows will build a display driver stack on it, even if the card has no physical display attached to it. So you may be stuck with the windows TDR mechanism. You could try experimenting with it to confirm that. (The windows TDR behavior can be modified via registry hacking).
Regarding steering CUDA tasks to the GTX Titan, the display driver control panel should have a selectable setting for this. It may be in the "Manage 3D settings" area or some other area depending on which driver you have. When you find the appropriate settings area, there will be a selection entitled something like CUDA - GPUs which will probably be set to "All". If you change the "Global Presets" selection to "Base Profile" you should be able to change this CUDA-GPUs setting. Clicking on it should give you a selection of "All" or a set of checkboxes for each GPU detected. If you uncheck the GF110 device and check the GK110 device, then CUDA programs that do not select a particular GPU via cudaSetDevice() should be steered to the GK110 device based on this checkbox selection. You may want to experiment with this as well to confirm.
Other than that, as mentioned in the comments, using a programmatic method, you can always query device properties and then select the device that reports itself as a cc3.5 device.

Is there any good way to get a indication if a computer can run a specific program/software?

Is there any good way too get a indication if a computer is capable to run a program/software without any performance problem, using pure JavaScript (Google V8), C++ (Windows, Mac OS & Linux), by requiring as little information as possible from the software creator (like CPU score, GPU score)?
That way can I give my users a good indication whether their computer is good enough to run the software or not, so the user doesn't need to download and install it from the first place if she/he will not be able to run it anyway.
I thinking of something like "score" based indications:
CPU: 230 000 (generic processor score)
GPU: 40 000 (generic GPU score)
+ Network/File I/O read/write requirements
That way can I only calculate those scores on the users computer and then compare them, as long as I'm using the same algorithm, but I have no clue about any such algorithm, whose would be sufficient for real-world software for desktop usage.
I would suggest testing on existence of specific libraries and environment (OS version, video card presence, working sound drivers, DirectX, OpenGL, Gnome, KDE). Assign priorities to these libraries and make comparison using the priorities, e.g. video card presence is more important than KDE presence.
The problem is, even outdated hardware can run most software without issues (just slower), but newest hardware cannot run some software without installing requirements.
For example, I can run Firefox 11 on my Pentium III coppermine (using FreeBSD and X server), but if you install windows XP on the newest hardware with six-core i7 and nVidia GTX 640 it still cannot run DirectX 11 games.
This method requires no assistance from the software creator, but is not 100% accurate.
If you want 90+% accurate information, make the software creator check 5-6 checkboxes before uploading. Example:
My application requires DirectX/OpenGL/3D acceleration
My application requires sound
My application requires Windows Vista or later
My application requires [high bandwith] network connection
then you can test specific applications using information from these checkboxes.
Edit:
I think additional checks could be:
video/audio codecs
pixel/vertex/geometry shader version, GPU physics acceleration (may be crucial for games)
not so much related anymore: processor extensions (SSE2 MMX etc)
third party software such as pdf, flash, etc
system libraries (libpng, libjpeg, svg)
system version (Service Pack number, OS edition (premium professional etc)
window manager (some apps on OSX require X11 for functioning, some apps on Linux work only on KDE, etc)
These are actual requirements I (and many others) have seen when installing different software.
As for old hardware, if the computer satisfies hardware requirements (pixel shader version, processor extensions, etc), then there's a strong reason to believe the software will run on the system (possibly slower, but that's what benchmarks are for if you need them).
For GPUs I do not think getting a score is usable/possible without running some code on the machine to test if the machine is up to spec.
With GPU's this is typically checking what Shader Models it is able to use, and either defaulting to a lower shader model (thus the complexity of the application is of less "quality") or telling them they have no hope of running the code and thus quitting.