Why isn't my colab notebook using the GPU? - google-cloud-platform

When I run code on my colab notebook after having selected the GPU, I get a message saying "You are connected to a GPU runtime, but not utilizing the GPU". Now I understand similar questions have been asked before, but I still don't understand why. I am running PCA on a dataset over hundreds of iterations, for multiple trials. Without a GPU it takes about as long as it does on my laptop, which can be >12 hours, resulting in a time out on colab. Is colab's GPU restricted to machine learning libraries like tensorflow only? Is there a way around this so I can take advantage of the GPU to speed up my analysis?

Colab is not restricted to Tensorflow only.
Colab offers three kinds of runtimes: a standard runtime (with a CPU), a GPU runtime (which includes a GPU) and a TPU runtime (which includes a TPU).
"You are connected to a GPU runtime, but not utilizing the GPU" indicates that the user is conneted to a GPU runtime, but not utilizing the GPU, and so a less costly CPU runtime would be more suitable.
Therefore, you have to use a package that utilizes the GPU, such as Tensorflow or Jax. GPU runtimes also have a CPU, and unless you are specifically using packages that exercise the GPU, it will sit idle.

Related

Batch processing mode in Caffe - no performance gains

Following on this thread I reimplemented my image processing code to send in 10 images at once (i.e. I now have the num property of the input blob set to 100 instead of 10).
However, the time required to process this batch is 10 times bigger than originally. Which means that I did not get any performance increase.
Is that reasonable or did I make something wrong?
I am running Caffe in CPU mode. Unfortunately GPU mode is not an option for me.
Update: Caffe now natively supports parallel processing of multiple images when using multiple GPUs. Though it seems relatively simple to implement base on the current implementation of GPU parallelism, at the moment there's no similar support for parallel processing on multiple CPUs.
Considering that the main problem with implementing parallelism is the syncing you need during training If you just want to process your images in parallel (as opposed to training the model), then you could load several copies of the same network to memory (whether through python with multiprocessing or c++ with multi-threading), and process each image on a different network. It would be simple and quite effective, especially if you load the networks once and then just process a large amount of images. Nevertheless, GPUs are much faster :)
Caffe doesn't process multiple images in parallel, the only saving you get by batch processing several images is in the time it takes to transfer the image data back and forth between Caffe's framework, which could be significant when dealing with the GPU.
IIRC there are several attempts to make Caffe process images in parallel, but most focus on the GPU implementation (CUDNN, CUDA Streams etc.), with few attempts to add parallelism to the CPU code (OpenBLAS's multithread mode, or simply running on multiple threads). Of those I believe only the CUDNN option is currently part of the stable version of Caffe, but obviously requires a GPU. You can try to look at one of the pull requests about this matter on Caffe's github page and see if it works for you, but note that it might cause compatibilities issue with your current version.
This is one such version that in the past I've used, though it's no longer maintained: https://github.com/BVLC/caffe/pull/439
I've also noticed in the last comment of the above issue that there's some speed up to the CPU code on this pull request as well, though I've never tried it myself: https://github.com/BVLC/caffe/pull/2610

How can I run a code directly into a processor with a File System?

I have a simple anisotropic filter c/c++ code that will process an .pgm image which is an text file with greyscale information for each pixel, and after done processing, it will generate an output image with the filter applied.
This program takes up to some seconds in order for it to do about 10 iterations on a x86 CPU running windows.
Me and an academic finishing his master degree on applied computing, we need to run the code under FPGA (Altera DE2-115) to see if there is considerable results of performance gain when running the code directly on the processor (NIOS 2).
We have successfully booted up the S.O uClinux under the FPGA, but there are some errors with device hardware, and by that we can't access SD-Card not even Ethernet, so we can't get the code and image into the FPGA in order to test its performance.
So I am here asking to an alternative way to test our code performance directly into an CPU with a file system so the code can read the image and generate another one.
The alternative can be either with an product that has low cost and easy to use (I was thinking raspberry PI), or either if I could upload the code somewhere that runs automatically for me and give me the reports.
Thanks in advance.
what you're trying to do is benchmarking some software on a multi GHz x86 Processor vs. a soft-core processor running 50MHz? (as much as I can tell from Altera docs)
I can guarantee that it will be even slower on the FPGA! Since it is also running an OS (even embedded Linux) it also has threading overhead and what not. This can not be considered running it "directly" on CPU (whatever you mean by this)
If you really want to leverage the performance of an FPGA you should "convert" your C-Code into a HDL and run it directly in hardware. Accessing the data should be possible. I don't know how it's done with an Altera board but Xilinx has some libraries accessing data from a SD card with FAT.
You can use on board SRAM or DDR2 RAM to run OS and your application.
Hardware design in your FPGA must have memory controller in it. In SOPC or Qsys select external memory as reset vector and compile design.
Then open NioSII build tools for Eclipse.
In Eclipse create new project by selecting NiosII Application and BSP project.
Once the project is created, go to BSP properties and type offset of external memory in the linker tab and generate BSP.
Compile project and Run as Nios II hardware.
This will run you application on through external memory.
You wont be able to see the image but 2-D array representing image in memory can be
printed on console.

Is there any good way to get a indication if a computer can run a specific program/software?

Is there any good way too get a indication if a computer is capable to run a program/software without any performance problem, using pure JavaScript (Google V8), C++ (Windows, Mac OS & Linux), by requiring as little information as possible from the software creator (like CPU score, GPU score)?
That way can I give my users a good indication whether their computer is good enough to run the software or not, so the user doesn't need to download and install it from the first place if she/he will not be able to run it anyway.
I thinking of something like "score" based indications:
CPU: 230 000 (generic processor score)
GPU: 40 000 (generic GPU score)
+ Network/File I/O read/write requirements
That way can I only calculate those scores on the users computer and then compare them, as long as I'm using the same algorithm, but I have no clue about any such algorithm, whose would be sufficient for real-world software for desktop usage.
I would suggest testing on existence of specific libraries and environment (OS version, video card presence, working sound drivers, DirectX, OpenGL, Gnome, KDE). Assign priorities to these libraries and make comparison using the priorities, e.g. video card presence is more important than KDE presence.
The problem is, even outdated hardware can run most software without issues (just slower), but newest hardware cannot run some software without installing requirements.
For example, I can run Firefox 11 on my Pentium III coppermine (using FreeBSD and X server), but if you install windows XP on the newest hardware with six-core i7 and nVidia GTX 640 it still cannot run DirectX 11 games.
This method requires no assistance from the software creator, but is not 100% accurate.
If you want 90+% accurate information, make the software creator check 5-6 checkboxes before uploading. Example:
My application requires DirectX/OpenGL/3D acceleration
My application requires sound
My application requires Windows Vista or later
My application requires [high bandwith] network connection
then you can test specific applications using information from these checkboxes.
Edit:
I think additional checks could be:
video/audio codecs
pixel/vertex/geometry shader version, GPU physics acceleration (may be crucial for games)
not so much related anymore: processor extensions (SSE2 MMX etc)
third party software such as pdf, flash, etc
system libraries (libpng, libjpeg, svg)
system version (Service Pack number, OS edition (premium professional etc)
window manager (some apps on OSX require X11 for functioning, some apps on Linux work only on KDE, etc)
These are actual requirements I (and many others) have seen when installing different software.
As for old hardware, if the computer satisfies hardware requirements (pixel shader version, processor extensions, etc), then there's a strong reason to believe the software will run on the system (possibly slower, but that's what benchmarks are for if you need them).
For GPUs I do not think getting a score is usable/possible without running some code on the machine to test if the machine is up to spec.
With GPU's this is typically checking what Shader Models it is able to use, and either defaulting to a lower shader model (thus the complexity of the application is of less "quality") or telling them they have no hope of running the code and thus quitting.

Profiling OpenCL kernels

I am trying to optimize my OpenCL kernels and all I have right now is NVidia Visual Profiler, which seems rather constrained. I would like to see line-by-line profile of kernels to better understand issues with coalescing, etc. Is there a way to get more thorough profiling data than the one, provided by Visual Profiler?
I think that AMD CodeXL is what you are looking for. It's a free set of tools that contains an OpenCL debugger and a GPU profiler.
The OpenCL debugger allows you to do line-by-line debugging of your OpenCL kernels and host code, view all variables across different workgroups, view special events and errors that occur, etc..
The GPU profiler has a nice feature that generates a timeline displaying how long your program spends on tasks like data transfer and kernel execution.
For more info and download links, check out http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
No, there is no such tool but you can profile your code changes. Try measuring the speed of your code, change something and then measure it once again. clEnqueueNDRangeKernel has an Event argument which can be used with clGetEventProfilingInfo afterwards, the timer is very sharp, the accuracy is measured in order of microseconds. This is the only way to measure performance of a separate code part...
I haven't test it but I just found this program http://www.gremedy.com/gDEBuggerCL.php.
The description is: " This new product brings gDEBugger's advanced Debugging, Profiling and Memory Analysis abilities to the OpenCL developer's world..."
LTPV is an open-source, OpenCL profiler that may fit your requirements. It is, for now, only working under Linux.
(disclosure: I am the developer of this tool)

cuda app on part of the cards

I've got a Nvidia Tesla s2050; a host with a nvidia quadro card.CentOS 5.5 with CUDA 3.1
When i run cuda app, i wanna use 4 Tesla c-2050, but not including quadro on host in order not to lagging the whole performance while split the job by 5 equally.any way to implement this?
I'm assuming you have four processes and four devices, although your question suggests you have five processes and four devices, which means that manual scheduling may be preferable (with the Tesla devices in "shared" mode).
The easiest is to use nvidia-smi to specify that the Quadro device is "compute prohibited". You would also specify that the Teslas are "compute exclusive" meaning only one context can attach to each of these at any given time.
Run man nvidia-smi for more information.
Yes. Check CUDA Support/Choosing a GPU
Problem
Running your code on a machine with
multiple GPUs may result in your code
executing on an older and slower GPU.
Solution
If you know the device number of the
GPU you want to use, call
cudaSetDevice(N). For a more robust
solutions, include the code shown
below at the beginning of your program
to automatically select the best GPU
on any machine.
Check their website for further explanation.
You may also find this post very interesting.