I am trying to implement some neural networks for training and inference using C++. It should work on GPU (if available, with cuDNN) and CPU (if GPU is not available).
All modern frameworks support this. However, I wonder how this is achieved. Are all functions, such as backprop etc., implemented twice - once with cuDNN, once with a CPU BLAS? How do they ensure that identical results are achieved?
Related
Is it possible to configure a C++ Tensorflow application to use one pool of CPU cores (threads) for feeding a GPU but also use a separate pool of CPU cores (threads) for processing operations?
I have C++ application that is highly optimized to run tensorflow predictions on my 32-core workstation. I also have an RTX 3060. If I configure the application to use the same number of threads but to have Tensorflow use the GPU, the application's throughput is essentially the same, though the CPU load drops from >3000% to about 500%.
I can reduce the number of threads significantly and still obtain the same throughput, indicating that I might be able to nearly double the throughput if I could get Tensorflow to use most of the cores for performing operations as if there was no GPU available.
I have ideas for how I could implement this with my own scheduling algorithm, but given that Tensorflow is already very capable of distributing operations over multiple GPUs I would prefer to let Tensorflow do the distribution.
Can this be done? The API looks like it might support this use case, but if it does I would expect to find examples of it being done.
I am HPC student and I have project coding by OpenCV functions and C++. I have to paralles the code for high performance, so I decide to use CUDA acceleration. I confused with the following…
For getting a high performance , is it enough of using only CUDA?
Can I use both OpenCV::GPU or OpenCV::CUDA with Cuda GPU?
What is different between OpenCV::GPU and OpenCV::CUDA?
CUDA programming can be only used if you have NVIDIA cards. Power of General purpose GPU hardware will be utilized only if you do parallel processing.
For example if you are working with images, every pixels of the images have individual operation. Then GPU programming helps in saving you computation time.
In your application second pixel input depends on first pixel input. Then its better run your application in CPU itself. Again data transfer from CPU to GPU and GPU to CPU will also affect performance. Need to take care while you code.
2 & 3. OpenCV2 versions syntax cv::gpu, whereas OpenCV3 version syntax is cv::cuda. It depends on which opencv version you use.
From this https://software.intel.com/en-us/videos/purpose-of-the-mic-architecture I understand that applications with complex or numerous random memory access are not well suited for Intel Xeon-phi. This is because the architecture uses 61 cores and 8 memory controllers. In case of L1 and L2 cache misses, it takes up to 100s cycles to fetch the line into memory and get it ready for use by the CPU. Such applications are called latency-bound.
Then, the tutorial mentions that many-core architecture (Xeon-phi coprocessor only) are well suited for highly parallel homogeneous code. Two questions from there:
What is referred to as homogeneous code ?
What are real-world applications which can fully benefit from MIC architecture ?
I see the Intel MIC architecture as a "x86 based GPGPU" and if you are familiar with the concept of GPGPU you will find your self familiar with the Intel MIC.
An homogeneous clustering is a system infrastructure with multiple execution unit (i.e. CPUs) all with the same features. For example a multicore system that have four Intel Xeon processors is homogeneous.
An heterogeneous clustering is a system infrastructure with multiple execution unit with different features (i.e. like CPU and GPU). For example my Levono z510 with its Intel i7 Haswell (4 CPUs), its Nvidia GT740M (GPU) and its Intel HD Graphics 4600 (GPU) is an heterogeneous system.
An example of heterogeneous code could be a Video Game.
A video game has a control code, executed by one code of one CPU, that control what the other agents do, its send shaders to execute on the GPUs, physic computation to be performed on others cores or GPUs and so on.
In this example you need to write code that run on the CPU (so it is "CPU aware") and code that run on GPU (so it is "GPU aware"). This is actually done by using different tools, different programming languages and different programming models!
homogeneous code is code that don't need to be aware of n different programming models, one for each different kind of agent. It is just the same programming model, language and tool.
Take a look a this very simple sample code for the MPI library.
The code is all written in C, it is the same program that just take a different flow.
About the applications, Well that's really a broad question...
As said above I see the Intel MIC as a GPGPU based on x86 ISA (part of it at least).
An SDK particularly useful (and listed on the video you linked) to work with clustered systems is OpenCL, it can be used for fast processing of images and computer vision and basically for anything that need the same algorithm to be run billions of times with different inputs (like cryptography applications/brute forcing).
If you search for some OpenCL based project on the web you will get an idea.
To answer you second question it is better to ask ourselves "What could not take advantage of the MIC architecture?" and we will soon find that the more an algorithm is distant from the concept of Stream Processing and the related topics, including the one of Kernel, the less it is suitable for the MIC.
First a straight forward answer to your direct question - to get the most out of the coprocessor, your code should be able to use a large number of threads and should vectorize. How many threads? Well, you have 60 cores (+/- depending on which version you get) and 4 threads per core, with a sweet spot around 2 threads per core on many codes. Sometimes you can get good performance even if you don't use every single core. But vectorization is extremely important; the long (512 byte) vectors are a big source of speed on the coprocessor.
Now, on to programming. The Intel Xeon Phi coprocessor uses two different kinds of programming - offload and native.
In the offload model, you write a program, determine which parts of that code have enough parallelism to make use of the large number of cores on the coprocessor and mark those sections with offload directives. Then inside those offloaded sections, you write the code using some form of parallelism, like OpenMP. (Heterogeneous)
In native code, you do not use any offload directives but, instead, use a -mmic compiler directive. Then you run the code directly on the coprocessor. The code you write will use some form of parallelism, like OpenMP, to make use of the large number of cores the coprocessor has. (Homogeneous)
Another variation on these programming models is to use MPI, often in addition to OpenMP. You can use the offload programming model, in which case, the nodes in you MPI system will be the host nodes in your system. (Hybrid) Alternately, you can use the native programming model, in which case you treat the coprocessor as just another node in your system. (Heterogeneous if host and coprocessors are nodes; homogeneous if only coprocessors are use.)
You may have noticed that nothing I have said implies a separate programming style for the host and coprocessor. There are some optimizations you can make that will keep code written for the coprocessor from running on the processor as well but, in general, the code you write for the coprocessor can also be compiled for and run on the host by just changing the compiler options.
As far as real world apps, see https://software.intel.com/en-us/mic-developer/app-catalogs
why there is no openCL (ocl) in opencv 3.0 beta?
I heard that the new opencv transparently uses opencl, but when I am testing this on a windows running on a intel core i5 (gpu HD400), I can not see any speed improvement as the result of running on GPU.
Am I missing something here?
Ocl module of OpenCV is intentionally removed. Developers are no more expected to use ocl::Canny like invocations. These methods will be invoked internally by OpenCV. Developers are expected to use UMat structure as explained in the presentation. UMat wrapsclmem when OpenCL is available. Or else it default to CPU. See ocl.cpp.
Regarding speed, I would ensure below
In cvconfig.h in build directory, check if OpenCL flag is ON or OFF
In code, ocl::setUseOpenCL(true)
In code, Use UMat in place of Mat
Then check FPS with and with out call to ocl::setUseOpenCL(true);
What I will expect to see is not a drastic FPS increase. Even assuming GPU is used, there could be cases when data has to be copied between CPU/GPU memory and back and forth this might affect end performance. I will expect to see processing offloading to GPU and a less burden on CPU. Not necessarily speed increase.
With regard to the tools, you could use AMD´s CodeXL to observe the behavior of OpenCV/OpenCL. You can see the sequence of OpenCL API calls, the kernels used, their performance and source code, data buffers and its contents, etc. Of course, all this only on AMD hardware. I think for NVIDIA, ParallelInsight can do the same. For Intel, do not know which tool can help.
What methods exist to verify that work is indeed being parallelized by OpenCL? (How can I verify that work is being distributed to all the processing elements for execution?) Or at least a method to monitor which cores/processors of the GPU or CPU are being used?
I would simply like a way to verify that OpenCL is actually doing what its specification claims it is supposedly doing. To do this, I need to collect hard evidence that OpenCL / the OS / the drivers are indeed scheduling kernels and work items to be executed in parallel (as opposed to serially).
I have written an OpenCL program conforming to the OpenCL API 1.2 specification along with a simple OpenCL C kernel which simply squares in the input integer.
In my program, work_group_size = MAX_WORK_GROUP_SIZE (so that they will fit on the compute units and so that OpenCL won't throw a fit).
The total amount_of_work is a scalar multiple of (MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE). Since amount_of_work > MAX_COMPUTE_UNITS * MAX_WORK_GROUP_SIZE, hopefully OpenCL
Hopefully this would be enough to force the schedulers to execute the maximum number of kernels + work items efficiently as possible, making use of the available cores / processors.
For a CPU, you can check cpuid, or sched_getcpu, or GetProcessorNumber in order to check which core / processor the current thread is currently executing on.
Is there a method on the OpenCL API which provides this information? (I have yet to find any.)
Is there an OpenCL C language built in function... or perhaps do the vendor's compilers understand some form of assembly language which I could use to obtain this information?
Is there an equivalent to cpuid, sched_getcpu, or GetProcessorNumber for GPUs for core usage monitoring, etc? Perhaps something vender architecture specific?
Is there an external program which I could use as a monitor for this information? I have tried Process Monitor and AMD's CodeXL, both of which are not useful for what I'm looking for. Intel has VTune, but I doubt that works on an AMD GPU.
Perhaps I could take a look at the compiled kernel code as generated from the AMD and Intel Compilers for some hints?
Hardware Details:
GPU: AMD FirePro, using AMD Capeverde architecture, 7700M Series chipset. I don't know which one exactly of in the series it is. If there is an AMD instruction set manual for this architecture (i.e. there are manuals for x86), that would possibly be a start.
CPU: Intel(R) Core(TM) i7-3630QM CPU # 2.40GHz
Development Environment Details:
OS: Win 7 64-bit, will also eventually need to run on Linux, but that's besides the point.
Compiling with MinGW GNU GCC 4.8.1 -std=c++11
Intel OpenCL SDK (OpenCL header, libraries, and runtime)
According to Process Manager, Intel's OpenCL compiler is a clang variant.
AMD APP OpenCL SDK (OpenCL header, libraries, and runtime)
OpenCL 1.2
I am trying to keep the source code as portable as possible.
Instead of relying on speculations, you can comment-out a program's buffer copies and visualisations, leave only kernel-executions intact. Then put it in a tight loop and watch for heat rising. If it heats like furmark, then it is using cores. If it is not heating, you can disable serial operations in kernels too(gid==0), then try again. For example, a simple nbody simulator pushes a well cooled HD7000 series gpu to over 70°C in minutes and 90°C for poor coolers. Compare it to a known benchmark's temperature limits.
Similar thing for CPU exists. Using float4 heats more than simple floats which shows even instruction type is important to use all ALUs (let alone threads)
If GPU has a really good cooler, you can watch its Vdroop. More load means more voltage drop. More cores more drop, more load per core also more drop.
Whatever you do, its up to compiler and hardware's abilities and you don't have explicit control over ALUs. Because opencl hides hardware complexity from developer.
Usin msi-after burner or similar software is not useful because they show %100 usage even when you use %1 of cards true potential.
Simply look at temperature difference of computer case at equilibrium state from starting state. If delta-T is like 50 with opencl and 5 without opencl, opencl is parallelising stuff you can't know how much.