Will CUDA API affect CPU's Ram access performance? - c++

* More testing shows CPU's Ram slowness is not related to CUDA. It turns out Func2(CPU) is CPU intensive but not memory intensive, then for my program1, the pressure on memory is less as it's Func2 who is occupying CPU. For program2(GPU), as Func2 becomes very fast with GPU, Func1 occupies CPU and put a lot of pressure on memory, leading to Func1's slowness *
Short version: if I run 20 processes concurrently on the same server, I noticed CPU's running speed is much slower when GPU is involved (vs pure CPU processes)
Long version:
My server: Win Server 2012, 48 cores (hyperthreaded from 24), 192 GB Ram (the 20 processes will only use ~40GB), 4 K40 cards
My program1 (CPU Version):
For 30 iterations:
Func1(CPU): 6s (lot's CPU memory access)
Func2(CPU): 60s (lot's CPU memory access)
My program2 (GPU Version, to use CPU cores for Func1, and K40s for Func2):
//1 K40 will hold 5 contexts at the same time, till end of the 30 iterations
cudaSetDevice //very slow, can be few seconds
cudaMalloc ~1GB //can be another few seconds
For 30 iterations:
Func1(CPU): 6s
Func2(GPU), 1s (60X speedup) //share GPU via named_mutex
If I run 20 program1(CPU) together, I noticed Func1's 6s becomes 12s on average
While for 20 program2(GPU), Func1 takes ~42s to complete, while my Func2(GPU) is still ~1s (This 1s includes locking GPU, some cudaMemcpy & the kernel call. I presume this includes GPU context switching also). So seems GPU's own performance is not affected much, while CPU does (by GPU)
So I suspect cudaSetDevice/cudaMalloc/cudaMemcpy is affecting CPU's Ram access? If it's true, parallelization using both multi-core CPU & GPU will be affected.
Thanks.

This is almost certainly caused by resource contention.
When, in the standard API, you run 20 processes, you have 20 separate GPU contexts. Every time one of those processes wishes to perform an API call, there must be a context switch to that processes context. Context switching is expensive and has a lot of latency. That will be the source of the slow down you are seeing. Nothing to do with memory performance.
NVIDIA has released a system called MPS (Multi Process Server) which reimplements the CUDA API as a service and internally exploits the Hyper-Q facility of modern TesLa cards to push operations onto the wide command queue which Hyper-Q supports. This removes all the context switching latency. It sounds like you might want to investigate this, if performances is important to you and your code requires a large number of processes sharing a single device

Related

OpenCL - multiple threads on a gpu

After having parallelized a C++ code via OpenMP, I am now considering to use the GPU (a Radeon Pro Vega II) to speed up specific parts of my code. Being an OpenCL neophyte,I am currently searching for examples that can show me how to implement a multicore CPU - GPU interaction.
Here is what I want to achieve. Suppose to have a fixed short length array, say {1,2,3,4,5}, and that as an exercise, you want to compute all of the possible "right shifts" of this array, i.e.,
{5,1,2,3,4}
{4,5,1,2,3}
{3,4,5,1,2}
{2,3,4,5,1}
{1,2,3,4,5}
.
The relative OpenCL code is quite straightforward.
Now, suppose that your CPU has many cores, say 56, that each core has a different starting array and that at any random instant of time each CPU core may ask the GPU to compute the right shifts of its own array. This core, say core 21, should copy its own array into the GPU memory, run the kernel, and wait for the result. My question is "during this operation, could the others CPU cores submit a similar request, without waiting for the completion of the task submitted by core 21?"
Also, can core 21 perform in parallel another task while waiting for the completion of the GPU task?
Would you feel like suggesting some examples to look at?
Thanks!
The GPU works with a queue of kernel calls and (PCIe-) memory transfers. Within this queue, it can work on non-blocking memory transfers and a kernel at the same time, but not on two consecutive kernels. You could do several queues (one per CPU core), then the kernels from different queues can be executed in parallel, provided that each kernel only takes up a fraction of the GPU resources. The CPU core can, while the queue is being executed on the GPU, perform a different task, and with the command queue.finish() the CPU will wait until the GPU is done.
However letting multiple CPUs send tasks to a single GPU is bad practice and will not give you any performance advantage while making your code over-complicated. Each small PCIe memory transfer has a large latency overhead and small kernels that do not sufficiently saturate the GPU have bad performance.
The multi-CPU approach is only useful if each CPU sends tasks to its own dedicated GPU, and even then I would only recommend this if your VRAM of a single GPU is not enough or if you want to have more parallel troughput than a single GPU allows.
A better strategy is to feed the GPU with a single CPU core and - if there is some processing to do on the CPU side - only then parallelize across multiple CPU cores. By combining small data packets into a single large PCIe memory transfer and large kernel, you will saturate the hardware and get the best possible performance.
For more details on how the parallelization on the GPU works, see https://stackoverflow.com/a/61652001/9178992

Gflop vendor specification vs practical results

I have a questions that i can't i can't figure out.
I have a Nvidia GPU 750M and from specification it say it should have 722.7 GFlop/s. (GPU specification) but when i try the test from CUDA samples give me about 67.64 GFlop/Sec.
Why such a big difference?
Thanks.
The peak performance can only be achieved when every core is busy executing FMA on every cycle, which is impossible in a real task.
Apart from no other operation is counted as 2 operations like FMA,
For a single kernel launch, if you do some sampling in Visual Profiler you will notice there is something called stall. Each operation takes time to finish. And if another operation relies on the result of the previous one, it has to wait. This will eventually create "gaps" that a core is left idle waiting for a new operation is ready to be executed. Among them, device memory operations have HUGE latencies. If you don't do it right, your code will end up busy waiting for memory operations all the time.
Some tasks can be well optimized. If you test on gemm in cuBLAS, it can reach over 80% of the peak FLOPS, on some devices even 90%. While some other tasks just can not be optimized for FLOPS. For example, if you add one vector to another, the performance is always be limited by the memory bandwidth, and you can never see high FLOPS.

Performance Decrease by Increasing Threads Number More Than 4 in 24 Core CPU

I have an Intel Xeon E5-2620 which has 24 on 2 CPU. I have write an application which creates 24 threads for decryption of AES using openssl. When I increase thread number from 1 to 24 on 1 million data decryption I get a result such as following image.
The problem is when I increase thread numbers all of core which I determined are becoming 100% and because of 32GB ram of system always at least half of the ram is free which indicate that the problem is not core usage or ram limit.
I wonder to know that should I set a special parameter for increasing performance in OS level or it is process limitation which can not attain more than 4 thread in maximum performance.
I have to mention that when I execute "openssl evp ..." for testing aes encryption decryption because of process fork it increase the performance about 20 times more than one core performance.
Does anyone has any idea?
I finally found the the reason. multiple CPUs have different rams on servers which have different distances. when I created threads until 4 threads are created on one single cpu but fifth thread will be placed on second cpu which decrease performance because of not using NUMA in os.
so when I disabled cores of second cpu, performance of 6 threads increased as expected.
you can disable 7th core using following command:
cd /sys/devices/system/cpu/
echo 0 > cpu6/online
If multiprocessing gives a 20x speedup and equivalent multithreading only gives 2.5x, there's clearly a bottleneck in the multithreaded code. Furthermore, this bottleneck is unrelated to the hardware architecture.
It could be something in your code, or in the underlying library. It's really impossible to tell without studying both in some detail.
I'd start by looking at lock contention in your multithreaded application.

CPU vs GPU number of cores

I need some help understanding the concept of cores on a GPU vs. cores in a CPU for the purpose of doing parallel calculations.
When it comes to cores in a CPU, it seems pretty simple. I have a super intensive "for" loop that iterates four times. I have four cores in my Intel i5 2.26GHz CPU. I give one loop to each core. Each of the four loops is independent of the other. Boom - I now have four threads created and 100% CPU usage (instead of 25% CPU usage with only one core). My "for" loop now runs almost four times faster than it would have if I did not parallelize it.
In contrast, I don't even know the number of cores in my laptop's GPU (Intel Graphics Media Accelerator HD, or Intel HD Graphics, with 1696MB shared memory) that I can use for parallel calculations. I don't even know a valid way of comparing the GPU to the CPU. When I see compute unit = 6 on my graphics card description, I wonder if that means the graphics card has 6 cores for parallelization that can work kinda like the 4 cores in a CPU, except that the GPU cores run at 500MHz [slow] instead of 2.26GHz [fast]?
So, would you please fill some of the gaps or mistakes in my knowledge or help me compare the two? I don't need a super complicated answer, something as simple as "You can't compare a CPU core with a GPU core because of blankity blank" or "a GPU core isn't really a core like a CPU core is" would be very much appreciated.
GPU core is technically different from a CPU core in its design. GPU cores are optimized for vectorized code's execution unlike the CPU cores. Hence, the speedup you would get with GPU comparedto CPU is not only dependent on the number of cores but it also depends on the extent to which the code could be vectorized. You can check the specifications of your computer's GPU to find the number of cores. You can use CUDA/ OpenCL depending on the GPU on your machine.

CUDA performance improves when running more threads than there are cores

Why does performance improve when I run more than 32 threads per block?
My graphics card has 480 CUDA Cores (15 MS * 32 SP).
Each SM has 1-4 warp schedulers (Tesla = 1, Fermi = 2, Kepler = 4). Each warp scheduler is responsible for executing a subset of the warps allocated to the SM. Each warp scheduler maintains a list of eligible warps. A warp is eligible if it can issue an instruction on the next cycle. A warp is not eligible if it is stalled on a data dependency, waiting to fetch and instruction, or the execution unit for the next instruction is busy. On each cycle each warp scheduler will pick a warp from the list of eligible warp and issue 1 or 2 instructions.
The more active warps per SM the larger the number of warps each warp scheduler will have to pick from on each cycle. In most cases, optimal performance is achieved when there is sufficient active warps per SM to have 1 eligible warp per warp scheduler per cycle. Increasing occupancy beyond this point does not increase performance and may decrease performance.
A typical target for active warps is 50-66% of the maximum warps for the SM. The ratio of warps to maximum warps supported by a launch configuration is called Theoretical Occupancy. The runtime ratio of active warps per cycle to maximum warps per cycle is Achieved Occupancy. For a GTX480 (CC 2.0 device) a good starting point when designing a kernel is 50-66% Theoretical Occupancy. CC 2.0 SM can have a maximum of 48 warps. A 50% occupancy means 24 warps or 768 threads per SM.
The CUDA Profiling Activity in Nsight Visual Studio Edition can show the theoretical occupancy, achieved occupancy, active warps per SM, eligible warps per SM, and stall reasons.
The CUDA Visual Profiler, nvprof, and the command line profiler can show theoretical occupancy, active warps, and achieved occupancy.
NOTE: The count of CUDA cores should only be used to compare cards of similar architectures, to calculate theoretical FLOPS, and to potentially compare differences between architectures. Do not use the count when designing algorithms.
Welcome to Stack Overflow. The reason is that CUDA cores are pipelined. On Fermi, the pipeline is around 20 clocks long. This means that to saturate the GPU, you may need up to 20 threads per core.
The primary reason is the memory latency hiding model of CUDA. Most modern CPU's use cache to hide the latency to main memory. This results in a large percentage of chip resources being devoted to cache. Most desktop and server processors have several megabytes of cache on the die, which actually accounts for most of the die space. In order to pack on more cores with the same energy usage and heat dissipation characteristics, CUDA-based chips instead devote their chip space to throwing on tons of CUDA cores (which are mostly just floating-point ALU's.) Since there is very little cache, they instead rely on having more threads ready to run while other threads are waiting on memory accesses to return in order to hide that latency. This gives the cores something productive to be working on while some warps are waiting on memory accesses. The more warps per SM, the more chance one of them will be runnable at any given time.
CUDA also has zero-cost thread switching in order to aid in this memory-latency-hiding scheme. A normal CPU will incur a large overhead to switch from execution of one thread to the next due to need to store all of the register values for the thread it is switching away from onto the stack and then loading all of the ones for the thread it is switching to. CUDA SM's just have tons and tons of registers, so each thread has its own set of physical registers assigned to it through the life of the thread. Since there is no need to store and load register values, each SM can execute threads from one warp on one clock cycle and execute threads from a different warp on the very next clock cycle.