Gflop vendor specification vs practical results

Gflop vendor specification vs practical results - c++

I have a questions that i can't i can't figure out.
I have a Nvidia GPU 750M and from specification it say it should have 722.7 GFlop/s. (GPU specification) but when i try the test from CUDA samples give me about 67.64 GFlop/Sec.
Why such a big difference?
Thanks.

The peak performance can only be achieved when every core is busy executing FMA on every cycle, which is impossible in a real task.
Apart from no other operation is counted as 2 operations like FMA,
For a single kernel launch, if you do some sampling in Visual Profiler you will notice there is something called stall. Each operation takes time to finish. And if another operation relies on the result of the previous one, it has to wait. This will eventually create "gaps" that a core is left idle waiting for a new operation is ready to be executed. Among them, device memory operations have HUGE latencies. If you don't do it right, your code will end up busy waiting for memory operations all the time.
Some tasks can be well optimized. If you test on gemm in cuBLAS, it can reach over 80% of the peak FLOPS, on some devices even 90%. While some other tasks just can not be optimized for FLOPS. For example, if you add one vector to another, the performance is always be limited by the memory bandwidth, and you can never see high FLOPS.

Related

OpenCL - multiple threads on a gpu

After having parallelized a C++ code via OpenMP, I am now considering to use the GPU (a Radeon Pro Vega II) to speed up specific parts of my code. Being an OpenCL neophyte,I am currently searching for examples that can show me how to implement a multicore CPU - GPU interaction.
Here is what I want to achieve. Suppose to have a fixed short length array, say {1,2,3,4,5}, and that as an exercise, you want to compute all of the possible "right shifts" of this array, i.e.,
{5,1,2,3,4}
{4,5,1,2,3}
{3,4,5,1,2}
{2,3,4,5,1}
{1,2,3,4,5}
.
The relative OpenCL code is quite straightforward.
Now, suppose that your CPU has many cores, say 56, that each core has a different starting array and that at any random instant of time each CPU core may ask the GPU to compute the right shifts of its own array. This core, say core 21, should copy its own array into the GPU memory, run the kernel, and wait for the result. My question is "during this operation, could the others CPU cores submit a similar request, without waiting for the completion of the task submitted by core 21?"
Also, can core 21 perform in parallel another task while waiting for the completion of the GPU task?
Would you feel like suggesting some examples to look at?
Thanks!

The GPU works with a queue of kernel calls and (PCIe-) memory transfers. Within this queue, it can work on non-blocking memory transfers and a kernel at the same time, but not on two consecutive kernels. You could do several queues (one per CPU core), then the kernels from different queues can be executed in parallel, provided that each kernel only takes up a fraction of the GPU resources. The CPU core can, while the queue is being executed on the GPU, perform a different task, and with the command queue.finish() the CPU will wait until the GPU is done.
However letting multiple CPUs send tasks to a single GPU is bad practice and will not give you any performance advantage while making your code over-complicated. Each small PCIe memory transfer has a large latency overhead and small kernels that do not sufficiently saturate the GPU have bad performance.
The multi-CPU approach is only useful if each CPU sends tasks to its own dedicated GPU, and even then I would only recommend this if your VRAM of a single GPU is not enough or if you want to have more parallel troughput than a single GPU allows.
A better strategy is to feed the GPU with a single CPU core and - if there is some processing to do on the CPU side - only then parallelize across multiple CPU cores. By combining small data packets into a single large PCIe memory transfer and large kernel, you will saturate the hardware and get the best possible performance.
For more details on how the parallelization on the GPU works, see https://stackoverflow.com/a/61652001/9178992

Running a single block with multiple threads, CUDA

I know that you should generally have at least 32 threads running per block on CUDA since threads are executed in groups of 32. However I was wondering if it is considered an acceptable practice to have only one block with a bunch of threads (I know there is a limit on the number of threads). I am asking this because I have some problems which require the shared memory of threads and synchronization across every element of the computation. I want to launch my kernel like
computeSomething<<< 1, 256 >>>(...)
and just used the threads to do the computation.
Is this efficient to just have one block, or would I be better off just doing the computation on the cpu?

If you care about performance, it's a bad idea.
The principal reason is that a given threadblock can only occupy the resources of a single SM on a GPU. Since most GPUs have 2 or more SMs, this means you're leaving somewhere between 50% to over 90% of the GPU performance untouched.
For performance, both of these kernel configurations are bad:
kernel<<<1, N>>>(...);
and
kernel<<<N, 1>>>(...);
The first is the case you're asking about. The second is the case of a single thread per threadblock; this leaves about 97% of the GPU horsepower untouched.
In addition to the above considerations, GPUs are latency hiding machines and like to have a lot of threads, warps, and threadblocks available, to select work from, to hide latency. Having lots of available threads helps the GPU to hide latency, which generally will result in higher efficiency (work accomplished per unit time.)
It's impossible to tell if it would be faster on the CPU. You would have to benchmark and compare. If all of the data is already on the GPU, and you would have to move it back to the CPU to do the work, and then move the results back to the GPU, then it might still be faster to use the GPU in a relatively inefficient way, in order to avoid the overhead of moving data around.

how to compute in game loop until the last possible moment

As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.

This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.

I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).

Why this C++ code don't reach 100% usage of one core?

I just made some benchmarks for this super question/answer Why is my program slow when looping over exactly 8192 elements?
I want to do benchmark on one core so the program is single threaded. But it doesn't reach 100% usage of one core, it uses 60% at most. So my tests are not acurate.
I'm using Qt Creator, compiling using MinGW release mode.
Are there any parameters to setup for better performance ? Is it normal that I can't leverage CPU power ? Is it Qt related ? Is there some interruptions or something preventing code to run at 100%...
Here is the main loop
// horizontal sums for first two lines
for(i=1;i<SIZE*2;i++){
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// rest of the computation
for(;i<totalSize;i++){
// compute horizontal sum for next line
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
// final result
resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}
This is run 10 times on an array of SIZE*SIZE float with SIZE=8193, the array is on the heap.

There could be several reasons why Task Manager isn't showing 100% CPU usage on 1 core:
You have a multiprocessor system and the load is getting spread across multiple CPUs (most OSes will do this unless you specify a more restrictive CPU affinity);
The run isn't long enough to span a complete Task Manager sampling period;
You have run out of RAM and are swapping heavily, meaning lots of time is spent waiting for disk I/O when reading/writing memory.
Or it could be a combination of all three.
Also Let_Me_Be's comment on your question is right -- nothing here is QT's fault, since no QT functions are being called (assuming that the objects being read and written to are just simple numeric data types, not fancy C++ objects with overloaded operator=() or something). The only activities taking place in this region of the code are purely CPU-based (well, the CPU will spend some time waiting for data to be sent to/from RAM, but that is counted as CPU-in-use time), so you would expect to see 100% CPU utilisation except under the conditions given above.

C++: Limiting CPU usage intentionally

At my company, we often test the performance of our USB and FireWire devices under CPU strain.
There is a test code we run that loads the CPU, and it is often used in really simple informal tests to see what happens to our device's performance.
I took a look at the code for this, and its a simple loop that increments a counter and does a calculation based on the new value, storing this result in another variable.
Running a single instance will use 1/X of the CPU, where X is the number of cores.
So, for instance, if we're on a 8-core PC and we want to see how our device runs under 50% CPU usage, we can open four instances of this at once, and so forth...
I'm wondering:
What decides how much of the CPU gets used up? does it just run everything as fast as it can on a single thread in a single threaded application?
Is there a way to voluntarily limit the maximum CPU usage your program can use? I can think of some "sloppy" ways (add sleep commands or something), but is there a way to limit to say, some specified percent of available CPU or something?

CPU quotas on Windows 7 and on Linux.
Also on QNX (i.e. Blackberry Tablet OS) and LynuxWorks
In case of broken links, the articles are named:
Windows -- "CPU rate limits in Windows Server 2008 R2 and Windows 7"
Linux -- "CPU Usage Limiter for Linux"
QNX -- "Adaptive Partitioning"
LynuxWorks - "Partitioning Operating Systems" and "ARINC 653"

The OS usually decides how to schedule processes and on which CPUs they should run. It basically keeps a ready queue for processes which are ready to run (not marked for termination and not blocked waiting for some I/O, event etc.). Whenever a process used up its timeslice or blocks it basically frees a processing core and the OS selects another process to run. Now if you have a process which is always ready to run and never blocks then this process essentially runs whenever it can thus pushing the utilization of a processing unit to a 100%. Of course this is a bit simplified description (there are things like process priorities for example).
There is usually no generic way to achieve this. The OS you are using might offer some mechanism to do this (some kind of CPU quota). You could try and measure how much time has passed vs. how much cpu time your process used up and then put your process to sleep for certain periods to achieve an approximation of desired CPU utilization.

You've essentially answered your own questions!
The key trait of code that burns a lot of CPU is that it never does anything that blocks (e.g. waiting for network or file I/O), and never voluntarily yields its time slice (e.g. sleep(), etc.).
The other trick is that the code must do something that the compiler cannot optimize away. So, most likely your CPU burn code outputs something based on the loop calculation at the end, or is simply compiled without optimization so that the optimizer isn't tempted to remove the useless loop. Since you're trying to load the CPU, there's no sense in optimizing anyways.
As you hypothesized, single threaded code that matches this description will saturate a CPU core unless the OS has more of these processes than it has cores to run them--then it will round-robin schedule them and the utilization of each will be some fraction of 100%.

The issue isn't how much time the CPU spends idle, but rather how long it takes for your code to start executing. Who cares if it's idle or doing low-priority busywork, as long as the latency is low?
Your problem is fundamentally a consequence of using a synthetic benchmark, presumably in an attempt to obtain reproducible results. But synthetic benchmarks tend to produce meaningless results, so reproducibility is moot.
Look at your bug database, find actual customer complaints, and use actual software and test hardware to reproduce a situation that actually made someone dissatisfied. Develop the performance test in parallel with hard, meaningful performance requirements.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js