C++ AMP with fast GPUs slower than CPU

C++ AMP with fast GPUs slower than CPU - c++

I'm just starting to learn C++ AMP and I've obtained a few examples that I've built with the VS 2012 RC, but I'm finding that the performance of the GPU is slower than the CPU. For instance, the examples by Kate Gregory: http://ampbook.codeplex.com/releases/view/90595 (relevant to her upcoming book http://www.gregcons.com/cppamp/). They were demonstrated by her in a lecture I watched where she obtained a ~5x performance improvement for the chapter 4 example by using her laptop's GPU (I believe she said it was a 6650) compared to CPU (not sure what CPU she had). I've tried testing the example myself and on a couple of system configurations (as below) I've always found the CPU to be faster. I've also tested other examples and found the same. Am I doing something wrong? Is there a reason for the slower than expected performance? Does anyone have an example that would definitely show the GPU being faster?
System 1: Intel i7 2600K with onboard graphics (I expect this to be
slower)
System 2: Intel i7 2630QM with Intel HD switchable with AMD
6770 (I have it running in performance mode so it should be using the
6770)
System 3: Intel i5 750 with 2xCrossfire AMD HD 5850
Example of results: chapter4 project results in 1.15ms CPU, 2.57ms GPU, 2.55ms GPU tiled.
Edit:
Doh, I think I just found the reason why - the values for the size of the matrices she used in the lecture were different. The sample on the website uses M=N=W=64. If I use 64, 512 and 256 as she did in the lecture then I get the corresponding ~5x increase in performance.

It seems like your overarching question is WHY moving things to the GPU doesn't always get you a benefit. The answer is copy time. Imagine a calculation that takes a time proprotional to n squared. Copying takes a time proportional to n. You might need quite a large n before spending the time to copy to and from the GPU is outweighed by the time saved doing the calculation there.
The book mentions this briefly in the early chapters, and Chapters 7 and 8 are all about performance and optimization. Chapter 7 is on Rough Cuts now; Chapter 8 should be there shortly. (Its code is already on Codeplex - the Reduction case study.)
I've just checked in an update to the Chapter 4 code that uses the Tech Ed starting numbers instead of the ones that were there before. Smaller matrices lose too much time to the copy to/from the GPU; larger ones take too long to be a good demo. But do feel free to play around with the sizes. Make them even larger since you don't mind a minute or two of "dead air", and see what happens.

Related

Sycl kernel-call very slow

I am new to stackoverflow, sycl and gpu-programming. I have a project with a working basic sycl kernel. The logic is working so I'm skipping it in the question. Also during the compiling and execution is no error.
The big problem is now that the call of the sycl code is very slow. First I thought it was some memory copying or similar, so I left out anything but what you can see below (the bare minimum, comments are where code would be located when not minimum kernel).
My measured times: (Release x64)
with Visual Studio Debugger shown, total time of function with empty kernel call: ~100 ms
with Cuda Nsight, time of OpenCl-kernel execution: ~5 us
The kernel gpu time of 5 us is very fast as expected with an empty kernel.
But the total time of the c++ function in my Code is with 100 ms slow.
What could be the problem here? Or is the sycl overhead expected to be this slow?(I really doubt that)
My efforts:
I changed my compute++.exe flags from -O2 to -O3 what improved the total time about 5 to 10 ms.
I made the kernel bare minimum
The code inside a dll function:
{ //scope
sycl::gpu_selector gpuSel;
sycl::queue myQueue(gpuSel);
//....buffers
auto ra = range<1>(size);
myQueue.submit([&](sycl::handler& hd)
{
//....get_access<access::mode::read>
auto kernel = ([=](cl::sycl::id<1> id)
{
//...some vector math
});
hd.parallel_for<someName>(ra, kernel);
});
myQueue.wait();
}
I am using:
Visual Studio 2019
ComputeCpp Community 2.0.0
Latest Cuda Drivers
NVIDIA Gtx 980 ptx64 (experimental ComputeCpp support)
compute++ call:
"..\compute++.exe" -sycl -D_ALLOW_COMPILER_AND_STL_VERSION_MISMATCH -O3 -mllvm -inline-threshold=1000 -intelspirmetadata -sycl-target ptx64 -std=c++14 -I"../Codeplay/ComputeCpp/include" -I"../NVIDIA GPU Computing Toolkit/CUDA/v10.2/include" -sycl-ih something.cpp.sycl -c something.cpp
Summarized:
The total execution time of a sycl kernel is slow.
Can I do something here to improve it or is it because of the implementation of sycl/computecpp on Nvidia gpus and is expected to be this slow?

First I would point out that this is a very simple set of SYCL code so if you are looking to measure performance it's probably not a very relevant example. Here's a research paper showing comparable performance of ComputeCpp with CUDA doing a reduction algorithm benchmark, see slide 40 for the chart. You'll also see in the presentation that the performance increase goes up exponentially based on the size of the data set being worked on. That is generally the same for HPC programming as the benefits of a GPU are generally only seen when processing larger data sets.
The difference you are seeing is because ComputeCpp uses OpenCL callbacks, and the NVIDIA OpenCL driver does seem to introduce an overhead when using these callbacks. Here's a relevant post about this from a while back
If you were to write a simple OpenCL kernel that uses callbacks it would exhibit the same sort of behaviour.
I'd also add that we've implemented NVIDIA support for the DPC++ compiler that uses CUDA directly and does not see the same level of overhead. You can find out more about that in our blog post, it would be worth giving that a try if you want to run SYCL code on NVIDIA hardware.

GPUs are terrible when you want to add or multiply 3 or 4 numbers. For that you better use the CPU it is optimized for that and you may have an AVX extension which is optimized to do vector math. So for that you should replace cl::sycl::gpu_selector with cl::sycl::cpu_selector. I'm not sure if sycl uses AVX when you have one, put it will definitely use multi threading.
But when you're trying to add 500'000 numbers, then will the GPU be much faster than the CPU.
This video explains it very well.

Extremely slow ffmpeg/sws_scale() - only on heavy duty

I am writing a video player using ffmpeg (Windows only, Visual Studio 2015, 64 bit compile).
With common videos (up to 4K # 30FPS), it works pretty good. But with my maximum target - 4K # 60FPS, it fails. Decoding still is fast enough, but when it comes to YUV/BGRA conversion it is simply not fast enough, even though it's done in 16 threads (one thread per frame on a 16/32 core machine).
So as a first countermeasure I skipped the conversion of some frames and got a stable frame rate of ~40 that way. Comparing the two versions in Concurrency Visualizer, I found a strange issue I don't know the reason of.
.
Here's an image of the frameskip version:
You see that the conversion is pretty quick (average roughly ~35ms)
Thus, as multiple threads are used, it also should be quick enough for 60FPS, but it isn't!
.
The image of the non-frameskip version shows why:
The conversion of a single frame has become ten times slower than before (average roughly ~350ms). Now a heavy workload on many cores would of course cause a minor slowdown per core due to reduced turbo - let's say 10 or 20%. But never an extreme slowdown of ~1000%.
.
Interesting detail is, that the stack trace of the non-frameskip version shows some system activity I don't really understand - beginning with ntoskrnl.exe!KiPageFault+0x373. There are no exceptions, other error messages or such - it just becomes extremely slow.
Edit: A colleague just told me that this looks like a memory problem with paged-out memory at first glance - but my memory utilization is low (below 1GB, and more than 20GB free)
Can anyone tell me what could be causing this?

This is probably too old to be useful, but just for the record:
What's probably happening is that you're allocating 4k frames over and over again in multiple threads. The windows allocator really doesn't like that access pattern.
The malloc itself will not show up in the profiler, since only when the memory is actually accessed, will the OS fetch the pages. This shows up as ntoskrnl.exe!KiPageFault and gets attributed to the function first accessing the new memory.
Solutions include:
Using a different allocator (e.g. tbb_malloc, mimalloc, etc.)
Using your own per-thread or per process frame pool. ffmpeg does something similar internally, maybe you can just use that.

Want to improve the computing speed of matrix calculation, OpenMP or CUDA?

My program has a bunch of matrix multiplication and inversion, which is time consuming.
My computer: CPU: intel i7; GPU: 512MB NVIDIA® Quadro® NVS3100M
Which one is better for improving computing speed? OpenMP or CUDA?
(ps. I think generally, GPU has more cores than cpu, thus, CUDA could improve multiple times more than OpenMP?)

From my experience(work on both as a school project, in most condition, the calculation time for a medium size array, I would say less than 2000 * 2000, is almost the same, the actual calculation time depending on the working load of your computer(usually when you working on openMP, you would share a cluster with other guys, so make sure you are running your application alone, so that you might got a better result))
But if you are good at CUDA, the GPU is very powerful in these kinds of calculation stuff, when i was working on my CUDA project, there are lots of good materials in the official website. For openMP, it is only a library, and if you are good at c or c++, should not be any problem for you to use it(but the compiler of openMP is buggy~~, don't trust it, try to log anything).
And i assumed you have experience on CUDA, is not hard to find some good example i think. But CUDA is really dummy, can't debug, so I recommend you to try openMP first, it should be easier.

I'd guess it depends on what your application is and how you go about trying to implement improvements. Keep in mind that every optimization has tradeoffs. For instance, GPU's typically use half-precision floating point, and there are compiler options that allow you to bypass some aspects of the IEEE standard, which brings you some extra speed at the expense of precision, etc.

How to optimize large data manipulation in parallel

I'm developing a C/C++ application to manipulate large quantities of data in a generic way (aggregation/selection/transformation).
I'm using a AMD Phenom II X4 965 Black Edition, so with decent amount of different caches.
I've developed both ST and MT version of the functions to perform all the single operations and, not surprisingly, in the best case the MT version are 2x faster than the ST, even when using 4 cores.
Given I'm a fan of using 100% of available resources, I was pissed about the fact just 2x, I'd want 4x.
For this reason I've spent already quite a considerable amount of time with -pg and valgrind, using the cache simulator and callgraph. The program is working as expected and cores are sharing the input process data (i.e. operations to apply on data) and the cache misses are reported (as expected sic.) when the different threads load the data to be processed (millions of entities or rows if now you have an idea what I'm trying to do :-) ).
Eventually I've used different compilers, g++ and clang++, with -O3 both, and performance is identical.
My conclusion is that due to the large amount of data (GB of data) to process, given the fact the data has got to be loaded eventually in the CPU, this is real wait time.
Can I further improve my software? Have I hit a limit?
I'm using C/C++ on Linux x86-64, Ubuntu 11.10.
I'm all ears! :-)

What kind of application is it? Could you show us some code?
As I commented, you might have reached some hardware limit like RAM bandwidth. If you did, no software trick could improve it.
You might investigate using MPI, OpenMP, or OpenCL (on GPUs) but without an idea of your application we cannot help.
If compiling with GCC and if you want to help the processor cache prefetching, consider using with care and parsimony __builtin_prefetch (but using it too much or badly would decrease performance).

Death of the Cell processor

in the last times I heard lots of people claiming that the Cell processor is dead, mainly due to the following reasons:
Lack of support in the new playstation 3, as the user can not install linux
The increasing processing power of the GPU's and its costs sinking
The existence of a unified programming approach (openCL) for different GPU's and not for the CBE (well today was announced for the Cell!)
Carency of real world examples of use of the cell (apart from the academic circles)
Global feeling of unsuccess
What do you think? If you started two or three years ago to program the cell, will you continue on this or are you considering switching to GPU's? Is a new version of the cell coming?
Thanks

I'd say the reasons for the lack of popularity for cell development are closer to:
The lack of success in the PS3 (due to many mistakes on Sony's part and strong competition from the XBOX 360)
Low manufacturing yield, high cost (partly due to low yield), and lack of affordable hardware systems other than the PS3
Development difficulty (the cell is an unusual processor to design for and the tooling is lacking)
Failure to achieve significant performance differences compared to existing x86 based commodity hardware. Even the XBOX 360's several year old triple core Power architecture processor has proven competitive, compared to a modern Core2 Quad processor the cell's advantages just aren't evident.
Increasing competition from GPU general purpose computing platforms such as CUDA

It's easier to write parallel programs for 1000s of threads than it is for 10s of threads. GPUs have 1000s of threads, with hardware thread scheduling and load balancing. Although current GPUs are suited mainly for data parallel small kernels, they have tools that make doing such programming trivial. Cell has only a few, order of 10s, of processors in consumer configurations. (The Cell derivatives used in supercomputers cross the line, and have 100s of processors.)
IMHO one of the biggest problems with Cell was lack of an instruction cache. (I argued this vociferously with the Cell architects on a plane back from the MICRO conference Barcelona in 2005. Although they disagreed with me, I have heard the same from bigsuper computer users of cell.) People can cope with fitting into fixed size data memories - GPUs have the same problem, although they complain. But fitting code into fixed size instruction memory is a pain. Add an IF statement, and performance may fall off a cliff because you have to start using overlays. It's a lot easier to control your data structures than it is to avoid having to add code to fix bugs late in the development cycle.
GPUs originally had the same problems as cell - no caches, neither I nor D.
But GPUs did more threads, data parallelism so much better than Cell, that they ate up that market. Leaving Cell only its locked in console customers, and codes that were more complicated than GPUs, but less complicated than CPU code. Squeezed in the middle.
And, in the meantime, GPUs are adding I$ and D$. So they are becoming easier to program.

Why did Cell die?
1) The SDK was horrid. I saw some very bright developers about scratch their eyes out pouring through IBM mailing lists trying to figure out this problem or that with the Cell SDK.
2) The bus between compute units was starting to show scaling problems and never would have made it to 32 cores.
3) OpenCl was about 3-4 years too late to be of any use.

If you started two or three years ago
to program the cell, will you continue
on this or are you considering
switching to GPU's?
I would have thought that 90% of the people who program for the Cell processor are not in a position where they can arbitrarily decide to stop programming for it. Are you aiming this question at a very specific development community?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js