Clueless on how to execute big tasks on C++ AMP - c++

I have a task to see if an algorithm I developed can be ran faster using computing on GPU rather than CPU. I'm new to computing on accelerators, I was given a book "C++ AMP" which I've read thoroughly, and I thought I understood it reasonably well (I coded in C and C++ in the past but nowadays its mostly C#).
However, when going into real application, I seem to just not get it. So please, help me if you can.
Let's say I have a task to compute some complicated function that takes a huge matrix input (like 50000 x 50000) and some other data and outputs matrix of same size. Total calculation for the whole matrix takes several hours.
On CPU, I'd just cut tasks into several pieces (number of pieces being something like 100 or so) and execute them using Parralel.For or just a simple task managing loop I wrote myself. Basically, keep several threads running (num of threads = num of cores), start new part when thread finishes, until all parts are done. And it worked well!
However, on GPU, I cannot use the same approach, not only because of memory constraints (that's ok, can partition into several parts) but because of the fact that if something runs for over 2 seconds it's considered a "timeout" and GPU gets reset! So, I must ensure that every part of my calculation takes less than 2 seconds to run.
But that's not every task (like, partition a hour-long work into 60 tasks of 1sec each), which would be easy enough, thats every bunch of tasks, because no matter what queue mode I choose (immediate or automatic), if I run (via parralel_for_each) anything that takes in total more than 2s to execute, GPU will get reset.
Not only that, but if my CPU program hogs all CPU resource, as long as it is kept in lower priority, UI stays interactive - system is responsive, however, when executing code on GPU, it seems that screen is frozen until execution is finished!
So, what do I do? In the demonstrations to the book (N-Body problem), it shows that it is supposed to be like 100x as effective (multicore calculations give 2 gflops, or w/e amount of flops that was, while amp give 200 gflops), but in real application, I just don't see how to do it!
Do I have to partition my big task into like, into billions of pieces, like, partition into pieces that each take 10ms to execute and run 100 of them in parralel_for_each at a time?
Or am I just doing it wrong, and there is a better solution I just don't get?
Help please!

TDRs (the 2s timeouts you see) are a reality of using a resource that is shared between rendering the display and executing your compute work. The OS protects your application from completely locking up the display by enforcing a timeout. This will also impact applications which try and render to the screen. Moving your AMP code to a separate CPU thread will not help, this will free up your UI thread on the CPU but rendering will still be blocked on the GPU.
You can actually see this behavior in the n-body example when you set N to be very large on a low power system. The maximum value of N is actually limited in the application to prevent you running into these types of issues in typical scenarios.
You are actually on the right track. You do indeed need to break up your work into chunks that fit into sub 2s chunks or smaller ones if you want to hit a particular frame rate. You should also consider how your work is being queued. Remember that all AMP work is queued and in automatic mode you have no control over when it runs. Using immediate mode is the way to have better control over how commands are batched.
Note: TDRs are not an issue on dedicated compute GPU hardware (like Tesla) and Windows 8 offers more flexibility when dealing with TDR timeout limits if the underlying GPU supports it.

Related

Is there any way to maximize my application's timeslice on Windows?

I have a 3D application that needs to generate a new frame roughly every 6ms or so. This frame-rate needs to be constant in order to not result in stuttering. To make matters worse, the application has to perform several moderately-heavy calculations (mostly preparing the 3D scene and copying data to the VRAM) so there is it consumes a fairly large amount of that ~6ms doing it's own stuff.
This has been a problem because Windows causes my application to stutter a bit when it tries to use the CPU for other things. Is there any way I could make Windows not "give away" timeslices to other processes? I'm not concerned about it negatively impacting background processes.
Windows will allow you to raise your application's priority. A process will normally only lose CPU time to other processes with the same or higher priority, so raising your priority can prevent CPU time from being "stolen".
Be aware, however, that if you go too far, you can render the system unstable, so if you're going to do this, you generally only want to raise priority a little bit, so it higher than other "normal" applications.
Also note that this won't make a huge difference. If you're running into small problem once in a while, increasing the priority may take care of the problem. If it's a constant problem, chances are that a priority boost won't be sufficient to fix it.
If you decide to try this, see SetPriorityClass and SetThreadPriority.
It normally depends on the scheduling algorithm used by your OS. Windows 7,8,XP,VISTA use the multilevel queue scheduling with round robin so increasing the priority of your application or thread or process will do what you want. Which version of Windows are u currently using?.. I can help u accordingly once i get to know that
You can raise your process priority, but I don’t think it will help much. Instead, you should optimize your code.
For a start, use a VS built-in profiler (Debug/Performance profiler menu) to find out where your app spends most time, optimize that.
Also, all modern CPUs are at least dual core (the last single-core Celeron is from 2013). Therefore, “there it consumes a fairly large amount of that ~6ms doing it's own stuff” shouldn’t be the case. Your own stuff should be running in a separate thread, not on the same thread you use to render. See this article for an idea how to achieve that. You probably don’t need that level of complexity, just 2 threads + 2 tasks (compute and render) will probably be enough, but that article should give you some ideas how to re-design your app. This approach however will bring 1 extra frame of input latency (for 1 frame the background thread will compute stuff, only the next frame the renderer thread will show the result of that), but with your 165Hz rendering you can probably live with that.

Why does my logging library cause performance tests to run faster?

I have spent the past year developing a logging library in C++ with performance in mind. To evaluate performance I developed a set of benchmarks to compare my code with other libraries, including a base case that performs no logging at all.
In my last benchmark I measure the total running time of a CPU-intensive task while logging is active and when it is not. I can then compare the time to determine how much overhead my library has. This bar chart shows the difference compared to my non-logging base case.
As you can see, my library ("reckless") adds negative overhead (unless all 4 CPU cores are busy). The program runs about half a second faster when logging is enabled than when it is disabled.
I know I should try to isolate this down to a simpler case rather than asking about a 4000-line program. But there are so many venues for what to remove, and without a hypothesis I will just make the problem go away when I try to isolate it. I could probably spend another year just doing this. I'm hoping that the collective expertise of Stack Overflow will make this a much more shallow problem or that the cause will be obvious to someone who has more experience than me.
Some facts about my library and the benchmarks:
The library consists of a front-end API that pushes the log arguments onto a lockless queue (Boost.Lockless) and a back-end thread that performs string formatting and writes the log entries to disk.
The timing is based on simply calling std::chrono::steady_clock::now() at the beginning and end of the program, and printing the difference.
The benchmark is run on a 4-core Intel CPU (i7-3770K).
The benchmark program computes a 1024x1024 Mandelbrot fractal and logs statistics about each pixel, i.e. it writes about one million log entries.
The total running time is about 35 seconds for the single worker-thread case. So the speed increase is about 1.5%.
The benchmark produces an output file (this is not part of the timed code) that contains the generated Mandelbrot fractal. I have verified that the same output is produced when logging is on and off.
The benchmark is run 100 times (with all the benchmarked libraries, this takes about 10 hours). The bar chart shows the average time and the error bars show the interquartile range.
Source code for the Mandelbrot computation
Source code for the benchmark.
Root of the code repository and documentation.
My question is, how can I explain the apparent speed increase when my logging library is enabled?
Edit: This was solved after trying the suggestions given in comments. My log object is created on line 24 of the benchmark test. Apparently when LOG_INIT() touches the log object it triggers a page fault that causes some or all pages of the image buffer to be mapped to physical memory. I'm still not sure why this improves the performance by almost half a second; even without the log object, the first thing that happens in the mandelbrot_thread() function is a write to the bottom of the image buffer, which should have a similar effect. But, in any case, clearing the buffer with a memset() before starting the benchmark makes everything more sane. Current benchmarks are here
Other things that I tried are:
Run it with the oprofile profiler. I was never able to get it to register any time in the locks, even after enlarging the job to make it run for about 10 minutes. Almost all the time was in the inner loop of the Mandelbrot computation. But maybe I would be able to interpret them differently now that I know about the page faults. I didn't think to check whether the image write was taking a disproportionate amount of time.
Removing the locks. This did have a significant effect on performance, but results were still weird and anyway I couldn't do the change in any of the multithreaded variants.
Compare the generated assembly code. There were differences but the logging build was clearly doing more things. Nothing stood out as being an obvious performance killer.
When uninitialised memory is first accessed, page faults will affect timing.
So, before your first call to, std::chrono::steady_clock::now(), initialise the memory by running memset() on your sample_buffer.

how to compute in game loop until the last possible moment

As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.
This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.
I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).

Running a process multiple times at the same time

I have a c++ program with opencv library which takes an image as input and perform pose estimation,color detection,phog. When I run this program from the command line it takes around 4-5sec to complete. It takes around 60%cpu. When I try to run the same program from two different command line windows at the same time the process takes around 10-15 sec to finish and both the process finish in almost the same time. The CPU Usage reaches upto 100%.
I have a website which calls this c++ exe using exec() command. So when two users try to upload an image and run it takes more time as I explained above in the command line. Is this because the c++ program involves high computation and the CPU reaches 100% it slows down? But I read that the CPU reaching 100% is not a bad thing as the computer is using its full capacity to run the program. So is this because of my c++ program or is it something to do with my server(computer) settings? This is probably not the apache server problem because when I try to run it from the command line also it slows down. I am using a quad core processor and all the 4 CPU reaches 100% when I try to run the same process at the same time so I think that its distributed among all the processor. So I have few more questions:
1) Can this be solved by using multithreading in my c++ code?As for now I am not using it but will multithreading make the c++ code more computationally expensive and increase the CPU usage(if this is the problem).
2) What can be the reason of it slowing down? Is the process in a queue and each process is ran only a certain amount of time and it switches between the two process?
3) If this is because it involves high computation will it help if I change some functions to opencv gpu functions?
4) Is there a way I can solve this problems any ideas or tips?
I have inserted the result of top when running one process and running the same process twice at the same time:
Version5 is the process,running it once
Two Version5 running at the same time
The CPU info:
Thanks in advance.
After zooming so that your picture fills almost my entire 22" screen, I can make out that the CPU flags show "ht", which means "hyperthreading", so you actually only have two genuine cores, that are shared between two hyperthreads. So running on all four CPU cores at once will not give the same performance as running on two genuine cores.
In other words, the "loss of performance" is entirely as you'd expect, because you have four threads fighting for the actual computational resources of two CPU cores. Hyperthreading helps if the code has a lot of memory interaction that can be "hidden" by running a second thread. But if you have a CPU intensive code, that isn't "missing in the cache" much, then the gain is much less, and in extreme cases, hyperthreading will actually cause slow-downs (because the code in one thread disrupts the caches and otherwise "gets in the way" of the first thread). You may want to experiment by going into the BIOS settings and turn off the hyperthreading, and compare the results. Sure, running two instances of the code will clearly still take longer, but the question is "is it longer than running with hyperthreading" - unfortunately, it's impossible to say for sure which is better from a theoretical standpoint (even if I could see the assembly code and understood the memory access patterns - without that level of detail, it's completely impossible to judge).
When running only one process reaches 60% of CPU usage it would be possible that using multithreading speeds up the execution. However, the CPU usage is likely to be higher
That's true. There might be an additional overhead for context switching (multitasking)
Changing functions can bring some improvements, but without having your code it is hard say.
Since the computational effort is that high, I think you have to decide whether you accept a high CPU usage or a longer execution time (of course after optimizing the code itself)
Greetz

Why this C++ code don't reach 100% usage of one core?

I just made some benchmarks for this super question/answer Why is my program slow when looping over exactly 8192 elements?
I want to do benchmark on one core so the program is single threaded. But it doesn't reach 100% usage of one core, it uses 60% at most. So my tests are not acurate.
I'm using Qt Creator, compiling using MinGW release mode.
Are there any parameters to setup for better performance ? Is it normal that I can't leverage CPU power ? Is it Qt related ? Is there some interruptions or something preventing code to run at 100%...
Here is the main loop
// horizontal sums for first two lines
for(i=1;i<SIZE*2;i++){
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
}
// rest of the computation
for(;i<totalSize;i++){
// compute horizontal sum for next line
hsumPointer[i]=imgPointer[i-1]+imgPointer[i]+imgPointer[i+1];
// final result
resPointer[i-SIZE]=(hsumPointer[i-SIZE-SIZE]+hsumPointer[i-SIZE]+hsumPointer[i])/9;
}
This is run 10 times on an array of SIZE*SIZE float with SIZE=8193, the array is on the heap.
There could be several reasons why Task Manager isn't showing 100% CPU usage on 1 core:
You have a multiprocessor system and the load is getting spread across multiple CPUs (most OSes will do this unless you specify a more restrictive CPU affinity);
The run isn't long enough to span a complete Task Manager sampling period;
You have run out of RAM and are swapping heavily, meaning lots of time is spent waiting for disk I/O when reading/writing memory.
Or it could be a combination of all three.
Also Let_Me_Be's comment on your question is right -- nothing here is QT's fault, since no QT functions are being called (assuming that the objects being read and written to are just simple numeric data types, not fancy C++ objects with overloaded operator=() or something). The only activities taking place in this region of the code are purely CPU-based (well, the CPU will spend some time waiting for data to be sent to/from RAM, but that is counted as CPU-in-use time), so you would expect to see 100% CPU utilisation except under the conditions given above.