how to compute in game loop until the last possible moment - opengl

As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.

This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.

I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).

Related

Is there any way to maximize my application's timeslice on Windows?

I have a 3D application that needs to generate a new frame roughly every 6ms or so. This frame-rate needs to be constant in order to not result in stuttering. To make matters worse, the application has to perform several moderately-heavy calculations (mostly preparing the 3D scene and copying data to the VRAM) so there is it consumes a fairly large amount of that ~6ms doing it's own stuff.
This has been a problem because Windows causes my application to stutter a bit when it tries to use the CPU for other things. Is there any way I could make Windows not "give away" timeslices to other processes? I'm not concerned about it negatively impacting background processes.
Windows will allow you to raise your application's priority. A process will normally only lose CPU time to other processes with the same or higher priority, so raising your priority can prevent CPU time from being "stolen".
Be aware, however, that if you go too far, you can render the system unstable, so if you're going to do this, you generally only want to raise priority a little bit, so it higher than other "normal" applications.
Also note that this won't make a huge difference. If you're running into small problem once in a while, increasing the priority may take care of the problem. If it's a constant problem, chances are that a priority boost won't be sufficient to fix it.
If you decide to try this, see SetPriorityClass and SetThreadPriority.
It normally depends on the scheduling algorithm used by your OS. Windows 7,8,XP,VISTA use the multilevel queue scheduling with round robin so increasing the priority of your application or thread or process will do what you want. Which version of Windows are u currently using?.. I can help u accordingly once i get to know that
You can raise your process priority, but I don’t think it will help much. Instead, you should optimize your code.
For a start, use a VS built-in profiler (Debug/Performance profiler menu) to find out where your app spends most time, optimize that.
Also, all modern CPUs are at least dual core (the last single-core Celeron is from 2013). Therefore, “there it consumes a fairly large amount of that ~6ms doing it's own stuff” shouldn’t be the case. Your own stuff should be running in a separate thread, not on the same thread you use to render. See this article for an idea how to achieve that. You probably don’t need that level of complexity, just 2 threads + 2 tasks (compute and render) will probably be enough, but that article should give you some ideas how to re-design your app. This approach however will bring 1 extra frame of input latency (for 1 frame the background thread will compute stuff, only the next frame the renderer thread will show the result of that), but with your 165Hz rendering you can probably live with that.

Gflop vendor specification vs practical results

I have a questions that i can't i can't figure out.
I have a Nvidia GPU 750M and from specification it say it should have 722.7 GFlop/s. (GPU specification) but when i try the test from CUDA samples give me about 67.64 GFlop/Sec.
Why such a big difference?
Thanks.
The peak performance can only be achieved when every core is busy executing FMA on every cycle, which is impossible in a real task.
Apart from no other operation is counted as 2 operations like FMA,
For a single kernel launch, if you do some sampling in Visual Profiler you will notice there is something called stall. Each operation takes time to finish. And if another operation relies on the result of the previous one, it has to wait. This will eventually create "gaps" that a core is left idle waiting for a new operation is ready to be executed. Among them, device memory operations have HUGE latencies. If you don't do it right, your code will end up busy waiting for memory operations all the time.
Some tasks can be well optimized. If you test on gemm in cuBLAS, it can reach over 80% of the peak FLOPS, on some devices even 90%. While some other tasks just can not be optimized for FLOPS. For example, if you add one vector to another, the performance is always be limited by the memory bandwidth, and you can never see high FLOPS.

Is there any usage for letting a process "warm up"?

I recently did some digging into memory and how to use it properly. Of course, I also stumbled upon prefetching and how I can make life easier for the CPU.
I ran some benchmarks to see the actual benefits of proper storage/access of data and instructions. These benchmarks showed not only the expected benefits of helping your CPU prefetch, it also showed that prefetching also speeds up the process during runtime. After about 100 program cycles, the CPU seems to have figured it out and has optimized the cache accordingly. This saves me up to 200.000 ticks per cycle, the number drops from around 750.000 to 550.000. I got these Numbers using the qTestLib.
Now to the Question: Is there a safe way to use this runtime-speedup, letting it warm up, so to speak? Or should one not calculate this in at all and just build faster code from the start up?
First of all, there is generally no gain in trying to warm up a process prior to normal execution: That would only speed up the first 100 program cycles in your case, gaining a total of less than 20000 ticks. That's much less than the 75000 ticks you would have to invest in the warming up.
Second, all these gains from warming up a process/cache/whatever are rather brittle. There is a number of events that destroy the warming effect that you generally do not control. Mostly these come from your process not being alone on the system. A process switch can behave pretty much like an asynchronous cache flush, and whenever the kernel needs a page of memory, it may drop a page from the disk cache.
Since the factors make computing time pretty unpredictable, they need to be controlled when running benchmarks that are supposed to produce results of any reliability. Apart from that, these effects are mostly ignored.
It is important to note that keeping the CPU busy isn't necessarily a bad thing. Ideally you want your CPU to run anywhere from 60% to 100% because that means that your computer is actually doing "work". Granted, if there is a process that you are unaware of and that process is taking up CPU cycles, that isn't good.
In answer to your question, the machine usually takes care of this.

Clueless on how to execute big tasks on C++ AMP

I have a task to see if an algorithm I developed can be ran faster using computing on GPU rather than CPU. I'm new to computing on accelerators, I was given a book "C++ AMP" which I've read thoroughly, and I thought I understood it reasonably well (I coded in C and C++ in the past but nowadays its mostly C#).
However, when going into real application, I seem to just not get it. So please, help me if you can.
Let's say I have a task to compute some complicated function that takes a huge matrix input (like 50000 x 50000) and some other data and outputs matrix of same size. Total calculation for the whole matrix takes several hours.
On CPU, I'd just cut tasks into several pieces (number of pieces being something like 100 or so) and execute them using Parralel.For or just a simple task managing loop I wrote myself. Basically, keep several threads running (num of threads = num of cores), start new part when thread finishes, until all parts are done. And it worked well!
However, on GPU, I cannot use the same approach, not only because of memory constraints (that's ok, can partition into several parts) but because of the fact that if something runs for over 2 seconds it's considered a "timeout" and GPU gets reset! So, I must ensure that every part of my calculation takes less than 2 seconds to run.
But that's not every task (like, partition a hour-long work into 60 tasks of 1sec each), which would be easy enough, thats every bunch of tasks, because no matter what queue mode I choose (immediate or automatic), if I run (via parralel_for_each) anything that takes in total more than 2s to execute, GPU will get reset.
Not only that, but if my CPU program hogs all CPU resource, as long as it is kept in lower priority, UI stays interactive - system is responsive, however, when executing code on GPU, it seems that screen is frozen until execution is finished!
So, what do I do? In the demonstrations to the book (N-Body problem), it shows that it is supposed to be like 100x as effective (multicore calculations give 2 gflops, or w/e amount of flops that was, while amp give 200 gflops), but in real application, I just don't see how to do it!
Do I have to partition my big task into like, into billions of pieces, like, partition into pieces that each take 10ms to execute and run 100 of them in parralel_for_each at a time?
Or am I just doing it wrong, and there is a better solution I just don't get?
Help please!
TDRs (the 2s timeouts you see) are a reality of using a resource that is shared between rendering the display and executing your compute work. The OS protects your application from completely locking up the display by enforcing a timeout. This will also impact applications which try and render to the screen. Moving your AMP code to a separate CPU thread will not help, this will free up your UI thread on the CPU but rendering will still be blocked on the GPU.
You can actually see this behavior in the n-body example when you set N to be very large on a low power system. The maximum value of N is actually limited in the application to prevent you running into these types of issues in typical scenarios.
You are actually on the right track. You do indeed need to break up your work into chunks that fit into sub 2s chunks or smaller ones if you want to hit a particular frame rate. You should also consider how your work is being queued. Remember that all AMP work is queued and in automatic mode you have no control over when it runs. Using immediate mode is the way to have better control over how commands are batched.
Note: TDRs are not an issue on dedicated compute GPU hardware (like Tesla) and Windows 8 offers more flexibility when dealing with TDR timeout limits if the underlying GPU supports it.

CPU throttling in C++

I was just wondering if there is an elegant way to set the maximum CPU load for a particular thread doing intensive calculations.
Right now I have located the most time consuming loop in the thread (it does only compression) and use GetTickCount() and Sleep() with hardcoded values. It makes sure that the loop continues for a certain period and then sleeps for a certain minimum time. It more or less does the job, i.e. guarantees that the thread will not use more than 50% of CPU. However, behavior is dependent on the number of CPU cores (huge disadvantage) and simply ugly (smaller disadvantage :)). Any ideas?
I am not aware of any API to do get the OS's scheduler to do what you want (even if your thread is idle-priority, if there are no higher-priority ready threads, yours will run). However, I think you can improvise a fairly elegant throttling function based on what you are already doing. Essentially (I don't have a Windows dev machine handy):
Pick a default amount of time the thread will sleep each iteration. Then, on each iteration (or on every nth iteration, such that the throttling function doesn't itself become a significant CPU load),
Compute the amount of CPU time your thread used since the last time your throttling function was called (I'll call this dCPU). You can use the GetThreadTimes() API to get the amount of time your thread has been executing.
Compute the amount of real time elapsed since the last time your throttling function was called (I'll call this dClock).
dCPU / dClock is the percent CPU usage (of one CPU). If it is higher than you want, increase your sleep time, if lower, decrease the sleep time.
Have your thread sleep for the computed time.
Depending on how your watchdog computes CPU usage, you might want to use GetProcessAffinityMask() to find out how many CPUs the system has. dCPU / (dClock * CPUs) is the percentage of total CPU time available.
You will still have to pick some magic numbers for the initial sleep time and the increment/decrement amount, but I think this algorithm could be tuned to keep a thread running at fairly close to a determined percent of CPU.
On linux, you can change the scheduling priority of a thread with nice().
I can't think of any cross platform way of what you want (or any guaranteed way full stop) but as you are using GetTickCount perhaps you aren't interested in cross platform :)
I'd use interprocess communications and set the intensive processes nice levels to get what you require but I'm not sure that's appropriate for your situation.
EDIT:
I agree with Bernard which is why I think a process rather than a thread might be more appropriate but it just might not suit your purposes.
The problem is it's not normal to want to leave the CPU idle while you have work to do. Normally you set a background task to IDLE priority, and let the OS handle scheduling it all the CPU time that isn't used by interactive tasks.
It sound to me like the problem is the watchdog process.
If your background task is CPU-bound then you want it to take all the unused CPU time for its task.
Maybe you should look at fixing the watchdog program?
You may be able to change the priority of a thread, but changing the maximum utilization would either require polling and hacks to limit how many things are occurring, or using OS tools that can set the maximum utilization of a process.
However, I don't see any circumstance where you would want to do this.