When the code is waiting for some condition in which delay time is not deterministic, it looks like many people choose to use exponential backoff, i.e. wait N seconds, check if the condition satisfies; if not, wait for 2N seconds, check the condition, etc. What is the benefit of this over checking in a constant/linearly increasing time span?
Exponential back-off is useful in cases where simultaneous attempts to do something will interfere with each other such that none succeed. In such cases, having devices randomly attempt an operation in a window which is too small will result in most attempts failing and having to be retried. Only once the window has grown large enough will attempts have any significant likelihood of success.
If one knew in advance that 16 devices would be wanting to communicate, one could select the size of window that would be optimal for that level of loading. In practice, though, the number of competing devices is generally unknown. The advantage of an exponential back-off where the window size doubles on each retry is that regardless of the number of competing entities:
The window size where most operations succeed will generally be within a factor of two of the smallest window size where most operations would succeed,
Most of the operations which fail at that window size will succeed on the next attempt (since most of the earlier operations will have succeeded, that will leave less than half of them competing for a window which is twice as big), and
The total time required for all attempts will end up only being about twice what was required for the last one.
If, instead of doubling each time, the window were simply increased by a constant amount, then the time spent retrying an operation until the window reached a usable size would be proportional to the square of whatever window size was required. While the final window size might be smaller than would have been used with exponential back-off, the total cost of all the attempts would be much greater.
This is the behavior of TCP congestion control. If the network is extremely congested, effectively no traffic gets through. If every node waits for a constant time before checking, the traffic just for checking will continue to clog the network, and the congestion never resolves. Similarly for a linear increasing time between checks, it may take a long time before the congestion resolves.
Assuming you are referring to testing a condition before performing an action:
Exponential backoff is beneficial when the cost of testing the condition is comparable to the cost of performing the action (such as in network congestion).
if the cost of testing the condition is much smaller (or negligible), then a linear or constant wait can work better, provided the time it takes for the condition to change is negigible as well.
For exemple, if your condition is a complex (slow) query against a database, and the action is an update of the same database, then every check of the condition will negatively impact the database performance, and at some point, without exponential backoff, checking the condition by multiple actors could be enough to use all database resources.
But if the condition is just a lightweight memory check (f.i. a critical section), and the action is still an update of a database (at best tens of thousandths of times slower than the check), and if the condition is flipped in a negligible time at the very start of the action (by entering the critical section), then a constant or linear backoff would be fine. Actually under this particular scenario, an exponential backoff would be detrimental as it would introduce delays in situations of low load, and is more likely to result in time-outs in situations of high load (even when the processing bandwidth is sufficient).
So to summarize, exponential backoff is a hammer: it works greats for nails, not so much for screws :)
Related
I have a while loop that executes a program, with a sleep every so often. The while loop is meant to simulate a real-time program that executes at a certain frequency. The current logic calcualtes a number of cycles to execute per sleep to achieve a desired frequency. This has proven to be innacurate. I think a timer would be a better implementation, but do to the complexity of refactor I am trying to maintain a while loop solution to achieve this. I am looking for advice on a scheme that may more tightly achieve a desired frequency of execution in a while loop. Pseudo-code below:
MaxCounts = DELAY_TIME_SEC/DESIRED_FREQUENCY;
DoProgram();
while(running)
{
if(counts > MaxCounts)
{
Sleep(DELAY_TIME_SEC);
}
}
You cannot reliably schedule an operation to occur at specific times on a non-realtime OS.
As C++ runs on non-realtime OS's, it cannot provide what cannot be provided.
The amount of error you are willing to accept, in both typical and extreme cases, will matter. If you want something running every minute or so, and you don't want drift on the level of days, you can just set up a starting time, then do math to determine when the nth event should happen.
Then do a wait for the nth time.
This fixes "cumulative drift" issues, so over 24 hours you get 1440+/-1 events with 1 minute between them. The time between the events will vary and not be 60 seconds exactly, but over the day it will work out.
If your issue is time on the ms level, and you are ok with a loaded system sometimes screwing up, you can sleep and aim for a time before the next event shy half a second (or whatever margin makes it reliable enough for you). Then busy-wait until the time occurs. You may also have to tweak process/thread priority; be careful, this can easily break things badly if you make priority too high.
Combining the two can work as well.
I'm trying to do some calculations where it starts off with 10-20~ objects, but by doing calculations on these objects it creates 20-40 and so on and so forth, so slightly recursive but not forever, eventually the amount of calculations will reach zero. I have considered using a different tool but it's kind of too late for that for me. It's kind of an odd request which is probably why no results came up.
In short I'm wondering how it is possible to set global work size to as many threads as there are available. For example if the GPU can have X different processes running in parallel it will set that to global work size to X.
edit:it would also work if I can call more kernels from the GPU but that doesn't look possible on version 1.2.
There is not really a limit to global work size (only above 2^32 threads you have to use 64-bit ulong to avoid integer overflow), and the hard limit at 2^64 threads is so large that you can never possibly come even close to it.
If you need a billion threads, than set global work size to a billion threads. The GPU scheduler and hardware will handle that just fine, even if the GPU only has a few thousand physical cores. In fact, you should always launch much more threads than there are cores on the GPU; otherwise the hardware won't be fully saturated and you loose performance.
Only issue could be to run out of GPU memory.
Launching kernels from within kernels is only possible in OpenCL 2.0-2.2, on AMD or Intel GPUs.
It sounds like each iteration depends on the result of the previous one. In that case, your limiting factor is not the number of available threads. You cannot cause some work-items to "wait" for others submitted by the same kernel enqueueing API call (except to a limited extent within a work group).
If you have an OpenCL 2.0+ implementation at your disposal, you can queue subsequent iterations dynamically from within the kernel. If not, and you have established that your bottleneck is checking whether another iteration is required and the subsequent kernel submission, you could try the following:
Assuming a work-item can trivially determine how many threads are actually needed for an iteration based on the output of the previous iteration, you could speculatively enqueue multiple batches of the kernel, each of which depends on the completion event of the previous batch. Inside the kernel, you can exit early if the thread ID is greater or equal the number of threads required in that iteration.
This only works if you either have a hard upper bound or can make a reasonable guess that will yield sensible results (with acceptable perf characteristics if the guess is wrong) for:
The maximum number of iterations.
The number of work-items required on each iteration.
Submitting, say UINT32_MAX work items for each iteration will likely not make any sense in terms of performance, as the number of work-items that fail the check for whether they are needed will dominate.
You can work around incorrect guesses for the latter number by surrounding the calculation with a loop, so that work item N will calculate both item N and M+N if the number of items on an iteration exceeds M, where M is the enqueued work size for that iteration.
Incorrect guesses for the number of iterations would need to be detected on the host, and more iterations enqueued.
So it becomes a case of performing a large number of runs with different guesses and gathering statistics on how good the guesses are and what overall performance they yielded.
I can't say whether this will yield acceptable performance in general - it really depends on the calculations you are performing and whether they are a good fit for GPU-style parallelism, and whether the overhead of the early-out for a potentially large number of work items becomes a problem.
Usually profile data is gathered by randomly sampling the stack of the running program to see which function is in execution, over a running period it is possible to be statistically sure which methods/function calls eats most time and need intervention in case of bottlenecks.
However this has to do with overall application/game performance. Sometime happens that there are singular and isolated hiccups in performance that are causing usability troubles anyway (user notice it / introduced lag in some internal mechanism etc). With regular profiling over few seconds of execution is not possible to know which. Even if the hiccup lasts long enough (says 30 ms, which are not enough anyway), to detect some method that is called too often, we will still miss to see execution of many other methods that are just "skipped" because of the random sampling.
So are there any tecniques to profile hiccups in order to keep framerate more stable after fixing those kind of "rare bottlenecks"? I'm assuming usage of languages like C# or C++.
This has been answered before, but I can't find it, so here goes...
The problem is that the DrawFrame routine sometimes takes too long.
Suppose it normally takes less than 1000/30 = 33ms, but once in a while it takes longer than 33ms.
At the beginning of DrawFrame, set a timer interrupt that will expire after, say, 40ms.
Then at the end of DrawFrame, disable the interrupt.
So if it triggers, you know DrawFrame is taking an unusually long time.
Put a breakpoint in the interrupt handler, and when it gets there, examine the stack.
Chances are pretty good that you have caught it in the process of doing the costly thing.
That's a variation on random pausing.
As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.
This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.
I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).
I have a process that is blocked on a socket. When input becomes available in the socket the process decodes the input and most of the time does nothing but update an in memory structure. Periodically the input is such that more complex analysis is triggered, ultimately resulting in a outgoing message on another connection. I would like to minimize the latency in this later case, i.e. minimize the time between receiving and sending. What I have noticed is that latency numbers are 2x worse when the time between interesting events increases. What could this be attributed to and how could I improve on it? I have tried to reserve a CPU for my process but I haven't see much of an improvement.
You should try to "nice" the process to a negative value. I don't know the Linux scheduler in detail, but normal policy is to reduce the time slice (sometimes a quantum) when a process fails to use its slice up and vice versa. This is called multi-level feedback policy. In your case getting a bunch of quickly handled events probably gives the process a very short time slice. When a "significant" event occurs, it would have to work its way up to a longer slice through several context swaps. Setting the "nice" value high enough is likely to give it whatever time slice it needs.
Unfortunately "negative niceness" requires superuser privilege in most systems.