Previously in my main loop of the game, the time was managed at 60 FPS and the respective Delay for the time delay.
The Sprite sequence was animated as follows:
<pre>
if(++ciclos > 10){
siguienteSprite++;
ciclos = 0;
}
</pre>
Given that I am using Smooth Motion with DeltaTime, therefore I have eliminated Delay from the Main Cycle; Making this the sprites cycles of the animation are faster, and not only this, but also the time between each sequence varies.
Someone could give me a hand, only with the logic of this problem, beforehand thank you. :)
delay in main loop is not really a good way for this (as it is not accounting for the time other stuff in your main loop takes). When you removed delay then the speed is bigger and varying more because the other stuff in your main loop timing is more significant and usually non constant for many reasons like:
OS granularity
synchronization with gfx card/driver
non constant processing times
There are more ways how to handle this:
measure time
<pre>
t1=get_actual_time();
while (t1-t0>=animation_T)
{
siguienteSprite++;
t0+=animation_T;
}
// t0=t1; // this is optional and change the timing properties a bit
</pre>
where t0 is some global variable holding "last" measured time os sprite change. t1 is actual time and animation_T is time constant between animation changes. To measure time you need to use OS api like PerformanceCounter on windows or RDTSC in asm or any other you got at hand but with small enough resolution.
OS timer
simply increment the siguienteSprite in some timer with animation_T interval. This is simple but OS timers are not precise and usually of around 1ms or more + OS granularity (similar to Sleep accuracy).
Thread timer
you can create single thread for timing purposes for example something like this:
for (;!threads_stop;)
{
Delay(animation_T); // or Sleep()
siguienteSprite++;
}
Do not forget that siguienteSprite must be volatile and buffered during rendering to avoid flickering and or access violation errors. This approach is a bit more precise (unless you got single core CPU).
You cam also increment some time variable instead and use that as actual time in your app with any resolution you want. But beware if delay is not returning CPU control to OS then this approach will utilize your CPU to 100%/CPU_cores. There is remedy for this and that is replacing your delay with this:
Sleep(0.9*animation_T);
for (;;)
{
t1=get_actual_time();
if (t1-t0>=animation_T)
{
siguienteSprite++;
t0=t1;
break;
}
If you are using measured time then you should handle overflows (t1<t0) because any counter will overflow after time. For example using 32bit part of RDTSC on 3.2 GHz CPU core will overflow every 2^32/3.2e9 = 1.342 sec so it is real possibility. If my memory serves well then Performance counters in Windows usually runs around 3.5 MHz on older OS systems and around 60-120 MHz on newer (at least last time I check) and are 64 bit so the overflows are not that big of a problem (unless you run 24/7). Also in case of RDTSC use you should set process/thread affinity to single CPU core to avoid timing problems on multi core CPUs.
I did my share of benchmarking and advanced high resolution timing at low level over the years so here few related QAs of mine:
wrong clock cycle measurements with rdtsc - OS Granularity
Measuring Cache Latencies - measuring CPU frequency
Cache size estimation on your system? - PerformanceCounter example
Questions on Measuring Time Using the CPU Clock - PIT as alternative timing source
Related
background: synchronization of an emulated microcontroller (Motorola MC68331) at 20 Mhz, running in thread A, to an emulated DSP (Motorola 56300) at 120 MHz, running in thread B.
I need synchronization at audio rate, which results in synchronization 44100 times per second.
The current approach is to use a std::condition_variable, but the overhead of wait() is too high. At the moment I'm profiling on a Windows system, however, this has to work on Win, Mac and possibly Linux/Android, too.
In particular, the issue is a jmp instruction inside of SleepConditionvariableSRW, which is very costly:
I already tried some other options. Various variants of sleep are too imprecise and usually take way too long, the best one can get out of Windows is roughly one millisecond, however, the maximum sleep time should not be more than 1/44100 seconds => about 22us:
The closest one can get on Windows is to use CreateWaitableTimerEx with the high precision flag, but even in that case, the overhead of these functions is even higher than the std::condition_variable.
What I also tried: a spinloop with std::this_thread::yield, which results in more CPU usage in general.
Is there anything I am missing / could try? More than 50% of CPU time is wasted in the wait code, which I'd like to eliminate.
Thanks in advance!
As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.
This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.
I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).
I use pthread lib 2.8 and the OS kernel is Linux 2.6.37 on arm. In my program, Thread A is using pthread interfaces to set scheduling priority to the halfway point between sched_get_priority_min(policy) and sched_get_priority_max(policy).
In the thread function loop:
{
//do my work
pthread_cond_timedwait(..., ... , 15 ms)
}
I find this thread consumes about 3% CPU. If I change the timeout to 30 ms, it reduces to 1.3%. However, I can not increase the timeout. Is there a way to reduce the CPU consumption without reducing the timeout? It seems the cost is due to thread switching.
Using this sort of construct will cause approximately 67 task switches per second, and most likely that switches to a different process, which means a complete context switch including page-tables. It's been a while since I looked at what that involves in ARM, but I'm sure it's not a "Lightweight" operation. If we count backwards, 1.75% of that is about 210k clock cycles per task switch. Which seems quite a lot. But I'm not sure how much work is involved in scrubbing TLB's, caches and such.
I'm programming in a Keil Board and am trying to count the number of clock periods taken for execution by a code block inside a C function.
Is there a way to get time with precision to microseconds before and after the code block, so that I can get the diff and divide it by the number of clock periods per microsecond to compute the clock periods consumed by the block?
The clock() function in time.h gives time in seconds which will give the diff as 0 as it is a small code block that I'm trying to get the clock periods for.
If this is not a good way to solve this problem are there alternatives?
Read up on the timers in the chip, find one the operating system/environment you are using has not consumed and use it directly. This takes some practice, you need to use volatiles to not let the compiler re-arrange your code or not re-read the timer. And you need to adjust the prescaler on the timer so that it gets the most practical resolution without rolling over. So start with a big prescale divisor, convince yourself it is not rolling over, then make that prescale divisor shorter, until you reach a divide by one or or reach the desired accuracy. If divide by one doesnt give you enough then you have to call the function many times in a loop and time around that loop. Remember that any time you change your code to add these measurements you can and will change the performance of your code, sometimes small enough not to notice, sometimes 10% - 20% or more. if you are using a cache then any line of code you add or remove can change the performance by double digit percentages and you have to understand more about timing your code at that point.
The best way to count the number of clock cycles in the embedded world is to use an oscilloscope. Toggle a GPIO pin before and after your code block and measure the time with the oscilloscope.The measured time multiplied by the CPU frequency is the numbler of CPU clock cycles spent.
You have omitted to say what processor is on the board (far more important than the brand of board!), if the processor includes ETM, and you have a ULINK-Pro or other trace-capable debugger then uVision can unintrusively profile the executing code directly at the instruction cycle level.
Similarly if you run the code in the uVision simulator rather than real hardware, you can get cycle accurate profiling and timing, without the need for hardware trace support.
Even without the trace capability, uVision's "stopwatch" feature can perform timing between two break-points directly. The stopwatch is at the bottom of the IDE in the status bar. You do need to set the clock frequency in the debugger trace configuration to get "real-time" from the stop-watch.
A simple approach that requires no special debug or simulator capability is to use an available timer peripheral (or in the case of Cortex-M devices the sysclk) to timestamp the start and end of execution of a code section, or if you have no available timing resource, you could toggle a GPIO pin and monitor it on an oscilloscope. These methods have some level of software overhead that is not present in hardware or simulator trace, that may make them unsuitable for very short code sections.
Hi i am using QueryperformanceFrequency to get the No of cycle i.e processor speed.
But it is showing me the wrong value. It is written in the specicfication is the Processor is about 400MHz, but what we are getting through code is something 16MHz.
Please porvide any pointer :
The code for Wince device is:
LARGE_INTEGER FrequnecyCounter;
QueryPerformanceFrequency(&FrequnecyCounter);
CString temp;
temp.Format(L"%lld",FrequnecyCounter.QuadPart)`AfxMessageBox(temp);
Thanks,
Mukesh
QueryPerformanceFrequency returns frequency of the counter peripheral not of the processor. These peripheral typically runs at the original Crystal clock frequency. 16Mhz should be good enough resolution for you to measure fine grain intervals.
QPF doesn't return the CPU clock speed. It returns the frequency of a high performance timer. On a few systems, it might actually measure CPU cycles. On other systems, it might use s a separate timer running at the same frequency. (but which is unaffected by things like SpeedStep which can change the clock speed of the CPU). Often, it uses a separate timer entirely, one which may not even be on the CPU itself, but may be part of the motherboard.
QueryPerformanceCounter/QueryPerformanceFrequency only promise that they use the best timer available on the system. They make no promises about what that timer might be.