I'm programming in a Keil Board and am trying to count the number of clock periods taken for execution by a code block inside a C function.
Is there a way to get time with precision to microseconds before and after the code block, so that I can get the diff and divide it by the number of clock periods per microsecond to compute the clock periods consumed by the block?
The clock() function in time.h gives time in seconds which will give the diff as 0 as it is a small code block that I'm trying to get the clock periods for.
If this is not a good way to solve this problem are there alternatives?
Read up on the timers in the chip, find one the operating system/environment you are using has not consumed and use it directly. This takes some practice, you need to use volatiles to not let the compiler re-arrange your code or not re-read the timer. And you need to adjust the prescaler on the timer so that it gets the most practical resolution without rolling over. So start with a big prescale divisor, convince yourself it is not rolling over, then make that prescale divisor shorter, until you reach a divide by one or or reach the desired accuracy. If divide by one doesnt give you enough then you have to call the function many times in a loop and time around that loop. Remember that any time you change your code to add these measurements you can and will change the performance of your code, sometimes small enough not to notice, sometimes 10% - 20% or more. if you are using a cache then any line of code you add or remove can change the performance by double digit percentages and you have to understand more about timing your code at that point.
The best way to count the number of clock cycles in the embedded world is to use an oscilloscope. Toggle a GPIO pin before and after your code block and measure the time with the oscilloscope.The measured time multiplied by the CPU frequency is the numbler of CPU clock cycles spent.
You have omitted to say what processor is on the board (far more important than the brand of board!), if the processor includes ETM, and you have a ULINK-Pro or other trace-capable debugger then uVision can unintrusively profile the executing code directly at the instruction cycle level.
Similarly if you run the code in the uVision simulator rather than real hardware, you can get cycle accurate profiling and timing, without the need for hardware trace support.
Even without the trace capability, uVision's "stopwatch" feature can perform timing between two break-points directly. The stopwatch is at the bottom of the IDE in the status bar. You do need to set the clock frequency in the debugger trace configuration to get "real-time" from the stop-watch.
A simple approach that requires no special debug or simulator capability is to use an available timer peripheral (or in the case of Cortex-M devices the sysclk) to timestamp the start and end of execution of a code section, or if you have no available timing resource, you could toggle a GPIO pin and monitor it on an oscilloscope. These methods have some level of software overhead that is not present in hardware or simulator trace, that may make them unsuitable for very short code sections.
Related
Previously in my main loop of the game, the time was managed at 60 FPS and the respective Delay for the time delay.
The Sprite sequence was animated as follows:
<pre>
if(++ciclos > 10){
siguienteSprite++;
ciclos = 0;
}
</pre>
Given that I am using Smooth Motion with DeltaTime, therefore I have eliminated Delay from the Main Cycle; Making this the sprites cycles of the animation are faster, and not only this, but also the time between each sequence varies.
Someone could give me a hand, only with the logic of this problem, beforehand thank you. :)
delay in main loop is not really a good way for this (as it is not accounting for the time other stuff in your main loop takes). When you removed delay then the speed is bigger and varying more because the other stuff in your main loop timing is more significant and usually non constant for many reasons like:
OS granularity
synchronization with gfx card/driver
non constant processing times
There are more ways how to handle this:
measure time
<pre>
t1=get_actual_time();
while (t1-t0>=animation_T)
{
siguienteSprite++;
t0+=animation_T;
}
// t0=t1; // this is optional and change the timing properties a bit
</pre>
where t0 is some global variable holding "last" measured time os sprite change. t1 is actual time and animation_T is time constant between animation changes. To measure time you need to use OS api like PerformanceCounter on windows or RDTSC in asm or any other you got at hand but with small enough resolution.
OS timer
simply increment the siguienteSprite in some timer with animation_T interval. This is simple but OS timers are not precise and usually of around 1ms or more + OS granularity (similar to Sleep accuracy).
Thread timer
you can create single thread for timing purposes for example something like this:
for (;!threads_stop;)
{
Delay(animation_T); // or Sleep()
siguienteSprite++;
}
Do not forget that siguienteSprite must be volatile and buffered during rendering to avoid flickering and or access violation errors. This approach is a bit more precise (unless you got single core CPU).
You cam also increment some time variable instead and use that as actual time in your app with any resolution you want. But beware if delay is not returning CPU control to OS then this approach will utilize your CPU to 100%/CPU_cores. There is remedy for this and that is replacing your delay with this:
Sleep(0.9*animation_T);
for (;;)
{
t1=get_actual_time();
if (t1-t0>=animation_T)
{
siguienteSprite++;
t0=t1;
break;
}
If you are using measured time then you should handle overflows (t1<t0) because any counter will overflow after time. For example using 32bit part of RDTSC on 3.2 GHz CPU core will overflow every 2^32/3.2e9 = 1.342 sec so it is real possibility. If my memory serves well then Performance counters in Windows usually runs around 3.5 MHz on older OS systems and around 60-120 MHz on newer (at least last time I check) and are 64 bit so the overflows are not that big of a problem (unless you run 24/7). Also in case of RDTSC use you should set process/thread affinity to single CPU core to avoid timing problems on multi core CPUs.
I did my share of benchmarking and advanced high resolution timing at low level over the years so here few related QAs of mine:
wrong clock cycle measurements with rdtsc - OS Granularity
Measuring Cache Latencies - measuring CPU frequency
Cache size estimation on your system? - PerformanceCounter example
Questions on Measuring Time Using the CPU Clock - PIT as alternative timing source
As part of optimizing my 3D game/simulation engine, I'm trying to make the engine self-optimizing.
Essentially, my plan is this. First, get the engine to measure the number of CPU cycles per frame. Then measure how many CPU cycles the various subsystems consume (min, average, max).
Given this information, at just a few specific points in the frame loop, the engine could estimate how many "extra CPU cycles" it has available to perform "optional processing" that is efficient to perform now (the relevant data is in the cache now), but could otherwise be delayed until some subsequent frame if the current frame is in danger of running short of CPU cycles.
The idea is to keep as far ahead of the game as possible on grunt work, so every possible CPU cycle is available to process "demanding frames" (like "many collisions during a single frame") can be processed without failing to call glXSwapBuffers() in time to exchange back/front buffers before the latest possible moment for vsync).
The analysis above presumes swapping back/front buffers is fundamental requirement to assure a constant frame rate. I've seen claims this is not the only approach, but I don't understand the logic.
I captured 64-bit CPU clock cycle times just before and after glXSwapBuffers(), and found frames vary by about 2,000,000 clock cycles! This appears to be due to the fact glXSwapBuffers() doesn't block until vsync (when it can exchange buffers), but instead returns immediately.
Then I added glFinish() immediately before glXSwapBuffers(), which reduced the variation to about 100,000 CPU clock cycles... but then glFinish() blocked for anywhere from 100,000 to 900,000 CPU clock cycles (presumably depending on how much work the nvidia driver had to complete before it could swap buffers). With that kind of variation in how long glXSwapBuffers() may take to complete processing and swap buffers, I wonder whether any "smart approach" has any hope.
The bottom line is, I'm not sure how to achieve my goal, which seems rather straightforward, and does not seem to ask too much of the underlying subsystems (the OpenGL driver for instance). However, I'm still seeing about 1,600,000 cycles variation in "frame time", even with glFinish() immediately before glXSwapBuffers(). I can average the measured "CPU clock cycles per frame" rates and assume the average yields the actual frame rate, but with that much variation my computations might actually cause my engine to skip frames by falsely assuming it can depend on these values.
I will appreciate any insight into the specifics of the various GLX/OpenGL functions involved, or in general approaches that might work better in practice than what I am attempting.
PS: The CPU clock rate of my CPU does not vary when cores are slowed-down or sped-up. Therefore, that's not the source of my problem.
This is my advice: at the end of the rendering just call the swap buffer function and let it block if needed. Actually, you should have a thread that perform all your OpenGL API calls, and only that. If there is another computation to perform (e.g. physics, game logic), use other threads and the operating system will let these threads running while the rendering thread is waiting for vsync.
Furthermore, if some people disable vsync, they would like to see how many frames per seconds they can achieve. But with your approach, it seems that disabling vsync would just let the fps around 60 anyway.
I'll try to re-interpret your problem (so that if I missed something you could tell me and I can update the answer):
Given T is the time you have at your disposal before a Vsync event happens, you want to make your frame using 1xT seconds (or something near to 1).
However, even if you are so able to code tasks so that they can exploit cache locality to achieve fully deterministic time behaviour (you know in advance how much time each tasks require and how much time you have at your disposal) and so you can theorically achieve times like:
0.96xT
0.84xT
0.99xT
You have to deal with some facts:
You don't know T (you tried to mesure it and it seems to hic-cup: those are drivers dependent!)
Timings have errors
Different CPU architectures: you measure CPU cycles for a function but on another CPU that function requires less or more cycles due to better/worse prefeteching or pipelining.
Even when running on the same CPU, another task may pollute the prefeteching algorithm so the same function does not necessarily results in same CPU cycles (depends on functions called before and prefetech algorihtm!)
Operative system could interfere at any time by pausing your application to run some background process, that would increase the time of your "filling" tasks effectively making you miss the Vsync event (even if your "predicted" time is reasonable like 0.85xT)
At some times you can still get a time of
1.3xT
while at the same time you didn't used all the possible CPU power (When you miss a Vsync event you basically wasted your frame time so it becomes wasted CPU power)
You can still workaround ;)
Buffering frames: you store Rendering calls up to 2/3 frames (no more! You already adding some latency, and certain GPU drivers will do a similiar thing to improve parallelism and reduce power consumption!), after that you use the game loop to idle or to do late works.
With that approach it is reasonable to exceed 1xT. because you have some "buffer frames".
Let's see a simple example
You scheduled tasks for 0.95xT but since the program is running on a machine with a different CPU than the one you used to develop the program due to different architecture your frame takes 1.3xT.
No problem you know there are some frames behind so you can still be happy, but now you have to launch a 1xT - 0.3xT task, better using also some security margin so you launch tasks for 0.6xT instead of 0.7xT.
Ops something really went wrong, the frame took again 1.3xT now you exausted your reserve of frames, you just do a simple update and submit GL calls, your program predict 0.4xT
surprise your program took 0.3xT for the following frames even if you scheduled work for more than 2xT, you have again 3 frames queued in the rendering thread.
Since you have some frames and also have late works you schedule a update for 1,5xT
By introducing a little latency you can exploit full CPU power, of course if you measure that most times your queue have more than 2 frames buffered you can just cut down the pool to 2 instead of 3 so that you save some latency.
Of course this assumes you do all work in a sync way (apart deferring GL cals). You can still use some extra threads where necessary (file loading or other heavy tasks) to improve performance (if required).
As the title suggests I'm interested in obtaining CPU clock cycles used by a process in kernel mode only. I know there is an API called "QueryProcessCycleTime" which returns the CPU clock
cycles used by the threads of the process. But this value includes cycles spent in both user mode and kernel mode. How can I obtain cycles spent in kernel mode only? Do I need to get this using Performance counters? If yes, which one I should use?
Thanks in advance for your answers.
I've just found an interesting article that describes almost what you ask for. It's on MSDN Internals.
They write there, that if you were using C# or C++/CLI, you could easily get that information from an instance of System.Diagnostic.Process class, pointed to the right PID. But it would give you a TimeSpan from the PrivilegedProcessorTime, so a "pretty time" instead of 'cycles'.
However, they also point out that all that .Net code is actually thin wrapper for unmanaged APIs, so you should be able to easily get it from native C++ too. They were ILDASM'ing that class to show what it calls, but the image is missing. I've just done the same, and it uses the GetProcessTimes from kernel32.dll
So, again, MSDN'ing it - it returns LPFILETIME structures. So, the 'pretty time', not 'cycles', again.
The description of this method points out that if you want to get the clock cycles, you should use QueryProcessCycleTime function. This actually returns the amount of clock cycles.. but user- and kernel-mode counted together.
Now, summing up:
you can read userTIME
you can read kernelTIME
you can read (user+kernel)CYCLES
So you have almost everything needed. By some simple math:
u_cycles = u_time * allcycles / (utime+ktime)
k_cycles = k_time * allcycles / (utime+ktime)
Of course this will be some approximation due to rounding etc.
Also, this will has a gotcha: you have to invoke two functions (GetTimes, QueryCycles) to get all the information, so there will be a slight delay between their readings, and therefore all your calculation will probably slip a little since the target process still runs and burns the time.
If you cannot allow for this (small?) noise in the measurement, I think you can circumvent it by temporarily suspending the process:
suspend the target
wait a little and ensure it is suspended
read first stats
read second stats
then resume the process and calculate values
I think this will ensure the two readings to be consistent, but in turn, each such reading will impact the overall performance of the measured process - so i.e. things like "wall time" will not be measureable any longer, unless you take some corrections for the time spent in suspension..
There may be some better way to get the separate clock cycles, but I have not found them, sorry. You could try looking inside the QueryProcessCycleTime and what source it reads the data from - maybe you are lucky and it reads A,B and returns A+B and maybe you could peek what are the sources. I have not checked it.
Take a look at GetProcessTimes. It'll give you the amount of kernel and user time your process has used.
How i can limit cpu usage to 10% for example for specific process in windows C++?
You could use Sleep(x) - will slow down your program execution but it will free up CPU cycles
Where x is the time in milliseconds
This is rarely needed and maybe thread priorities are better solution but since you asked, what you should is:
do a small fraction of your "solid" work i.e. calculations
measure how much time step 1) took, let's say it's twork milliseconds
Sleep() for (100/percent - 1)*twork milliseconds where percent is your desired load
go back to 1.
In order for this to work well, you have to be really careful in selecting how big a "fraction" of a calculation is and some tasks are hard to split up. A single fraction should take somewhere between 40 and 250 milliseconds or so, if it takes less, the overhead from sleeping and measuring might become significant, if it's more, the illusion of using 10% CPU will disappear and it will seem like your thread is oscillating between 0 and 100% CPU (which is what happens anyway, but if you do it fast enough then it looks like you only take whatever percent). Two additional things to note: first, as mentioned before me, this is on a thread level and not process level; second, your work must be real CPU work, disk/device/network I/O usually involves a lot of waiting and doesn't take as much CPU.
That is the job of OS, you can not control it.
You can't limit to exactly 10%, but you can reduce it's priority and restrict it to use only one CPU core.
I was just wondering if there is an elegant way to set the maximum CPU load for a particular thread doing intensive calculations.
Right now I have located the most time consuming loop in the thread (it does only compression) and use GetTickCount() and Sleep() with hardcoded values. It makes sure that the loop continues for a certain period and then sleeps for a certain minimum time. It more or less does the job, i.e. guarantees that the thread will not use more than 50% of CPU. However, behavior is dependent on the number of CPU cores (huge disadvantage) and simply ugly (smaller disadvantage :)). Any ideas?
I am not aware of any API to do get the OS's scheduler to do what you want (even if your thread is idle-priority, if there are no higher-priority ready threads, yours will run). However, I think you can improvise a fairly elegant throttling function based on what you are already doing. Essentially (I don't have a Windows dev machine handy):
Pick a default amount of time the thread will sleep each iteration. Then, on each iteration (or on every nth iteration, such that the throttling function doesn't itself become a significant CPU load),
Compute the amount of CPU time your thread used since the last time your throttling function was called (I'll call this dCPU). You can use the GetThreadTimes() API to get the amount of time your thread has been executing.
Compute the amount of real time elapsed since the last time your throttling function was called (I'll call this dClock).
dCPU / dClock is the percent CPU usage (of one CPU). If it is higher than you want, increase your sleep time, if lower, decrease the sleep time.
Have your thread sleep for the computed time.
Depending on how your watchdog computes CPU usage, you might want to use GetProcessAffinityMask() to find out how many CPUs the system has. dCPU / (dClock * CPUs) is the percentage of total CPU time available.
You will still have to pick some magic numbers for the initial sleep time and the increment/decrement amount, but I think this algorithm could be tuned to keep a thread running at fairly close to a determined percent of CPU.
On linux, you can change the scheduling priority of a thread with nice().
I can't think of any cross platform way of what you want (or any guaranteed way full stop) but as you are using GetTickCount perhaps you aren't interested in cross platform :)
I'd use interprocess communications and set the intensive processes nice levels to get what you require but I'm not sure that's appropriate for your situation.
EDIT:
I agree with Bernard which is why I think a process rather than a thread might be more appropriate but it just might not suit your purposes.
The problem is it's not normal to want to leave the CPU idle while you have work to do. Normally you set a background task to IDLE priority, and let the OS handle scheduling it all the CPU time that isn't used by interactive tasks.
It sound to me like the problem is the watchdog process.
If your background task is CPU-bound then you want it to take all the unused CPU time for its task.
Maybe you should look at fixing the watchdog program?
You may be able to change the priority of a thread, but changing the maximum utilization would either require polling and hacks to limit how many things are occurring, or using OS tools that can set the maximum utilization of a process.
However, I don't see any circumstance where you would want to do this.