EDIT:
I would like to thank you all for the swift replies ^^ Sleep() works as intended and my CPU is not being viciously devoured by this program anymore! I will keep this question as is, but to let everybody know that the CPU problem has been answered expediently and professionally :D
As an aside to the aside, I'll certainly make sure that micro-optimizations are kept to a minimum in the face of larger, more important problems!
================================================================================
For some reason my program, a console alarm clock I made for laughs and practice, is extremely CPU intensive. It consumes about 2mB RAM, which is already quite a bit for such a small program, but it devastates my CPU with over 50% resources at times.
Most of the time my program is doing nothing except counting down the seconds, so I guess this part of my program is the one that's causing so much strain on my CPU, though I don't know why. If it is so, could you please recommend a way of making it less, or perhaps a library to use instead if the problem can't be easily solved?
/* The wait function waits exactly one second before returning to the *
* called function. */
void wait( const int &seconds )
{
clock_t endwait; // Type needed to compare with clock()
endwait = clock() + ( seconds * CLOCKS_PER_SEC );
while( clock() < endwait ) {} // Nothing need be done here.
}
In case anybody browses CPlusPlus.com, this is a genuine copy/paste of the clock() function they have written as an example for clock(). Much why the comment //Nothing need be done here is so lackluster. I'm not entirely sure what exactly clock() does yet.
The rest of the program calls two other functions that only activate every sixty seconds, otherwise returning to the caller and counting down another second, so I don't think that's too CPU intensive- though I wouldn't know, this is my first attempt at optimizing code.
The first function is a console clear using system("cls") which, I know, is really, really slow and not a good idea. I will be changing that post-haste, but, since it only activates every 60 seconds and there is a noticeable lag-spike, I know this isn't the problem most of the time.
The second function re-writes the content of the screen with the updated remaining time also only every sixty seconds.
I will edit in the function that calls wait, clearScreen and display if it's clear that this function is not the problem. I already tried to reference most variables so they are not copied, as well as avoid endl as I heard that it's a little slow compared to \n.
This:
while( clock() < endwait ) {}
Is not "doing nothing". Certainly nothing is being done inside the while loop, but the test of clock() < endwait is not free. In fact, it is being executed over and over again as fast as your system can possibly handle doing it, which is what is driving up your load (probably 50% because you have a dual core processor, and this is a single-threaded program that can only use one core).
The correct way to do this is just to trash this entire wait function, and instead just use:
sleep(seconds);
Which will actually stop your program from executing for the specified number of seconds, and not consume any processor time while doing so.
Depending on your platform, you will need to include either <unistd.h> (UNIX and Linux) or <windows.h> (Windows) to access this function.
This is called a busy-wait. The CPU is spinning its wheels at full throttle in the while loop. You should replace the while loop with a simple call to sleep or usleep.
I don't know about the 2 MB, especially without knowing anything about the overall program, but that's really not something to stress out over. It could be that the C runtime libraries suck up that much on start-up for efficiency reasons.
The CPU issue has been answered well. As for the memory issue, it's not clear what 2 MB is actually measuring. It might be the total size of all the libraries mapped into your application's address space.
Run and inspect a program that simply contains
int main() { for (;;) }
to gauge the baseline memory usage on your platform.
You're spinning without yielding here, so it's no surprise that you burn CPU cycles.
Drop a
Sleep(50);
in the while loop.
The while loop is keeping the processor busy whenever your thread gets a timeslice to execute. If all you wish is to wait for a determined amount of time, you don't need a loop. You can replace it by a single call to sleep, usleep or nanosleep (depending on platform and granularity). They suspend the thread execution until the amount of time you specified has elapsed.
Alternatively, you can just give up (yield) on the remaining timeslice, calling Sleep(0) (Windows) or sched_yield() (Unix/Linux/etc).
If you want to understand the exact reason for this problem, read about scheduling.
while( clock() < endwait ) { Sleep(0); } // yield to equal priority threads
Related
Usually profile data is gathered by randomly sampling the stack of the running program to see which function is in execution, over a running period it is possible to be statistically sure which methods/function calls eats most time and need intervention in case of bottlenecks.
However this has to do with overall application/game performance. Sometime happens that there are singular and isolated hiccups in performance that are causing usability troubles anyway (user notice it / introduced lag in some internal mechanism etc). With regular profiling over few seconds of execution is not possible to know which. Even if the hiccup lasts long enough (says 30 ms, which are not enough anyway), to detect some method that is called too often, we will still miss to see execution of many other methods that are just "skipped" because of the random sampling.
So are there any tecniques to profile hiccups in order to keep framerate more stable after fixing those kind of "rare bottlenecks"? I'm assuming usage of languages like C# or C++.
This has been answered before, but I can't find it, so here goes...
The problem is that the DrawFrame routine sometimes takes too long.
Suppose it normally takes less than 1000/30 = 33ms, but once in a while it takes longer than 33ms.
At the beginning of DrawFrame, set a timer interrupt that will expire after, say, 40ms.
Then at the end of DrawFrame, disable the interrupt.
So if it triggers, you know DrawFrame is taking an unusually long time.
Put a breakpoint in the interrupt handler, and when it gets there, examine the stack.
Chances are pretty good that you have caught it in the process of doing the costly thing.
That's a variation on random pausing.
I understand that a preemptive multitasking OS can interrupt a process at any "code position".
Given the following code:
int main() {
while( true ) {
doSthImportant(); // needs to be executed at least each 20 msec
// start of critical section
int start_usec = getTime_usec();
doSthElse();
int timeDiff_usec = getTime_usec() - start_usec;
// end of critical section
evalUsedTime( timeDiff_usec );
sleep_msec( 10 );
}
}
I would expect this code to usually produce proper results for timeDiff_usec, especially in case that doSthElse() and getTime_usec() don't take much time so they get interrupted rarely by the OS scheduler.
But the program would get interrupted from time to time somewhere in the "critical section". The context switch will do what it is supposed to do, and still in such a case the program would produce wrong results for the timeDiff_usec.
This is the only example I have in mind right now but I'm sure there would be other scenarios where multitasking might get a program(mer) into trouble (as time is not the only state that might be changed at re-entry).
Is there a way to ensure that measuring the time for a certain action works fine?
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Edit:
I changed the sample code to make it more precise.
I want to check the time being spent to make sure that doSthElse() doesn't take like 50 msec or so, and if it does I would look for a better solution.
Is there a way to ensure that measuring the time for a certain action works fine?
That depends on your operating system and your privilege level. On some systems, for some privilege levels, you can set a process or thread to have a priority that prevents it from being preempted by anything at lower priority. For example, on Linux, you might use sched_setscheduler to give a thread real-time priority. (If you're really serious, you can also set the thread affinity and SMP affinities to prevent any interrupts from being handled on the CPU that's running your thread.)
Your system may also provide time tracking that accounts for time spent preempted. For example, POSIX defines the getrusage function, which returns a struct containing ru_utime (the amount of time spent in “user mode” by the process) and ru_stime (the amount of time spent in “kernel mode” by the process). These should sum to the total time the CPU spent on the process, excluding intervals during which the process was suspended. Note that if the kernel needs to, for example, spend time paging on behalf of your process, it's not defined how much (if any) of that time is charged to your process.
Anyway, the common way to measure time spent on some critical action is to time it (essentially the way your question presents) repeatedly, on an otherwise idle system, throw out outlier measurements, and take the mean (after eliminating outliers), or take the median or 95th percentile of the measurements, depending on why you need the measurement.
Which other common issues are critical with multitasking and need to be considered? (I'm not thinking of thread safety - but there might be common issues).
Too broad. There are whole books written about this subject.
I encountered a problem where my game-loop stuttered approximating once a second (variable intervals). A single frame then takes over 60ms whereas all others require less than 1ms.
After simplifying a lot I ended with the following program which reproduces the bug. It only measures the frame time and reports it.
#include <iostream>
#include "windows.h"
int main()
{
unsigned long long frequency, tic, toc;
QueryPerformanceFrequency((LARGE_INTEGER*)&frequency);
QueryPerformanceCounter((LARGE_INTEGER*)&tic);
double deltaTime = 0.0;
while( true )
{
//if(deltaTime > 0.01)
std::cerr << deltaTime << std::endl;
QueryPerformanceCounter((LARGE_INTEGER*)&toc);
deltaTime = (toc - tic) / double(frequency);
tic = toc;
if(deltaTime < 0.01) deltaTime = 0.01;
}
}
Again one frame in many is much slower than the others. Adding the if let the error vanish (cerr is never called then). My original problem didn't contain any cerr/cout. However, I consider this as a reproduction of the same error.
cerr is flushed in every iteration, so this is not what happens to create single slow frames. I know from a profiler (Very Sleepy) that the stream internally uses a lock/critical section, but this shouldn't change anything because the program is singlethreaded.
What causes single iterations to stall that much?
Edit: I did some more tests:
Adding std::this_thread::sleep_for( std::chrono::milliseconds(7) ); and therefore reducing the process CPU utilization does not change anything.
With printf("%f\n", deltaTime); the problem vanishes (maybe because it doesn't use a mutex and memory allocation in contrast to the stream)
The design of windows does not guarantee an upper limit on any execution time, since it dynamically allocates runtime resources to all programs using some logic - for example, the scheduler will allocate resources to a process with high priority, and starve out lower priority processes in some circumstances. Programs are statistically more likely to - eventually - be affected by such things if they run tight loops and consume a lot of CPU resources. Because - again eventually - the scheduler will temporarily boost the priority of programs that are being starved and/or reduce the priority of programs that are starving others (in your case, by running a tight loop).
Making the output to std::cerr conditional doesn't change the fact of this happening - it just changes the likelihood that it will happen in a specified time interval, because it changes how the program uses system resources in the loop, and therefore changes how it interacts with the system scheduler, policies, etc.
This sort of thing affects programs running in all non-realtime operating systems, although the precise impact depends on how each OS is implemented (e.g. scheduling strategies, other policies controlling access by programs to resources, etc). There is always a non-zero probability (even if it is small) of such stalls occurring.
If you want absolute guarantees of no stalls on such things, you will need a realtime operating system. These systems are designed to do things more predictably in a timing sense, but that comes with trade-offs, since it also requires your programs to be designed with the knowledge that they MUST complete execution of specified functions within specified time intervals. Realtime operating systems use different strategies, but their enforcing timing of constraints can cause the program to malfunction if the program is not designed with such things in mind.
I'm not sure about it, but it could be that the system is interrupting your main thread to let others run, and since it takes some time (I remember on my Windows XP pc the quantum was 10ms), it will stall a frame.
This is very visible because it is a single-threaded application, if you use several thread they are usually dispatched on several cores of the processor (if available), and the stalls will still be here but less important (if you implemented your application logic right).
Edit: here you can have more information about windows and linux schedulers. Basically, windows use quantums (varying from a handful of milliseconds to 120 ms on Windows Server).
Edit 2: you can see a more detailed explanation on the windows scheduler here.
Could you please tell me why the value of timediff printed by the following program is often 4 microseconds (in the range 90 to 1000 times for different runs), but sometimes 70 or more microseconds for a few cases (in the range of 2 to 10 times for different runs):
#include <iostream>
using namespace std;
#include<sys/time.h>
#define MAXQ 1000000
#define THRDS 3
double GetMicroSecond()
{
timeval tv;
gettimeofday (&tv, NULL);
return (double) (((double)tv.tv_sec * 1000000) + (double)tv.tv_usec);
}
int main()
{
double timew, timer, timediff;
bool flagarray[MAXQ];
int x=0, y=0;
for(int i=0; i<MAXQ; ++i)
flagarray[i] = false;
while(y <MAXQ)
{
x++;
if(x%1000 == 0)
{
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
timediff = timer - timew;
if(timediff > THRDS) cout << timer-timew << endl;
}
}
}
Compiled using: g++ testlatency.cpp -o testlatency
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
timew = GetMicroSecond();
flagarray[y++]=true;
timer = GetMicroSecond();
The statement flagarray[y++]=true; will take much less than a microsecond to execute on a modern computer if flagarray[y++] happens to be in the level 1 cache. The statement will take longer to execute if that location is in level 2 cache but not in level 1 cache, much longer if it is in level 3 cache but not in level 1 or level 2 cache, and much, much longer yet if it isn't in any of the caches.
Another thing that can make timer-timew exceed three milliseconds is when your program yields to the OS. Cache misses can result in a yield. So can system calls. The function gettimeofday is a system call. As a general rule, you should expect any system call to yield.
Note: In my system there are 12 cores. The performance is checked with only this program running in the system.
This is not true. There are always many other programs, and many, many other threads running on your 12 core computer. These include the operating system itself (which comprises many threads in and of itself), plus lots and lots of little daemons. Whenever your program yields, the OS can decide to temporarily suspend your program so that one of the myriad other threads that are temporarily suspended but are asking for use of the CPU.
One of those daemons is the Network Time Protocol daemon (ntpd). This does all kinds of funky little things to your system clock to keep it close to in sync with atomic clocks. With a tiny little instruction such as flagarray[y++]=true being the only thing between successive calls to gettimeofday, you might even see time occasionally go backwards.
When testing for timing, its a good idea to do the timing at a coarse level. Don't time an individual statement that doesn't involve any function calls. It's much better to time a loop than it is to time than it is to time individual executions of the loop body. Even then, you should expect some variability in timing because of cache misses and because the OS temporarily suspends execution of your program.
Modern Unix-based systems have better timers (e.g., clock_gettime) than gettimeofday that are not subject to changes made by the Network Time Protocol daemon. You should use one of these rather than gettimeofday.
Generally, there are many threads sharing a small number of cores. Unless you take steps to ensure that your thread has uninterrupted use of a core, you can't guarantee that the OS won't decide to preempt your thread between the two calls GetMicroSecond() calls, and let some other thread use the core for a bit.
Even if your code runs uninterrupted, the line you're trying to time:
flagarray[y++]=true;
likely takes much less time to execute than the measurement code itself.
There are many things happening inside of modern OS at the same time as Your program executes. Some of them may may "steal" CPU from Your program as it is stated in NPE's answer. A few more examples of what can influence timing:
interrups from devices (timer, HDD, network interfaces a few to mention);
access to RAM (caching)
None of these are easily predictable.
You can expect consistency if You run Your code on some microcontroller, or maybe using real time OS.
There are a lot of variables that might explain different time values seen. I would focus more on
Cache miss/fill
Scheduler Events
Interrupts
bool flagarray[MAXQ];
Since you defined MAXQ to 1000000, let's assume that flagarray takes up 1MB of space.
You can compute how many cache-misses can occur, based on your L1/L2 D-cache sizes. Then you can correlate with how many iterations it takes to fill all of L1 and start missing and same with L2. OS may deschedule your process and reschedule it - but, that I am hoping is less likely due to the number of cores you have. Same is the case with interrupts. An idle system is never completely idle. You may choose to affine your process to a core number, say N by doing
taskset 0x<MASK> ./exe and control its execution.
If you are really curious, I would suggest that you use "perf" tool available on most Linux distros.
You may do
perf stat -e L1-dcache-loadmisses
or
perf stat -e LLC-load-misses
Once you have these numbers and the number of iterations you start building a picture of the activity that causes the noticed lag. You may also monitor OS scheduler events using "perf stat".
We have a multi-threaded C++ application running on Solaris (5.10, sparc platform). As per "pstack" most of the threads seem to be waiting on the below call often for little too long. This corresponds to "time_t currentTime = time(NULL) ;" function in the application code to get the current time in seconds.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff76c63508, ffffffff76e3e000, 2000) + 8
The timezone is "Asia/Riyadh". I tried setting the TZ variable to both "Asia/Riyadh" as well as '<GMT+3>-3'. But there is no obvious improvement with either option. Changing the server code (even if there is an alternative) is rather difficult at this point. A test program (single thread, compiled without -O2) having 1 million "time(NULL)" invocations came out rather quickly. The application & test program are compiled using gcc 4.5.1.
Is there anything else that I can try out?
I agree that it is a rather broad question. I will try out the valid suggestions and close this as soon as there is adequate improvement to handle current load.
Edit 1 :
Please ignore the reference to time(NULL) above, as a possible cause for __time stack. I made the inference based on the signature, and finding the same invocation in the source method.
Following is another stack leading to __time.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff773e5cc0, ffffffff76e3e000, 2000) + 8
ffffffff76c9c7f0 getnow (ffffffff704fb180, ffffffff773c6384, 1a311c, 2, ffffffff76e4eae8, fffc00) + 4
ffffffff76c9af0c strptime_recurse (ffffffff76e4cea0, 1, 104980178, ffffffff704fb938, ffffffff704fb180, ffffffff704fb1a4) + 34
ffffffff76c9dce8 __strptime_std (ffffffff76e4cea0, 10458b2d8, 104980178, ffffffff704fb938, 2400, 1a38ec) + 2c
You (and we) are not going to be able to make time faster.
From your message, I gather that you are calling it from many
different threads at once. This may be a problem; it's quite
possible that Solaris serializes these calls, so you end up with
a lot of threads waiting for the others to complete.
How much accuracy do you need? A possible solution might be to
have one thread loop on reading the time, sleeping maybe 10 ms
between each read, and putting the results in a global variable,
which the other threads read. (Don't forget that you'll need to
synchronize all accesses to the variable, unless you have some
sort of atomic variables, like std::atomic<time_t> in C++11.)
Keep in mind that pstack doesn't just immediately interrupt your program and generate a stack. It has to grab debug-level control and if time calls are sufficiently frequent it may drastically over-indicate calls to time as it utilizes those syscalls to take control of your application to print the stack.
Most likely the time calls are not the source of your real performance problem. I suspect you'll want to utilize a profiler such as gprof (with g++ -p). Alternately you could utilize some of the dtrace kits and use the hotuser dtrace script which will do basic statistical profiling on your running application's user code.
time returns UTC time so any changes to TZ should have no effect on its call time whatsoever.
If, after profiling, it turns out that time really is the culprit you may be able to cache the value from the time call since it won't change more than once a second.