c++ time() function performance in solaris - c++

We have a multi-threaded C++ application running on Solaris (5.10, sparc platform). As per "pstack" most of the threads seem to be waiting on the below call often for little too long. This corresponds to "time_t currentTime = time(NULL) ;" function in the application code to get the current time in seconds.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff76c63508, ffffffff76e3e000, 2000) + 8
The timezone is "Asia/Riyadh". I tried setting the TZ variable to both "Asia/Riyadh" as well as '<GMT+3>-3'. But there is no obvious improvement with either option. Changing the server code (even if there is an alternative) is rather difficult at this point. A test program (single thread, compiled without -O2) having 1 million "time(NULL)" invocations came out rather quickly. The application & test program are compiled using gcc 4.5.1.
Is there anything else that I can try out?
I agree that it is a rather broad question. I will try out the valid suggestions and close this as soon as there is adequate improvement to handle current load.
Edit 1 :
Please ignore the reference to time(NULL) above, as a possible cause for __time stack. I made the inference based on the signature, and finding the same invocation in the source method.
Following is another stack leading to __time.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff773e5cc0, ffffffff76e3e000, 2000) + 8
ffffffff76c9c7f0 getnow (ffffffff704fb180, ffffffff773c6384, 1a311c, 2, ffffffff76e4eae8, fffc00) + 4
ffffffff76c9af0c strptime_recurse (ffffffff76e4cea0, 1, 104980178, ffffffff704fb938, ffffffff704fb180, ffffffff704fb1a4) + 34
ffffffff76c9dce8 __strptime_std (ffffffff76e4cea0, 10458b2d8, 104980178, ffffffff704fb938, 2400, 1a38ec) + 2c

You (and we) are not going to be able to make time faster.
From your message, I gather that you are calling it from many
different threads at once. This may be a problem; it's quite
possible that Solaris serializes these calls, so you end up with
a lot of threads waiting for the others to complete.
How much accuracy do you need? A possible solution might be to
have one thread loop on reading the time, sleeping maybe 10 ms
between each read, and putting the results in a global variable,
which the other threads read. (Don't forget that you'll need to
synchronize all accesses to the variable, unless you have some
sort of atomic variables, like std::atomic<time_t> in C++11.)

Keep in mind that pstack doesn't just immediately interrupt your program and generate a stack. It has to grab debug-level control and if time calls are sufficiently frequent it may drastically over-indicate calls to time as it utilizes those syscalls to take control of your application to print the stack.
Most likely the time calls are not the source of your real performance problem. I suspect you'll want to utilize a profiler such as gprof (with g++ -p). Alternately you could utilize some of the dtrace kits and use the hotuser dtrace script which will do basic statistical profiling on your running application's user code.
time returns UTC time so any changes to TZ should have no effect on its call time whatsoever.
If, after profiling, it turns out that time really is the culprit you may be able to cache the value from the time call since it won't change more than once a second.

Related

Measuring execution time - in program code or in shell?

I have a program and I want to measure its execution (wallclock) time for different input sizes.
In some other similar questions I read that using clock_gettime in the source code wouldn't be reliable because of the CPUs branch predictor, register renaming, speculative execution, out-of-order execution etc., and sometimes even the optimizer can move the clock_gettime call someplace other than where I put it.
But these questions I read were about measuring the time of a specific function. Would these problems still exist if I'm measuring for the whole program (i.e. the main function)? I'm looking for relative measurements, how the execution time changes for different input sizes, not the absolute value.
How would I get better results? Using timing functions in the code:
start = clock_gettime();
do_stuff();
end = clock_gettime();
execution_time = end - start;
or with the time command in bash:
time ./program
Measuring in the program will give you a more accurate answer. Sure, in theory, in some cases you can get the clock_gettime calls moved where you don't expect them. In practice, it will not happen if you have only a function call in between. (If in doubt, check the resulting assembler code)
Calling time in shell will include things you don't care about, like the time it takes to load your executable and get to the interesting point. On the other hand, if your do_stuff takes a few seconds, then it doesn't really matter.
I'd go with the following recommendation:
If you can isolate your function easily and make it take a few seconds (you can also loop it, but measure empty loop for comparison as well), then either clock_gettime or time will do just fine.
If you cannot isolate easily, but your function consistently takes hundreds of milliseconds, use clock_gettime.
If you cannot isolate and you're optimising something tiny, have a look at rdtsc timing for a measuring a function which talks about measuring actual executed cycles.

How do I profile Hiccups in performance?

Usually profile data is gathered by randomly sampling the stack of the running program to see which function is in execution, over a running period it is possible to be statistically sure which methods/function calls eats most time and need intervention in case of bottlenecks.
However this has to do with overall application/game performance. Sometime happens that there are singular and isolated hiccups in performance that are causing usability troubles anyway (user notice it / introduced lag in some internal mechanism etc). With regular profiling over few seconds of execution is not possible to know which. Even if the hiccup lasts long enough (says 30 ms, which are not enough anyway), to detect some method that is called too often, we will still miss to see execution of many other methods that are just "skipped" because of the random sampling.
So are there any tecniques to profile hiccups in order to keep framerate more stable after fixing those kind of "rare bottlenecks"? I'm assuming usage of languages like C# or C++.
This has been answered before, but I can't find it, so here goes...
The problem is that the DrawFrame routine sometimes takes too long.
Suppose it normally takes less than 1000/30 = 33ms, but once in a while it takes longer than 33ms.
At the beginning of DrawFrame, set a timer interrupt that will expire after, say, 40ms.
Then at the end of DrawFrame, disable the interrupt.
So if it triggers, you know DrawFrame is taking an unusually long time.
Put a breakpoint in the interrupt handler, and when it gets there, examine the stack.
Chances are pretty good that you have caught it in the process of doing the costly thing.
That's a variation on random pausing.

How to get CPU clock cycles used by process in kernel mode on windows?

As the title suggests I'm interested in obtaining CPU clock cycles used by a process in kernel mode only. I know there is an API called "QueryProcessCycleTime" which returns the CPU clock
cycles used by the threads of the process. But this value includes cycles spent in both user mode and kernel mode. How can I obtain cycles spent in kernel mode only? Do I need to get this using Performance counters? If yes, which one I should use?
Thanks in advance for your answers.
I've just found an interesting article that describes almost what you ask for. It's on MSDN Internals.
They write there, that if you were using C# or C++/CLI, you could easily get that information from an instance of System.Diagnostic.Process class, pointed to the right PID. But it would give you a TimeSpan from the PrivilegedProcessorTime, so a "pretty time" instead of 'cycles'.
However, they also point out that all that .Net code is actually thin wrapper for unmanaged APIs, so you should be able to easily get it from native C++ too. They were ILDASM'ing that class to show what it calls, but the image is missing. I've just done the same, and it uses the GetProcessTimes from kernel32.dll
So, again, MSDN'ing it - it returns LPFILETIME structures. So, the 'pretty time', not 'cycles', again.
The description of this method points out that if you want to get the clock cycles, you should use QueryProcessCycleTime function. This actually returns the amount of clock cycles.. but user- and kernel-mode counted together.
Now, summing up:
you can read userTIME
you can read kernelTIME
you can read (user+kernel)CYCLES
So you have almost everything needed. By some simple math:
u_cycles = u_time * allcycles / (utime+ktime)
k_cycles = k_time * allcycles / (utime+ktime)
Of course this will be some approximation due to rounding etc.
Also, this will has a gotcha: you have to invoke two functions (GetTimes, QueryCycles) to get all the information, so there will be a slight delay between their readings, and therefore all your calculation will probably slip a little since the target process still runs and burns the time.
If you cannot allow for this (small?) noise in the measurement, I think you can circumvent it by temporarily suspending the process:
suspend the target
wait a little and ensure it is suspended
read first stats
read second stats
then resume the process and calculate values
I think this will ensure the two readings to be consistent, but in turn, each such reading will impact the overall performance of the measured process - so i.e. things like "wall time" will not be measureable any longer, unless you take some corrections for the time spent in suspension..
There may be some better way to get the separate clock cycles, but I have not found them, sorry. You could try looking inside the QueryProcessCycleTime and what source it reads the data from - maybe you are lucky and it reads A,B and returns A+B and maybe you could peek what are the sources. I have not checked it.
Take a look at GetProcessTimes. It'll give you the amount of kernel and user time your process has used.

Gprof: specific function time [duplicate]

This question already has answers here:
Function execution time
(2 answers)
Closed 9 years ago.
I want to find out the time spent by a particular function in my program. FOr that purpose, I am making use of gprof. I used the following command to get the time for the specific function but still the log file displays the results for all the functions present in the program. Please help me in this regard.
gprof -F FunctionName Executable gmon.out>log
You are nearly repeating another question about function execution time.
As I answered there, there is a difficulty (due to hardware!) to get reliably the execution time of some particular function, specially if that function takes little time (e.g. less than a millisecond). Your original question pointed to these methods.
I would suggest using clock_gettime(2) with CLOCK_REALTIME or perhaps CLOCK_THREAD_CPUTIME_ID
gprof(1) (after compilation with -pg) works with profil(3) and is using a sampling technique, based upon sending a SIGPROF signal (see signal(7)) at periodic intervals (e.g. each 10 milliseconds) from a timer set with setitimer(2) and TIMER_PROF; so the program counter is sampled periodically. Read the wikipage on gprof and notice that profiling may significantly degrade the running time.
If your function gets executed in a short time (less than a millisecond) the profiling gives an imprecise measurement (read about heisenbugs).
In other words, profiling and measuring the time of a short running function is altering the behavior of the program (and this would happen with some other OS too!). You might have to give up the goal of measuring precisely and reliably and accurately the timing of your function without disturbing it. It might even not make any precise sense, e.g. because of the CPU cache.
You could use gprof without any -F argument and, if needed, post-process the textual profile output (e.g. with GNU awk) to extract the information you want.
BTW, the precise timing of a particular function might not be important. What is important is the benchmarking of the entire application.
You could also ask the compiler to optimize even more your program; if you are using link time optimization, i.e. compiling and linking with g++ -flto -O2, the notion of timing of a small function may even cease to exist (because the compiler and the linker could have inlined it without you knowing that).
Consider also that current superscalar processors have a so complex micro-architecture with instruction pipeline, caches, branch predictor, register renaming, speculative execution, out-of-order execution etc etc that the very notion of timing a short function is undefined. You cannot predict or measure it.

Using gprof with sockets

I have a program I want to profile with gprof. The problem (seemingly) is that it uses sockets. So I get things like this:
::select(): Interrupted system call
I hit this problem a while back, gave up, and moved on. But I would really like to be able to profile my code, using gprof if possible. What can I do? Is there a gprof option I'm missing? A socket option? Is gprof totally useless in the presence of these types of system calls? If so, is there a viable alternative?
EDIT: Platform:
Linux 2.6 (x64)
GCC 4.4.1
gprof 2.19
The socket code needs to handle interrupted system calls regardless of profiler, but under profiler it's unavoidable. This means having code like.
if ( errno == EINTR ) { ...
after each system call.
Take a look, for example, here for the background.
gprof (here's the paper) is reliable, but it only was ever intended to measure changes, and even for that, it only measures CPU-bound issues. It was never advertised to be useful for locating problems. That is an idea that other people layered on top of it.
Consider this method.
Another good option, if you don't mind spending some money, is Zoom.
Added: If I can just give you an example. Suppose you have a call-hierarchy where Main calls A some number of times, A calls B some number of times, B calls C some number of times, and C waits for some I/O with a socket or file, and that's basically all the program does. Now, further suppose that the number of times each routine calls the next one down is 25% more times than it really needs to. Since 1.25^3 is about 2, that means the entire program takes twice as long to run as it really needs to.
In the first place, since all the time is spent waiting for I/O gprof will tell you nothing about how that time is spent, because it only looks at "running" time.
Second, suppose (just for argument) it did count the I/O time. It could give you a call graph, basically saying that each routine takes 100% of the time. What does that tell you? Nothing more than you already know.
However, if you take a small number of stack samples, you will see on every one of them the lines of code where each routine calls the next.
In other words, it's not just giving you a rough percentage time estimate, it is pointing you at specific lines of code that are costly.
You can look at each line of code and ask if there is a way to do it fewer times. Assuming you do this, you will get the factor of 2 speedup.
People get big factors this way. In my experience, the number of call levels can easily be 30 or more. Every call seems necessary, until you ask if it can be avoided. Even small numbers of avoidable calls can have a huge effect over that many layers.