I have a program and I want to measure its execution (wallclock) time for different input sizes.
In some other similar questions I read that using clock_gettime in the source code wouldn't be reliable because of the CPUs branch predictor, register renaming, speculative execution, out-of-order execution etc., and sometimes even the optimizer can move the clock_gettime call someplace other than where I put it.
But these questions I read were about measuring the time of a specific function. Would these problems still exist if I'm measuring for the whole program (i.e. the main function)? I'm looking for relative measurements, how the execution time changes for different input sizes, not the absolute value.
How would I get better results? Using timing functions in the code:
start = clock_gettime();
do_stuff();
end = clock_gettime();
execution_time = end - start;
or with the time command in bash:
time ./program
Measuring in the program will give you a more accurate answer. Sure, in theory, in some cases you can get the clock_gettime calls moved where you don't expect them. In practice, it will not happen if you have only a function call in between. (If in doubt, check the resulting assembler code)
Calling time in shell will include things you don't care about, like the time it takes to load your executable and get to the interesting point. On the other hand, if your do_stuff takes a few seconds, then it doesn't really matter.
I'd go with the following recommendation:
If you can isolate your function easily and make it take a few seconds (you can also loop it, but measure empty loop for comparison as well), then either clock_gettime or time will do just fine.
If you cannot isolate easily, but your function consistently takes hundreds of milliseconds, use clock_gettime.
If you cannot isolate and you're optimising something tiny, have a look at rdtsc timing for a measuring a function which talks about measuring actual executed cycles.
Related
In my current project some part of code taking more than 30 minutes to complete the process. I found clock function would be best choice for getting the method execution time, but is there any other way to get the maximum time taking line of code? or else I have to log every method with clock function that would be a complex process for me because it is really gigantic project that would take forever to do it.
Proper way to do it - profiling. This will give you pretty useful information based on functions - where code spent most, which function was called most of the time etc. There are profilers that available on compiler level (like gcc has option to enable it) or you can use 3rd party ones. Unfortunately profiling itself will affect performance of the program and you may see different timing than real program with profiler enabled, but usually it is a good starting point.
As to measure execution time of every line of code, that is not practical. First of all not every line produces executable code especially after optimizer. On another side it is pretty useless to opimize code that not compiled with optimization enabled.
This question already has answers here:
Function execution time
(2 answers)
Closed 9 years ago.
I want to find out the time spent by a particular function in my program. FOr that purpose, I am making use of gprof. I used the following command to get the time for the specific function but still the log file displays the results for all the functions present in the program. Please help me in this regard.
gprof -F FunctionName Executable gmon.out>log
You are nearly repeating another question about function execution time.
As I answered there, there is a difficulty (due to hardware!) to get reliably the execution time of some particular function, specially if that function takes little time (e.g. less than a millisecond). Your original question pointed to these methods.
I would suggest using clock_gettime(2) with CLOCK_REALTIME or perhaps CLOCK_THREAD_CPUTIME_ID
gprof(1) (after compilation with -pg) works with profil(3) and is using a sampling technique, based upon sending a SIGPROF signal (see signal(7)) at periodic intervals (e.g. each 10 milliseconds) from a timer set with setitimer(2) and TIMER_PROF; so the program counter is sampled periodically. Read the wikipage on gprof and notice that profiling may significantly degrade the running time.
If your function gets executed in a short time (less than a millisecond) the profiling gives an imprecise measurement (read about heisenbugs).
In other words, profiling and measuring the time of a short running function is altering the behavior of the program (and this would happen with some other OS too!). You might have to give up the goal of measuring precisely and reliably and accurately the timing of your function without disturbing it. It might even not make any precise sense, e.g. because of the CPU cache.
You could use gprof without any -F argument and, if needed, post-process the textual profile output (e.g. with GNU awk) to extract the information you want.
BTW, the precise timing of a particular function might not be important. What is important is the benchmarking of the entire application.
You could also ask the compiler to optimize even more your program; if you are using link time optimization, i.e. compiling and linking with g++ -flto -O2, the notion of timing of a small function may even cease to exist (because the compiler and the linker could have inlined it without you knowing that).
Consider also that current superscalar processors have a so complex micro-architecture with instruction pipeline, caches, branch predictor, register renaming, speculative execution, out-of-order execution etc etc that the very notion of timing a short function is undefined. You cannot predict or measure it.
We have a multi-threaded C++ application running on Solaris (5.10, sparc platform). As per "pstack" most of the threads seem to be waiting on the below call often for little too long. This corresponds to "time_t currentTime = time(NULL) ;" function in the application code to get the current time in seconds.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff76c63508, ffffffff76e3e000, 2000) + 8
The timezone is "Asia/Riyadh". I tried setting the TZ variable to both "Asia/Riyadh" as well as '<GMT+3>-3'. But there is no obvious improvement with either option. Changing the server code (even if there is an alternative) is rather difficult at this point. A test program (single thread, compiled without -O2) having 1 million "time(NULL)" invocations came out rather quickly. The application & test program are compiled using gcc 4.5.1.
Is there anything else that I can try out?
I agree that it is a rather broad question. I will try out the valid suggestions and close this as soon as there is adequate improvement to handle current load.
Edit 1 :
Please ignore the reference to time(NULL) above, as a possible cause for __time stack. I made the inference based on the signature, and finding the same invocation in the source method.
Following is another stack leading to __time.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff773e5cc0, ffffffff76e3e000, 2000) + 8
ffffffff76c9c7f0 getnow (ffffffff704fb180, ffffffff773c6384, 1a311c, 2, ffffffff76e4eae8, fffc00) + 4
ffffffff76c9af0c strptime_recurse (ffffffff76e4cea0, 1, 104980178, ffffffff704fb938, ffffffff704fb180, ffffffff704fb1a4) + 34
ffffffff76c9dce8 __strptime_std (ffffffff76e4cea0, 10458b2d8, 104980178, ffffffff704fb938, 2400, 1a38ec) + 2c
You (and we) are not going to be able to make time faster.
From your message, I gather that you are calling it from many
different threads at once. This may be a problem; it's quite
possible that Solaris serializes these calls, so you end up with
a lot of threads waiting for the others to complete.
How much accuracy do you need? A possible solution might be to
have one thread loop on reading the time, sleeping maybe 10 ms
between each read, and putting the results in a global variable,
which the other threads read. (Don't forget that you'll need to
synchronize all accesses to the variable, unless you have some
sort of atomic variables, like std::atomic<time_t> in C++11.)
Keep in mind that pstack doesn't just immediately interrupt your program and generate a stack. It has to grab debug-level control and if time calls are sufficiently frequent it may drastically over-indicate calls to time as it utilizes those syscalls to take control of your application to print the stack.
Most likely the time calls are not the source of your real performance problem. I suspect you'll want to utilize a profiler such as gprof (with g++ -p). Alternately you could utilize some of the dtrace kits and use the hotuser dtrace script which will do basic statistical profiling on your running application's user code.
time returns UTC time so any changes to TZ should have no effect on its call time whatsoever.
If, after profiling, it turns out that time really is the culprit you may be able to cache the value from the time call since it won't change more than once a second.
The output of a typical profiler is, a list of functions in your code, sorted by the amount of time each function took while the program ran.
This is very good, but sometimes I'm interested more with what was program doing most of the time, than with where was EIP most of the time.
An example output of my hypothetical profiler is:
Waiting for file IO - 19% of execution time.
Waiting for network - 4% of execution time
Cache misses - 70% of execution time.
Actual computation - 7% of execution time.
Is there such a profiler? Is it possible to derive such an output from a "standard" profiler?
I'm using Linux, but I'll be glad to hear any solutions for other systems.
This is Solaris only, but dtrace can monitor almost every kind of I/O, on/off CPU, time in specific functions, sleep time, etc. I'm not sure if it can determine cache misses though, assuming you mean CPU cache - I'm not sure if that information is made available by the CPU or not.
Please take a look at this and this.
Consider any thread. At any instant of time it is doing something, and it is doing it for a reason, and slowness can be defined as the time it spends for poor reasons - it doesn't need to be spending that time.
Take a snapshot of the thread at a point in time. Maybe it's in a cache miss, in an instruction, in a statement, in a function, called from a call instruction in another function, called from another, and so on, up to call _main. Every one of those steps has a reason, that an examination of the code reveals.
If any one of those steps is not a very good reason and could be avoided, that instant of time does not need to be spent.
Maybe at that time the disk is coming around to certain sector, so some data streaming can be started, so a buffer can be filled, so a read statement can be satisfied, in a function, and that function is called from a call site in another function, and that from another, and so on, up to call _main, or whatever happens to be the top of the thread.
Repeat previous point 1.
So, the way to find bottlenecks is to find when the code is spending time for poor reasons, and the best way to find that is to take snapshots of its state. The EIP, or any other tiny piece of the state, is not going to do it, because it won't tell you why.
Very few profilers "get it". The ones that do are the wall-clock-time stack-samplers that report by line of code (not by function) percent of time active (not amount of time, especially not "self" or "exclusive" time.) One that does is Zoom, and there are others.
Looking at where the EIP hangs out is like trying to tell time on a clock with only a second hand. Measuring functions is like trying to tell time on a clock with some of the digits missing. Profiling only during CPU time, not during blocked time, is like trying to tell time on a clock that randomly stops running for long stretches. Being concerned about measurement precision is like trying to time your lunch hour to the second.
This is not a mysterious subject.
I have a program I want to profile with gprof. The problem (seemingly) is that it uses sockets. So I get things like this:
::select(): Interrupted system call
I hit this problem a while back, gave up, and moved on. But I would really like to be able to profile my code, using gprof if possible. What can I do? Is there a gprof option I'm missing? A socket option? Is gprof totally useless in the presence of these types of system calls? If so, is there a viable alternative?
EDIT: Platform:
Linux 2.6 (x64)
GCC 4.4.1
gprof 2.19
The socket code needs to handle interrupted system calls regardless of profiler, but under profiler it's unavoidable. This means having code like.
if ( errno == EINTR ) { ...
after each system call.
Take a look, for example, here for the background.
gprof (here's the paper) is reliable, but it only was ever intended to measure changes, and even for that, it only measures CPU-bound issues. It was never advertised to be useful for locating problems. That is an idea that other people layered on top of it.
Consider this method.
Another good option, if you don't mind spending some money, is Zoom.
Added: If I can just give you an example. Suppose you have a call-hierarchy where Main calls A some number of times, A calls B some number of times, B calls C some number of times, and C waits for some I/O with a socket or file, and that's basically all the program does. Now, further suppose that the number of times each routine calls the next one down is 25% more times than it really needs to. Since 1.25^3 is about 2, that means the entire program takes twice as long to run as it really needs to.
In the first place, since all the time is spent waiting for I/O gprof will tell you nothing about how that time is spent, because it only looks at "running" time.
Second, suppose (just for argument) it did count the I/O time. It could give you a call graph, basically saying that each routine takes 100% of the time. What does that tell you? Nothing more than you already know.
However, if you take a small number of stack samples, you will see on every one of them the lines of code where each routine calls the next.
In other words, it's not just giving you a rough percentage time estimate, it is pointing you at specific lines of code that are costly.
You can look at each line of code and ask if there is a way to do it fewer times. Assuming you do this, you will get the factor of 2 speedup.
People get big factors this way. In my experience, the number of call levels can easily be 30 or more. Every call seems necessary, until you ask if it can be avoided. Even small numbers of avoidable calls can have a huge effect over that many layers.