Gprof: specific function time [duplicate] - c++

This question already has answers here:
Function execution time
(2 answers)
Closed 9 years ago.
I want to find out the time spent by a particular function in my program. FOr that purpose, I am making use of gprof. I used the following command to get the time for the specific function but still the log file displays the results for all the functions present in the program. Please help me in this regard.
gprof -F FunctionName Executable gmon.out>log

You are nearly repeating another question about function execution time.
As I answered there, there is a difficulty (due to hardware!) to get reliably the execution time of some particular function, specially if that function takes little time (e.g. less than a millisecond). Your original question pointed to these methods.
I would suggest using clock_gettime(2) with CLOCK_REALTIME or perhaps CLOCK_THREAD_CPUTIME_ID
gprof(1) (after compilation with -pg) works with profil(3) and is using a sampling technique, based upon sending a SIGPROF signal (see signal(7)) at periodic intervals (e.g. each 10 milliseconds) from a timer set with setitimer(2) and TIMER_PROF; so the program counter is sampled periodically. Read the wikipage on gprof and notice that profiling may significantly degrade the running time.
If your function gets executed in a short time (less than a millisecond) the profiling gives an imprecise measurement (read about heisenbugs).
In other words, profiling and measuring the time of a short running function is altering the behavior of the program (and this would happen with some other OS too!). You might have to give up the goal of measuring precisely and reliably and accurately the timing of your function without disturbing it. It might even not make any precise sense, e.g. because of the CPU cache.
You could use gprof without any -F argument and, if needed, post-process the textual profile output (e.g. with GNU awk) to extract the information you want.
BTW, the precise timing of a particular function might not be important. What is important is the benchmarking of the entire application.
You could also ask the compiler to optimize even more your program; if you are using link time optimization, i.e. compiling and linking with g++ -flto -O2, the notion of timing of a small function may even cease to exist (because the compiler and the linker could have inlined it without you knowing that).
Consider also that current superscalar processors have a so complex micro-architecture with instruction pipeline, caches, branch predictor, register renaming, speculative execution, out-of-order execution etc etc that the very notion of timing a short function is undefined. You cannot predict or measure it.

Related

Get the execution time of each line of code in c++

In my current project some part of code taking more than 30 minutes to complete the process. I found clock function would be best choice for getting the method execution time, but is there any other way to get the maximum time taking line of code? or else I have to log every method with clock function that would be a complex process for me because it is really gigantic project that would take forever to do it.
Proper way to do it - profiling. This will give you pretty useful information based on functions - where code spent most, which function was called most of the time etc. There are profilers that available on compiler level (like gcc has option to enable it) or you can use 3rd party ones. Unfortunately profiling itself will affect performance of the program and you may see different timing than real program with profiler enabled, but usually it is a good starting point.
As to measure execution time of every line of code, that is not practical. First of all not every line produces executable code especially after optimizer. On another side it is pretty useless to opimize code that not compiled with optimization enabled.

LD_BIND_NOW Can Make the Executable run Slower?

I am curious if an executable is poorly written that it has much dead code, referring to 1000s of functions externally (i.e. .so files) but only 100s of those functions are actually called during runtime, will LD_BIND_NOW=1 be worse than LD_BIND_NOW not set? Because the Procedure Linkage Table will contain 900 useless function addresses? Worse in a sense of memory footprint and performance (as I don't know whether the lookup is O(n)).
I am trying to see whether setting LD_BIND_NOW to 1 will help (by comparing to LD_BIND_NOW not set):
1. a program that runs 24 x 5 in terms of latency
2. saving 1 microsecond is considered big in my case as the code paths being executed during the life time of the program are mainly processing incoming messages from TCP/UDP/shared memory and then doing some computations on them;
all these code paths take very short time (e.g. < 10 micro) and these code paths will be run like millions of times
Whether LD_BIND_NOW=1 helps the startup time doesn't matter to me.
saving 1 microsecond is considered big in my case as the executions by the program are all short (e.g. <10 micro)
This is unlikely (or you mean something else). A typical call to execve(2) -the system call used to start programs- is usually lasting several milliseconds. So it is rare (and practically impossible) that a program executes (from execve to _exit(2)) in microseconds.
I suggest that your program should not be started more than a few times per second. If indeed the entire program is very short lived (so its process lasts only a fraction of a second), you could consider some other approach (perhaps making a server running those functions).
LD_BIND_NOW will affect (and slow down) the start-up time (e.g. in the dynamic linker ld-linux(8)). It should not matter (except for cache effects) the steady state execution time of some event loop.
See also references in this related answer (to a different question), they contain detailed explanations relevant to your question.
In short, the setting of LD_BIND_NOW will not affect significantly the time needed to handle each incoming message in a tight event loop.
Calling functions in shared libraries (containing position-independent code) might be slightly slower (by a few percents at most, and probably less on x86-64) in some cases. You could try static linking, and you might consider even link time optimization (i.e. compiling and linking all your code -main programs and static libraries- with -flto -O2 if using GCC).
You might have accumulated technical debt, and you could need major code refactoring (which takes a lot of time and effort, that you should budget).

Measuring execution time - in program code or in shell?

I have a program and I want to measure its execution (wallclock) time for different input sizes.
In some other similar questions I read that using clock_gettime in the source code wouldn't be reliable because of the CPUs branch predictor, register renaming, speculative execution, out-of-order execution etc., and sometimes even the optimizer can move the clock_gettime call someplace other than where I put it.
But these questions I read were about measuring the time of a specific function. Would these problems still exist if I'm measuring for the whole program (i.e. the main function)? I'm looking for relative measurements, how the execution time changes for different input sizes, not the absolute value.
How would I get better results? Using timing functions in the code:
start = clock_gettime();
do_stuff();
end = clock_gettime();
execution_time = end - start;
or with the time command in bash:
time ./program
Measuring in the program will give you a more accurate answer. Sure, in theory, in some cases you can get the clock_gettime calls moved where you don't expect them. In practice, it will not happen if you have only a function call in between. (If in doubt, check the resulting assembler code)
Calling time in shell will include things you don't care about, like the time it takes to load your executable and get to the interesting point. On the other hand, if your do_stuff takes a few seconds, then it doesn't really matter.
I'd go with the following recommendation:
If you can isolate your function easily and make it take a few seconds (you can also loop it, but measure empty loop for comparison as well), then either clock_gettime or time will do just fine.
If you cannot isolate easily, but your function consistently takes hundreds of milliseconds, use clock_gettime.
If you cannot isolate and you're optimising something tiny, have a look at rdtsc timing for a measuring a function which talks about measuring actual executed cycles.

CUDA architecture -sm_11 compile issue in NSight

As my GPU device Quadro FX 3700 doesn't support arch>sm_11. I was not able to use relocatable device code (rdc). Hence i combined all the utilities needed into 1 large file (say x.cu).
To give a overview of x.cu it contains 2 classes with 5 member functions each, 20 device functions, 1 global kernel, 1 kernel caller function.
Now, when i try to compile via Nsight it just hangs showing Build% as 3.
When i try compiling using
nvcc x.cu -o output -I"."
It shows the following Messages and compiles after a long time,
/tmp/tmpxft_0000236a_00000000-9_Kernel.cpp3.i(0): Warning: Olimit was exceeded on function _Z18optimalOrderKernelPdP18PrepositioningCUDAdi; will not perform function-scope optimization.
To still perform function-scope optimization, use -OPT:Olimit=0 (no limit) or -OPT:Olimit=45022
/tmp/tmpxft_0000236a_00000000-9_Kernel.cpp3.i(0): Warning: To override Olimit for all functions in file, use -OPT:Olimit=45022
(Compiler may run out of memory or run very slowly for large Olimit values)
Where optimalOrderKernel is the global kernel. As compiling shouldn't be taking much time. I want to understand the reason behind this messages, particularly Olimit.
Olimit is pretty clear, I think. It is a limit the compiler places on the amount of effort it will expend on optimizing code.
Most codes compile just fine using nvcc. However, no compiler is perfect, and some seemingly innocuous codes can cause the compiler to spend a long time at an optimization process that would normally be quick.
Since you haven't provided any code, I'm speaking in generalities.
Since there is the occasional case where the compiler spends a disproportionately long time in certain optimization phases, the Olimit provides a convenient watchdog, so you have some idea of why it is taking so long. Furthermore, the Olimit acts like a watchdog on an optimization process that is taking too long. When it is exceeded, certain optimization steps are aborted, and a "less optimized" version of your code is generated, instead.
I think the compiler messages you received are quite clear on how to modify the Olimit depending on your intentions. You can override it to increase the watchdog period, or disable it entirely (by setting it to zero). In that case, the compile process could take an arbitrarily long period of time, and/or run out of memory, as the messages indicate.

c++ time() function performance in solaris

We have a multi-threaded C++ application running on Solaris (5.10, sparc platform). As per "pstack" most of the threads seem to be waiting on the below call often for little too long. This corresponds to "time_t currentTime = time(NULL) ;" function in the application code to get the current time in seconds.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff76c63508, ffffffff76e3e000, 2000) + 8
The timezone is "Asia/Riyadh". I tried setting the TZ variable to both "Asia/Riyadh" as well as '<GMT+3>-3'. But there is no obvious improvement with either option. Changing the server code (even if there is an alternative) is rather difficult at this point. A test program (single thread, compiled without -O2) having 1 million "time(NULL)" invocations came out rather quickly. The application & test program are compiled using gcc 4.5.1.
Is there anything else that I can try out?
I agree that it is a rather broad question. I will try out the valid suggestions and close this as soon as there is adequate improvement to handle current load.
Edit 1 :
Please ignore the reference to time(NULL) above, as a possible cause for __time stack. I made the inference based on the signature, and finding the same invocation in the source method.
Following is another stack leading to __time.
ffffffff76cdbe1c __time (0, 23e8, 1dab58, ffffffff773e5cc0, ffffffff76e3e000, 2000) + 8
ffffffff76c9c7f0 getnow (ffffffff704fb180, ffffffff773c6384, 1a311c, 2, ffffffff76e4eae8, fffc00) + 4
ffffffff76c9af0c strptime_recurse (ffffffff76e4cea0, 1, 104980178, ffffffff704fb938, ffffffff704fb180, ffffffff704fb1a4) + 34
ffffffff76c9dce8 __strptime_std (ffffffff76e4cea0, 10458b2d8, 104980178, ffffffff704fb938, 2400, 1a38ec) + 2c
You (and we) are not going to be able to make time faster.
From your message, I gather that you are calling it from many
different threads at once. This may be a problem; it's quite
possible that Solaris serializes these calls, so you end up with
a lot of threads waiting for the others to complete.
How much accuracy do you need? A possible solution might be to
have one thread loop on reading the time, sleeping maybe 10 ms
between each read, and putting the results in a global variable,
which the other threads read. (Don't forget that you'll need to
synchronize all accesses to the variable, unless you have some
sort of atomic variables, like std::atomic<time_t> in C++11.)
Keep in mind that pstack doesn't just immediately interrupt your program and generate a stack. It has to grab debug-level control and if time calls are sufficiently frequent it may drastically over-indicate calls to time as it utilizes those syscalls to take control of your application to print the stack.
Most likely the time calls are not the source of your real performance problem. I suspect you'll want to utilize a profiler such as gprof (with g++ -p). Alternately you could utilize some of the dtrace kits and use the hotuser dtrace script which will do basic statistical profiling on your running application's user code.
time returns UTC time so any changes to TZ should have no effect on its call time whatsoever.
If, after profiling, it turns out that time really is the culprit you may be able to cache the value from the time call since it won't change more than once a second.