LD_BIND_NOW Can Make the Executable run Slower?

LD_BIND_NOW Can Make the Executable run Slower? - c++

I am curious if an executable is poorly written that it has much dead code, referring to 1000s of functions externally (i.e. .so files) but only 100s of those functions are actually called during runtime, will LD_BIND_NOW=1 be worse than LD_BIND_NOW not set? Because the Procedure Linkage Table will contain 900 useless function addresses? Worse in a sense of memory footprint and performance (as I don't know whether the lookup is O(n)).
I am trying to see whether setting LD_BIND_NOW to 1 will help (by comparing to LD_BIND_NOW not set):
1. a program that runs 24 x 5 in terms of latency
2. saving 1 microsecond is considered big in my case as the code paths being executed during the life time of the program are mainly processing incoming messages from TCP/UDP/shared memory and then doing some computations on them;
all these code paths take very short time (e.g. < 10 micro) and these code paths will be run like millions of times
Whether LD_BIND_NOW=1 helps the startup time doesn't matter to me.

saving 1 microsecond is considered big in my case as the executions by the program are all short (e.g. <10 micro)
This is unlikely (or you mean something else). A typical call to execve(2) -the system call used to start programs- is usually lasting several milliseconds. So it is rare (and practically impossible) that a program executes (from execve to _exit(2)) in microseconds.
I suggest that your program should not be started more than a few times per second. If indeed the entire program is very short lived (so its process lasts only a fraction of a second), you could consider some other approach (perhaps making a server running those functions).
LD_BIND_NOW will affect (and slow down) the start-up time (e.g. in the dynamic linker ld-linux(8)). It should not matter (except for cache effects) the steady state execution time of some event loop.
See also references in this related answer (to a different question), they contain detailed explanations relevant to your question.
In short, the setting of LD_BIND_NOW will not affect significantly the time needed to handle each incoming message in a tight event loop.
Calling functions in shared libraries (containing position-independent code) might be slightly slower (by a few percents at most, and probably less on x86-64) in some cases. You could try static linking, and you might consider even link time optimization (i.e. compiling and linking all your code -main programs and static libraries- with -flto -O2 if using GCC).
You might have accumulated technical debt, and you could need major code refactoring (which takes a lot of time and effort, that you should budget).

Related

Is callgrind profiling influenced by other processes?

I'd like to profile my application using callgrind. Now, since it takes a very long time, in the meanwhile I go on with web-browsing, compiling and other intensive tasks on the same machine.
Am I biasing the profiling results? I'm expecting that, since valgrind uses a simulated CPU, other external processes should not interfere with valgrind execution. Am I right?

By default, Callgrind does not record anything related to time, so you can expect all collected metrics to (mostly) be independent of other processes on the machine. As the Callgrind manual states,
By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls.
As such, the metrics Callgrind reports should only depend on what instructions the program is executing on the (simulated) CPU - not on how much time such instructions take. Indeed, many times the output of Callgrind can be somewhat misleading, as the simulated CPU might operate different to the real one (particularly when it comes to branch prediction).
The Callgrind paper presented at ICCS 2004 is very clear about this as well:
We note that the simulation is not able to predict consumed wall clock time, as this would need a detailed simulation of the microarchitecture.
In any case, however, the simulated CPU is unaffected by what the real CPU is doing.
The reason is straightforward.
Like you said, your program is not executed on your machine at all.
Instead, at runtime, Valgrind dynamically translates your program, that is, it disassembles the binary into "UCode" for an simulated machine, adds analysis code (called instrumentation), then generates binary code that executes the simulation.
The addition of analysis code is what makes instruction counting (in Callgrind), memory checking (in Memcheck), and all other plugins possible.
Therein lies the twist, however.
Naturally there are limits to how isolated the program can run in such a dynamic simulation.
First, your program might interact with other programs.
While the time spent for doing so is irrelevant (as it is not accounted for), the return codes of inter-process communication can certainly change, depending on what else is going on in the system.
Second, most system calls need to be run untranslated and their return codes can change as well -- leading to different execution paths of your program and, thus, slightly different metrics being collected. (As an aside, Calgrind offers an option to record the wall clock time spent during syscalls, which will always be affected by what else goes on in the system).
More details about these restrictions can be found in the PhD Dissertation of Nicholas Nethercote ("Dynamic Binary Analysis and Instrumentation").

How to get an accurate performance measure?

In our project we're trying to automatically monitor the performance of test runs, to make sure that we don't have any significant changes in the performance of the program over time.
The problem is that there seems to be a consistent 5% variability in the measures we get. That is, on the same machine with the same program (no recompilation) running the same test we get values that differ by around 5% from run to run. This is way too much for what we want to use the numbers for.
We're already excluding setup costs from the timing considerations - that is, from within C++ code itself we're grabbing the time immediately before and after running the time-critical portions, rather than doing the timing of the whole program on the OS level. We are also doing averaging and outlier exclusion. The problem is that the variability looks to also have long-term trends, so we get tight clustering of times for replicates right after each other, but an hour or two later the times are substantially different. (Unfortunately, spreading the test out over several hours is not feasible.) The tests are also being run on a dedicated machine while "nothing else" is being run on it.
We're not quite sure where the timing variation is coming from, but it may have to do with the processor and the system - there's indications that the size of the variability depends on what machine the program is running on.
Does anyone have an idea where this variation is likely to be coming from, and how to remove it? The tests are running on a dedicated machine, so changing the operating system settings would be possible.
(As indicated by the tags, this is a C++ program running on a x86 Linux system, if that helps clarify things.)
Edit: Response to comments
Our current timing scheme is to use the clock() function from the C standard library, looking at the difference in the return value from before/after the functions we want to test.
The code we're testing should be deterministic, and shouldn't involve heavy IO.
I realize that the situation is a little hazy for a "silver bullet" answer. I guess I'm more looking for a "these are the factors that are important to consider, this is the order you probably should check them in, and here's how you go about checking each of them" type answer.

I'm amazed you got down to 5% variation.
Unless you can get rid of all the unnecessary things running on your system, you will be getting high variation. This is at the top level.
You OS needs to be deterministic. You need to know what other tasks and threads are running and their durations. For example, there is the clock interrupt. Now, how many other functions are chained to this interrupt? Do these other functions vary?
Is your system isolated? For example, your measurements may vary if your system is connected to a network.
Does your program use external resources? For example a hard drive. If the program writes to the hard drive, the drive will not be deterministic. Files and parts of files may move on the drive. The drive may become fragmented. This fragmentation may cause variance in your measurements.
The operating system memory may get fragmented. Also, the executable's memory may become fragmented. Fragmentation may add to the variance.

C++, always running processes or invoked executable files?

I'm working on a project made of some separate processes (services). Some services are called every second, some other every minute and some services may not be called after days. (and there are some services that are called randomly and there is no exact information about their call times).
I have two approaches to develop the project. To make services always running processes using interprocess messaging, or to write separate C++ programs and run executable files when I need them.
I have two questions that I couldn't find a suitable answer to.
Is there any way I could calculate an approximated threshold that can help me answer to 'when to use which way'?
How much faster is always running processes? (I mean compared with process of initializing and running executable files in OS)
Edit 1: As mentioned in comments and Mats Petersson's answer, answer to my questions is heavily related to environment. Then I explain more about these conditions.
OS: CentOS 6.3
services are small (smaller that 1000 line codes normally) and use no additional resources (such as database)

I don't think anyone can answer your direct two questions, as it depends on many factors, such as "what OS", "what secondary storage", "how large an application is", "what your application does" (loading up the contents of a database with a million entries takes much longer than int x = 73; as the whole initialization outside main).
There is overhead with both approaches, and assuming there isn't enough memory to hold EVERYTHING in RAM at all times (and modern OS's will try to make use of the RAM as disk-cache or for other caching, rather than keep old crusty application code that doesn't run, so eventually your application code will be swapped out if it's not being run), you are going to have approximately the same disk I/O amount for both solutions.
For me, "having memory available" trumps other things, so executing a process when itäs needed is better than leaving it running in the expectation that in some time, it will need to be reused. The only exceptions are if the executable takes a long time to start (in other words, it's large and has a complex starting procedure) AND it's being run fairly frequently (at the very least several times per minute). Or you have high real-time requirements, so the extra delay of starting the process is significantly worse than "we're holding it in memory" penalty (but bear in mind that holding it in memory isn't REALLY holding it in memory, since the content will be swapped out to disk anyway, if it isn't being used).
Starting a process that was recently run is typically done from cache, so it's less of an issue. Also, if the application uses shared libraries (.so, .dll or .dynlib depending on OS) that are genuinely shared, then it will normally shorten the load time if that shared library is in memory already.
Both Linux and Windows (and I expect OS X) are optimised to load a program much faster the second time it executes in short succession - because it caches things, etc. So for the frequent calling of the executable, this will definitely work in your favour.
I would start by "execute every time", and if you find that this is causing a problem, redesign the programs to stay around.

Gprof: specific function time [duplicate]

This question already has answers here:
Function execution time
(2 answers)
Closed 9 years ago.
I want to find out the time spent by a particular function in my program. FOr that purpose, I am making use of gprof. I used the following command to get the time for the specific function but still the log file displays the results for all the functions present in the program. Please help me in this regard.
gprof -F FunctionName Executable gmon.out>log

You are nearly repeating another question about function execution time.
As I answered there, there is a difficulty (due to hardware!) to get reliably the execution time of some particular function, specially if that function takes little time (e.g. less than a millisecond). Your original question pointed to these methods.
I would suggest using clock_gettime(2) with CLOCK_REALTIME or perhaps CLOCK_THREAD_CPUTIME_ID
gprof(1) (after compilation with -pg) works with profil(3) and is using a sampling technique, based upon sending a SIGPROF signal (see signal(7)) at periodic intervals (e.g. each 10 milliseconds) from a timer set with setitimer(2) and TIMER_PROF; so the program counter is sampled periodically. Read the wikipage on gprof and notice that profiling may significantly degrade the running time.
If your function gets executed in a short time (less than a millisecond) the profiling gives an imprecise measurement (read about heisenbugs).
In other words, profiling and measuring the time of a short running function is altering the behavior of the program (and this would happen with some other OS too!). You might have to give up the goal of measuring precisely and reliably and accurately the timing of your function without disturbing it. It might even not make any precise sense, e.g. because of the CPU cache.
You could use gprof without any -F argument and, if needed, post-process the textual profile output (e.g. with GNU awk) to extract the information you want.
BTW, the precise timing of a particular function might not be important. What is important is the benchmarking of the entire application.
You could also ask the compiler to optimize even more your program; if you are using link time optimization, i.e. compiling and linking with g++ -flto -O2, the notion of timing of a small function may even cease to exist (because the compiler and the linker could have inlined it without you knowing that).
Consider also that current superscalar processors have a so complex micro-architecture with instruction pipeline, caches, branch predictor, register renaming, speculative execution, out-of-order execution etc etc that the very notion of timing a short function is undefined. You cannot predict or measure it.

gcc compilations (sometimes) result in cpu underload

I have a larger C++ program which starts out by reading thousands of small text files into memory and storing data in stl containers. This takes about a minute. Periodically, a compilation will exhibit behavior where that initial part of the program will run at about 22-23% CPU load. Once that step is over, it goes back to ~100% CPU. It is more likely to happen with O2 flag turned on but not consistently. It happens even less often with the -p flag which makes it almost impossible to profile. I did capture it once but the gprof output wasn't helpful - everything runs with the same relative speed just at low cpu usage.
I am quite certain that this has nothing to do with multiple cores. I do have a quad-core cpu, and most of the code is multi-threaded, but I tested this issue running a single thread. Also, when I run the problematic step in multiple threads, each thread only runs at ~20% CPU.
I apologize ahead of time for the vagueness of the question but I have run out of ideas as to how to troubleshoot it further, so any hints might be helpful.
UPDATE: Just to make sure it's clear, the problematic part of the code does sometimes (~30-40% of the compilations) run at 100% CPU, so it's hard to buy the (otherwise reasonable) argument that I/O is the bottleneck

It's the buffer cache
My guess is that you are seeing the results of the Linux buffer cache in operation.
Those thousands of files will take a long time to read in from the disk and the CPU will mostly be waiting on rotational and seek latencies. Reported CPU-time used will be low expressed as a percentage. (But probably greater overall.)
But once read, those small files are completely cached in memory and accessing each file (in subsequent runs) becomes a purely CPU-bound activity.
Whether the blocks remain in the cache depends on intervening activity, such as recompiles. When new programs are run and other files are read, the programs and the files will be cached and old blocks will be dropped, and obviously, memory-intensive workload will also clear out the buffer cache.

Since you're reading a ton of small files, your program is blocked waiting on disk I/O for the majority of the time. Since the CPU isn't busy while it's waiting for the disk to ship the data to it, you're seeing a load of significantly less than 100%. Once that's over, now you're CPU-bound, and your program will eat all available CPU time.
The fact that it works faster sometimes is because (as Jarryd & DigitalRoss mention) once you've read them into system memory, they're in the OS's cache, so subsequent loads will be an order of magnitude faster, unless they've been evicted by other disk I/O. So if you run the program back-to-back, the 2nd run will probably be much faster. If you wait a while (and do other stuff in the meantime), there may have been enough other disk I/O to evict those files from the cache, in which case it will take a long time to read them again.

In addition to other answers mentionning the buffer cache, if you want to understand what is going on during a compilation, you could pass some of the below flags to GCC (i.e. to g++, probably as a CXXFLAGS setting in your Makefile):
-v to ask g++ to show the involved subprocesses (e.g. cc1plus for the proper C++ compiler)
-time to ask g++ to report the time of each sub-process
-ftime-report to ask g++ (actually cc1plus) to report the time of internal phases or passes inside the compiler.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js