Callgrind / kcachegrind why does running a program in valgrind increase sysCall time? - profiling

I've been profiling some code which likely spends a lot of it's execution time in system calls. Timing some functions manually and with callgrind, callgrind reports system call times of around 20, 30 or 40 times longer than simply timing the function (which would ofcourse include CPU time as well).
--collect-systime=ON was used to collect this sysCall time for each function.
As far as I know, callgrind works by counting the CPU instructions and for timing system calls simply lets the OS do the work and doesn't interfere. I don't understand why the time spent in sysCalls increases when profiling with callgrind, can anyone elaborate?
Is callgrind still a useful tool to profile time spent in sysCalls?

Can you try --collect-systime=usec and --collect-systime=nsec to see if they are significantly different? usec might be a bit faster.
When --collect-systime is specified, Valgrind will be calling one of the various time syscalls for each of the client application syscalls. I would expect that to add a substantial overhead, particularly if your client application is calling very many "fast" syscalls.

Related

Micro benchmarking C++ Linux

I am benchmarking a function of my c++ program using inline rdtsc(1st and last instruction in the function) My setup has isolated cores and hyper threading is off and the frequency is 3.5Mhz.
I cannot afford more than 1000 cpu cycles so i count the percentage of calls taking more than 1000 cpu cycles and that is approximately 2-3%. The structure being accessed in the code is huge and can certainly result in cache miss. But a cache miss is 300-400 cpu cycles.
Is there a problem with rdtsc benchmarking? If not, what else can cause a 2-3% of my cases going through the same set of instructions abruptly high number of cycles
I want help to understand what i should look for to understand this 2-3% of my WC(worst cases)
Often rare "performance noise" like you describe is caused by context switches in the timed region, where your process happened to exceed its scheduler quanta during your interval and some other process was scheduled to run on the core. Another possibility is a core migration by the kernel.
When you say "My setup has isolated cores", are you actually pinning your process to specific cores using a tool (e.g. hwloc)? This can greatly help to get reproducible results. Have you checked for other daemon processes that might also be eligible to run on your core?
Have you tried measuring your code using a sampling profiler like gprof or HPCToolkit? These tools provide alot more context and behavioral information that can be difficult to discover from manual instrumentation.

Analyzing spikes in performance measurement

I have a set of C++ functions which does some image processing related operation. Generally I see that the final output is delivered in 5-6ms time range. I am measuring the time taken using QueryPerformanceCounter Win32 API. But when running in a continuous loop with 100 images, I see that the performance spikes up to 20ms for some images. My question is how do I go about analyzing such issues. Basically I want to determine whether the spikes are caused due to some delay in this code or whether some other task started running inside the CPU because of which this operation took time. I have tried using GetThreadTimes API to see how much time my thread spent inside CPU but am unable to conclude based on those numbers. What is the standard way to go about troubleshooting these types of issues?
Reason behind sudden spikes during processing could be any of IO, interrupt, scheduled processes etc.
It is very common to see such spikes considering such low latency/processing time operations. IMO you can consider them because of any of the above mentioned reasons (There could be more). Simplest solution is run same experiment with more inputs multiple times and take the average for final consideration.
To answer your question about checking/confirming source of the spike you can try following,
Check variation in images - already ruled out as per your comment
Monitor resource utilization during processing. Check if any resource is choking (% util is simplest way to check and SAR/NMON utility on linux is best with minimal overhead)
Reserve few CPU's on system (CPU Affinity) for your experiment which are dedicated only for your program and no OS task will run on them. Taskset is simplest utility to try out. More details are here.
Run the experiment with this setting and check behavior.
That's a nasty thing you are trying to figure out, I wouldn'd even attempt to, since coming into concrete conlusions is hard.
In general, one should run a loop of many iterations (100 just seems too small I think), and then take the average time for an image to be processed.
That will rule out any unexpected exterior events that may have hurt performance of your program.
A typical way to check if "some other task started running inside the CPU" would be to run your program once and mark the images that produce that spike. Example, image 2, 4, 5, and 67 take too long to be processed. Run your program again some times, and mark again which images produce the spikes.
If the same images produce these spikes, then it's not something caused by another exterior task.
What is the standard way to go about troubleshooting these types of issues?
There are Real Time Operating Systems (RTOS) which guarantee those kind of delays. It is totally different class of operating systems than Windows or Linux.
But still, there are something you can do about your delays even on general purpose OS.
1. Avoid system calls
Once you ask your OS to read or write something to a disk -- there are no guarantees whatever about delays. So, avoid any system functions on you critical path:
even functions like gettimeofday() might cause unpredictable delays, so you should really avoid any system calls in time-critical code;
use another thread to perform IO and pass data via a shared buffer to your critical code.
If your code base is big, tools like strace on Linux or Dr Memory on Windows to trace system calls.
2. Avoid context switches
The multi threading on Windows is preemptive. It means, there is a system scheduler, which might stop your thread any time and schedule another thread on your CPU. As previously, there are RTOSes, which allow to avoid such context switches, but there is something you can do about it:
make sure there is at least one CPU core left for system and other tasks;
bind each of your threads to a dedicated CPU with SetThreadAffinityMask() (Windows) or sched_setaffinity() (Linux) -- this effectively hints system scheduler to avoid scheduling other threads on this CPU;
make sure hardware interrupts go to another CPU; usually interrupts go to CPU 0, so the easiest way would be to bind your thread with CPU 1+;
increase your thread priority, so scheduler less likely to switch your thread with another one.
There are tools like perf (Linux) and Intel VTune (Windows) to confirm there are context switches.
3. Avoid other non-deterministic features
Few more sources of unexpected delays:
disable swap, so you know for sure your thread memory will not be swapped on slow and unpredictable disk drive;
disable CPU turbo boost -- after a high-performance CPU boosts, there is always a slow down, so the CPU stays withing its thermal power (TDP);
disable hyper threading -- from scheduler point of view those are independent CPUs, but in fact performance of each hyper-thread CPU depend on what another thread is doing at the moment.
Hope this helps.

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

Running a process multiple times at the same time

I have a c++ program with opencv library which takes an image as input and perform pose estimation,color detection,phog. When I run this program from the command line it takes around 4-5sec to complete. It takes around 60%cpu. When I try to run the same program from two different command line windows at the same time the process takes around 10-15 sec to finish and both the process finish in almost the same time. The CPU Usage reaches upto 100%.
I have a website which calls this c++ exe using exec() command. So when two users try to upload an image and run it takes more time as I explained above in the command line. Is this because the c++ program involves high computation and the CPU reaches 100% it slows down? But I read that the CPU reaching 100% is not a bad thing as the computer is using its full capacity to run the program. So is this because of my c++ program or is it something to do with my server(computer) settings? This is probably not the apache server problem because when I try to run it from the command line also it slows down. I am using a quad core processor and all the 4 CPU reaches 100% when I try to run the same process at the same time so I think that its distributed among all the processor. So I have few more questions:
1) Can this be solved by using multithreading in my c++ code?As for now I am not using it but will multithreading make the c++ code more computationally expensive and increase the CPU usage(if this is the problem).
2) What can be the reason of it slowing down? Is the process in a queue and each process is ran only a certain amount of time and it switches between the two process?
3) If this is because it involves high computation will it help if I change some functions to opencv gpu functions?
4) Is there a way I can solve this problems any ideas or tips?
I have inserted the result of top when running one process and running the same process twice at the same time:
Version5 is the process,running it once
Two Version5 running at the same time
The CPU info:
Thanks in advance.
After zooming so that your picture fills almost my entire 22" screen, I can make out that the CPU flags show "ht", which means "hyperthreading", so you actually only have two genuine cores, that are shared between two hyperthreads. So running on all four CPU cores at once will not give the same performance as running on two genuine cores.
In other words, the "loss of performance" is entirely as you'd expect, because you have four threads fighting for the actual computational resources of two CPU cores. Hyperthreading helps if the code has a lot of memory interaction that can be "hidden" by running a second thread. But if you have a CPU intensive code, that isn't "missing in the cache" much, then the gain is much less, and in extreme cases, hyperthreading will actually cause slow-downs (because the code in one thread disrupts the caches and otherwise "gets in the way" of the first thread). You may want to experiment by going into the BIOS settings and turn off the hyperthreading, and compare the results. Sure, running two instances of the code will clearly still take longer, but the question is "is it longer than running with hyperthreading" - unfortunately, it's impossible to say for sure which is better from a theoretical standpoint (even if I could see the assembly code and understood the memory access patterns - without that level of detail, it's completely impossible to judge).
When running only one process reaches 60% of CPU usage it would be possible that using multithreading speeds up the execution. However, the CPU usage is likely to be higher
That's true. There might be an additional overhead for context switching (multitasking)
Changing functions can bring some improvements, but without having your code it is hard say.
Since the computational effort is that high, I think you have to decide whether you accept a high CPU usage or a longer execution time (of course after optimizing the code itself)
Greetz

Make a program run slowly

Is there any way to run a C++ program slower by changing any OS parameters in Linux? In this way I would like to simulate what will happen if that particular program happens to run on a real slower machine.
In other words, a faster machine should behave as a slower machine to that particular program.
Lower the priority using nice (and/or renice). You can also do it programmatically using nice() system call. This will not slow down the execution speed per se, but will make Linux scheduler allocate less (and possibly shorter) execution time frames, preempt more often, etc. See Process Scheduling (Chapter 10) of Understanding the Linux Kernel for more details on scheduling.
You may want to increase the timer interrupt frequency to put more load on the kernel, which will in turn slow everything down. This requires a kernel rebuild.
You can use CPU Frequency Scaling mechanism (requires kernel module) and control (slow down, speed up) the CPU using the cpufreq-set command.
Another possibility is to call sched_yield(), which will yield quantum to other processes, in performance critical parts of your program (requires code change).
You can hook common functions like malloc(), free(), clock_gettime() etc. using LD_PRELOAD, and do some silly stuff like burn a few million CPU cycles with rep; hop;, insert memory barriers etc. This will slow down the program for sure. (See this answer for an example of how to do some of this stuff).
As #Bill mentioned, you can always run Linux in a virtualization software which allows you to limit the amount of allocated CPU resources, memory, etc.
If you really want your program to be slow, run it under Valgrind (may also help you find some problems in your application like memory leaks, bad memory references, etc).
Some slowness can be achieved by recompiling your binary with disabled optimizations (i.e. -O0 and enable assertions (i.e. -DDEBUG).
You can always buy an old PC or a cheap netbook (like One Laptop Per Child, and don't forget to donate it to a child once you are done testing) with a slow CPU and run your program.
Hope it helps.
QEMU is a CPU emulator for Linux. Debian has packages for it (I imagine most distros will). You can run a program in an emulator and most of them should support slowing things down. For instance, Miroslav Novak has patches to slow down QEMU.
Alternatively, you could cross compile to another CPU-linux (arm-none-gnueabi-linux, etc) and then have QEMU translate that code to run.
The nice suggestion is simple and may work if you combine it with another process which will consume cpu.
nice -19 test &
while [ 1 ] ; do sha1sum /boot/vmlinuz*; done;
You did not say if you need graphics, file and/or network I/O? Do you know something about the class of error you are looking for? Is it a race condition, or does the code just perform poorly at a customer site?
Edit: You can also use signals like STOP and CONT to start and stop your program. A debugger can also do this. The issue is that the code runs a full speed and then gets stopped. Most solutions with the Linux scheduler will have this issue. There was some sort of thread analyzer from Intel afair. I see Vtune Release Notes. This is Vtune, but I was pretty sure there is another tool to analyze thread races. See: Intel Thread Checker, which can check for some thread race conditions. But we don't know if the app is multi-threaded?
Use cpulimit:
Cpulimit is a tool which limits the CPU usage of a process (expressed in percentage, not in CPU time). It is useful to control batch jobs, when you don't want them to eat too many CPU cycles. The goal is prevent a process from running for more than a specified time ratio. It does not change the nice value or other scheduling priority settings, but the real CPU usage. Also, it is able to adapt itself to the overall system load, dynamically and quickly.
The control of the used cpu amount is done sending SIGSTOP and SIGCONT POSIX signals to processes.
All the children processes and threads of the specified process will share the same percent of CPU.
It's in the Ubuntu repos. Just
apt-get install cpulimit
Here're some examples on how to use it on an already-running program:
Limit the process 'bigloop' by executable name to 40% CPU:
cpulimit --exe bigloop --limit 40
cpulimit --exe /usr/local/bin/bigloop --limit 40
Limit a process by PID to 55% CPU:
cpulimit --pid 2960 --limit 55
Get an old computer
VPS hosting packages tend to run slowly, have lots of interruptions, and wildly varying latencies. The cheaper you go the worse the hardware will be. Unlike truly old hardware, there is a good chance they will contain instruction sets (SSE4) that are not usually found on old hardware. Neverthless, if you want a system that walks slowly and shutters often, a cheap VPS host will be the quickest start.
If you just want to simulate your program to analyze its behavior on really slow machine, you can try making your whole program to run as a thread of some other main program.
In this manner you can prioritize same code with different priorities in few threads at once and collect data of your analysis. I have used this in game development for frame-processing analysis.
Use sleep or wait inside of your code. Its not the brightest way to do but acceptable in all kind of computer with different speeds.
The simplest possible way to do it would be to wrap your main runable code in a while loop with a sleep at the end of it.
For example:
void main()
{
while 1
{
// Logic
// ...
usleep(microseconds_to_sleep)
}
}
As people will mention, this isn't the most accurate way, since your logic code will still run at normal speed but with delays between runs. Also, it assumes that your logic code is something that runs in a loop.
But it is both simple and configurable.
You can increase load on CPUs via launching tools like stress or stress-ng