Is callgrind profiling influenced by other processes? - profiling

I'd like to profile my application using callgrind. Now, since it takes a very long time, in the meanwhile I go on with web-browsing, compiling and other intensive tasks on the same machine.
Am I biasing the profiling results? I'm expecting that, since valgrind uses a simulated CPU, other external processes should not interfere with valgrind execution. Am I right?

By default, Callgrind does not record anything related to time, so you can expect all collected metrics to (mostly) be independent of other processes on the machine. As the Callgrind manual states,
By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls.
As such, the metrics Callgrind reports should only depend on what instructions the program is executing on the (simulated) CPU - not on how much time such instructions take. Indeed, many times the output of Callgrind can be somewhat misleading, as the simulated CPU might operate different to the real one (particularly when it comes to branch prediction).
The Callgrind paper presented at ICCS 2004 is very clear about this as well:
We note that the simulation is not able to predict consumed wall clock time, as this would need a detailed simulation of the microarchitecture.
In any case, however, the simulated CPU is unaffected by what the real CPU is doing.
The reason is straightforward.
Like you said, your program is not executed on your machine at all.
Instead, at runtime, Valgrind dynamically translates your program, that is, it disassembles the binary into "UCode" for an simulated machine, adds analysis code (called instrumentation), then generates binary code that executes the simulation.
The addition of analysis code is what makes instruction counting (in Callgrind), memory checking (in Memcheck), and all other plugins possible.
Therein lies the twist, however.
Naturally there are limits to how isolated the program can run in such a dynamic simulation.
First, your program might interact with other programs.
While the time spent for doing so is irrelevant (as it is not accounted for), the return codes of inter-process communication can certainly change, depending on what else is going on in the system.
Second, most system calls need to be run untranslated and their return codes can change as well -- leading to different execution paths of your program and, thus, slightly different metrics being collected. (As an aside, Calgrind offers an option to record the wall clock time spent during syscalls, which will always be affected by what else goes on in the system).
More details about these restrictions can be found in the PhD Dissertation of Nicholas Nethercote ("Dynamic Binary Analysis and Instrumentation").

Related

Performance counter for Halide?

Is there a performance counter available for code written in the Halide language? I would like to know how many loads, stores, and ALU operations are performed by my code.
The Halide tutorial for scheduling multi-stage pipelines compares different schedules by comparing the amount of allocated memory, loads, stores, and calls to halide Funcs, but I don't see how this information was collected. I suppose it might be possible to use trace_stores, trace_loads, and trace_realizations to print to the console every time one of these operations occurs. This isn't a great option though because it would greatly slow down the program's execution and would require some kind of counting script to compile the long list of console outputs into the desired counts for loads, stores, and ALU operations.
I'm pretty sure they just used the trace_xxx output and ran some scripts/programs on it.
If you're looking for real performance numbers on a X86 platform, I would go with Intel VTune Amplifier. It's pretty expensive, but may be free if you're in academia (student, teacher, researcher) or it's for an open source project.
Other than that, look at the lowered statement code by setting HL_DEBUG_CODEGEN=1 in the environment and you can get a better idea of the loop structure and data use. Note that this output goes to stderr, not stdout.
EDIT: For Linux, there's perf.
We do not have any perf counter based support at present. It is fairly difficult to make it portable. (And on mobile devices, often the OS simply doesn't allow access to the hardware.) The support in Profiling.cpp and src/profiling.cpp could likely be used to drive perf counter operation. The profiling lowering pass adds code to call routines in the runtime which update information about Func and Pipeline execution. This information is collected and aggregated by another thread.
If tracing is run to a file (e.g. using HL_TRACE_FILE) a binary format is used and it is a bit more efficient. See utils/HalideTraceViz for a tool to work with the binary format. This is generally how analyses are done within the team.
There was a small amount of investigation of OProfile, which looked promising but I don't think we ever got code working.

How to get an accurate performance measure?

In our project we're trying to automatically monitor the performance of test runs, to make sure that we don't have any significant changes in the performance of the program over time.
The problem is that there seems to be a consistent 5% variability in the measures we get. That is, on the same machine with the same program (no recompilation) running the same test we get values that differ by around 5% from run to run. This is way too much for what we want to use the numbers for.
We're already excluding setup costs from the timing considerations - that is, from within C++ code itself we're grabbing the time immediately before and after running the time-critical portions, rather than doing the timing of the whole program on the OS level. We are also doing averaging and outlier exclusion. The problem is that the variability looks to also have long-term trends, so we get tight clustering of times for replicates right after each other, but an hour or two later the times are substantially different. (Unfortunately, spreading the test out over several hours is not feasible.) The tests are also being run on a dedicated machine while "nothing else" is being run on it.
We're not quite sure where the timing variation is coming from, but it may have to do with the processor and the system - there's indications that the size of the variability depends on what machine the program is running on.
Does anyone have an idea where this variation is likely to be coming from, and how to remove it? The tests are running on a dedicated machine, so changing the operating system settings would be possible.
(As indicated by the tags, this is a C++ program running on a x86 Linux system, if that helps clarify things.)
Edit: Response to comments
Our current timing scheme is to use the clock() function from the C standard library, looking at the difference in the return value from before/after the functions we want to test.
The code we're testing should be deterministic, and shouldn't involve heavy IO.
I realize that the situation is a little hazy for a "silver bullet" answer. I guess I'm more looking for a "these are the factors that are important to consider, this is the order you probably should check them in, and here's how you go about checking each of them" type answer.
I'm amazed you got down to 5% variation.
Unless you can get rid of all the unnecessary things running on your system, you will be getting high variation. This is at the top level.
You OS needs to be deterministic. You need to know what other tasks and threads are running and their durations. For example, there is the clock interrupt. Now, how many other functions are chained to this interrupt? Do these other functions vary?
Is your system isolated? For example, your measurements may vary if your system is connected to a network.
Does your program use external resources? For example a hard drive. If the program writes to the hard drive, the drive will not be deterministic. Files and parts of files may move on the drive. The drive may become fragmented. This fragmentation may cause variance in your measurements.
The operating system memory may get fragmented. Also, the executable's memory may become fragmented. Fragmentation may add to the variance.

Finding performance issue that may be due to thread locking (possibly)

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.
In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.
My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?
Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.
We might have to try walking through the code or something but not sure that we have time.
Any tips appreciated.
Thanks.
Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.
This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:
time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context
These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.
While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

Measuring performance/throughput of fast code ignoring processor speed?

Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?
I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.
Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.
It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.

Measuring running time of computational geometry algorithms

I am taking a course on computational geometry in the fall, where we will be implementing some algorithms in C or C++ and benchmarking them. Most of the students generate a few datasets and measure their programs with the time command, but I would like to be a bit more thorough.
I am thinking about writing a program to automatically generate different datasets, run my program with them and use R to test hypotheses and estimate parameters.
So... How do you measure program running time more accurately?
What might be relevant to measure?
What hypotheses might be interesting to test (variance, effects caused by caching, etc.)?
Should I test my code on more than one machine? How should these machines differ?
My overall goals are to learn how these algorithms perform in practice, which implementation techniques are better and how the hardware actually performs.
Profilers are great. Valgrind is pretty popular. Also, I'd suggest trying your code out on risc machines if you can get access to some. Their performance characteristics are different from those of cisc machines in interesting ways.
You could use the Windows API timing function (are not that exactly) and you can use the RDTSC inline assembler command which is sub-nanosecond exact(don't forget that the command and the instructions around it create a small overhead of some hundreds cycles but this is not an big issue).
In order to get better accuracy with program metrics, you will have to run your program many times, such as 100 or 1000.
For more details, on metrics, search the web for metrics and profiling.
Beware that programs may differ in performance (time) measurements due to things running in the background such as virus scanners, music players, and other programs with timers in them.
You could test your program on different machines. Processor clock rates, L1 and L2 cache sizes, RAM sizes, and Disk speeds are all factors (as well as the number of other programs / tasks running concurrently). Floating point may also be a factor.
If you want, you can challenge your compiler by printing the assembly language of the listings for various optimization settings. See which setting produces the fewest or most efficient assembly code.
Since your processing data, look at data driven design: http://www.gamearchitect.net/Articles/DataDrivenDesign.html
You can use the Windows High Performance Counter to get nanosecond accuracy. Technically, afaik, the HPC can be any speed, but you can query it's counts per second, and as far as I know, most CPUs do very very high performance counting.
What you should do is just get a professional profiler. That's what they're for. More realistically, however.
If you're only comparing between algorithms, as long as your machine doesn't happen to excel in one area (Pentium D, SSD sort of thing) it shouldn't matter too much to do it on just one machine. If you want to look at cache effects, try running the algorithm right after the machine starts up (make sure that you get a copy of Windows 7, should be free for CS students), then leave it doing something that can be plenty cache heavy, like image processing, for 24h or something to convince the OS to cache it. Then run algorithm again. Compare.
You didn't specify your platform. If you are on a POSIX system (eg linux) have a look into clock_gettime. This lets you access different kinds of clocks e.g wall clock time or cpu time. You also may get to know about the precision of the clocks.
Since you are willing to do good statistics on your numbers, you should repeat your experiments often enough such that the statistical test give you enough confidence.
If your measurements are not too fine grained and your variance is low this often is quite good for 10 probes or so. But if you go down to small scale, a short function or so, you might need to go much higher.
Also you would have to ensure reproducible experimental conditions, no other load on the machine, enough memory available etc.