Interpreting GPerfTools sample count

Interpreting GPerfTools sample count - profiling

I'm struggling a little with reading the textual output the GPerfTools generate. I think part of the problem is that I don't fully understand how the sampling method operates.
From Wikipedia I gather that profilers based on sample functions usually work by sending an interrupt to the OS and querying the program's current instruction pointer. Now my knowledge about assembly is a little rusty, so I'm wondering what it means if the instruction pointer points to method m at any given time? I.e. does it mean that the function is about to be called or does it mean it's currently executed, or both?
There's a difference if I'm not mistaken, because in the first case the sample count (i.e. times m is seen while taking a sample) translates to the absolute call count of m, while in the latter case it simply translates to times seen, i.e. a mere indication of relative time spent in this method.
Can someone clarify?

Related

Detecting bad input for `boost::math::tools::brent_find_minima()`

This documentation page of boost::math::tools::brent_find_minima says about its first argument:
The function to minimise: a function object (or C++ lambda) ... with no maxima occurring in that interval.
But what happens if this is not the case? (After all, this condition is rather difficult to pre-ensure, especially since the function is usually expensive to evaluate at many points.) Best would be to detect violations to this condition on the fly.
If this condition is violated, does boost throw an exception, or does it exhibit undefined behavior?
A workaround I am thinking of is to build the checking into the lambda ("function to minimize"), by capturing and maintaining a std::map<double,double> holding all the points that have been evaluated, and comparing each new evaluation with its nearest neighbor in each direction, to check whether there may be a local maximum. But I don't want to do all that if it isn't necessary.

There is no way for this to be done. If you read Corless's A Graduate Introduction to Numerical Methods, you'll read a very interesting point: All numerically defined functions are discontinuous halfway between representables, and have zero derivatives between representables. Basically they can be thought of as a sum of Heaviside functions.
So none of them are differentiable in the mathematical sense. Ok, maybe you think this is a bit unfair-the scale should be zoomed out. But how much? We know that |x-1| isn't differentiable at x=1, but how could a computer tell that? How does it know that there isn't some locally smooth mollifier that makes it differentiable between x=1-eps and x=1+eps? I don't think there's a good answer to this question.
One of the most difficult problems in this class arises in quadrature. Some of these methods work fast when the complex extension of the function has poles far from the real axis. Try to numerically determine that.
Function spaces are impossible to determine numerically. Users just have to get it right.

Implementing an Interrupt driven Sampling Profiler

I am trying to create a sampling profiler that works on linux, I am unsure how to send an interrupt or how to get the program counter (pc) so I can find out the point the program is at when I interrupt it.
I have tried using signal(SIGUSR1, Foo*) and calling backtrace, but I get the stack for the thread I am in when I raise(SIGUSR1) rather than the thread the program is being run on.
I am not really sure if this is even the right way of going about it...
Any advice?

If you must write a profiler, let me suggest you use a good one (Zoom) as your model, not a bad one (gprof).
These are its characteristics.
There are two phases. First is the data-gathering phase:
When it takes a sample, it reads the whole call stack, not just the program counter.
It can take samples even when the process is blocked due to I/O, sleep, or anything else.
You can turn sampling on/off, so as to only take samples during times you care about. For example, while waiting for the user to type something, it is pointless to be sampling.
Second is the data-presentation phase.
What you have is a collection of stack samples, where a stack sample is a vector of memory addresses, which are almost all return addresses.
Each return address indicates a line of code in a function, unless it's in some system routine you don't have symbolic information for.
The key piece of useful information is residency fraction (usually expressed as a percent).
If there are a total of m stack samples, and line of code L is present anywhere on n of the samples, then its residency fraction is n/m.
This is true even if L appears more that once on a sample, that is still just one sample it appears on.
The importance of residency fraction is it directly indicates what fraction of time statement L is responsible for.
If you have taken m=1000 samples, and L appears on n=300 of them, then L's residency fraction is 300/1000 or 30%.
This means that if L could be removed, total time would decrease by 30%.
It is typically known as inclusive percent.
You can determine residency fraction not just for lines of code, but for anything else you can describe. For example, line of code L is inside some function F.
So you can determine the residency fraction for functions, as opposed to lines of code.
That would give you inclusive percent by function.
You could look at function pairs, like on what fraction of samples do you see function F calling function G.
That would give you the information that makes up call graphs.
There are all kinds of information you can get out of the stack samples.
One that is often seen is a "butterfly view", where you have a "focus" on one line L or function F, and on one side you show all the lines or functions immediately above it in the stack samples, and on the other side all the lines of functions immediately below it.
On each of these, you can show the residency fraction.
You can click around in this to try to find lines of code with high residency fraction that you can find a way to eliminate or reduce.
That's how you speed up the code.
Whatever you do for output, I think it is very important to allow the user to actually examine a small number of them, randomly selected.
They convey far more insight than can be gotten from any method that condenses the information.
As important as it is to know what the profiler should do, it is also important to know what not to do, even if lots of other profilers do them:
self time. A useless number. Look at some reasonable-size programs and you will see why.
invocation counts. Of no help in finding code with high residency fraction, and you can't get it with samples alone anyway.
high-frequency sampling. It's amazing how many people, certainly profiler builders, think it is important to get lots of samples. Suppose line L is on 30% of 1000 samples. Then its true inclusive percent is 30 +/- 1.4 percent. On the other hand, if it is on 30% of 10 samples, its inclusive percent is 30 +/- 14 percent. It's still pretty big - big enough to fix. What happens in most profilers is people think they need "numerical precision", so they take lots of samples and accumulate what they call "statistics", and then throw away the samples. That's like digging up diamonds, weighing them, and throwing them away. The real value is in the samples themselves, because they tell you what the problem is.

You can send signal to specific thread using pthread_kill and tid (gettid()) of target thread.
Right way of creating simple profilers is by using setitimer which can send periodic signal (SIGALRM or SIGPROF) for example, every 10 ms; or posix timers (timer_create, timer_settime, or timerfd), without needs of separate thread for sending profiling signals. Check sources of google-perftools (gperftools), they use setitimer or posix timers and collects profile with backtraces.
gprof also uses setitimer for implementing cpu time profiling (9.1 Implementation of Profiling - " Linux 2.0 ..arrangements are made for the kernel to periodically deliver a signal to the process (typically via setitimer())").
For example: result of codesearch for setitimer in gperftools's sources: https://code.google.com/p/gperftools/codesearch#search/&q=setitimer&sq=package:gperftools&type=cs
void ProfileHandler::StartTimer() {
if (!allowed_) {
return;
}
struct itimerval timer;
timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = 1000000 / frequency_;
timer.it_value = timer.it_interval;
setitimer(timer_type_, &timer, 0);
}
You should know that setitimer has problems with fork and clone; it doesn't work with multithreaded applications. There is try to create helper wrapper: http://sam.zoy.org/writings/programming/gprof.html (wrong one) but I don't remember, does it work correctly (setitimer usually send process-wide signal, and not thread-wide). UPD: seems that since linux kernel 2.6.12, setitimer's signal is directed to the process as whole (any thread may get it).
To direct signal from timer_create to specific thread, you need gettid() (#include <sys/syscall.h>, syscall(__NR_gettid)) and SIGEV_THREAD_ID flag. Don't checked how to create periodic posix timer with thread_create (probably with timer_settime and non-zero it_interval).
PS: there is some overview of profiling in wikibooks: http://en.wikibooks.org/wiki/Introduction_to_Software_Engineering/Tools/Profiling

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP

1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.

Which is faster (mask >> i & 1) or (mask & 1 << i)?

In my code I must choose one of this two expressions (where mask and i non constant integer numbers -1 < i < (sizeof(int) << 3) + 1). I don't think that this will make preformance of my programm better or worse, but it is very interesting for me. Do you know which is better and why?

First of all, whenever you find yourself asking "which is faster", your first reaction should be to profile, measure and find out for yourself.
Second of all, this is such a tiny calculation, that it almost certainly has no bearing on the performance of your application.
Third, the two are most likely identical in performance.

C expressions cannot be "faster" or "slower", because CPU cannot evaluate them directly.
Which one is "faster" depends on the machine code your compiler will be able to generate for these two expressions. If your compiler is smart enough to realize that in your context both do the same thing (e.g. you simply compare the result with zero), it will probably generate the same code for both variants, meaning that they will be equally fast. In such case it is quite possible that the generated machine code will not even remotely resemble the sequence of operations in the original expression (i.e. no shift and/or no bitwise-and). If what you are trying to do here is just test the value of one bit, then there are other ways to do it besides the shift-and-bitwise-and combination. And many of those "other ways" are not expressible in C. You can't use them in C, while the compiler can use them in machine code.
For example, the x86 CPU has a dedicated bit-test instruction BT that extracts the value of a specific bit by its number. So a smart compiler might simply generate something like
MOV eax, i
BT mask, eax
...
for both of your expressions (assuming it is more efficient, of which I'm not sure).

Use either one and let your compiler optimize it however it likes.

If "i" is a compile-time constant, then the second would execute fewer instructions -- the 1 << i would be computed at compile time. Otherwise I'd imagine they'd be the same.

Depends entirely on where the values mask and i come from, and the architecture on which the program is running. There's also nothing to stop the compiler from transforming one into the other in situations where they are actually equivalent.
In short, not worth worrying about unless you have a trace showing that this is an appreciable fraction of total execution time.

It is unlikely that either will be faster. If you are really curious, compile a simple program that does both, disassemble, and see what instructions are generated.
Here is how to do that:
gcc -O0 -g main.c -o main
objdump -d main | less

You could examine the assembly output and then look-up how many clock cycles each instruction takes.
But in 99.9999999 percent of programs, it won't make a lick of difference.

The 2 expressions are not logically equivalent, performance is not your concern!
If performance was your concern, write a loop to do 10 million of each and measure.
EDIT: You edited the question after my response ... so please ignore my answer as the constraints change things.

Gentle introduction to JIT and dynamic compilation / code generation

The deceptively simple foundation of dynamic code generation within a C/C++ framework has already been covered in another question. Are there any gentle introductions into topic with code examples?
My eyes are starting to bleed staring at highly intricate open source JIT compilers when my needs are much more modest.
Are there good texts on the subject that don't assume a doctorate in computer science? I'm looking for well worn patterns, things to watch out for, performance considerations, etc. Electronic or tree-based resources can be equally valuable. You can assume a working knowledge of (not just x86) assembly language.

Well a pattern I've used in emulators goes something like this:
typedef void (*code_ptr)();
unsigned long instruction_pointer = entry_point;
std::map<unsigned long, code_ptr> code_map;
void execute_block() {
code_ptr f;
std::map<unsigned long, void *>::iterator it = code_map.find(instruction_pointer);
if(it != code_map.end()) {
f = it->second
} else {
f = generate_code_block();
code_map[instruction_pointer] = f;
}
f();
instruction_pointer = update_instruction_pointer();
}
void execute() {
while(true) {
execute_block();
}
}
This is a simplification, but the idea is there. Basically, every time the engine is asked to execute a "basic block" (usually a everything up to next flow control op or whole function in possible), it will look it up to see if it has already been created. If so, execute it, else create it, add it and then execute.
rinse repeat :)
As for the code generation, that gets a little complicated, but the idea is to emit a proper "function" which does the work of your basic block in the context of your VM.
EDIT: note that I haven't demonstrated any optimizations either, but you asked for a "gentle introduction"
EDIT 2: I forgot to mention one of the most immediately productive speed ups you can implement with this pattern. Basically, if you never remove a block from your tree (you can work around it if you do but it is way simpler if you never do), then you can "chain" blocks together to avoid lookups. Here's the concept. Whenever you return from f() and are about to do the "update_instruction_pointer", if the block you just executed ended in either a call, unconditional jump, or didn't end in flow control at all, then you can "fixup" its ret instruction with a direct jmp to the next block it'll execute (cause it'll always be the same one) if you have already emited it. This makes it so you are executing more and more often in the VM and less and less in the "execute_block" function.

I'm not aware of any sources specifically related to JITs, but I imagine that it's pretty much like a normal compiler, only simpler if you aren't worried about performance.
The easiest way is to start with a VM interpreter. Then, for each VM instruction, generate the assembly code that the interpreter would have executed.
To go beyond that, I imagine that you would parse the VM byte codes and convert them into some sort of suitable intermediate form (three address code? SSA?) and then optimize and generate code as in any other compiler.
For a stack based VM, it may help to to keep track of the "current" stack depth as you translate the byte codes into intermediate form, and treat each stack location as a variable. For example, if you think that the current stack depth is 4, and you see a "push" instruction, you might generate an assignment to "stack_variable_5" and increment a compile time stack counter, or something like that. An "add" when the stack depth is 5 might generate the code "stack_variable_4 = stack_variable_4+stack_variable_5" and decrement the compile time stack counter.
It is also possible to translate stack based code into syntax trees. Maintain a compile-time stack. Every "push" instruction causes a representation of the thing being pushed to be stored on the stack. Operators create syntax tree nodes that include their operands. For example, "X Y +" might cause the stack to contain "var(X)", then "var(X) var(Y)" and then the plus pops both var references off and pushes "plus(var(X), var(Y))".

Get yourself a copy of Joel Pobar's book on Rotor (when it's out), and delve through the source to the SSCLI. Beware, insanity lies within :)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js