What is code branching? I've seen it mentioned in various places, especially with bit twiddling, but never really thought about it?
How does it slow a program down and what should I be thinking about while coding?
I see mention of if statements. I really don't understand how such code can slow down the code. If condition is true do following instructions, otherwise jump to another set of instructions? I see the other thread mentioning "branch prediction", maybe this is where I'm really lost. What is there to predict? The condition is right there and it can only be true or false.
I don't believe this to be a duplicate of this related question. The linked thread is talking about "Branch prediction" in reference to an unsorted array. I'm asking what is branching and why prediction is required.
The most simple example of a branch is an if statement:
if (condition)
doSomething();
Now if condition is true then doSomething() is executed. If not then the execution branches, by jumping to the statement that follows the end of the if.
In very simple machine pseudo code this might be compiled to something along these lines:
TEST condition
JZ label1 ; jump over the CALL if condition is 0
CALL doSomething
##label1
The branch point is the JZ instruction. The subsequent execution point depends on the outcome of the test of condition.
Branching affects performance because modern processors predict the outcome of branches and perform speculative execution, ahead of time. If the prediction turns out to be wrong then the speculative execution has to be unwound.
If you can arrange the code so that prediction success rates are higher, then performance is increased. That's because the speculatively executed code is now less of an overhead since it has already been executed before it was even needed. That this is possible is down to the fact that modern processors are highly parallel. Spare execution units can be put to use performing this speculative execution.
Now, there's one sort of code that never has branch prediction misses. And that is code with no branches. For branch free code, the results of speculative execution are always useful. So, all other things being equal, code without branches executes faster than code with branches.
Essentially imagine an assembly line in a factory. Imagine that, as each item passes through the assembly line, it will go to employee 1, then employee 2, on up to employee 5. After employee 5 is done with it, the item is finished and is ready to be packaged. Thus all five employees can be working on different items at the same time and not having to just wait around on each other. Unlike most assembly lines though, every single time employee 1 starts working on a new item, it's potentially a new type of item - not just the same type over and over.
Well, for whatever weird and imaginative reason, imagine the manager is standing at the very end of the assembly line. And he has a list saying, "Make this item first. Then make that type of item. Then that type of item." And so on. As he sees employee 5 finish each item and move on to the next, the manager then tells employee 1 which type of item to start working on, looking at where they are in the list at that time.
Now let's say there's a point in that list - that "sequence of computer instructions" - where it says, "Now start making a coffee cup. If it's nighttime when you finish making the cup, then start making a frozen dinner. If it's daytime, then start making a bag of coffee grounds." This is your if statement. Since the manager, in this kind of fake example, doesn't really know what time of day it's going to be until he actually sees the cup after it's finished, he could just wait until that time to call out the next item to make - either a frozen dinner or some coffee grounds.
The problem there is that if waits until the very last second like that - which he has to wait until to be absolutely sure what time of day it'll be when the cup is finished, and thus what the next item's going to be - then workers 1-4 are not going to be working on anything at all until worker 5 is finished. That completely defeats the purpose of an assembly line! So the manager takes a guess. The factory is open 7 hours in the day and only 1 hour at night. So it is much more likely that the cup will be finished in the daytime, thus warranting the coffee grounds.
So as soon as employee 2 starts working on the coffee cup, the manager calls out the coffee grounds to the employee 1. Then the assembly line just keeps moving along like it had been, until employee 5 is finished with the cup. At that time the manager finally sees what time of day it is. If it's daytime, that's great! If it's nighttime, everything started on after that coffee cup must be thrown away, and the frozen dinner must be started on. ...So essentially branch prediction is where the manager temporarily ventures a guess like that, and the line moves along faster when he's right.
Pseudo-Edit:
It is largely hardware-related. The main search phrase would probably be "computer pipeline cpu". But the list of instructions is already made up - it's just that that list of instructions has branches within it; it's not always 1, 2, 3, etc. But as stage 5 of the pipeline is finishing up instruction 10, stage 1 can already be working on instruction 14. Usually computer instructions can be broken up like that and worked on in segments. If stages 1-n are all working on something at the same time, and nothing gets trashed later, that's just faster than finishing one before starting another.
Any jump in your code is a branch. This happens in if statements function calls and loops.
Modern CPUs have long pipelines. This means the CPUs is processes various parts of multiple instructions at the same time. The problem with branches is that the pipeline might not have started processing the correct instructions. This means that the speculative instructions need to be thrown out and the processor will need to start processing the instructions from scratch.
When a branch is encountered, the CPU tries to predict which branch is going to be used. This is called branch prediction.
Most of the optimizations for branch prediction will be done by your compiler so you do not really need to worry about branching.
This probably falls into the category of only worry about branch optimizations if you have profiled the code and can see that this is a problem.
A branch is a deviation from normal control flow. Processors will execute instructions sequentially, but in a branch, the program counter is moved to another place in memory (for example, a branch depending on a condition, or a procedure call).
Related
Usually profile data is gathered by randomly sampling the stack of the running program to see which function is in execution, over a running period it is possible to be statistically sure which methods/function calls eats most time and need intervention in case of bottlenecks.
However this has to do with overall application/game performance. Sometime happens that there are singular and isolated hiccups in performance that are causing usability troubles anyway (user notice it / introduced lag in some internal mechanism etc). With regular profiling over few seconds of execution is not possible to know which. Even if the hiccup lasts long enough (says 30 ms, which are not enough anyway), to detect some method that is called too often, we will still miss to see execution of many other methods that are just "skipped" because of the random sampling.
So are there any tecniques to profile hiccups in order to keep framerate more stable after fixing those kind of "rare bottlenecks"? I'm assuming usage of languages like C# or C++.
This has been answered before, but I can't find it, so here goes...
The problem is that the DrawFrame routine sometimes takes too long.
Suppose it normally takes less than 1000/30 = 33ms, but once in a while it takes longer than 33ms.
At the beginning of DrawFrame, set a timer interrupt that will expire after, say, 40ms.
Then at the end of DrawFrame, disable the interrupt.
So if it triggers, you know DrawFrame is taking an unusually long time.
Put a breakpoint in the interrupt handler, and when it gets there, examine the stack.
Chances are pretty good that you have caught it in the process of doing the costly thing.
That's a variation on random pausing.
I am trying to use google perf tools CPU profiler for debugging performance issues on a multi-threaded program. With single thread it take 250 ms while 4 threads take around 900ms.
My program has a mmap'ed file which is shared across threads and all operations are read only. Also my program creates large number of objects which are not shared across threads. (Specifically my program uses CRF++ library to do some querying). I am trying to figure out how to make my program perform better with multi threads. Call graph produced by CPU profiler of gperf tools shows that my program spends a lot of time (around 50%) of time in _L_unlock_16.
Searching web for _L_unlock_16 pointed to some bug reports with canonical suggesting that its associated with libpthread. But other than that I was not able to find any useful information for debugging.
A brief description of what my program does. I have few words in a file (4). In my program I have a processWord() which processes a single word using CRF++. This processWord() is what each thread executes. My main() reads words from the file and each threads runs processWord() in parallel. If I process a single word(hence only 1 thread) it takes 250ms and so if I process all 4 words(and hence 4 threads) I expected it to finish by same time 250 ms, however as I mentioned above it's taking around 900ms.
This is the callgraph of the execution - https://www.dropbox.com/s/o1mkh477i7e9s4m/cgout_n2.png
I want to understand why my program is spending lot of time at _L_unlock_16 and what I can do to mitigate it.
Yet again the _L_unlock_16 is not a function of your code. Have you looked at the stracktraces above that function? What are the callers of it when the program waits? You've said that the program wastes 50% waiting inside. But, which part of the program ordered that operation? Is it again from memory alloc/dealloc ops?
The function seems to come from libpthread. Does CRF+ handle threads/libpthread in any way? If yes, then maybe the library is illconfigured? Or maybe it implements some 'basic threadsafety' by adding locks everywhere and simply is not built well for multithreading? What does the docs say about that?
Personally, I'd guess that it ignores threads and that you have added all the threading. I may be wrong, but if that's true, then the CRF++ probably will not call that 'unlock' function at all, and the 'unlock' is somwhow called from your code that orchestrates the threads/locks/queues/messages etc? Halt the program a few times and look at who called the unlock. If it really spends 50% sitting in the unlock, you will very quickly know who causes the lock to be used and you will be able to either eliminate it or at least perform a more refined research..
EDIT #1:
Eh.. when I said "stacktrace" I meant stacktrace, not callgraph. Callgraph may look nice in trivial cases, but in more complex ones, it will be mangled and unreadable and will hide the precious details into "compacted" form.. But, fortunatelly, here the case looks simple enough.
Please observe the beginning: "Process word, 99x". I assume that the "99x" is the call count. Then, look at "tagger-parse": 97x. From that:
61x into rebuildFeatures from which 41x goes right into unlock and 20(13) indirectly into it
23x goes to buildLattice fro which 21x goes into unlock
I'd guess that it was the CRF++ uses locking quite heavily. For me, it seems that you simply observe the effects of CRF's internal locking. It certainly is not lockless internally.
It seems to lock at least once per "processWord". It's hard to say without looking at code (is it opensource? I've not checked..), from stacktraces it would be more obvious, but IF it really locks once per "processWord" that it could even be a sort of a "global lock" that protects "everything" from "all threads" and causes all jobs to serialize. Whatever. Anyways, clearly, it's the CRF++'s internals that lock and wait.
If your CRF objects are really (really) not shared across threads, then remove threading configuration flags from CRF, pray that they were sane enough to not use any static variables nor global objects, add some own locking (if needed) at the topmost job/result level and retry. You should have it now much faster.
If the CRF objects are shared, unshare them and see above.
But, if they are shared behind the scenes, then there's little doable. Change your library to a one that has a better threading support, OR fix the library, OR ignore and use it with current performance.
The last advice may sound strange (it works slowly, right? so why to ignore it?), but in fact is the most important one, and you should try it first. If the parallel tasks have similar "data profile", then there is very probable that they will try to hit the same locks in the same approximate moment of time. Imagine a medium-sized cache that holds words sorted by its first letter. At the toplevel there's array of, say, 26 entries. Each entry has a lock and a list of words inside. If you run 100 threads that will each first check "mom" then "dad" then "son" - then all of that 100 threads will first hit and wait for each other at "M", then at "D" then at "S". Well, approximately/probably of course. But you get the idea. If the data profile were more random then they'd block each other far less. Mind that processing ONE word is a .. small task and that you try to process the same word(s). Even if the internal CRF's locking is smart, it is just bound to hit the same areas. Try again with a more dispersed data.
Add to that the fact that threading costs. If something was guarded against races with use of locks, then every lock/unlock costs because at least they have to "halt and check if the lock is open" (sorry very imprecise wording). If the data-to-process is small in relation to the-amount-of-lockchecks, then adding more threads will not help and will just waste the time. For checking one word, it may even happen that the sole handling of a single lock takes longer than processing the word! But, if the amount of data to be processed were larger, then the cost of flipping a lock compared to processing the data might start being neglible.
Prepare a set of 100 or more words. Run and measure it on one thread. Then partition the words at random and run it on 2 and 4 threads. And measure. If not better, try at 1000 and 10000 words. The more the better, of course, keeping in mind that the test should not last till your next birthday ;)
If you notice that 10k words split over 4 threads (2500w per th) works about 40%-30%-or even 25% faster than on one thread - here you go! You simply gave it a too small job. It was tailored and optimized for bigger ones!
But, on the other hand, it may happen that 10k words split over 4 threads does not work faster, or worse, works slower - then it might indicate that the library handles multithreading very wrong. Now try the other things like stripping threading from it or repairing it.
I'm a newbie with profiling. I'd like to optimize my code to satisfy timing constraints. I use Visual C++ 08 Express and thus had to download a profiler, for me it's Very Sleepy. I did some search but found no decent tutorial on Sleepy, and here my question:
How to use it properly? I grasped the general idea of profiling, so I sorted according to %exclusive to find my bottlenecks. Firstly, on the top of this list I have ZwWaitForSingleObject, RtlEnterCriticalSection, operator new, RtlLeaveCriticalSection, printf, some iterators ... and after they take some like 60% there comes my first function, first position with Child Calls. Can someone explain me why above mentioned come out, what do they mean and how can I optimize my code if I have no access to this critical 60%? (for "source file": unknown...).
Also, for my function I'd think I get time for each line, but it's not the case, e.g. arithmetics or some functions have no timing (not nested in unused "if" clauses).
AND last thing: how to find out that some line can execute superfast, but is called thousands times, being the actual bottleneck?
Finally, is Sleepy good? Or some free alternative for my platform?
Help very appreciated!
cheers!
UPDATE - - - - -
I have found another version of profiler, called plain Sleepy. It shows how many times some snippet was called plus the number of line (I guess it points to the critical one). So in my case.. KiFastSystemCallRet takes 50%! It means that it waits for some data right? How to improve that matter, is there maybe a decent approach to trace what causes these multiple calls and eventually remove/change it?
I'd like to optimize my code to satisfy timing constraints
You're running smack into a persistent issue in this business.
You want to find ways to make your code take less time, and you (and many people) assume (and have been taught) the only way to do that is by taking various sorts of measurements.
There's a minority view, and the only thing it has to recommend it is actual significant results (plus an ironclad theory behind it).
If you've got a "bottleneck" (and you do, probably several), it's taking some fraction of time, like 30%.
Just treat it as a bug to be found.
Randomly halt the program with the pause button, and look carefully to see what the program is doing and why it's doing it.
Ask if it's something that could be gotten rid of.
Do this 10 times. On average you will see the problem on 3 of the pauses.
Any activity you see more than once, if it's not truly necessary, is a speed bug.
This does not tell you precisely how much the problem costs, but it does tell you precisely what the problem is, and that it's worth fixing.
You'll see things this way that no profiler can find, because profilers
are only programs, and cannot be broad-minded about what constitutes an opportunity.
Some folks are risk-averse, thinking it might not give enough speedup to be worth it.
Granted, there is a small chance of a low payoff, but it's like investing.
The theory says on average it will be worthwhile, and there's also a small chance of a high payoff.
In any case, if you're worried about the risks, a few more samples will settle your fears.
After you fix the problem, the remaining bottlenecks each take a larger percent, because they didn't get smaller but the overall program did.
So they will be easier to find when you repeat the whole process.
There's lots of literature about profiling, but very little that actually says how much speedup it achieves in practice.
Here's a concrete example with almost 3 orders of magnitude speedup.
I've used GlowCode (commercial product, similar to Sleepy) for profiling native C++ code. You run the instrumenting process, then execute your program, then look at the data produced by the tool. The instrumenting step injects a little trace function at every methods' entrypoints and exitpoints, and simply measures how much time it takes for each function to run through to completion.
Using the call graph profiling tool, I listed the methods sorted from "most time used" to "least time used", and the tool also displays a call count. Simply drilling into the highest percentage routine showed me which methods were using the most time. I could see that some methods were very slow, but drilling into them I discovered they were waiting for user input, or for a service to respond. And some took a long time because they were calling some internal routines thousands of times each invocation. We found someone made a coding error and was walking a large linked list repeatedly for each item in the list, when they really only needed to walk it once.
If you sort by "most frequently called" to "least called", you can see some of the tiny functions that get called from everywhere (iterator methods like next(), etc.) Something to check for is to make sure the functions that are called the most often are really clean. Saving a millisecond in a routine called 500 times to paint a screen will speed that screen up by half a second. This helps you decide which are the most important places to spend your efforts.
I've seen two common approaches to using profiling. One is to do some "general" profiling, running through a suite of "normal" operations, and discovering which methods are slowing the app down the most. The other is to do specific profiling, focusing on specific user complaints about performance, and running through those functions to reveal their issues.
One thing I would caution you about is to limit your changes to those that will measurably impact the users' experience or system throughput. Shaving one millisecond off a mouse click won't make a difference to the average user, because human reaction time simply isn't that fast. Race car drivers have reaction times in the 8 millisecond range, some elite twitch gamers are even faster, but normal users like bank tellers will have reaction times in the 20-30 millisecond range. The benefits would be negligible.
Making twenty 1-millisecond improvements or one 20-millisecond change will make the system a lot more responsive. It's cheaper and better if you can do the single big improvement over the many small improvements.
Similarly, shaving one millisecond off a service that handles 100 users per second will make a 10% improvement, meaning you could improve the service to handle 110 users per second.
The reason for concern is that coding changes strictly to improve performance often negatively impact your code's structure by adding complexity. Let's say you decided to improve a call to a database by caching results. How do you know when the cache goes invalid? Do you add a cache cleaning mechanism? Consider a financial transaction where looping through all the line items to produce a running total is slow, so you decide to keep a runningTotal accumulator to answer faster. You now have to modify the runningTotal for all kinds of situations like line voids, reversals, deletions, modifications, quantity changes, etc. It makes the code more complex and more error-prone.
I need to implement execution time measuring functionality. I thought about two possibilities.
First - regular time() call, just remember time each execution step starts and time when each execution step completes. Unix time shell command works this way.
Second method is sampling. Every execution step set some sort of flag before execution begins(for example - creates some object in the stack frame), and destroy it when it's completes. Timer periodically scan all flags and generate execution time profile. If some execution step takes more time then the others - it will be scanned more times. Many profilers works like this.
I need to add some profiling functionality in my server application, what method is better and why? I think that second method is less accurate and first method add dependency to profiling library code.
The second method is essentially stack sampling.
You can try to do it yourself, by means of some kind of entry-exit event capture, or it's better if there is a utility to actually read the stack.
The latter has an advantage in that you get line-of-code resolution, rather than just method-level.
There's something that a lot of people don't get about this, which is that precision of timing measurement is far less important than precision of problem identification.
It is important to take samples even during I/O or other blocking, so you are not blind to needless I/O. If you are worried that competition with other processes could inflate the time, don't be, because what really matters is not absolute time measurements, but percentages.
For example, if one line of code is on the stack 50% of the wall-clock time, and thus responsible for that much, getting rid of it would double the speed of the app, regardless of whatever else is going on.
Profiling is about more than just getting samples.
Often people are pretty casual about what they do with them, but that's where the money is.
First, inclusive time is the fraction of time a method or line of code is on the stack. Forget "self" time - it's included in inclusive time.
Forget invocation counting - its relation to inclusive percent is, at best, very indirect.
If you are summarizing, the best way to do it is to have a "butterfly view" whose focus is on a single line of code.
To its left and right are the lines of code appearing immediately above it and below it on the stack samples.
Next to each line of code is a percent - the percent of stack samples containing that line of code.
(And don't worry about recursion. It's simply not an issue.)
Even better than any kind of summary is to just let the user see a small random selection of the stack samples themselves.
That way, the user can get the whole picture of why each snapshot in time was being spent.
Any avoidable activity appearing on more than one sample is a chance for some serious speedup, guaranteed.
People often think "Well, that could just be a fluke, not a real bottleneck".
Not so. Fixing it will pay off, maybe a little, maybe a lot, but on average - significant.
People should not be ruled by risk-aversion.
More on all that.
When boost is an option, you can use the timer library.
Make sure that you really know what you're looking for in the profiler you're writing, whenever you're collecting the total execution time of a certain piece of code, it will include time spent in all its children and it may be hard to really find what is the bottleneck in your system as the most top-level function will always bubble up as the most expensive one - like for instance main().
What I would suggest is to hook into every function's prologue and epilogue (if your application is a CLR application, you can use the ICorProfilerInfo::SetEnterLeaveFunctionHooks to do that, you can also use macros at the beginning of every method, or any other mechanism that would inject your code at the beginning and and of every function) and collect your times in a form of a tree for each thread that your profiling.
The algorithm for this would look somehow similar to this:
For each thread that you're monitoring create a stack-like data structure.
Whenever you're notified about a function that began execution, push something that would identify the function into that stack.
If that function is not the only function on the stack, then you know that the previous function that did not return yet was the one that called your function.
Keep track of those callee-called relationships in your favorite data structure.
Whenever a method returns, it's identifier will always be on top of its thread stack. It's total execution time is equal to (time when the last (it's) identifier was pushed on the stack - current time). Pop that identifier of the stack.
This way you'll have a tree-like breakdown of what eats up your execution time where you can see what child calls account for the total execution time of a function.
Have fun!
In my profiler I used an extended version of the 1-st approach mentioned by you.
I have a class which provides context objects. You can define them in your work code as automatic objects to be freed up as soon as execution flow leaves the context where they have been defined (for example, a function or a loop). The constructor calls GetTickCount (it was a Windows project, you can choose analogous function as appropriate to your target platform) and stores this value, while destructor calls GetTickCount again and calculates difference between this moment and start. Each object has unique context ID (can be autogenerated as a static object inside the same context), so profiler can sum up all timings with the same IDs, which means that the same context has been passed several times. Also number of executions is counted.
Here is a macro for preprocessor, which helps to profile a function:
#define _PROFILEFUNC_ static ProfilerLocator locator(__FUNC__); ProfilerObject obj(locator);
When I want to profile a function I just insert PROFILEFUNC at the beginning of the function. This generates a static object locator which identifies the context and stores a name of it as the function name (you may decide to choose another naming). Then automatic ProfilerObject is created on stack and "traces" own creation and deletion, reporting this to the profiler.
I am working on a (relatively complex) game. The game freezes in release mode. The freeze happens after 1-2 min. of game-play. The current configuration of the release mode that I have allows me to break (that is go into debug), which is good, but may give me wrong information but that is fine for this particular case (I can turn off the optimization for a single file/function/code).
Problem is, I (we, since we are a team) don't know where it is hanging. It is not as simple as one relatively small infinite loop that is hanging, as other things (Graphics, sound) are being updated, just that the game-play has stalled. The main game loop (an infinite loop) is always running and is very long/complex, so stepping through is going to be a pain (but it is one of the options).
The first thing I tried is Visual Studio's break all but it always breaks in code that is not mine and consequently shows me assembly output. Eventually, with enough persistence, SVN history checking and commenting out code I will be able to figure out where it is hanging, but there has to be a better way... hopefully?
Note: There is a Visual Studio option I am aware of that allows debugging user code only, but that is managed code only.
EDIT: Was able to solve the problem via stack trace and lots of hours of keeping track of various things to see where the game is hanging. I will select Sjoerd's answer as the correct one, however, if someone has a suggestion for a tool/technique that allows to automate such a task, by all means, add your answer!
If you break and you encounter native code that is not yours, check the call stack. The call stack is the list of functions that got called to reach the current point in the code. Go up some levels in the stack until you encounter the method which is currently running.
Hit the pause button in Visual Studio while the program is hung.
This should break the debugger at the current line. You can then step through and see what is happening.
As an alternative to debugging symbols and breaks (which is the tool of choice when possible), add logging: It is not uncommon for games (and other apps) to have a huge logging system they can turn on and off with a compiler flag so they can still do some kind of debugging/tracing in "release builds". If your logging works fine you should see what is and what is not happening and get at least some idea where things go wrong.
You might well never be able to catch the problem via an interrupt if the code that should be executing isn't executing. There are lots of ways this can happen. Just a few:
You have some parameter that indicates the time at which the next update is to be performed. If this somehow gets set to some big number, the code that does the update will happily see that nothing needs to be done. Next! This can give all the appearances of a hung program even though it isn't really hung at all. The state update and the graphics functions are still being called at their prescribed rate.
You may some counter that represents time and some rounding mechanism for incrementing time. If the counter is a 32 bit signed int and the granularity of your counter is 0.1 microseconds, you will hit INT32_MAX after just 3.6 minutes. Now time is frozen, so once again you have a situation where updates may not be performed.
You are using a single precision floating point number to represent time and update time via time += delta_t; This will stop working after a couple of minutes if your delta_t is 10 microseconds. This is yet another mechanism by which time can be frozen.
Edit
Have you looked at the CPU usage in your various threads? The above problems might cause the physics or game-playing thread to exhibit a drastic drop in CPU usage after a couple of minutes. You might also get this behavior if the game playing thread is perpetually locked, but here you might (with the right tool) get an indication that that thread is always asleep.