What happens if function inlining is too aggressive? - c++

Each time I read about inline keyword in C++ there's a long explanation that the compiler makes a "speed versus code volume" analysis and then decided whether to inline a function call in each specific case.
Now Visual C++ 9 has a __forceinline keyword that seems to make the compiler inline the call to the function unless such inlining is absolutely impossible (like a call is virtual).
Suppose I look through some project without understanding what goes inside it and decide myself that one third of functions are small enough and good for inlining and mark them with __forceinline and the compiler does inline them and now the executable has become say one hundred times bigger.
Will it really matter? What effect should I expect from having functions inlined overly aggressively and having one hundred times bigger executable?

The main impact will be to the cache. Inlining goes against the principal of locality; the CPU will have to fetch the instructions from the main memory far more often. So what was intended to make the code faster may actually make it slower.

Others have already mentioned the impact on cache. There's another penalty to pay. Modern CPU's are quite fast, but at a price. They have deep pipelines of instructions being processed. To keep these pipelines filled even in the presence of conditional branches, fast CPUs use branch prediction. They record how often a branch was taken and use that to predict whether a branch will be taken in the future.
Obviously, this history takes memory, and it's a fixed size table. It contains only a limited number of branch instructions. By increasing the number of instructions a hundredfold, you also increase the number of branches by that much. This means the number of branches with predictions decreases sharply. In addition, for the branches that are present in the prediction table, less data is available.

Having a bigger executable is its own punishment:
1) It takes more memory to store the program which matters more on some systems than others (cell phones for example can have very limited memory)
2) Having a larger program takes longer to load into memory
3) During execution, you will likely have more cache misses (you try to branch to part of your program which isn't in cache) because your program is spread out over more space. This slows down your program.

It will load and run more slowly, and may run out of virtual address space (100 times bigger is pretty dire).

Less of your program will fit in the CPU caches, disk caches etc. and therefore more time will be wasted as the CPU sits idle waiting for that code to become available. It's as simple as that really.
Ah - I hadn't looked at who'd posted the question - sharptooth hey ? :-) - you obviously won't have learned anything from the answer above. But, that's all there is too it - it's just a statistical balancing act, with defaults doubtless shaped by the compiler writers based on customer pressure to explain both larger executable sizes and slower execution speeds when compared to other compiler vendors.
Interesting if dated lectures notes here: http://www.cs.utexas.edu/users/djimenez/utsa/cs3343/lecture15.html

This is somewhat a complex topic, and I think you should have a look at this C++ faq lite about inline
It explains that there is no simple solution, and there are many things to consider (but it all boils down to a good intuition anyway!)

Related

Finding which code segment is faster than the other

Say that we have two C++ code segments, for doing the same task. How can we determine which code will run faster?
As an example lets say there is this global array "some_struct_type numbers[]". Inside a function, I can read a location of this array in two ways(I do not want to alter the content of the array)
some_struct_type val = numbers[i];
some_struct_type* val = &numbers[i]
I assume the second one is faster. but I can't measure the time to make sure because it will be a negligible difference.
So in this type of a situation, how do I figure out which code segment runs faster? Is there a way to compile a single line of code or set of lines and view
how many lines of assembly instructions are there?
I would appreciate your thoughts on this matter.
The basics are to run the piece of code so many times that it takes a few seconds at least to complete, and measure the time.
But it's hard, very hard, to get any meaningful figures this way, for many reasons:
Todays compilers are very good at optimizing code, but the optimizations depend on the context. It often does not make sense to look at a single line and try to optimize it. When the same line appears in a different context, the optimizations applied may be different.
Short pieces of code can be much faster than the surrounding looping code.
Not only the compiler makes optimizations, the processor has a cache, an instruction pipeline, and tries to predict branching code. A value which has been read before will be read much faster the next time, for example.
...
Because of this, it's usually better to leave the code in its place in your program, and use a profiling tool to see which parts of your code use the most processing resources. Then, you can change these parts and profile again.
While writing new code, prefer readable code to seemingly optimal code. Choose the right algorithm, this also depends on your input sizes. For example, insertion sort can be faster than quicksort, if the input is very small. But don't write your own sorting code, if your input is not special, use the libraries available in general. And don't optimize prematurely.
Eugene Sh. is correct that these two lines aren't doing the same thing - the first one copies the value of numbers[i] into a local variable, whereas the second one stores the address of numbers[i] into a pointer local variable. If you can do what you need using just the address of numbers[i] and referring back to numbers[i], it's likely that will be faster than doing a wholesale copy of the value, although it depends on a lot of factors like the size of the struct, etc.
Regarding the general optimization question, here are some things to consider...
Use a Profiler
The best way to measure the speed of your code is to use a profiling tool. There are a number of different tools available, depending on your target platform - see (for example) How can I profile C++ code running in Linux? and What's the best free C++ profiler for Windows?.
You really want to use a profiler for this because it's notoriously difficult to tell just from looking what the costliest parts of a program will be, for a number of reasons...
# of Instructions != # of Processor Cycles
One reason to use a profiler is that it's often difficult to tell from looking at two pieces of code which one will run faster. Even in assembly code, you can't simply count the number of instructions, because many instructions take multiple processor cycles to complete. This varies considerably by target platform. For example, on some platforms the fastest way to load the value 1 to a CPU register is something straightforward like this:
MOV r0, #1
Whereas on other platforms the fastest approach is actually to clear the register and then increment it, like this:
CLR r0
INC r0
The second case has more instruction lines, but that doesn't necessarily mean that it's slower.
Other Complications
Another reason that it's difficult to tell which pieces of code will most need optimizing is that most modern computers employ fairly sophisticated caches that can dramatically improve performance. Executing a cached loop several times is often less expensive than loading a single piece of data from a location that isn't cached. It can be very difficult to predict exactly what will cause a cache miss, but when using a profiler you don't have to predict - it makes the measurements for you.
Avoid Premature Optimization
For most projects, optimizing your code is best left until relatively late in the process. If you start optimizing too early, you may find that you spend a lot of time optimizing a feature that turns out to be relatively inexpensive compared to your program's other features. That said, there are some notable counterexamples - if you're building a large-scale database tool you might reasonably expect that performance is going to be an important selling point.

How do I segregate C++ code without impacting performance?

I'm having trouble refactoring my C++ code. The code itself is barely 200 lines, if even, however, being an image processing affair, it loops a lot, and the roadblocks I'm encoutering (I assume) deal with very gritty details (e.g. memory access).
The program produces a correct output, but is supposed to ultimately run in realtime. Initially, it took ~3 minutes per 320x240px frame, but it's at around 2 seconds now (running approximately as fast on mid-range workstation and low-end laptop hardware; red flag?). Still a far cry from 24 times per second, however. Basically, any change I make propagates through the millions of repetitions, and tracking my beginner mistakes has become exponentially more cumbersome as I approach the realtime mark.
At 2 points, the program calculates a less computationally expensive variant of Euclidean distance, called taxicab distance (the sum of absolute differences).
Now, the abridged version:
std::vector<int> positiveRows, positiveCols;
/* looping through pixels, reading values */
distance = (abs(pValues[0] - qValues[0]) + abs(pValues[1] - qValues[1]) + abs(pValues[2] - qValues[2]));
if(distance < threshold)
{
positiveRows.push_back(row);
positiveCols.push_back(col);
}
If I wrap the functionality, as follows:
int taxicab_dist(int Lp,
int ap,
int bp,
int Lq,
int aq,
int bq)
{
return (abs(Lp - Lq) + abs(ap - aq) + abs(bp - bq));
}
and call it from within the same .cpp file, there is no performance degradation. If I instead declare and define it in separate .hpp / .cpp files, I get a significant slowdown. This directly opposes what I've been told in my undergraduate courses ("including a file is the same as copy-pasting it"). The closest I've gotten to the original code's performance was by declaring the arguments const, but it still takes ~100ms longer, which my judgement says is not affordable for such a meager task. Then again, I don't see why it slows down (significantly) if I also make them const int&. Then, when I do the most sensible thing, and choose to take arrays as arguments, again I take a performance hit. I don't even dare attempt any templating shenanigans, or try making the function modify its behavior so that it accepts an arbitrary number of pairs, at least not until I understand what I've gotten myself into.
So my question is: how can take the calculation definition to a separate file, and have it perform the same as the original solution? Furthermore, should the fact that compilers are optimizing my program to run 2 seconds instead of 15 be a huge red flag (bad algorithm design, not using more exotic C++ keywords / features)?
I'm guessing the main reason why I've failed to find an answer is because I don't know what is the name of this stuff. I've heard the terms "vectorization" tossed around quite a bit in the HPC community. Would this be related to that?
If it helps in any way at all, the code it its entirety can be found here.
As Joachim Pileborg says, you should profile first. Find out where in your program most of the execution time occurs. This is the place where you should optimize.
Reserving space in vector
Vectors start out small and then reallocate as necessary. This involves allocating a larger space in memory and then copying the old elements to the new vector. Finally deallocating the memory. The std::vector has the capability of reserving space upon construction. For large sizes of vectors, this can be a time saver, eliminating many reallocations.
Compiling with speed optimizations
With modern compilers, you should set the optimizations for high speed and see what they can do. The compiler writers have many tricks up their sleeve and can often spot locations to optimize that you or I miss.
Truth is assembly language
You will need to view the assembly language listing. If the assembly language shows only two instructions in the area you think is the bottleneck, you really can't get faster.
Loop unrolling
You may be able to get more performance by copying the content in a for loop many times. This is called loop unrolling. In some processors, branch or jump instructions cost more execution time than data processing instructions. Unrolling a loop reduces the number of executed branch instructions. Again, the compiler may automatically perform this when you raise the optimization level.
Data cache optimization
Search the web for "Data cache optimization". Loading and unloading the data cache wastes time. If your data can fit into the processor's data cache, it doesn't have to keep loading an unloading (called cache misses). Also remember to perform all your operations on the data in one place before performing other operations. This reduces the likelihood of the processor reloading the cache.
Multi-processor computing
If your platform has more than one processor, such as a Graphics Processing Unit (GPU), you may be able to delegate some tasks to it. Be aware that you have also added time by communicating with the other processor. So for small tasks, the communications overhead may waste the time you gained by delegating.
Parallel computing
Similar to multi-processors, you can have the Operating System delegate the tasks. The OS could delegate to different cores in your processor (if you have a multi-core processor) or it runs it in another thread. Again there is a cost: overhead of managing the task or thread and communications.
Summary
The three rules of Optimization:
Don't
Don't
Profile
After you profile, review the area where the most execution takes place. This will gain you more time than optimizing a section that never gets called. Design optimizations will generally get you more time than code optimizations. Likewise, requirement changes (such as elimination) may gain you more time than design optimizations.
After your program is working correctly and is robust, you can optimize, only if warranted. If your UI is so slow that the User can go get a cup of coffee, it is a good place to optimize. If you gain 100 milliseconds by optimizing data transfer, but your program waits 1 second for the human response, you have not gained anything. Consider this as driving really fast to a stop sign. Regardless of your speed, you still have to stop.
If you still need performance gain, search the web for "Optimizations c++", or "data optimizations" or "performance optimization".

Measuring performance/throughput of fast code ignoring processor speed?

Is there a way I could write a "tool" which could analyse the produced x86 assembly language from a C/C++ program and measure the performance in such a way, that it wouldnt matter if I ran it on a 1GHz or 3GHz processor?
I am thinking more along the lines of instruction throughput? How could I write such a tool? Would it be possible?
I'm pretty sure this has to be equivalent to the halting problem, in which case it can't be done. Things such as branch prediction, memory accesses, and memory caching will all change performance irrespective of the speed of the CPU upon which the program is run.
Well, you could, but it would have very limited relevance. You can't tell the running time by just looking at the instructions.
What about cache usage? A "longer" code can be more cache-friendly, and thus faster.
Certain CPU instructions can be executed in parallel and out-of-order, but the final behaviour depends a lot on the hardware.
If you really want to try it, I would recommend writing a tool for valgrind. You would essentially run the program under a simulated environment, making sure you can replicate the behaviour of real-world CPUs (that's the challenging part).
EDIT: just to be clear, I'm assuming you want dynamic analysis, extracted from real inputs. IF you want static analysis you'll be in "undecidable land" as the other answer pointed out (you can't even detect if a given code loops forever).
EDIT 2: forgot to include the out-of-order case in the second point.
It's possible, but only if the tool knows all the internals of the processor for which it is projecting performance. Since knowing 'all' the internals is tantamount to building your own processor, you would correctly guess that this is not an easy task. So instead, you'll need to make a lot of assumptions, and hope that they don't affect your answer too much. Unfortunately, for anything longer than a few hundred instructions, these assumptions (for example, all memory reads are found in L1 data cache and have 4 cycle latency; all instructions are in L1 instruction cache but in trace cache thereafter) affect your answer a lot. Clock speed is probably the easiest variable to handle, but the details for all the rest that differ greatly from processor to processor.
Current processors are "speculative", "superscalar", and "out-of-order". Speculative means that they choose their code path before the correct choice is computed, and then go back and start over from the branch if their guess is wrong. Superscalar means that multiple instructions that don't depend on each other can sometimes be executed simultaneously -- but only in certain combinations. Out-of-order means that there is a pool of instructions waiting to be executed, and the processor chooses when to execute them based on when their inputs are ready.
Making things even worse, instructions don't execute instantaneously, and the number of cycles they do take (and the resources they occupy during this time) vary also. Accuracy of branch prediction is hard to predict, and it takes different numbers of cycles for processors to recover. Caches are different sizes, take different times to access, and have different algorithms for decided what to cache. There simply is no meaningful concept of 'how fast assembly executes' without reference to the processor it is executing on.
This doesn't mean you can't reason about it, though. And the more you can narrow down the processor you are targetting, and the more you constrain the code you are evaluating, the better you can predict how code will execute. Agner Fog has a good mid-level introduction to the differences and similarities of the current generation of x86 processors:
http://www.agner.org/optimize/microarchitecture.pdf
Additionally, Intel offers for free a very useful (and surprisingly unknown) tool that answers a lot of these questions for recent generations of their processors. If you are trying to measure the performance and interaction of a few dozen instructions in a tight loop, IACA may already do what you want. There are all sorts of improvements that could be made to the interface and presentation of data, but it's definitely worth checking out before trying to write your own:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer
To my knowledge, there isn't an AMD equivalent, but if there is I'd love to hear about it.

Pipeline optimzation, is there any point to do this?

Some very expencied programmer from another company told me about some low-level code-optimzation tips that targetting specific CPU, including pipeline-optimzation, which means, arrange the code (inlined assembly, obviously) in special orders such that it fit the pipeline better for the targetting hardware.
With the presence of out-of-order and speculative execuation, I just wonder is there any points to do this kind of low-level stuff? We are mostly invovled in high performance computing, so we can really focus on one very specific CPU type to do our optimzation, but I just dont know if there is any point to do this specific optimzation, anyone has any experience here, where to begin? are there any code examples for this kind of optimzation? many thanks!
I'll start by saying that the compiler will usually optimize code sufficiently (i.e. well enough) that you do not need to worry about this provided your high-level code and algorithms are optimized. In general, manual optimizing should only happen if you have hard evidence that there is an actual performance issue that you can quantify and have tracked down.
Now, with that said, it's always possible to improve things - sometimes a little, sometimes a lot.
If you are in the high-performance computing game, then this sort of optimization might make sense. There are all sorts of "tricks" that can be done, but they are best left to real experts and not for the faint of heart.
If you really want to know more about this topic, a good place to start is by reading Agner Fog's website.
Pipeline optimization will improve your programs performance:
Branches and jumps may force your processor to reload the instruction pipeline, which takes some time. This time could be devoted to data processing instructions.
Some platform independent methods for pipeline optimizations:
Reduce number of branches.
Use Boolean Arithmetic
Set up code to allow for conditional execution of instructions.
Unroll loops.
Make loops have short content (that can fit in a processor's cache
without loading).
Edit 1: Other optimizations
Reduce code by eliminating features and requirements.
Review and optimize the design.
Review implementation for more efficient implementations.
Revert to assembly language only when all other optimizations have
provided little performance improvement; optimize only the code that
is executed 80% of the time; find out by profiling.
Edit 2: Data Optimizations
You can also gain performance improvements by organizing your data. Search the web for "Data Driven Design" or "Optimize performance data".
One idea is that the most frequently used data should be close together and ultimately fit into the processor's data cache. This will reduce the frequency that the processor has to reload its data cache.
Another optimization is to: Load data (into registers), operate on data, then write all data back to memory. The idea here is to trigger the processor's data cache loading circuitry before it processes the data (or registers).
If you can, organize the data to fit in one "line" of your processor's cache. Sequential locations require less time than random access locations.
There are always things that "help" vs. "hinder" the execution in the pipeline, but for most general purpose code that isn't highly specialized, I would expect that performance from compiled code is about as good as the best you can get without highly specialized code for each model of processor. If you have a controlled system, where all of your machines are using the same (or a small number of similar) processor model, and you know that 99% of the time is spent in this particular function, then there may be a benefit to optimizing that particular function to become more efficient.
In your case, it being HPC, it may well be beneficial to handwrite some of the low-level code (e.g. matrix multiplication) to be optimized for the processor you are running on. This does take some reasonable amount of understanding of the processor however, so you need to study the optimization guides for that processor model, and if you can, talk to people who've worked on that processor before.
Some of the things you'd look at is "register to register dependencies" - where you need the result of c = a + b to calculate x = c + d - so you try to separate these with some other useful work, such that the calculation of x doesn't get held up by the c = a + b calculation.
Cache-prefetching and generally caring for how the caches are used is also a useful thing to look at - not kicking useful cached data out that you need 100 instructions later, when you are storing the resulting 1MB array that won't be used again for several seconds can be worth a lot of processor time.
It's hard(er) to control these things when compilers decide to shuffle it around in it's own optimisation, so handwritten assembler is pretty much the only way to go.

Function size vs execution speed

I remember hearing somewhere that "large functions might have higher execution times" because of code size, and CPU cache or something like that.
How can I tell if function size is imposing a performance hit for my application? How can I optimize against this? I have a CPU intensive computation that I have split into (as many threads as there are CPU cores). The main thread waits until all of the worker threads are finished before continuing.
I happen to be using C++ on Visual Studio 2010, but I'm not sure that's really important.
Edit:
I'm running a ray tracer that shoots about 5,000 rays per pixel. I create (cores-1) threads (1 per extra core), split the screen into rows, and give each row to a CPU thread. I run the trace function on each thread about 5,000 times per pixel.
I'm actually looking for ways to speed this up. It is possible for me to reduce the size of the main tracing function by refactoring, and I want to know if I should expect to see a performance gain.
A lot of people seem to be answering the wrong question here, I'm looking for an answer to this specific question, even if you think I can probably do better by optimizing the contents of the function, I want to know if there is a function size/performance relationship.
It's not really the size of the function, it's the total size of the code that gets cached when it runs. You aren't going to speed things up by splitting code into a greater number of smaller functions, unless some of those functions aren't called at all in your critical code path, and hence don't need to occupy any cache. Besides, any attempt you make to split code into multiple functions might get reversed by the compiler, if it decides to inline them.
So it's not really possible to say whether your current code is "imposing a performance hit". A hit compared with which of the many, many ways that you could have structured your code differently? And you can't reasonably expect changes of that kind to make any particular difference to performance.
I suppose that what you're looking for is instructions that are rarely executed (your profiler will tell you which they are), but are located in the close vicinity of instructions that are executed a lot (and hence will need to be in cache a lot, and will pull in the cache line around them). If you can cluster the commonly-executed code together, you'll get more out of your instruction cache.
Practically speaking though, this is not a very fruitful line of optimization. It's unlikely you'll make much difference. If nothing else, your commonly-executed code is probably quite small and adjacent already, it'll be some small number of tight loops somewhere (your profiler will tell you where). And cache lines at the lowest levels are typically small (of the order of 32 or 64 bytes), so you'd need some very fine re-arrangement of code. C++ puts a lot between you and the object code, that obstructs careful placement of instructions in memory.
Tools like perf can give you information on cache misses - most of those won't be for executable code, but on most systems it really doesn't matter which cache misses you're avoiding: if you can avoid some then you'll speed your code up. Perhaps not by a lot, unless it's a lot of misses, but some.
Anyway, what context did you hear this? The most common one I've heard it come up in, is the idea that function inlining is sometimes counter-productive, because sometimes the overhead of the code bloat is greater than the function call overhead avoided. I'm not sure, but profile-guided optimization might help with that, if your compiler supports it. A fairly plausible profile-guided optimization is to preferentially inline at call sites that are executed a larger number of times, leaving colder code smaller, with less overhead to load and fix up in the first place, and (hopefully) less disruptive to the instruction cache when it is pulled in. Somebody with far more knowledge of compilers than me, will have thought hard about whether that's a good profile-guided optimization, and therefore decided whether or not to implement it.
Unless you're going to hand-tune to the assembly level, to include locking specific lines of code in cache, you're not going to see a significant execution difference between one large function and multiple small functions. In both cases, you still have the same amount of work to perform and that's going to be your bottleneck.
Breaking things up into multiple smaller functions will, however, be easier to maintain and easier to read -- especially 6 months later when you've forgotten what you did in the first place.
Function size is unlikely to be a bottleneck in your application. What you do in the function is much more important that it's physical size. There are some things your compiler can do with small function that it cannot do with large functions (namely inlining), but usually this isn't a huge difference anyway.
You can profile the code to see where the real bottleneck is. I suspect calling a large function is not the problem.
You should, however, break up the function into smaller function for code readability reasons.
It's not really about function size, but about what you do in it. Depending on what you do, there is possibly some way to optimize it.