Function size vs execution speed - c++

I remember hearing somewhere that "large functions might have higher execution times" because of code size, and CPU cache or something like that.
How can I tell if function size is imposing a performance hit for my application? How can I optimize against this? I have a CPU intensive computation that I have split into (as many threads as there are CPU cores). The main thread waits until all of the worker threads are finished before continuing.
I happen to be using C++ on Visual Studio 2010, but I'm not sure that's really important.
Edit:
I'm running a ray tracer that shoots about 5,000 rays per pixel. I create (cores-1) threads (1 per extra core), split the screen into rows, and give each row to a CPU thread. I run the trace function on each thread about 5,000 times per pixel.
I'm actually looking for ways to speed this up. It is possible for me to reduce the size of the main tracing function by refactoring, and I want to know if I should expect to see a performance gain.
A lot of people seem to be answering the wrong question here, I'm looking for an answer to this specific question, even if you think I can probably do better by optimizing the contents of the function, I want to know if there is a function size/performance relationship.

It's not really the size of the function, it's the total size of the code that gets cached when it runs. You aren't going to speed things up by splitting code into a greater number of smaller functions, unless some of those functions aren't called at all in your critical code path, and hence don't need to occupy any cache. Besides, any attempt you make to split code into multiple functions might get reversed by the compiler, if it decides to inline them.
So it's not really possible to say whether your current code is "imposing a performance hit". A hit compared with which of the many, many ways that you could have structured your code differently? And you can't reasonably expect changes of that kind to make any particular difference to performance.
I suppose that what you're looking for is instructions that are rarely executed (your profiler will tell you which they are), but are located in the close vicinity of instructions that are executed a lot (and hence will need to be in cache a lot, and will pull in the cache line around them). If you can cluster the commonly-executed code together, you'll get more out of your instruction cache.
Practically speaking though, this is not a very fruitful line of optimization. It's unlikely you'll make much difference. If nothing else, your commonly-executed code is probably quite small and adjacent already, it'll be some small number of tight loops somewhere (your profiler will tell you where). And cache lines at the lowest levels are typically small (of the order of 32 or 64 bytes), so you'd need some very fine re-arrangement of code. C++ puts a lot between you and the object code, that obstructs careful placement of instructions in memory.
Tools like perf can give you information on cache misses - most of those won't be for executable code, but on most systems it really doesn't matter which cache misses you're avoiding: if you can avoid some then you'll speed your code up. Perhaps not by a lot, unless it's a lot of misses, but some.
Anyway, what context did you hear this? The most common one I've heard it come up in, is the idea that function inlining is sometimes counter-productive, because sometimes the overhead of the code bloat is greater than the function call overhead avoided. I'm not sure, but profile-guided optimization might help with that, if your compiler supports it. A fairly plausible profile-guided optimization is to preferentially inline at call sites that are executed a larger number of times, leaving colder code smaller, with less overhead to load and fix up in the first place, and (hopefully) less disruptive to the instruction cache when it is pulled in. Somebody with far more knowledge of compilers than me, will have thought hard about whether that's a good profile-guided optimization, and therefore decided whether or not to implement it.

Unless you're going to hand-tune to the assembly level, to include locking specific lines of code in cache, you're not going to see a significant execution difference between one large function and multiple small functions. In both cases, you still have the same amount of work to perform and that's going to be your bottleneck.
Breaking things up into multiple smaller functions will, however, be easier to maintain and easier to read -- especially 6 months later when you've forgotten what you did in the first place.

Function size is unlikely to be a bottleneck in your application. What you do in the function is much more important that it's physical size. There are some things your compiler can do with small function that it cannot do with large functions (namely inlining), but usually this isn't a huge difference anyway.
You can profile the code to see where the real bottleneck is. I suspect calling a large function is not the problem.
You should, however, break up the function into smaller function for code readability reasons.

It's not really about function size, but about what you do in it. Depending on what you do, there is possibly some way to optimize it.

Related

Is there any way to know how much space a function takes (or possibly can take) in CPU cache?

I'm just starting to learn about CPU cache in depth and I want to learn how to estimate a functions instruction size in CPU cache for curiosity reasons.
So far I learned it's not very easy to monitor L1 cache by surfing in SO and Google. But surprisingly I couldn't find any posts explaining my question.
If it's not possible, at least knowing when someone should worry about filling L1/L2 caches and not would be good to know.
Thanks.
Can you measure it?
Yes. Take a look at the output of a disassembler or measure the size increase of the library.
Should you worry about it?
Absolutely not. The executable code is usually tiny. If you're going through it once, even if we're talking GBs it's going to be fast. The usual way to make things slow is loops and recursion and usually such functions tend to be focused and small. On most system the several MBs of L1 cache should cover anything interesting code wise.
The usual source of cache dependent speedups is memory access patterns. If your tiny loop skips all over the memory, it will be a lot slower than a gigantic function that accesses things more or less linearly or is very predictable.
Another source of bad performance in the code tends to be branch predictions. Incorrectly predicting the outcome of a branch causes a stall in the CPU. Get enough of those the the performance will suffer.
Both of those are usually the last drops of performance to be squeezed out of a system. Make sure the code works correctly first, then go and try to find performance improvements, usually starting with algorithm and data structure optimizations around the hottest bits of code (the most executted).

Finding which code segment is faster than the other

Say that we have two C++ code segments, for doing the same task. How can we determine which code will run faster?
As an example lets say there is this global array "some_struct_type numbers[]". Inside a function, I can read a location of this array in two ways(I do not want to alter the content of the array)
some_struct_type val = numbers[i];
some_struct_type* val = &numbers[i]
I assume the second one is faster. but I can't measure the time to make sure because it will be a negligible difference.
So in this type of a situation, how do I figure out which code segment runs faster? Is there a way to compile a single line of code or set of lines and view
how many lines of assembly instructions are there?
I would appreciate your thoughts on this matter.
The basics are to run the piece of code so many times that it takes a few seconds at least to complete, and measure the time.
But it's hard, very hard, to get any meaningful figures this way, for many reasons:
Todays compilers are very good at optimizing code, but the optimizations depend on the context. It often does not make sense to look at a single line and try to optimize it. When the same line appears in a different context, the optimizations applied may be different.
Short pieces of code can be much faster than the surrounding looping code.
Not only the compiler makes optimizations, the processor has a cache, an instruction pipeline, and tries to predict branching code. A value which has been read before will be read much faster the next time, for example.
...
Because of this, it's usually better to leave the code in its place in your program, and use a profiling tool to see which parts of your code use the most processing resources. Then, you can change these parts and profile again.
While writing new code, prefer readable code to seemingly optimal code. Choose the right algorithm, this also depends on your input sizes. For example, insertion sort can be faster than quicksort, if the input is very small. But don't write your own sorting code, if your input is not special, use the libraries available in general. And don't optimize prematurely.
Eugene Sh. is correct that these two lines aren't doing the same thing - the first one copies the value of numbers[i] into a local variable, whereas the second one stores the address of numbers[i] into a pointer local variable. If you can do what you need using just the address of numbers[i] and referring back to numbers[i], it's likely that will be faster than doing a wholesale copy of the value, although it depends on a lot of factors like the size of the struct, etc.
Regarding the general optimization question, here are some things to consider...
Use a Profiler
The best way to measure the speed of your code is to use a profiling tool. There are a number of different tools available, depending on your target platform - see (for example) How can I profile C++ code running in Linux? and What's the best free C++ profiler for Windows?.
You really want to use a profiler for this because it's notoriously difficult to tell just from looking what the costliest parts of a program will be, for a number of reasons...
# of Instructions != # of Processor Cycles
One reason to use a profiler is that it's often difficult to tell from looking at two pieces of code which one will run faster. Even in assembly code, you can't simply count the number of instructions, because many instructions take multiple processor cycles to complete. This varies considerably by target platform. For example, on some platforms the fastest way to load the value 1 to a CPU register is something straightforward like this:
MOV r0, #1
Whereas on other platforms the fastest approach is actually to clear the register and then increment it, like this:
CLR r0
INC r0
The second case has more instruction lines, but that doesn't necessarily mean that it's slower.
Other Complications
Another reason that it's difficult to tell which pieces of code will most need optimizing is that most modern computers employ fairly sophisticated caches that can dramatically improve performance. Executing a cached loop several times is often less expensive than loading a single piece of data from a location that isn't cached. It can be very difficult to predict exactly what will cause a cache miss, but when using a profiler you don't have to predict - it makes the measurements for you.
Avoid Premature Optimization
For most projects, optimizing your code is best left until relatively late in the process. If you start optimizing too early, you may find that you spend a lot of time optimizing a feature that turns out to be relatively inexpensive compared to your program's other features. That said, there are some notable counterexamples - if you're building a large-scale database tool you might reasonably expect that performance is going to be an important selling point.

Runtime performance (speed) optimization -- Cache size consideration

What are the basic tips and tricks that a C++ programmer should know when trying to optimize his code in the context of Caching?
Here's something to think about:
For instance, I know that reducing a function's footprint would make the code run a bit faster since you would have less overall instructions on the processor's instruction register I.
When trying to allocate an std::array<char, <size>>, what would be the ideal size that could make your read and writes faster to the array?
How big can an object be to decide to put it on the heap instead of the stack?
In most cases, knowing the correct answer to your question will gain you less than 1% overall performance.
Some (data-)cache optimizations that come to my mind are:
For arrays: use less RAM. Try shorter data types or a simple compression algorithm like RLE. This can also save CPU at the same time, or in the opposite waste CPU cycles with data type conversions. Especially floating point to integer conversions can be quite expensive.
Avoid access to the same cacheline (usually around 64 bytes) from different threads, unless all access is read-only.
Group members that are often used together next to each other. Prefer sequential access to random access.
If you really want to know all about caches, read What Every Programmer Should Know About Memory. While I disagree with the title, it's a great in-depth document.
Because your question suggests that you actually expect gains from just following the tips above (in which case you will be disappointed), here are some general optimization tips:
Tip #1: About 90% of your code you should be optimized for readability, not performance. If you decide to attempt an optimization for performance, make sure you actually measure the gain. When it is below 5% I usually go back to the more readable version.
Tip #2: If you have an existing codebase, profile it first. If you don't profile it, you will miss some very effective optimizations. Usually there are some calls to time-consuming functions that can be completely eliminated, or the result cached.
If you don't want to use a profiler, at least print the current time in a couple of places, or interrupt the program with a debugger a couple of times to check where it is most often spending its time.

How do I segregate C++ code without impacting performance?

I'm having trouble refactoring my C++ code. The code itself is barely 200 lines, if even, however, being an image processing affair, it loops a lot, and the roadblocks I'm encoutering (I assume) deal with very gritty details (e.g. memory access).
The program produces a correct output, but is supposed to ultimately run in realtime. Initially, it took ~3 minutes per 320x240px frame, but it's at around 2 seconds now (running approximately as fast on mid-range workstation and low-end laptop hardware; red flag?). Still a far cry from 24 times per second, however. Basically, any change I make propagates through the millions of repetitions, and tracking my beginner mistakes has become exponentially more cumbersome as I approach the realtime mark.
At 2 points, the program calculates a less computationally expensive variant of Euclidean distance, called taxicab distance (the sum of absolute differences).
Now, the abridged version:
std::vector<int> positiveRows, positiveCols;
/* looping through pixels, reading values */
distance = (abs(pValues[0] - qValues[0]) + abs(pValues[1] - qValues[1]) + abs(pValues[2] - qValues[2]));
if(distance < threshold)
{
positiveRows.push_back(row);
positiveCols.push_back(col);
}
If I wrap the functionality, as follows:
int taxicab_dist(int Lp,
int ap,
int bp,
int Lq,
int aq,
int bq)
{
return (abs(Lp - Lq) + abs(ap - aq) + abs(bp - bq));
}
and call it from within the same .cpp file, there is no performance degradation. If I instead declare and define it in separate .hpp / .cpp files, I get a significant slowdown. This directly opposes what I've been told in my undergraduate courses ("including a file is the same as copy-pasting it"). The closest I've gotten to the original code's performance was by declaring the arguments const, but it still takes ~100ms longer, which my judgement says is not affordable for such a meager task. Then again, I don't see why it slows down (significantly) if I also make them const int&. Then, when I do the most sensible thing, and choose to take arrays as arguments, again I take a performance hit. I don't even dare attempt any templating shenanigans, or try making the function modify its behavior so that it accepts an arbitrary number of pairs, at least not until I understand what I've gotten myself into.
So my question is: how can take the calculation definition to a separate file, and have it perform the same as the original solution? Furthermore, should the fact that compilers are optimizing my program to run 2 seconds instead of 15 be a huge red flag (bad algorithm design, not using more exotic C++ keywords / features)?
I'm guessing the main reason why I've failed to find an answer is because I don't know what is the name of this stuff. I've heard the terms "vectorization" tossed around quite a bit in the HPC community. Would this be related to that?
If it helps in any way at all, the code it its entirety can be found here.
As Joachim Pileborg says, you should profile first. Find out where in your program most of the execution time occurs. This is the place where you should optimize.
Reserving space in vector
Vectors start out small and then reallocate as necessary. This involves allocating a larger space in memory and then copying the old elements to the new vector. Finally deallocating the memory. The std::vector has the capability of reserving space upon construction. For large sizes of vectors, this can be a time saver, eliminating many reallocations.
Compiling with speed optimizations
With modern compilers, you should set the optimizations for high speed and see what they can do. The compiler writers have many tricks up their sleeve and can often spot locations to optimize that you or I miss.
Truth is assembly language
You will need to view the assembly language listing. If the assembly language shows only two instructions in the area you think is the bottleneck, you really can't get faster.
Loop unrolling
You may be able to get more performance by copying the content in a for loop many times. This is called loop unrolling. In some processors, branch or jump instructions cost more execution time than data processing instructions. Unrolling a loop reduces the number of executed branch instructions. Again, the compiler may automatically perform this when you raise the optimization level.
Data cache optimization
Search the web for "Data cache optimization". Loading and unloading the data cache wastes time. If your data can fit into the processor's data cache, it doesn't have to keep loading an unloading (called cache misses). Also remember to perform all your operations on the data in one place before performing other operations. This reduces the likelihood of the processor reloading the cache.
Multi-processor computing
If your platform has more than one processor, such as a Graphics Processing Unit (GPU), you may be able to delegate some tasks to it. Be aware that you have also added time by communicating with the other processor. So for small tasks, the communications overhead may waste the time you gained by delegating.
Parallel computing
Similar to multi-processors, you can have the Operating System delegate the tasks. The OS could delegate to different cores in your processor (if you have a multi-core processor) or it runs it in another thread. Again there is a cost: overhead of managing the task or thread and communications.
Summary
The three rules of Optimization:
Don't
Don't
Profile
After you profile, review the area where the most execution takes place. This will gain you more time than optimizing a section that never gets called. Design optimizations will generally get you more time than code optimizations. Likewise, requirement changes (such as elimination) may gain you more time than design optimizations.
After your program is working correctly and is robust, you can optimize, only if warranted. If your UI is so slow that the User can go get a cup of coffee, it is a good place to optimize. If you gain 100 milliseconds by optimizing data transfer, but your program waits 1 second for the human response, you have not gained anything. Consider this as driving really fast to a stop sign. Regardless of your speed, you still have to stop.
If you still need performance gain, search the web for "Optimizations c++", or "data optimizations" or "performance optimization".

What happens if function inlining is too aggressive?

Each time I read about inline keyword in C++ there's a long explanation that the compiler makes a "speed versus code volume" analysis and then decided whether to inline a function call in each specific case.
Now Visual C++ 9 has a __forceinline keyword that seems to make the compiler inline the call to the function unless such inlining is absolutely impossible (like a call is virtual).
Suppose I look through some project without understanding what goes inside it and decide myself that one third of functions are small enough and good for inlining and mark them with __forceinline and the compiler does inline them and now the executable has become say one hundred times bigger.
Will it really matter? What effect should I expect from having functions inlined overly aggressively and having one hundred times bigger executable?
The main impact will be to the cache. Inlining goes against the principal of locality; the CPU will have to fetch the instructions from the main memory far more often. So what was intended to make the code faster may actually make it slower.
Others have already mentioned the impact on cache. There's another penalty to pay. Modern CPU's are quite fast, but at a price. They have deep pipelines of instructions being processed. To keep these pipelines filled even in the presence of conditional branches, fast CPUs use branch prediction. They record how often a branch was taken and use that to predict whether a branch will be taken in the future.
Obviously, this history takes memory, and it's a fixed size table. It contains only a limited number of branch instructions. By increasing the number of instructions a hundredfold, you also increase the number of branches by that much. This means the number of branches with predictions decreases sharply. In addition, for the branches that are present in the prediction table, less data is available.
Having a bigger executable is its own punishment:
1) It takes more memory to store the program which matters more on some systems than others (cell phones for example can have very limited memory)
2) Having a larger program takes longer to load into memory
3) During execution, you will likely have more cache misses (you try to branch to part of your program which isn't in cache) because your program is spread out over more space. This slows down your program.
It will load and run more slowly, and may run out of virtual address space (100 times bigger is pretty dire).
Less of your program will fit in the CPU caches, disk caches etc. and therefore more time will be wasted as the CPU sits idle waiting for that code to become available. It's as simple as that really.
Ah - I hadn't looked at who'd posted the question - sharptooth hey ? :-) - you obviously won't have learned anything from the answer above. But, that's all there is too it - it's just a statistical balancing act, with defaults doubtless shaped by the compiler writers based on customer pressure to explain both larger executable sizes and slower execution speeds when compared to other compiler vendors.
Interesting if dated lectures notes here: http://www.cs.utexas.edu/users/djimenez/utsa/cs3343/lecture15.html
This is somewhat a complex topic, and I think you should have a look at this C++ faq lite about inline
It explains that there is no simple solution, and there are many things to consider (but it all boils down to a good intuition anyway!)