function calling performance - c++

I have called snprintf a few times consecutively with different arguments. I take the time needed for each snprintf. I found that the first call to snprintf takes the longest time. After that, the time needed to call the same function decreases until it converges. What is the reason for that? I have tried with other functions and also exhibit the same behavior.
I am asking this because it relates to testing the code performance. Normally in the main program it would be only be called periodically. However, when I test the function separately like in a loop, it would be faster, hence, resulting in inaccuracy of the measurement of performance.
The first call takes 4000++ ns, second call takes 1700ns, third call takes 800 ns until around 10++ calls, it is reduced to 130ns.
snprintf(buffer, 32, "%d", randomGeneratedNumber1);
snprintf(buffer, 32, "%d", randomGeneratedNumber2);
.
.
.

The most likely explanation is that both the function code will end up in the instruction cache after the second time around just like the input data (if there is any) will be in the data cache. Furthermore, some branches may be predicted correctly the second time around.
So, all in all, "things have been cached".

Your program may be dynamically linked to the library containing snprintf(). The first time delay would then be what is needed to load the library.

Search TLB and cache. But for just small answer, in these small codes, cache effects the execution time. For large codes, besides the cache, many memory pages will be swapped out and for later usage swapped in from hard disk to your ram. So when you use a part of code very often it will not be swapped out and thus it's execution time is enhanced.

Related

Execution time overhead at index 2^21

What do I want to do?
I have written a program which reads data from binary files and does calculation based on the read values. Execution time is most import for this program. To validate that my program is operating within the specified time limits, I tried to log all the calculations by storing them inside a std::vector<std::string>. And after the time critical execution is done, I write this vector to a file.
What is stored inside the vector?
In the vector I write the execution time (std::chrono:steady_clock.now()) and the current clock time (std::chrono::system_clock::now() with date.h by Howard Hinnant).
What did I observe?
While analyzing the results I stumble over the following pattern. Independent on the input data the mean execution time of 0.003ms for one operation explodes to ~20ms for a single operation at one specific reproducible index. After this, the execution time of all operations goes back to 0.003ms. The index of the execution time explosion is every time 2097151. Since 2^21 equals 2097152, something happens at 2^21 that slows down the entire program. The same effect can be observed with 2^22 and 2^23. Even more interesting is that the lag is doubled (2^21 = ~20ms, 2^22 = ~43ms, 2^23 =~81ms ). I googled about this specific number and the only thing I found was some node.js stuff which uses c++ under the hood.
What do I suspect?
At index 2^21 a memory area must be expanded, and that is why the delay occurs.
Questions
Is my assumption correct and the size of the vector is the problem?
How can I debug such a Phenomenon? (To be certain, that purely the vector is the problem)
Can I allocate enough memory beforehand to avoid the memory expansion?
What could I use instead of a std::vector, which supports > 10.000.000.000 elements?
I was able to solve my problem by reserving memory by using std::vector::reserve() before the time critical part of my program. Thanks to all the comments.
Here the working code I used:
std::vector<std::string> myLogVector;
myLogVector.reserve(12000000);
//...do time critical stuff, without reallocating storage

Is there any workaround to "reserve" a cache fraction?

Assume I have to write a C or C++ computational intensive function that has 2 arrays as input and one array as output. If the computation uses the 2 input arrays more often than it updates the output array, I'll end up in a situation where the output array seldom gets cached because it's evicted in order to fetch the 2 input arrays.
I want to reserve one fraction of the cache for the output array and enforce somehow that those lines don't get evicted once they are fetched, in order to always write partial results in the cache.
Update1(output[]) // Output gets cached
DoCompute1(input1[]); // Input 1 gets cached
DoCompute2(input2[]); // Input 2 gets cached
Update2(output[]); // Output is not in the cache anymore and has to get cached again
...
I know there are mechanisms to help eviction: clflush, clevict, _mm_clevict, etc. Are there any mechanisms for the opposite?
I am thinking of 3 possible solutions:
Using _mm_prefetch from time to time to fetch the data back if it has been evicted. However this might generate unnecessary traffic plus that I need to be very careful to when to introduce them;
Trying to do processing on smaller chunks of data. However this would work only if the problem allows it;
Disabling hardware prefetchers where that's possible to reduce the rate of unwanted evictions.
Other than that, is there any elegant solution?
Intel CPUs have something called No Eviction Mode (NEM) but I doubt this is what you need.
While you are attempting to optimise the second (unnecessary) fetch of output[], have you given thought to using SSE2/3/4 registers to store your intermediate output values, update them when necessary, and writing them back only when all updates related to that part of output[] are done?
I have done something similar while computing FFTs (Fast Fourier Transforms) where part of the output is in registers and they are moved out (to memory) only when it is known they will not be accessed anymore. Until then, all updates happen to the registers. You'll need to introduce inline assembly to effectively use SSE* registers. Of course, such optimisations are highly dependent on the nature of the algorithm and data placement.
I am trying to get a better understanding of the question:
If it is true that the 'output' array is strictly for output, and you never do something like
output[i] = Foo(newVal, output[i]);
then, all elements in output[] are strictly write. If so, all you would ever need to 'reserve' is one cache-line. Isn't that correct?
In this scenario, all writes to 'output' generate cache-fills and could compete with the cachelines needed for 'input' arrays.
Wouldn't you want a cap on the cachelines 'output' can consume as opposed to reserving a certain number of lines.
I see two options, which may or may not work depending on the CPU you are targeting, and on your precise program flow:
If output is only written to and not read, you can use streaming-stores, i.e., a write instruction with a no-read hint, so it will not be fetched into cache.
You can use prefetching with a non-temporally-aligned (NTA) hint for input. I don't know how this is implemented in general, but I know for sure that on some Intel CPUs (e.g., the Xeon Phi) each hardware thread uses a specific way of cache for NTA data, i.e., with an 8-way cache 1/8th per thread.
I guess solution to this is hidden inside, the algorithm employed and the L1 cache size and cache line size.
Though I am not sure how much performance improvement we will see with this.
We can probably introduce artificial reads which cleverly dodge compiler and while execution, do not hurt computations as well. Single artificial read should fill cache lines as many needed to accommodate one page. Therefore, algorithm should be modified to compute blocks of output array. Something like the ones used in matrix multiplication of huge matrices, done using GPUs. They use blocks of matrices for computation and writing result.
As pointed out earlier, the write to output array should happen in a stream.
To bring in artificial read, we should initialize at compile time the output array at right places, once in each block, probably with 0 or 1.

Execution time of functions decreases at runtime. (C++) Why?

For some testing purposes I have written a piece of code for measuring execution times of several fast operations in my real-time video processing code. And things are working fine. I am getting very realistic results, but i noticed one interesting peculiarity.
I am using a POSIX function clock_gettime with CLOCK_MONOTONIC attribute. So i am getting timespecs with nanosecond precision (1/1000000000sec) and it is said that getting a timespec value in that manner takes only several processor ticks.
Here are two functions that i am using for saving timespecs. I also added definitions of datastructures that are being used:
QVector<long> timeMemory;
QVector<std::string> procMemory;
timespec moment;
void VisionTime::markBegin(const std::string& action) {
if(measure){
clock_gettime(CLOCK_MONOTONIC, &moment);
procMemory.append(action + ";b");
timeMemory.append(moment.tv_nsec);
}
}
void VisionTime::markEnd(const std::string& action) {
if(measure){
clock_gettime(CLOCK_MONOTONIC, &moment);
procMemory.append(action + ";e");
timeMemory.append(moment.tv_nsec);
}
}
I am collecting the results into a couple of QVectors that are used later.
I noticed that when these two functions are executed for the first time(right after each other, having nothing between them), the difference between two saved timespecs is ~34000ns. Next time the difference is about 2 times smaller. And so on. If i execute them for hundreds of times then the average difference is ~2000ns.
So an average recurrent execution of these functions takes about 17000x less time than the first one.
As i am taking hundreds of measurements in a row, it does not really matter to me that some first executions last a little bit longer. But anyway it just interests me, why is it that way?
I have various experience in Java, but i am quite new to c++. I do not know much how things work here.
I am using O3 flag for optimization level.
My QMake conf:
QMAKE_CXXFLAGS += -O3 -march=native
So, can anyone tell, which part of this little code gets faster at runtime, how and why? I doubt appending to QVector. Does optimization affect this somehow?
It's my first question here on stackoverflow, hope it's not too long :) Many thanks for all your responses!
There are quite a few potential first-time costs in your measurement code, here's a couple and how you can test for them.
Memory allocation: Those QVectors won't have any memory allocated on the heap until the first time you use them.
Also, the vector will most likely start out by allocating a small amount of memory, then allocate exponentially more as you add more data (a standard compromise for containers like this). Therefore, you will have many memory allocations towards the beginning of your runtime, then the frequency will decrease over time.
You can verify that this is happening by looking at the return value of QVector::capacity(), and tune the behavior by QVector::reserve(int) - e.g. if you do timeMemory.reserve(10000);, procMemory.reserve(10000);, you can reserve enough space for the first ten thousand measurements before your measurements begin.
Lazy symbol binding: the dynamic linker by default won't resolve symbols from Qt (or other shared libraries) until they are needed. So, if these measuring functions are the first place in your code where some QVector or std::string functions are called, the dynamic linker will need to do some one-time work to resolve those functions, which takes time.
If this is indeed the case, you can disable the lazy loading by setting the environment variable LD_BIND_NOW=1 on Linux or DYLD_BIND_AT_LAUNCH=1 on Mac.
It is probably due to branch prediction. http://en.wikipedia.org/wiki/Branch_predictor

optimizing `std::vector operator []` (vector access) when it becomes a bottleneck

gprof says that my high computing app spends 53% of its time inside std::vector <...> operator [] (unsigned long), 32% of which goes to one heavily used vector. Worse, I suspect that my parallel code failing to scale beyond 3-6 cores is due to a related memory bottleneck. While my app does spend a lot of time accessing and writing memory, it seems like I should be able (or at least try) to do better than 52%. Should I try using dynamic arrays instead (size remains constant in most cases)? Would that be likely to help with possible bottlenecks?
Actually, my preferred solution would be to solve the bottleneck and leave the vectors as is for convenience. Based on the above, are there any likely culprits or solutions (tcmalloc is out)?
Did you examine your memory access pattern itself? It might be inefficient - cache unfriendly.
Did you try to use raw pointer while array accessing?
// regular place
for (int i = 0; i < arr.size(); ++i)
wcout << arr[i];
// In bottleneck
int *pArr = &arr.front();
for (int i = 0; i < arr.size(); ++i)
wcout << pArr[i];
I suspect that gprof prevents functions to be inlined. Try to use another profiling method. std::vector operator [] cannot be bottleneck because it doesn't differ much from raw array access. SGI implementaion is shown below:
reference operator[](size_type __n) { return *(begin() + __n); }
iterator begin() { return _M_start; }
You cannot trust gprof for high-speed code profiling, you should instead use a passive profiling method like oprofile to get the real picture.
As an alternative you could profile by manual code alteration (e.g. calling a computation 10 times instead of one and checking how much the execution time increases). Note that this is however going to be influenced by cache issues so YMMV.
The vector class is heavily liked and provides a certain amount of convenience, at the expense of performance, which is fine when you don't particularly need performance.
If you really need performance, it won't hurt you too much to bypass the vector class and go directly to a simple old hand-made array, whether statically or dynamically allocated. Then 1) the time you currently spend indexing should essentially disappear, speeding up your app by that amount, and 2) you can move on to whatever the "next big thing" is that takes time in your app.
EDIT:
Most programs have a lot more room for speedup than you might suppose. I made a walk-through project to illustrate this. If I can summarize it really quickly, it goes like this:
Original time is 2.7 msec per "job" (the number of "jobs" can be varied to get enough run-time to analyze it).
First cut showed roughly 60% of time was spent in vector operations, including indexing, appending, and removing. I replaced with a similar vector class from MFC, and time decreased to 1.8 msec/job. (That's a 1.5x or 50% speedup.)
Even with that array class, roughly 40% of time was spent in the [] indexing operator. I wanted it to index directly, so I forced it to index directly, not through the operator function. That reduced time to 1.5 msec/job, a 1.2x speedup.
Now roughly 60% of the time is adding/removing items in arrays. An additional fraction was spent in "new" and "delete". I decided to chuck the arrays and do two things. One was to use do-it-yourself linked lists, and to pool used objects. The first reduced time to 1.3 msec (1.15x). The second reduced it to 0.44 msec (2.95x).
Of that time, I found that about 60% of the time was in code I had written to do indexing into the list (as if it were an array). I decided that could be done instead just by having a pointer directly into the list. Result: 0.14 msec (3.14x).
Now I found that nearly all the time was being spent in a line of diagnostic I/O I was printing to the console. I decided to get rid of that: 0.0037 msec (38x).
I could have kept going, but I stopped.
The overall time per job was reduced by a compounded factor of about 700x.
What I want you to take away is if you need performance bad enough to deviate from what might be considered the accepted ways of doing things, you don't have to stop after one "bottleneck".
Just because you got a big speedup doesn't mean there are no more.
In fact the next "bottleneck" might be bigger than the first, in terms of speedup factor.
So raise your expectations of speedup you can get, and go for broke.

C++ 'small' optimization behaving strangely

I'm trying to optimize 'in the small' on a project of mine.
There's a series of array accesses that are individually tiny, but profiling has revealed that these array accesses are where the vast majority of my program is spending its time. So, time to make things faster, since the program takes about an hour to run.
I've moved the following type of access:
const float theValOld1 = image(x, y, z);
const float theValOld2 = image(x, y+1, z);
const float theValOld3 = image(x, y-1, z);
const float theValOld4 = image(x-1, y, z);
etc, for 28 accesses around the current pixel.
where image thunks down to
float image(const int x, const int y, const int z) const {
return data[z*xsize*ysize + y*xsize + x];
}
and I've replaced it with
const int yindex = y*xsize;
const int zindex = z*xsize*ysize;
const float* thePtr = &(data[z*xsize*ysize + y*xsize + x]);
const float theVal1 = *(thePtr);
const float theVal2 = *(thePtr + yindex);
const float theVal3 = *(thePtr - yindex);
const float theVal4 = *(thePtr - 1);
etc, for the same number of operations.
I would expect that, if the compiler were totally awesome, that this change would do nothing to the speed. If the compiler is not awesome, then I'd say that the second version should be faster, if only because I'm avoiding the implict pointer addition that comes with the [] thunk, as well as removing the multiplications for the y and z indeces.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
To measure performance, I'm using QueryPerformanceCounter.
What's odd to me is that the order of operations matters!
If I leave the operations as described and compare the timings (as well as the results, to make sure that the same value is calculated after optimization), then the older code takes about 45 ticks per pixel and the new code takes 10 ticks per pixel. If I reverse the operations, then the old code takes about 14 ticks per pixel and the new code takes about 30 ticks per pixel (with lots of noise in there, these are averages over about 100 pixels).
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
Corollary: To compare speed, I'm supposing that the right way is to run the two versions independently of one another, and then compare the results from different runs. I'd like to have the two comparisons next to each other make sense, but there's obviously something happening here that prevents that. Is there a way to salvage this side-by-side run to get a reasonable speed comparison from a single run, so I can make sure that the results are identical as well (easily)?
EDIT: To clarify.
I have both new and old code in the same function, so I can make sure that the results are identical.
If I run old code and then new code, new code runs faster than old.
If I run new code and then old code, old code runs faster than new.
The z hit is required by the math, and the if statement cannot be removed, and is present in both. For the new code, I've just moved more z-specific code into the z section, and the test code I'm using is 100% 2D. When I move to 3D testing, then I'm sure I'll see more of the effect of branching.
You may (possibly) be running into some sort of readahead or cacheline boundary issue. Generally speaking, when you load a single value and it isn't "hot" (in cache), the CPU will pull in a cache line (32, 64, or 128 bytes are pretty typical, depending on processor). Subsequent reads to the same line will be much faster.
If you change the order of operations, you may just be seeing stalls due to how the lines are being loaded and evicted.
The best way to figure something like this out is to open "Disassembly" view and spend some quality time with your processor's reference manual.
If you're lucky, the changes that the code reordering causes will be obvious (the compiler may be generating extra instructions or branches). Less lucky, it will be a stall somewhere in the processor -- during the decode pipeline or due to a memory fetch...
A good profiler that can count stalls and cache misses may help here too (AMD has CodeAnalyst, for example).
If you're not under a time crunch, it's really worthwhile to dive into the disasm -- at the very least, you'll probably end up learning something you didn't know before about how your CPU, machine architecture, compiler, libraries, etc work. (I almost always end up going "huh" when studying disasm.)
If both the new and old versions run on the same data array, then yes, the last run will almost certainly get a speed bump due to caching. Even if the code is different, it'll be accessing data that was already touched by the previous version, so depending on data size, it might be in L1 cache, will probably be in L2 cache, and if a L3 cache exists, almost certainly in that. There'll probably also be some overlap in the code, meaning that the instruction cache will further boost performance of the second version.
A common way to benchmark is to run the algorithm once first, without timing it, simply to ensure that that's going to be cached, is cached, and then run it again a large number of times with timing enabled. (Don't trust a single execution, unless it takes at least a second or two. Otherwise small variations in system load, cache, OS interrupts or page faults can cause the measured time to vary). To eliminate the noise, measure the combined time taken for several runs of the algorithm, and obviously with no output in between. The fact that you're seeing spikes of 3x the usual time means that you're measuring at a way too fine-grained level. Which basically makes your timings useless.
Why should the order matter? Is there caching or something happening? The variables are all named different things, so I wouldn't think that would matter. If there is some caching happening, is there any way I can take advantage of it from pixel to pixel?
The naming doesn't matter. When the code is compiled, variables are translated into memory addresses or register id's. But when you run through your image array, you're loading it all into CPU cache, so it can be read faster the next time you run through it.
And yes, you can and should take advantage of it.
The computer tries very hard to exploit spatial and temporal locality -- that is, if you access a memory address X at time T, it assumes that you're going to need address X+1 very soon (spatial locality), and that you'll probably also need X again, at time T+1 (temporal locality). It tries to speed up those cases in every way possible (primarily by caching), so you should try to exploit it.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
I don't know where you placed that if statement, but if it's in a frequently evaluated block of code, the cost of the branch might hurt you more than you're saving. Branches can be expensive, and they inhibit the compiler's and CPU's ability to reorder and schedule instructions. So you may be better off without it. You should probably do this as a separate optimization that can be benchmarked in isolation.
I don't know which algorithm you're implementing, but I'm guessing you need to do this for every pixel?
If so, you should try to cache your lookups. Once you've got image(x, y, z), that'll be the next pixel's image(x+1, y, z), so cache it in the loop so the next pixel won't have to look it up from scratch. That would potentially allow you to reduce your 9 accesses in the X/Y plane down to three (use 3 cached values from the last iteration, 3 from the one before it, and 3 we just loaded in this iteration)
If you're updating the value of each pixel as a result of its neighbors values, a better approach may be to run the algorithm in a checkerboard pattern. Update every other pixel in the first iteration, using only values from their neighbors (which you're not updating), and then run a second pass where you update the pixels you read from before, based on the values of the pixels you updated before. This allows you to eliminate dependencies between neighboring pixels, so their evaluation can be pipelined and parallelized efficiently.
In the loop that performs all the lookups, unroll it a few times, and try to place all the memory reads at the top, and all the computations further down, to give the CPU a chance to overlap the two (since data reads are a lot slower, get them started, and while they're running, the CPU will try to find other instructions it can evaluate).
For any constant values, try to precompute them as much as possible. (rather than z*xsize*ysize, precompute xsize*ysize, and multiply z with the result of that.
Another thing that may help is to prefer local variables over globals or class members. You may gain something simply by, at the start of the function, making local copies of the class members you're going to need. The compiler can always optimize the extra variables out again if it wants to, but you make it clear that it shouldn't worry about underlying changes to the object state (which might otherwise force it to reload the members every time you access them)
And finally, study the generated assembly in detail. See where it's performing unnecessary store/loads, where operations are being repeated even though they could be cached, and where the ordering of instructions is inefficient, or where the compiler fails to inline as much as you'd hoped.
I honestly wouldn't expect your changes to the lookup function to have much effect though. An array access with the operator[] is easily convertible to the equivalent pointer arithmetic, and the compiler can optimize that pretty efficiently, as long as the offsets you're adding don't change.
Usually, the key to low-level optimizations is, somewhat ironically, not to look at individual lines of code, but at whole functions, and at loops. You need a certain amount of instructions in a block so you have something to work with, since a lot of optimizations deal with breaking dependencies between chains of instructions, reordering to hide instruction latency, and with caching individual values to avoid memory load/stores. That's almost impossible to do on individual array lookups, but there's almost certainly a lot gained if you consider a couple of pixels at a time.
Of course, as with almost all microoptimizations, there are no always true answers. Some of the above might be useful to you, or they might not.
If you tell us more about the access pattern (which pixels are you accessing, is there any required order, and are you just reading, or writing as well? If writing, when and where are the updated values used?)
If you give us a bit more information, we'll be able to offer much more specific (and likely to be effective) suggestions
When optimising, examining the data access pattern is essential.
for example:
assuming a width of 240
for a pixel at <x,y,z> 10,10,0
with original access pattern would give you:
a. data[0+ 10*240 + 10] -> data[2410]
b. data[0+ 11*240 + 10] -> data[2650]
c. data[0+ 9*240 + 10] -> data[2170]
d. data[0+ 10*240 + 9] -> data[2409]
Notice the indices which are in arbitrary order.
Memory controller makes aligned accesses to the main memory to fill the cache lines.
If you order your operations so that accesses are to ascending memory addresses
(e.g. c,d,a,b ) then the memory controller would be able to stream the data in to
the cache lines.
Missing cache on read would be expensive as it has to search down the cache
hierarchy down to the main memory. Main memory access could be 100x slower than
cache. Minimising main memory access will improve the speed of your operation.
To make it even more lopsided, I've moved the z operations into their own section that only gets hit if zindex != 0, so effectively, the second version only has 9 accesses. So by that metric, the second version should definitely be faster.
Did you actually measure that? Because I'd be pretty surprised if that were true. An if statement in the inner loop of your program can add a surprising amount of overhead -- see Is "IF" expensive?. I'd be willing to bet that the overhead of the extra multiply is a lot less than the overhead of the branching, unless z happens to be zero 99% of the time.
What's odd to me is that the order of operations matters!
The order of what operations? It's not clear to me what you're reordering here. Please give some more snippets of what you're trying to do.