How to find runtime efficiency of a C++ code

How to find runtime efficiency of a C++ code - c++

I'm trying to find the efficiency of a program which I have recently posted on stackoverflow.
How to efficiently delete elements from a vector given an another vector
To compare the efficiency of my code with other answers I'm using chrono object.
Is it a correct way to check the runtime efficiency?
If not then kindly suggest a way to do it with an example.
Code on Coliru
#include <iostream>
#include <vector>
#include <algorithm>
#include <chrono>
#include <ctime>
using namespace std;
void remove_elements(vector<int>& vDestination, const vector<int>& vSource)
{
if(!vDestination.empty() && !vSource.empty())
{
for(auto i: vSource) {
vDestination.erase(std::remove(vDestination.begin(), vDestination.end(), i), vDestination.end());
}
}
}
int main() {
vector<int> v1={1,2,3};
vector<int> v2={4,5,6};
vector<int> v3={1,2,3,4,5,6,7,8,9};
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
remove_elements(v3,v1);
remove_elements(v3,v2);
std::chrono::steady_clock::time_point end= std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() <<std::endl;
for(auto i:v3)
cout << i << endl;
return 0;
}
Output
Time difference = 1472
7
8
9

Is it a correct way to check the runtime efficiency?
Looks like not the best way to do it. I see the following flaws in your method:
Values are sorted. Branch prediction may expose ridiculous effects when testing sorted vs. unsorted values with the same algorithm. Possible fix: test on both sorted and unsorted and compare results.
Values are hard-coded. CPU cache is a tricky thing and it may introduce subtle differences between tests on hard-coded values and real-life ones. In a real world you are unlikely to perform these operations on hard-coded values, so you may either read them from a file or generate random ones.
Too few values. Your code's execution time is much smaller than the timer precision.
You run the code only once. If you fix all other issues and run the code twice, the second run may be much faster than the first one due to cache warm-up: subsequent runs tend to have less cache misses than the first one.
You run the code once on fixed-size data. It would be better to run otherwise correct tests at least four times to cover a Cartesian product of the following parameters:
sorted vs. unsorted data
v3 fits CPU cache vs. v3 size exceeds CPU cache. Also consider cases when (v1.length() + v3.length()) * sizeof(int) fits the cache or not, (v1.length() + v2.length() + v3.length()) * sizeof(int) fits the cache or not and so on for all combinations.

The biggest issues with your approach are:
1) The code you're testing is too short and predictable. You need to run it at least a few thousand times so that there is at least a few hundred milliseconds between measurements. And you need to make the data set larger and less predictable. In general, CPU caches really make accurate measurements based on synthetic input data a PITA.
2) The compiler is free to reorder your code. In general it's quite difficult to ensure the code you're timing will execute between calls to check the time (and nothing else, for that matter). On the one hand, you could dial down optimization, but on the other you want to measure optimized code.
One solution is to turn off whole-program optimization and put timing calls in another compilation unit.
Another possible solution is to use a memory fence around your test, e.g.
std::atomic_thread_fence(std::memory_order_seq_cst);
(requires #include <atomic> and a C++11-capable compiler).
In addition, you might want to supplement your measurements with profiler data to see how efficiently L1/2/3 caches are used, memory bottlenecks, instruction retire rate, etc. Unfortunately the best tool for Intel x86 for that is commercial (vtune), but on AMD x86 a similar tool is free (codeXL).

You might consider using a benchmarking library like Celero to do the measurements for you and deal with the tricky parts of performance measurements, while you remain focused on the code you're trying to optimize. More complex examples are available in the code I've linked in the answer to your previous question (How to efficiently delete elements from a vector given an another vector), but a simple use case would look like this:
BENCHMARK(VectorRemoval, OriginalQuestion, 100, 1000)
{
std::vector destination(10000);
std::generate(destination.begin(), destination.end(), std::rand);
std::sample(destination.begin(), destination.end(), std::back_inserter(source),
100, std::mt19937{std::random_device{}()})
for (auto i: source)
destination.erase(std::remove(destination.begin(), destination.end(), i),
destination.end());
}

Related

Using one loop vs two loops

I was reading this blog :- https://developerinsider.co/why-is-one-loop-so-much-slower-than-two-loops/. And I decided to check it out using C++ and Xcode. So, I wrote a simple program given below and when I executed it, I was surprised by the result. Actually the 2nd function was slower compared to the first function contrary to what is stated in the article. Can anyone please help me figure out why this is the case?
#include <iostream>
#include <vector>
#include <chrono>
using namespace std::chrono;
void function1() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0;j<n;j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
void function2() {
const int n=100000;
int a1[n], b1[n], c1[n], d1[n];
for(int j=0; j<n; j++){
a1[j] = 0;
b1[j] = 0;
c1[j] = 0;
d1[j] = 0;
}
auto start = high_resolution_clock::now();
for(int j=0; j<n; j++){
a1[j] += b1[j];
}
for(int j=0;j<n;j++){
c1[j] += d1[j];
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
std::cout << duration.count() << " Microseconds." << std::endl;
}
int main(int argc, const char * argv[]) {
function1();
function2();
return 0;
}

TL;DR: The loops are basically the same, and if you are seeing differences, then your measurement is wrong. Performance measurement and more importantly, reasoning about performance requires a lot of computer knowledge, some scientific rigor, and much engineering acumen. Now for the long version...
Unfortunately, there is some very inaccurate information in the article to which you've linked, as well as in the answers and some comments here.
Let's start with the article. There won't be any disk caching that has any effect on the performance of these functions. It is true that virtual memory is paged to disk, when demand on physical memory exceeds what's available, but that's not a factor that you have to consider for programs that touch 1.6MB of memory (4 * 4 * 100K).
And if paging comes into play, the performance difference won't exactly be subtle either. If these arrays were paged to disk and back, the performance difference would be in order of 1000x for fastest disks, not 10% or 100%.
Paging and page faults and its effect on performance is neither trivial, nor intuitive. You need to read about it, and experiment with it seriously. What little information that article has is completely inaccurate to the point of being misleading.
The second is your profiling strategy and the micro-benchmark itself. Clearly, with such simple operations on data (an add,) the bottleneck will be memory bandwidth itself (maybe instruction retire limits or something like that with such a simple loop.) And since you only read memory linearly, and use all you read, whether its in 4 interleaving streams or 2, you are making use of all the bandwidth that is available.
However, if you call your function1 or function2 in a loop, you will be measuring the bandwidth of different parts of the memory hierarchy depending on N, from L1 all the way to L3 and main memory. (You should know the size of all levels of cache on your machine, and how they work.) This is obvious if you know how CPU caches work, and really mystifying otherwise. Do you want to know how fast this is when you do it the first time, when the arrays are cold, or do you want to measure the hot access?
Is your real use case copying the same mid-sized array over and over again?
If not, what is it? What are you benchmarking? Are you trying to measure something or just experimenting?
Shouldn't you be measuring the fastest run through a loop, rather than the average since that can be massively affected by a (basically random) context switch or an interrupt?
Have you made sure you are using the correct compiler switches? Have you looked at the generated assembly code to make sure the compiler is not adding debug checks and what not, and is not optimizing stuff away that it shouldn't (after all, you are just executing useless loops, and an optimizing compiler wants nothing more than to avoid generating code that is not needed).
Have you looked at the theoretical memory/cache bandwidth number for your hardware? Your specific CPU and RAM combination will have theoretical limits. And be it 5, 50, or 500 GiB/s, it will give you an upper bound on how much data you can move around and work with. The same goes with the number of execution units, the IPC or your CPU, and a few dozen other numbers that will affect the performance of this kind of micro-benchmark.
If you are reading 4 integers (4 bytes each, from a, b, c, and d) and then doing two adds and writing the two results back, and doing it 100'000 times, then you are - roughly - looking at 2.4MB of memory read and write. If you do it 10 times in 300 micro-seconds, then your program's memory (well, store buffer/L1) throughput is about 80 GB/s. Is that low? Is that high? Do you know? (You should have a rough idea.)
And let me tell you that the other two answers here at the time of this writing (namely this and this) do not make sense. I can't make heads nor tails of the first one, and the second one is almost completely wrong (conditional branches in a 100'000-times for loop are bad? allocating an additional iterator variable is costly? cold access to array on stack vs. on the heap has "serious performance implications?)
And finally, as written, the two functions have very similar performances. It is really hard separating the two, and unless you can measure a real difference in a real use case, I'd say write whichever one that makes you happier.
If you really really want a theoretical difference between them, I'd say the one with two separate loops is very slightly better because it is usually not a good idea interleaving access to unrelated data.

This has nothing to do with caching or instruction efficiency. Simple iterations over long vectors are purely a matter of bandwidth. (Google: stream benchmark.) And modern CPUs have enough bandwidth to satisfy not all of their cores, but a good deal.
So if you combine the two loops, executing them on a single core, there is probably enough bandwidth for all loads and stores at the rate that memory can sustain. But if you use two loops, you leave bandwidth unused, and the runtime will be a little less than double.

The reasons why the second is faster in your case (I do not think that this works on any machine) is better cpu caching at the point at ,which you cpu has enough cache to store the arrays, the stuff your OS requires and so on, the second function will probably be much slower than the first.
from a performance standpoint. I doubt that the two loop code will give better performance if there are enough other programs running as well, because the second function has obviously worse efficiency then the first and if there is enough other stuff cached the performance lead throw caching will be eliminated.

I'll just chime in here with a little something to keep in mind when looking into performance - unless you are writing embedded software for a real-time device, the performance of such low level code as this should not be a concern.
In 99.9% of all other cases, they will be fast enough.

Fast hash function for long string key

I am using an extendible hash and I want to have strings as keys. The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad for the program's performance since the hash function is called multiple times especially when I am splitting buckets.
Current hash function
int hash(const string& key)
{
int seed = 131;
unsigned long hash = 0;
for(unsigned i = 0; i < key.length(); i++)
{
hash = (hash * seed) + key[i];
}
return hash;
}
The keys could be as long as 40 characters.
Example of string/key
string key = "from-to condition"
I have searched over the internet for a better one but I didn't find anything to match my case. Any suggestions?

You should prefer to use std::hash unless measurement shows that you can do better. To limit the number of characters it uses, use something like:
const auto limit = min(key.length(), 16);
for(unsigned i = 0; i < limit; i++)
You will want to experiment to find the best value of 16 to use.
I would actually expect the performance to get worse (because you will have more collisions). If your strings were several k, then limiting to the first 64 bytes might well be worth while.
Depending on your strings, it might be worth starting not at the beginning. For example, hashing filenames you would probably do better using the characters between 20 and 5 from the end (ignore the often constant pathname prefix, and the file extension). But you still have to measure.

I am using an extendible hash and I want to have strings as keys.
As mentioned before, use std::hash until there is a good reason not to.
The problem is that the current hash function that I am using iterates over the whole string/key and I think that this is pretty bad...
It's an understandable thought, but is actually unlikely to be a real concern.
(anticipating) why?
A quick scan over stack overflow will reveal many experienced developers talking about caches and cache lines.
(forgive me if I'm teaching my grandmother to suck eggs)
A modern CPU is incredibly quick at processing instructions and performing (very complex) arithmetic. In almost all cases, what limits its performance is having to talk to memory across a bus, which is by comparison, horribly slow.
So chip designers build in memory caches - extremely fast memory that sits in the CPU (and therefore does not have to be communicated with over a slow bus). Unhappily, there is only so much space available [plus heat constraints - a topic for another day] for this cache memory so the CPU must treat it like an OS does a disk cache, flushing memory and reading in memory as and when it needs to.
As mentioned, communicating across the bus is slow - (simply put) it requires all the electronic components on the motherboard to stop and synchronise with each other. This wastes a horrible amount of time [This would be a fantastic point to go into a discussion about the propagation of electronic signals across the motherboard being constrained by approximately half the speed of light - it's fascinating but there's only so much space here and I have only so much time]. So rather than transfer one byte, word or longword at a time, the memory is accessed in chunks - called cache lines.
It turns out that this is a good decision by chip designers because they understand that most memory is accessed sequentially - because most programs spend most of their time accessing memory linearly (such as when calculating a hash, comparing strings or objects, transforming sequences, copying and initialising sequences and so on).
What's the upshot of all this?
Well, bizarrely, if your string is not already in-cache it turns of that reading one byte of it is almost exactly as expensive as reading all the bytes in the first (say) 128 bytes of it.
Plus, because the cache circuitry assumes that memory access is linear, it will begin a fetch of the next cache line as soon as it has fetched your first one. It will do this while your CPU is performing its hash computation.
I hope you can see that in this case, even if your string was many thousands of bytes long, and you chose to only hash (say) every 128th byte, all you would be doing would be to compute a very much inferior hash which still causing the memory cache to halt the processor while it fetched large chunks of unused memory. It would take just as long - for a worse result!
Having said that, what are good reasons not to use the standard implementation?
Only when:
The users are complaining that your software is too slow to be useful, and
The program is verifiably CPU-bound (using 100% of CPU time), and
The program is not wasting any cycles by spinning, and
Careful profiling has revealed that the program's biggest bottleneck is the hash function, and
Independent analysis by another experienced developer confirms that there is no way to improve the algorithm (for example by calling hash less often).
In short, almost never.

You can directly use std::hashlink instead of implementing your own function.
#include <iostream>
#include <functional>
#include <string>
size_t hash(const std::string& key)
{
std::hash<std::string> hasher;
return hasher(key);
}
int main() {
std::cout << hash("abc") << std::endl;
return 0;
}
See this code here: https://ideone.com/4U89aU

How to demonstrate the impact of instruction cache limitations

My orginial idea was to give an elegant code example, that would demonstrate the impact of instruction cache limitations. I wrote the following piece of code, that creates a large amount of identical functions, using template metaprogramming.
volatile int checksum;
void (*funcs[MAX_FUNCS])(void);
template <unsigned t>
__attribute__ ((noinline)) static void work(void) { ++checksum; }
template <unsigned t>
static void create(void) { funcs[t - 1] = &work<t - 1>; create<t - 1>(); }
template <> void create<0>(void) { }
int main()
{
create<MAX_FUNCS>();
for (unsigned range = 1; range <= MAX_FUNCS; range *= 2)
{
checksum = 0;
for (unsigned i = 0; i < WORKLOAD; ++i)
{
funcs[i % range]();
}
}
return 0;
}
The outer loop varies the amount of different functions to be called using a jump table. For each loop pass, the time taken to invoke WORKLOAD functions is then measured. Now what are the results? The following chart shows the average run time per function call in relation to the used range. The blue line shows the data measured on a Core i7 machine. The comparative measurement, depicted by the red line, was carried out on a Pentium 4 machine. Yet when it comes to interpreting these lines, I seem to be somehow struggling...
The only jumps of the piecewise constant red curve occur exactly where the total memory consumption for all functions within range exceed the capacity of one cache level on the tested machine, which has no dedicated instruction cache. For very small ranges (below 4 in this case) however, run time still increases with the amount of functions. This may be related to branch prediction efficiency, but since every function call reduces to an unconditional jump in this case, I'm not sure if there should be any branching penalty at all.
The blue curve behaves quite differently. Run time is constant for small ranges and increases logarithmic thereafter. Yet for larger ranges, the curve seems to be approaching a constant asymptote again. How exactly can the qualitative differences of both curves be explained?
I am currently using GCC MinGW Win32 x86 v.4.8.1 with g++ -std=c++11 -ftemplate-depth=65536 and no compiler optimization.
Any help would be appreciated. I am also interested in any idea on how to improve the experiment itself. Thanks in advance!

First, let me say that I really like how you've approached this problem, this is a really neat solution for intentional code bloating. However, there might still be several possible issues with your test -
You also measure the warmup time. you didn't show where you've placed your time checks, but if it's just around the internal loop - then the first time until you reach range/2 you'd still enjoy the warmup of the previous outer iteration. Instead, measure only warm performance - run each internal iteration for several times (add another loop in the middle), and take the timestamp only after 1-2 rounds.
You claim to have measure several cache levels, but your L1 cache is only 32k, which is where your graph ends. Even assuming this counts in terms of "range", each function is ~21 bytes (at least on my gcc 4.8.1), so you'll reach at most 256KB, which is only then scratching the size of your L2.
You didn't specify your CPU model (i7 has at least 4 generations in the market now, Haswell, IvyBridge, SandyBridge and Nehalem). The differences are quite large, for example an additional uop-cache since Sandybrige with complicated storage rules and conditions. Your baseline is also complicating things, if I recall correctly the P4 had a trace cache which might also cause all sorts of performance impacts. You should check an option to disable them if possible.
Don't forget the TLB - even though it probably doesn't play a role here in such a tightly organized code, the number of unique 4k pages should not exceed the ITLB (128 entries), and even before that you may start having collisions if your OS did not spread the physical code pages well enough to avoid ITLB collisions.

why the memory access is so slow?

I've got very strange performance issue, related to access to memory.
The code snippet is:
#include <vector>
using namespace std;
vector<int> arrx(M,-1);
vector< vector<int> > arr(N,arrx);
...
for(i=0;i<N;i++){
for(j=0;j<M;j++){
//>>>>>>>>>>>> Part 1 <<<<<<<<<<<<<<
// Simple arithmetic operations
int n1 = 1 + 2; // does not matter what (actually more complicated)
// Integer assignment, without access to array
int n2 = n1;
//>>>>>>>>>>>> Part 2 <<<<<<<<<<<<<<
// This turns out to be most expensive part
arr[i][j] = n1;
}
}
N and M - are some constants of the order of 1000 - 10000 or so.
When I compile this code (release version) it takes approximately 15 clocks to finish it if the Part 2 is commented. With this part the execution time goes up to 100+ clocks, so almost 10 times slower. I expected the assignment operation to be much cheaper than even simple arithmetic operations. This is actually true if we do not use arrays. But with that array the assignment seems to be much more expensive.
I also tried 1-D array instead of 2-D - the same result (for 2D is obviously slower).
I also used int** or int* instead of vector< vector< int > > or vector< int > - again the result is the same.
Why do I get such poor performance in array assignment and can I fix it?
Edit:
One more observation: in the part 2 of the given code if we change the assignment from
arr[i][j] = n1; // 172 clocks
to
n1 = arr[i][j]; // 16 clocks
the speed (numbers in comments) goes up. More interestingly, if we change the line:
arr[i][j] = n1; // 172 clocks
to
arr[i][j] = arr[i][j] * arr[i][j]; // 110 clocks
speed is also higher than for simple assignment
Is there any difference in reading and writing from/to memory? Why do I get such strange performance?
Thanks in advance!

Your assumptions are really wrong...
Writing to main memory is much slower than doing a simple addition.
If you don't do the write, it is
likely that the loop will be optimized away entirely.

Unless your actual "part 1" is significantly more complex than your example would lead us to believe, there's no surprise here -- memory access is slow compared to basic arithmetic. Additionally, if you're compiling with optimizations, most or all of "part 1" may be getting optimized away because the results are never used.

You have discovered a sad truth about modern computers: memory access is really slow compared to arithmetic. There's nothing you can do about it. It is because electric fields in a copper wire only travel at about two-thirds of the speed of light.

Your nested arrays are going to be somewhere in the region of 50–500MB of memory. It takes time to write that much memory, and no amount of clever hardware memory caching is going to help that much. Moreover, even a single memory write will take time as it has to make its way over some copper wires on a circuit board to a lump of silicon some distance away; physics wins.
But if you want to dig in more, try the cachegrind tool (assuming it's present on your platform). Just be aware that the code you're using above doesn't really permit a lot of optimization; its data access pattern doesn't have enormous amounts of reuse potential.

Let's make a simple estimate. A typical CPU clock speed nowadays is about 1--2 GHz (giga=10 to power nine). Simplifying a (really) great deal, it means that a single processor operation takes about 1ns (nano=10 to power negative nine). A simple arithmetic like int addition takes several CPU cycles, of order ten.
Now, memory: typical memory access time is about 50ns (again, it's not necessary just now to go into gory details, which are aplenty).
You see that even in the very best case scenario, the memory is slower than CPU by a factor of about 5 to 10.
In fact, I'm sweeping under the rug an enormous amount of detail, but I hope the basic idea is clear. If you're interested, there are books around (keywords: cache, cache misses, data locality etc). This one is dated, but still very good at the explaining general concepts.

c++ vector performance non intuitive result?

I am fiddling with the performance wizard in VS2010, the one that tests instrumentation (function call counts and timing.)
After learning about vectors in the C++ STL, I just decided to see what info I can get about performance of filling a vector with 1 million integers:
#include <iostream>
#include <vector>
void generate_ints();
int main() {
generate_ints();
return 0;
}
void generate_ints() {
typedef std::vector<int> Generator;
typedef std::vector<int>::iterator iter;
typedef std::vector<int>::size_type size;
Generator generator;
for (size i = 0; i != 1000000; ++i) {
generator.push_back(i);
}
}
What I get is: 2402.37 milliseconds of elapsed time for the above. But I learnt that vectors have to resize themselves when they run out of capacity as they are contiguous in memory. So I thought I'd get better performance by making one addition to the above which was:
generate.reserve(1000000);
However this doubles the execution time of the program to around 5000 milliseconds. Here is a screenshot of function calls, left side without the above line of code and right side with. I really don't understand this result and it doesn't make sense to me given what I learnt about how defining a vectors capacity if you know you will fill it with a ton is a good thing. Specifying reserve basically doubled most of the function calls.
http://imagebin.org/179302

From the screenshot you posted, it looks like you're compiling without optimization, which invalidates any benchmarking you do.
You benchmark when you care about performance, and when you care about performance, you press the "go faster" button on the compiler, and enable optimizations.
Telling the compiler to go slow, and then worrying that it's slower than expected is pointless. I'm not sure why the code becomes slower when you insert a reserve call, but in debug builds, a lot of runtime checks are inserted to catch more errors, and it is quite possible that the reserve call causes more such checks to be performed, slowing down the code.
Enable optimizations and see what happens. :)

Did you perform all your benchmarks under Release configuration?
Also, try running outside of profiler, in case you have hit some profiler-induced artifact (add manual time measurements to your code - you can use clock() for that).
Also, did you by any chance make a typo and actually called resize instead of reserve?

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js