Global time cost versus sum of local time costs -- "for" loop - c++

As silly as it seems, I would like to know whether there may be pitfalls when trying to reconcile the time costs for a for loop, as measured
either from time points just outside the for loop (global or external time cost)
or, from time points being inside the loop, and being cumulatively considered (local or internal time cost) ?
The example below illustrates my difficulties getting two equal measurements:
#include <iostream>
#include <vector> // std::vector
#include <ctime> // clock(), ..
int main(){
clock_t clockStartLoop;
double timeInternal(0)// the time cost of the loop, summing all time costs of commands within the "for" loop
, timeExternal // time cost of the loop, as measured outside the boundaries of "for" loop
;
std::vector<int> vecInt; // will be [0,1,..,10000] after the loop below
clock_t costExternal(clock());
for(int i=0;i<10000;i++){
clockStartLoop = clock();
vecInt.push_back(i);
timeInternal += clock() - clockStartLoop; // incrementing internal time cost
}
timeInternal /= CLOCKS_PER_SEC;
timeExternal = (clock() - costExternal)/(double)CLOCKS_PER_SEC;
std::cout << "timeExternal = "<< timeExternal << " s ";
std::cout << "vs timeInternal = " << timeInternal << std::endl;
std::cout << "We have a ratio of " << timeExternal/timeInternal << " between the two.." << std::endl;
}
I usually get a ratio around 2 as output e.g.
timeExternal = 0.008407 s vs timeInternal = 0.004287
We have a ratio of 1.96105 between the two..
, whereas I was hoping a ratio closer to 1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
I'm actually dealing with a more complex code and I would like to isolate time costs within a loop, being sure that all the time slices I consider do make up a complete pie (which I never achieved until now..). Thanks a lot

timeExternal = 0.008407 s vs timeInternal = 0.004287 We have a ratio of 1.96105 between the two..
A ratio of ~2 is to be expected - by far the heaviest call in your loop is clock() itself (on most systems clock() is a syscall to the kernel).
Imagine that clock() implementation looks like the following pseudocode:
clock_t clock() {
go_to_kernel(); // very long operation
clock_t rc = query_process_clock();
return_from_kernel(); // very long operation
return rc;
}
Now going back to the loop, we can annotate the places where time is spent:
for(int i=0;i<10000;i++){
// go_to_kernel - very long operation
clockStartLoop = clock();
// return_from_kernel - very long operation
vecInt.push_back(i);
// go_to_kernel - very long operation
timeInternal += clock() - clockStartLoop;
// return_from_kernel - very long operation
}
So between the two calls to clock() we have 2 long operations, with a total in the loop of 4. Hence the ratio of 2-to-1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
No, incrementing timeInterval is negligible.
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
No, i++ is also negligible. Remove the inner calls to clock() and you will see a much faster execution time. On my system it was 0.00003 s.
The next most expensive operation after clock() is vector::push_back(), because it needs to resize the vector. This is amortized by a quadratic growth factor and can be eliminated entirely by calling vector::reserve() before entering the loop.
Conclusion: when benchmarking, make sure to time entire loops, not individual iterations. Better yet, use frameworks like Google Benchmark, which will help to avoid many other pitfalls (like compiler optimizations). There's also quick-bench.com for simple cases (based on Google Benchmark).

Related

Throughput of trivially concurrent code does not increase with the number of threads

I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark.
Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.
I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.
It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...
With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.
The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).
When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.
With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.
The loop is almost certainly still getting optimized, even if you output the value of x after the outer loop. The compiler can trivially replace the entire loop with a single instruction since the loop bounds are constant at compile time. Indeed, in this example:
#include <iostream>
int main()
{
int x = 0;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
++x;
}
}
std::cout << x << '\n';
return 0;
}
The loop is replaced with the single assembly instruction mov esi, 10000000.
Always inspect the assembly output when benchmarking to make sure that you're measuring what you think you are; in this case you are just measuring the overhead of creating threads, which of course will be higher the more threads you create.
Consider having the innermost loop do something that can't be optimized away. Random number generation is a good candidate because it should perform in constant time, and it has the side-effect of permuting the PRNG state (making it ineligible to be removed entirely, unless the seed is known in advance and the compiler is able to unravel all of the mutation in the PRNG).
For example:
#include <iostream>
#include <random>
int main()
{
std::mt19937 r;
std::uniform_real_distribution<double> dist{0, 1};
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
dist(r);
}
}
return 0;
}
Both loops and the PRNG invocation are left intact here.

Chrono high_resolution_clock giving inconsistent times?

So I have a program that evaluates a polynomial in two different ways: Honrer's method and a Naive method. I'm trying to see their run times respectively, but depending on which order I place the function calls their times change. For example, I place the Horner method first and it takes longer. I then tried with the naive method first and then it takes longer. The Horner method should be much much faster since it only has one loop where the naive method has a nested loop. So i figured it must be the way I'm using the clocks from the chrono library. I tried both the high_resolution_clock and system_clock, but the same thing happens. Any help/comments are welcomed.
#include <cstdlib>
#include <iostream>
#include <chrono>
#include "Polynomial.h"
int main(int argc, char** argv) {
double c[5] = {5, 0, -3, 1, -8};
int degree = 4;
Polynomial obj(c, degree);
auto start = std::chrono::high_resolution_clock::now();
std::cout<<"Horner Evaluation: " << obj.hornerEval(-2)<<", ";
auto elapsed = std::chrono::high_resolution_clock::now() - start;
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(elapsed).count();
std::cout<< duration << " nanoseconds "<<std::endl;
auto start2 = std::chrono::high_resolution_clock::now();
std::cout<<"Naive Evaluation: " << obj.naiveEval(-2)<<", ";
auto elapsed2 = std::chrono::high_resolution_clock::now() - start2;
auto duration2 = std::chrono::duration_cast<std::chrono::nanoseconds>(elapsed2).count();
std::cout<< duration2 << " nanoseconds "<<std::endl;
}
You didn't put all the code but from description it looks it is caching effect.
When it runs first method CPU cache is cold (data from memory is not yet populated with CPU cache) so it takes more time to process (memory is slow compared to cache).
When second method is called it has all (or most depending on data size) the data already available in cache - cache is hot.
Solution - call both methods outside timing part first to warm up the cache than do measurements.
Like one of the previous responds already said, it's probably something with the cache, the prefetcher can maybe better determine which memory to load into cache in the naiveEval method. Here is a talk about benchmarking c++ code for futher information for exapmle on the effect of cold starts on benchmarking: https://www.youtube.com/watch?v=zWxSZcpeS8Q

Slow for loop with inconsistent timing

I have a for loop that is approximately this:
Timer timer1, timer2;
double inner_loop_time = 0;
timer1.Reset()
for (int i = 0; i < num_steps; i++) {
timer2.Reset();
sample_point += delta;
// Find some points close to the sample_point.
std::vector<int> point;
FindClosestPoints(sample_point, &new_keypoints);
// Insert the keypoints into a global container.
candidate_keypoints.insert(new_keypoints.begin(),
new_keypoints.end());
inner_loop_time += timer2.ElapsedTimeInSeconds();
}
const double outer_loop_time = timer1.ElapsedTimeInSeconds();
std::cout << "Inner loop time: " << inner_loop_time
<< " vs outer loop time: " << outer_loop_timer;
When I run this code, I get dramatic inconsistencies in the between the timing in the inner loop and outer loop. E.g. the outer loop reports 0.96s and the inner loop reports 0.51s. Why are these timings inconsistent?
Other notes:
The Timer class is a wrapper for c++11 time library. It is efficient and not the reason for the difference in timings.
The function FindClosestPoints is self-contained and does not spawn any threads.
This weird timing behavior is consistent across thousands of runs.
num_steps is on the order of 1000
inner_loop_time does not include the time to destroy std::vector<int> point or the time spent inside timer2.Reset().
Depends on num_steps. Every loop cycle does make a small difference between the two timers. Because you have a gap between reseting the inner timer and calculating the passed time.
The time difference should be proportional. You can calculate it with
diff = num_steps * gap_time

How to zero a vector<bool>?

I have a vector<bool> and I'd like to zero it out. I need the size to stay the same.
The normal approach is to iterate over all the elements and reset them. However, vector<bool> is a specially optimized container that, depending on implementation, may store only one bit per element. Is there a way to take advantage of this to clear the whole thing efficiently?
bitset, the fixed-length variant, has the set function. Does vector<bool> have something similar?
There seem to be a lot of guesses but very few facts in the answers that have been posted so far, so perhaps it would be worthwhile to do a little testing.
#include <vector>
#include <iostream>
#include <time.h>
int seed(std::vector<bool> &b) {
srand(1);
for (int i = 0; i < b.size(); i++)
b[i] = ((rand() & 1) != 0);
int count = 0;
for (int i = 0; i < b.size(); i++)
if (b[i])
++count;
return count;
}
int main() {
std::vector<bool> bools(1024 * 1024 * 32);
int count1= seed(bools);
clock_t start = clock();
bools.assign(bools.size(), false);
double using_assign = double(clock() - start) / CLOCKS_PER_SEC;
int count2 = seed(bools);
start = clock();
for (int i = 0; i < bools.size(); i++)
bools[i] = false;
double using_loop = double(clock() - start) / CLOCKS_PER_SEC;
int count3 = seed(bools);
start = clock();
size_t size = bools.size();
bools.clear();
bools.resize(size);
double using_clear = double(clock() - start) / CLOCKS_PER_SEC;
int count4 = seed(bools);
start = clock();
std::fill(bools.begin(), bools.end(), false);
double using_fill = double(clock() - start) / CLOCKS_PER_SEC;
std::cout << "Time using assign: " << using_assign << "\n";
std::cout << "Time using loop: " << using_loop << "\n";
std::cout << "Time using clear: " << using_clear << "\n";
std::cout << "Time using fill: " << using_fill << "\n";
std::cout << "Ignore: " << count1 << "\t" << count2 << "\t" << count3 << "\t" << count4 << "\n";
}
So this creates a vector, sets some randomly selected bits in it, counts them, and clears them (and repeats). The setting/counting/printing is done to ensure that even with aggressive optimization, the compiler can't/won't optimize out our code to clear the vector.
I found the results interesting, to say the least. First the result with VC++:
Time using assign: 0.141
Time using loop: 0.068
Time using clear: 0.141
Time using fill: 0.087
Ignore: 16777216 16777216 16777216 16777216
So, with VC++, the fastest method is what you'd probably initially think of as the most naive -- a loop that assigns to each individual item. With g++, the results are just a tad different though:
Time using assign: 0.002
Time using loop: 0.08
Time using clear: 0.002
Time using fill: 0.001
Ignore: 16777216 16777216 16777216 16777216
Here, the loop is (by far) the slowest method (and the others are basically tied -- the 1 ms difference in speed isn't really repeatable).
For what it's worth, in spite of this part of the test showing up as much faster with g++, the overall times were within 1% of each other (4.944 seconds for VC++, 4.915 seconds for g++).
Try
v.assign(v.size(), false);
Have a look at this link:
http://www.cplusplus.com/reference/vector/vector/assign/
Or the following
std::fill(v.begin(), v.end(), 0)
You are out of luck. std::vector<bool> is a specialization that apparently does not even guarantee contiguous memory or random access iterators (or even forward?!), at least based on my reading of cppreference -- decoding the standard would be the next step.
So write implementation specific code, pray and use some standard zeroing technique, or do not use the type. I vote 3.
The recieved wisdom is that it was a mistake, and may become deprecated. Use a different container if possible. And definitely do not mess around with the internal guts, or rely on its packing. Check if you have dynamic bitset in your std library mayhap, or roll your own wrapper around std::vector<unsigned char>.
I ran into this as a performance issue recently. I hadn't tried looking for answers on the web but did find that using assignment with the constructor was 10x faster using g++ O3 (Debian 4.7.2-5) 4.7.2. I found this question because I was looking to avoid the additional malloc. Looks like the assign is optimized as well as the constructor and about twice as good in my benchmark.
unsigned sz = v.size(); for (unsigned ii = 0; ii != sz; ++ii) v[ii] = false;
v = std::vector(sz, false); // 10x faster
v.assign(sz, false); > // 20x faster
So, I wouldn't say to shy away from using the specialization of vector<bool>; just be very cognizant of the bit vector representation.
Use the std::vector<bool>::assign method, which is provided for this purpose.
If an implementation is specific for bool, then assign, most likely, also implemented appropriately.
If you're able to switch from vector<bool> to a custom bit vector representation, then you can use a representation designed specifically for fast clear operations, and get some potentially quite significant speedups (although not without tradeoffs).
The trick is to use integers per bit vector entry and a single 'rolling threshold' value that determines which entries actually then evaluate to true.
You can then clear the bit vector by just increasing the single threshold value, without touching the rest of the data (until the threshold overflows).
A more complete write up about this, and some example code, can be found here.
It seems that one nice option hasn't been mentioned yet:
auto size = v.size();
v.resize(0);
v.resize(size);
The STL implementer will supposedly have picked the most efficient means of zeroising, so we don't even need to know which particular method that might be. And this works with real vectors as well (think templates), not just the std::vector<bool> monstrosity.
There can be a minuscule added advantage for reused buffers in loops (e.g. sieves, whatever), where you simply resize to whatever will be needed for the current round, instead of to the original size.
As an alternative to std::vector<bool>, check out boost::dynamic_bitset (https://www.boost.org/doc/libs/1_72_0/libs/dynamic_bitset/dynamic_bitset.html). You can zero one (ie, set each element to false) out by calling the reset() member function.
Like clearing, say, std::vector<int>, reset on a boost::dynamic_bitset can also compile down to a memset, whereas you probably won't get that with std::vector<bool>. For example, see https://godbolt.org/z/aqSGCi

Interesting processing time results

I've made a small application that averages the numbers between 1 and 1000000. It's not hard to see (using a very basic algebraic formula) that the average is 500000.5 but this was more of a project in learning C++ than anything else.
Anyway, I made clock variables that were designed to find the amount of clock steps required for the application to run. When I first ran the script, it said that it took 3770000 clock steps, but every time that I've run it since then, it's taken "0.0" seconds...
I've attached my code at the bottom.
Either a.) It's saved the variables from the first time I ran it, and it's just running quickly to the answer...
or b.) something is wrong with how I'm declaring the time variables.
Regardless... it doesn't make sense.
Any help would be appreciated.
FYI (I'm running this through a Linux computer, not sure if that matters)
double avg (int arr[], int beg, int end)
{
int nums = end - beg + 1;
double sum = 0.0;
for(int i = beg; i <= end; i++)
{
sum += arr[i];
}
//for(int p = 0; p < nums*10000; p ++){}
return sum/nums;
}
int main (int argc, char *argv[])
{
int nums = 1000000;//atoi(argv[0]);
int myarray[nums];
double timediff;
//printf("Arg is: %d\n",argv[0]);
printf("Nums is: %d\n",nums);
clock_t begin_time = clock();
for(int i = 0; i < nums; i++)
{
myarray[i] = i+1;
}
double average = avg(myarray, 0, nums - 1);
printf("%f\n",average);
clock_t end_time = clock();
timediff = (double) difftime(end_time, begin_time);
printf("Time to Average: %f\n", timediff);
return 0;
}
You are measuring the I/O operation too (printf), that depends on external factors and might be affecting the run time. Also, clock() might not be as precise as needed to measure such a small task - look into higher resolution functions such as clock_get_time(). Even then, other processes might affect the run time by generating page fault interrupts and occupying the memory BUS, etc. So this kind of fluctuation is not abnormal at all.
On the machine I tested, Linux's clock call was only accurate to 1/100th of a second. If your code runs in less than 0.01 seconds, it will usually say zero seconds have passed. Also, I ran your program a total of 50 times in .13 seconds, so I find it suspicous that you claim it takes 2 seconds to run it once on your computer.
Your code incorrectly uses the difftime, which may display incorrect output as well if clock says time did pass.
I'd guess that the first timing you got was with different code than that posted in this question, becase I can't think of any way the code in this question could produce a time of 3770000.
Finally, benchmarking is hard, and your code has several benchmarking mistakes:
You're timing how long it takes to (1) fill an array, (2) calculate an average, (3) format the result string (4) make an OS call (slow) that prints said string in the right language/font/colo/etc, which is especially slow.
You're attempting to time a task which takes less than a hundredth of a second, which is WAY too small for any accurate measurement.
Here is my take on your code, measuring that the average takes ~0.001968 seconds on this machine.