Slow for loop with inconsistent timing - c++

I have a for loop that is approximately this:
Timer timer1, timer2;
double inner_loop_time = 0;
timer1.Reset()
for (int i = 0; i < num_steps; i++) {
timer2.Reset();
sample_point += delta;
// Find some points close to the sample_point.
std::vector<int> point;
FindClosestPoints(sample_point, &new_keypoints);
// Insert the keypoints into a global container.
candidate_keypoints.insert(new_keypoints.begin(),
new_keypoints.end());
inner_loop_time += timer2.ElapsedTimeInSeconds();
}
const double outer_loop_time = timer1.ElapsedTimeInSeconds();
std::cout << "Inner loop time: " << inner_loop_time
<< " vs outer loop time: " << outer_loop_timer;
When I run this code, I get dramatic inconsistencies in the between the timing in the inner loop and outer loop. E.g. the outer loop reports 0.96s and the inner loop reports 0.51s. Why are these timings inconsistent?
Other notes:
The Timer class is a wrapper for c++11 time library. It is efficient and not the reason for the difference in timings.
The function FindClosestPoints is self-contained and does not spawn any threads.
This weird timing behavior is consistent across thousands of runs.
num_steps is on the order of 1000

inner_loop_time does not include the time to destroy std::vector<int> point or the time spent inside timer2.Reset().

Depends on num_steps. Every loop cycle does make a small difference between the two timers. Because you have a gap between reseting the inner timer and calculating the passed time.
The time difference should be proportional. You can calculate it with
diff = num_steps * gap_time

Related

Throughput of trivially concurrent code does not increase with the number of threads

I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark.
Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.
I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.
It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...
With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.
The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).
When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.
With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.
The loop is almost certainly still getting optimized, even if you output the value of x after the outer loop. The compiler can trivially replace the entire loop with a single instruction since the loop bounds are constant at compile time. Indeed, in this example:
#include <iostream>
int main()
{
int x = 0;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
++x;
}
}
std::cout << x << '\n';
return 0;
}
The loop is replaced with the single assembly instruction mov esi, 10000000.
Always inspect the assembly output when benchmarking to make sure that you're measuring what you think you are; in this case you are just measuring the overhead of creating threads, which of course will be higher the more threads you create.
Consider having the innermost loop do something that can't be optimized away. Random number generation is a good candidate because it should perform in constant time, and it has the side-effect of permuting the PRNG state (making it ineligible to be removed entirely, unless the seed is known in advance and the compiler is able to unravel all of the mutation in the PRNG).
For example:
#include <iostream>
#include <random>
int main()
{
std::mt19937 r;
std::uniform_real_distribution<double> dist{0, 1};
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
dist(r);
}
}
return 0;
}
Both loops and the PRNG invocation are left intact here.

Global time cost versus sum of local time costs -- "for" loop

As silly as it seems, I would like to know whether there may be pitfalls when trying to reconcile the time costs for a for loop, as measured
either from time points just outside the for loop (global or external time cost)
or, from time points being inside the loop, and being cumulatively considered (local or internal time cost) ?
The example below illustrates my difficulties getting two equal measurements:
#include <iostream>
#include <vector> // std::vector
#include <ctime> // clock(), ..
int main(){
clock_t clockStartLoop;
double timeInternal(0)// the time cost of the loop, summing all time costs of commands within the "for" loop
, timeExternal // time cost of the loop, as measured outside the boundaries of "for" loop
;
std::vector<int> vecInt; // will be [0,1,..,10000] after the loop below
clock_t costExternal(clock());
for(int i=0;i<10000;i++){
clockStartLoop = clock();
vecInt.push_back(i);
timeInternal += clock() - clockStartLoop; // incrementing internal time cost
}
timeInternal /= CLOCKS_PER_SEC;
timeExternal = (clock() - costExternal)/(double)CLOCKS_PER_SEC;
std::cout << "timeExternal = "<< timeExternal << " s ";
std::cout << "vs timeInternal = " << timeInternal << std::endl;
std::cout << "We have a ratio of " << timeExternal/timeInternal << " between the two.." << std::endl;
}
I usually get a ratio around 2 as output e.g.
timeExternal = 0.008407 s vs timeInternal = 0.004287
We have a ratio of 1.96105 between the two..
, whereas I was hoping a ratio closer to 1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
I'm actually dealing with a more complex code and I would like to isolate time costs within a loop, being sure that all the time slices I consider do make up a complete pie (which I never achieved until now..). Thanks a lot
timeExternal = 0.008407 s vs timeInternal = 0.004287 We have a ratio of 1.96105 between the two..
A ratio of ~2 is to be expected - by far the heaviest call in your loop is clock() itself (on most systems clock() is a syscall to the kernel).
Imagine that clock() implementation looks like the following pseudocode:
clock_t clock() {
go_to_kernel(); // very long operation
clock_t rc = query_process_clock();
return_from_kernel(); // very long operation
return rc;
}
Now going back to the loop, we can annotate the places where time is spent:
for(int i=0;i<10000;i++){
// go_to_kernel - very long operation
clockStartLoop = clock();
// return_from_kernel - very long operation
vecInt.push_back(i);
// go_to_kernel - very long operation
timeInternal += clock() - clockStartLoop;
// return_from_kernel - very long operation
}
So between the two calls to clock() we have 2 long operations, with a total in the loop of 4. Hence the ratio of 2-to-1.
Is it just because there are operations internal to the loop which are not measured by the clock() difference (such as incrementing timeInternal) ?
No, incrementing timeInterval is negligible.
Could the i++ operation in the for(..) be non-negligible in the external measurement and also explain the difference with the internal one ?
No, i++ is also negligible. Remove the inner calls to clock() and you will see a much faster execution time. On my system it was 0.00003 s.
The next most expensive operation after clock() is vector::push_back(), because it needs to resize the vector. This is amortized by a quadratic growth factor and can be eliminated entirely by calling vector::reserve() before entering the loop.
Conclusion: when benchmarking, make sure to time entire loops, not individual iterations. Better yet, use frameworks like Google Benchmark, which will help to avoid many other pitfalls (like compiler optimizations). There's also quick-bench.com for simple cases (based on Google Benchmark).

For loop performance and multithreaded performance questions

I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.

Would a pre-calculated variable faster than calculating it every time in a loop?

In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.

Using bools in calculations to avoid branches

Here's a little micro-optimization curiosity that I came up with:
struct Timer {
bool running{false};
int ticks{0};
void step_versionOne(int mStepSize) {
if(running) ticks += mStepSize;
}
void step_versionTwo(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
};
It seems the two methods practically do the same thing. Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3?
Yes, your trick allows to avoid branch and it makes it faster... sometimes.
I wrote benchmark that compares these solutions in various situations, along with my own:
ticks += mStepSize & -static_cast<int>(running)
My results are following:
Off:
branch: 399949150
mul: 399940271
andneg: 277546678
On:
branch: 204035423
mul: 399937142
andneg: 277581853
Pattern:
branch: 327724860
mul: 400010363
andneg: 277551446
Random:
branch: 915235440
mul: 399916440
andneg: 277537411
Off is when timers are turned off. In this cases solutions take about the same time.
On is when they are turned on. Branching solution two times faster.
Pattern is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.
Random is when branch is unpredictable. In this case multiplications is more than 2 times faster.
In all cases my bit-hacking trick is fastest, except for On where branching wins.
Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize is 1 than multiplication can be actually fastest).
Code of the benchmark:
#include<array>
#include<iostream>
#include<chrono>
struct Timer {
bool running{false};
int ticks{0};
void branch(int mStepSize) {
if(running) ticks += mStepSize;
}
void mul(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
void andneg(int mStepSize) {
ticks += mStepSize & -static_cast<int>(running);
}
};
void run(std::array<Timer, 256>& timers, int step) {
auto start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.branch(step);
auto end = std::chrono::steady_clock::now();
std::cout << "branch: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.mul(step);
end = std::chrono::steady_clock::now();
std::cout << "mul: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.andneg(step);
end = std::chrono::steady_clock::now();
std::cout << "andneg: " << (end - start).count() << std::endl;
}
int main() {
std::array<Timer, 256> timers;
int step = rand() % 256;
run(timers, step); // warm up
std::cout << "Off:\n";
run(timers, step);
for(auto& t : timers)
t.running = true;
std::cout << "On:\n";
run(timers, step);
std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
for(int i = 0; i < 256; i++)
timers[i].running = pattern[i % 6];
std::cout << "Pattern:\n";
run(timers, step);
for(auto& t : timers)
t.running = rand()&1;
std::cout << "Random:\n";
run(timers, step);
for(auto& t : timers)
std::cout << t.ticks << ' ';
return 0;
}
Does the second version avoid a branch
if you compile your code to get assembler output, g++ -o test.s test.cpp -S, you'll find that a branch is indeed avoided in the second function.
and consequently, is faster than the first version
i ran each of your functions 2147483647 or INT_MAX number of times where in each iteration i randomly assigned a boolean value to running member of your Timer struct, using this code:
int main() {
const int max = std::numeric_limits<int>::max();
timestamp_t start, end, one, two;
Timer t_one, t_two;
double percent;
srand(time(NULL));
start = get_timestamp();
for(int i = 0; i < max; ++i) {
t_one.running = rand() % 2;
t_one.step_versionOne(1);
}
end = get_timestamp();
one = end - start;
std::cout << "step_versionOne = " << one << std::endl;
start = get_timestamp();
for(int i = 0; i < max; ++i) {
t_two.running = rand() % 2;
t_two.step_versionTwo(1);
}
end = get_timestamp();
two = end - start;
percent = (one - two) / static_cast<double>(one) * 100.0;
std::cout << "step_versionTwo = " << two << std::endl;
std::cout << "step_one - step_two = " << one - two << std::endl;
std::cout << "one fast than two by = " << percent << std::endl;
}
and these are the results i got:
step_versionOne = 39738380
step_versionTwo = 26047337
step_one - step_two = 13691043
one fast than two by = 34.4529%
so yes, the second function is clearly faster, and by around 35%. note that the percentage increase in timed performance varied between 30 and 55 percent for a smaller number of iterations, whereas it seems to plateau at around 35% the longer it runs. this is might be due to sporadic execution of system tasks while the simulation is running, which become a lot less sporadic, i.e. consistent the longer you run the sim (although this is just my assumption, i have no idea if it's actually true)
all in all, nice question, i learned something today!
MORE:
of course, by randomly generating running, we are essentially rendering branch prediction useless in the first function, so the results above are not too surprising. however, if we decide to not alter running during loop iterations and instead leave it at its default value, in this case false, branch prediction will do its magic in the first function, and will actually be faster by almost 20% as these results suggest:
step_versionOne = 6273942
step_versionTwo = 7809508
step_two - step_one = 1535566
two fast than one by = 19.6628
because running is constant throughout execution, notice that the simulation time is much shorter than it was with a randomly changing running - result of a compiler optimization likely.
why is the second function slower in this case? well, branch prediction will quickly realize that the condition in the first function is never met, and so will stop checking in the first place (as though if(running) ticks += mStepSize; isn't even there). on the other hand, the second function will still have to perform this instruction ticks += mStepSize * static_cast<int>(running); in every iteration, thus making the first function more efficient.
but what if we set running to true? well, the branch prediction will kick in again, however, this time, the first function will have to evaluate ticks += mStepSize; in every iteration; here the results when running{true}:
step_versionOne = 7522095
step_versionTwo = 7891948
step_two - step_one = 369853
two fast than one by = 4.68646
notice that step_versionTwo takes a consistent amount of time whether running is constantly true or false. but it still takes longer than step_versionTwo, however marginally. well, this might be because i was too lazy to run it a lot of times to determine whether it's consistently faster or whether it was a one time fluke (results vary slightly every time you run it, since the OS has to run in the background and it's not always going to do the same thing). but if it is consistently faster, it might be because function two (ticks += mStepSize * static_cast<int>(running);) has an arithmetic op more than function one (ticks += mStepSize;).
finally, let's compile with an optimization - g++ -o test test.cpp -std=c++11 -O1 and let's revert running back to false and then check the results:
step_versionOne = 704973
step_versionTwo = 695052
more or less the same. the compiler will do its optimization pass, and realize running is always false and will thus, for all intents and purposes, remove the body of step_versionOne, so when you call it from the loop in main, it'll just call the function and return.
on the other hand, when optimizing the second function, it will realize that ticks += mStepSize * static_cast<int>(running); will always generate the same result, i.e. 0, so it won't bother executing that either.
all in all, if i'm correct (and if not, please correct me, i'm pretty new to this), all you'll get when calling both functions from the main loop is their overhead.
p.s. here's the result for the first case (running is randomly generated in every iteration) when compiled with an optimization.
step_versionOne = 18868782
step_versionTwo = 18812315
step_two - step_one = 56467
one fast than two by = 0.299261