Here's a little micro-optimization curiosity that I came up with:
struct Timer {
bool running{false};
int ticks{0};
void step_versionOne(int mStepSize) {
if(running) ticks += mStepSize;
}
void step_versionTwo(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
};
It seems the two methods practically do the same thing. Does the second version avoid a branch (and consequently, is faster than the first version), or is any compiler able to do this kind of optimization with -O3?
Yes, your trick allows to avoid branch and it makes it faster... sometimes.
I wrote benchmark that compares these solutions in various situations, along with my own:
ticks += mStepSize & -static_cast<int>(running)
My results are following:
Off:
branch: 399949150
mul: 399940271
andneg: 277546678
On:
branch: 204035423
mul: 399937142
andneg: 277581853
Pattern:
branch: 327724860
mul: 400010363
andneg: 277551446
Random:
branch: 915235440
mul: 399916440
andneg: 277537411
Off is when timers are turned off. In this cases solutions take about the same time.
On is when they are turned on. Branching solution two times faster.
Pattern is when they are in 100110 pattern. Performance is similar, but branching is a bit faster.
Random is when branch is unpredictable. In this case multiplications is more than 2 times faster.
In all cases my bit-hacking trick is fastest, except for On where branching wins.
Note that this benchmark is not necessarily representative for all compiler versions processors etc. Even small changes of benchmark can turn results upside down (for example if compiler can inline knowing mStepSize is 1 than multiplication can be actually fastest).
Code of the benchmark:
#include<array>
#include<iostream>
#include<chrono>
struct Timer {
bool running{false};
int ticks{0};
void branch(int mStepSize) {
if(running) ticks += mStepSize;
}
void mul(int mStepSize) {
ticks += mStepSize * static_cast<int>(running);
}
void andneg(int mStepSize) {
ticks += mStepSize & -static_cast<int>(running);
}
};
void run(std::array<Timer, 256>& timers, int step) {
auto start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.branch(step);
auto end = std::chrono::steady_clock::now();
std::cout << "branch: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.mul(step);
end = std::chrono::steady_clock::now();
std::cout << "mul: " << (end - start).count() << std::endl;
start = std::chrono::steady_clock::now();
for(int i = 0; i < 1000000; i++)
for(auto& t : timers)
t.andneg(step);
end = std::chrono::steady_clock::now();
std::cout << "andneg: " << (end - start).count() << std::endl;
}
int main() {
std::array<Timer, 256> timers;
int step = rand() % 256;
run(timers, step); // warm up
std::cout << "Off:\n";
run(timers, step);
for(auto& t : timers)
t.running = true;
std::cout << "On:\n";
run(timers, step);
std::array<bool, 6> pattern = {1, 0, 0, 1, 1, 0};
for(int i = 0; i < 256; i++)
timers[i].running = pattern[i % 6];
std::cout << "Pattern:\n";
run(timers, step);
for(auto& t : timers)
t.running = rand()&1;
std::cout << "Random:\n";
run(timers, step);
for(auto& t : timers)
std::cout << t.ticks << ' ';
return 0;
}
Does the second version avoid a branch
if you compile your code to get assembler output, g++ -o test.s test.cpp -S, you'll find that a branch is indeed avoided in the second function.
and consequently, is faster than the first version
i ran each of your functions 2147483647 or INT_MAX number of times where in each iteration i randomly assigned a boolean value to running member of your Timer struct, using this code:
int main() {
const int max = std::numeric_limits<int>::max();
timestamp_t start, end, one, two;
Timer t_one, t_two;
double percent;
srand(time(NULL));
start = get_timestamp();
for(int i = 0; i < max; ++i) {
t_one.running = rand() % 2;
t_one.step_versionOne(1);
}
end = get_timestamp();
one = end - start;
std::cout << "step_versionOne = " << one << std::endl;
start = get_timestamp();
for(int i = 0; i < max; ++i) {
t_two.running = rand() % 2;
t_two.step_versionTwo(1);
}
end = get_timestamp();
two = end - start;
percent = (one - two) / static_cast<double>(one) * 100.0;
std::cout << "step_versionTwo = " << two << std::endl;
std::cout << "step_one - step_two = " << one - two << std::endl;
std::cout << "one fast than two by = " << percent << std::endl;
}
and these are the results i got:
step_versionOne = 39738380
step_versionTwo = 26047337
step_one - step_two = 13691043
one fast than two by = 34.4529%
so yes, the second function is clearly faster, and by around 35%. note that the percentage increase in timed performance varied between 30 and 55 percent for a smaller number of iterations, whereas it seems to plateau at around 35% the longer it runs. this is might be due to sporadic execution of system tasks while the simulation is running, which become a lot less sporadic, i.e. consistent the longer you run the sim (although this is just my assumption, i have no idea if it's actually true)
all in all, nice question, i learned something today!
MORE:
of course, by randomly generating running, we are essentially rendering branch prediction useless in the first function, so the results above are not too surprising. however, if we decide to not alter running during loop iterations and instead leave it at its default value, in this case false, branch prediction will do its magic in the first function, and will actually be faster by almost 20% as these results suggest:
step_versionOne = 6273942
step_versionTwo = 7809508
step_two - step_one = 1535566
two fast than one by = 19.6628
because running is constant throughout execution, notice that the simulation time is much shorter than it was with a randomly changing running - result of a compiler optimization likely.
why is the second function slower in this case? well, branch prediction will quickly realize that the condition in the first function is never met, and so will stop checking in the first place (as though if(running) ticks += mStepSize; isn't even there). on the other hand, the second function will still have to perform this instruction ticks += mStepSize * static_cast<int>(running); in every iteration, thus making the first function more efficient.
but what if we set running to true? well, the branch prediction will kick in again, however, this time, the first function will have to evaluate ticks += mStepSize; in every iteration; here the results when running{true}:
step_versionOne = 7522095
step_versionTwo = 7891948
step_two - step_one = 369853
two fast than one by = 4.68646
notice that step_versionTwo takes a consistent amount of time whether running is constantly true or false. but it still takes longer than step_versionTwo, however marginally. well, this might be because i was too lazy to run it a lot of times to determine whether it's consistently faster or whether it was a one time fluke (results vary slightly every time you run it, since the OS has to run in the background and it's not always going to do the same thing). but if it is consistently faster, it might be because function two (ticks += mStepSize * static_cast<int>(running);) has an arithmetic op more than function one (ticks += mStepSize;).
finally, let's compile with an optimization - g++ -o test test.cpp -std=c++11 -O1 and let's revert running back to false and then check the results:
step_versionOne = 704973
step_versionTwo = 695052
more or less the same. the compiler will do its optimization pass, and realize running is always false and will thus, for all intents and purposes, remove the body of step_versionOne, so when you call it from the loop in main, it'll just call the function and return.
on the other hand, when optimizing the second function, it will realize that ticks += mStepSize * static_cast<int>(running); will always generate the same result, i.e. 0, so it won't bother executing that either.
all in all, if i'm correct (and if not, please correct me, i'm pretty new to this), all you'll get when calling both functions from the main loop is their overhead.
p.s. here's the result for the first case (running is randomly generated in every iteration) when compiled with an optimization.
step_versionOne = 18868782
step_versionTwo = 18812315
step_two - step_one = 56467
one fast than two by = 0.299261
Related
I'm trying to optimize the performance of a C++ program by using the TBB library.
My program only contains a couple of small for loop, so I know it can be a challenge to optimze time complexity in this case, but I have to use TBB.
As such, I tried to use a partitionner which made the program 2 time faster with TBB than without the partitionner, but it's still slower than the original program without the use of parallelism.
In my code, I print when a loop start and end with the id to see if there is parallelism. The output show that the loop is in fact execute sequentially, for example : start 1 end 1, start 2 end 2 , etc(it's a list of size 200). The output of the ids isn't random like you would expect from a parallelized program.
Here is an example of how I used the library:
tbb::global_control c(tbb::global_control::max_allowed_parallelism, 1000);
size_t grainsize = 1000;
size_t changes = 0;
tbb::parallel_for(
tbb::blocked_range<std::size_t>(0, list.size(), grainsize),
[&](const tbb::blocked_range<std::size_t> r) {
for (size_t id = r.begin(); id < r.end(); ++id) {
std::cout << "start:" << point_id << std::endl;
double disto = std::numeric_limits<double>::max();
size_t cluster_id = 0;
const Point& point = points.at(id);
for (size_t i = 0; i < short_list.size(); i++) {
const Point& origin = originss[i];
double disto2 = point.dist(origin);
if (disto2 < min) {
min = disto2;
clus = i;
}
}
if (m[id] != m_id) {
m[id] = m_id;
modif++;
}
disto_list[id] = min;
std::cout << "end:" << point_id << std::endl;
}
}
);
Is there a way to improve the performance of a C++ program composed of multiple small for loops with the use of the TBB library? And why are the loop not parallized?
If you are using task_scheduler_init in your program, then TBB uses the same thread throughout the program until task_scheduler_init objects are destroyed.
As you are passing max_allowed_parallelism as a parameter for global_control, if it is set to 1 then it will make your application run in a sequential way.
You can refer to the below link:
https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html
It will be helpful if you provide the complete reproducer to figure out where exactly the issue took place.
I'm trying to use steady clock to benchmark parts of my code and I'm pulling some hair out here. It seems that sometimes it returns the difference between 2 times, and sometimes it just returns 0.
I have the following code. This is not real code in my prog but illustrates the problem
typedef std::chrono::steady_clock::time_point clock_point;
for ( int i = 0; i < 2; i++ ) {
clock_point start_overall = std::chrono::steady_clock::now();
for ( int j = 0; j < 10000000; j++ ) {
int q = 4;
}
clock_point end_phase_1 = std::chrono::steady_clock::now();
std::cout << "DIFF=" << std::chrono::duration_cast<std::chrono::microseconds>( end_phase_1 - start_overall ).count() << "\n";
}
This gives me the following output from running the prog 4 times:
DIFF=15622
DIFF=0
DIFF=12968
DIFF=13001
DIFF=12966
DIFF=13997
DIFF=0
DIFF=0
Very frustrating!! i need some consistent times here. And Its not like the time needed to loop 10,000,000 times is completely irrelevant. In my actual program, there's much more going on in the loop and it takes significantly longer, but i still sometimes get 0 vals for time differences.
Whats going on? How can I fix this so i get reliable time differences? Thanks
EDIT: Ok because the explanation I'm getting is that the compiler is simplifying out the loop because nothing actually happens in it, I'm going to show you the actual code in the actual loop that runs, between the 2 clock points
// need to reset some variables with each situation
// these are global vars so can access throughout (ewww)
this_sit_phase_1_complete = dataVars.phase_1_complete;
this_sit_on_the_play = dataVars.on_the_play;
this_sit_start_drawing_cards = dataVars.start_drawing_cards;
this_sit_current_turn = dataVars.current_turn;
this_sit_max_turn = dataVars.max_turn;
// note: do i want a separate split count for each scenario?
// mmm yeah.. THIS IS WHAT I SHOULD DO INSTEAD OF GLOBAL VARS....
dataVars.scen_active_index = i;
// point to the split count we want to use
// dataVars.use_scen_split_count = &dataVars.scen_phase_1and2_split_counts[i];
dataVars.split_count[i] += 1;
// PHASE 1:
// if we're on the play, we execute first turn without drawing a card
// just a single split to start in a single que
// phase 1 won't be complete until we draw a card tho
// create the all_splits_phase_1 for each situation
all_splits_que all_splits_phase_1;
// SPLIT STRUCT
// create the first split in the scenario
split_struct first_split_struct;
// set vars to track splits
first_split_struct.split_id = dataVars.split_count[i];
// first_split_struct.split_trail = std::to_string(dataVars.split_count[i]);
// set remaining vars
first_split_struct.cards_in_hand_numbs = dataVars.scen_hand_card_numbs[i];
first_split_struct.cards_in_deck_numbs = dataVars.scen_initial_decks[i];
first_split_struct.cards_bf_numbs = dataVars.scen_bf_card_numbs[i];
first_split_struct.played_a_land = false;
// store the split struct as the initial split
all_splits_phase_1 = { first_split_struct };
// if we're on the play, execute first turn without
// drawing any cards
if ( this_sit_on_the_play ) {
// execute the turn on the play before drawing anything
execute_turn(all_splits_phase_1);
// move to next turn
this_sit_current_turn += 1;
}
// ok so now, regardless of if we were on the play or not, we have to draw
// a card for every remaining card in each split, and then execute a turn
// once these splits are done, we can convert over to phase 2
do_draw_every_card( all_splits_phase_1 );
// execute another turn after drawing one of everything,
// we wont actually draw anything within the turn
execute_turn( all_splits_phase_1 );
// next turn
this_sit_current_turn += 1;
clock_point end_phase_1 = std::chrono::steady_clock::now();
benchmarker[dataVars.scen_active_index].phase_1_t = std::chrono::duration_cast<std::chrono::microseconds>( end_phase_1 - start_overall ).count();
There is LOTS happening here, lots and lots, the compiler would never simplify out this block. And yet I'm getting 0's as i explained.
Out of OPs code:
for ( int j = 0; j < 10000000; j++ ) {
int q = 4;
}
This is a repeated assignment to a local variable which isn't used anywhere.
I strongly assume that the compiler is clever enough to recognize that there is no side-effect caused by the loop. Hence it doesn't emit any code for the loop – for proper (and legal) optimization.
To check this, I completed OPs code snippet to the following MCVE:
#include <chrono>
#include <iostream>
typedef std::chrono::steady_clock::time_point clock_point;
int main()
{
for ( int i = 0; i < 2; i++ ) {
clock_point start_overall = std::chrono::steady_clock::now();
for ( int j = 0; j < 10000000; j++ ) {
int q = 4;
}
clock_point end_phase_1 = std::chrono::steady_clock::now();
std::cout << "DIFF=" << std::chrono::duration_cast<std::chrono::microseconds>( end_phase_1 - start_overall ).count() << "\n";
}
}
and compiled with -O2 -Wall -std=c++17 on CompilerExplorer:
Live Demo on CompilerExplorer
Please, note that the lines for the loop are not colored.
The reason is (as I assumed): there is no code emitted for the for-loop.
So, OP measures two consecutive calls of std::chrono::steady_clock::now(); which may (or may not) appear in a sub-clock-tick time. Thus, it looks if no time has been passed between these calls.
To prevent such optimizations, the code has to contain something which causes side-effects that the compiler cannot foresee during compile time. Input/output operations are an option. So, the loop could contain an assignment from a variable determined by input and assign results to a container determined for output.
Marking variables as volatile could be an option as well because it forces the compiler to assign the variable in any case even if it cannot "see" side-effects.
I ran your code. In debug I see the correct difference. In release it's 0-es. Optimized assignment is my assumption. Try in debug, or flag int q to volatile int q
I am trying to use OpenMP to benchmark the speed of data structure that I implemented. However, I seem to make a fundamental mistake: the throughput decreases instead of increasing with the number of threads no matter what operation I try to benchmark.
Below you can see the code that tries to benchmark the speed of a for-loop, as such I would expect it to scale (somewhat) linearly with the number of threads, it doesn't (compiled on a dualcore laptop with and without -O3 flag on g++ with c++11).
#include <omp.h>
#include <atomic>
#include <chrono>
#include <iostream>
thread_local const int OPS = 10000;
thread_local const int TIMES = 200;
double get_tp(int THREADS)
{
double threadtime[THREADS] = {0};
//Repeat the test many times
for(int iteration = 0; iteration < TIMES; iteration++)
{
#pragma omp parallel num_threads(THREADS)
{
double start, stop;
int loc_ops = OPS/float(THREADS);
int t = omp_get_thread_num();
//Force all threads to start at the same time
#pragma omp barrier
start = omp_get_wtime();
//Do a certain kind of operations loc_ops times
for(int i = 0; i < loc_ops; i++)
{
//Here I would put the operations to benchmark
//in this case a boring for loop
int x = 0;
for(int j = 0; j < 1000; j++)
x++;
}
stop = omp_get_wtime();
threadtime[t] += stop-start;
}
}
double total_time = 0;
std::cout << "\nThread times: ";
for(int i = 0; i < THREADS; i++)
{
total_time += threadtime[i];
std::cout << threadtime[i] << ", ";
}
std::cout << "\nTotal time: " << total_time << "\n";
double mopss = float(OPS)*TIMES/total_time;
return mopss;
}
int main()
{
std::cout << "\n1 " << get_tp(1) << "ops/s\n";
std::cout << "\n2 " << get_tp(2) << "ops/s\n";
std::cout << "\n4 " << get_tp(4) << "ops/s\n";
std::cout << "\n8 " << get_tp(8) << "ops/s\n";
}
Outputs with -O3 on a dualcore, so we don't expect the throughput to increase after 2 threads, but it does not even increase when going from 1 to 2 threads it decreases by 50%:
1 Thread
Thread times: 7.411e-06,
Total time: 7.411e-06
2.69869e+11 ops/s
2 Threads
Thread times: 7.36701e-06, 7.38301e-06,
Total time: 1.475e-05
1.35593e+11ops/s
4 Threads
Thread times: 7.44301e-06, 8.31901e-06, 8.34001e-06, 7.498e-06,
Total time: 3.16e-05
6.32911e+10ops/s
8 Threads
Thread times: 7.885e-06, 8.18899e-06, 9.001e-06, 7.838e-06, 7.75799e-06, 7.783e-06, 8.349e-06, 8.855e-06,
Total time: 6.5658e-05
3.04609e+10ops/s
To make sure that the compiler does not remove the loop, I also tried outputting "x" after measuring the time and to the best of my knowledge the problem persists. I also tried the code on a machine with more cores and it behaved very similarly. Without -O3 the throughput also does not scale. So there is clearly something wrong with the way I benchmark. I hope you can help me.
I'm not sure why you are defining performance as the total number of operations per total CPU time and then get surprised by the decreasing function of the number of threads. This will almost always and universally be the case except for when cache effects kick in. The true performance metric is the number of operations per wall-clock time.
It is easy to show with simple mathematical reasoning. Given a total work W and processing capability of each core P, the time on a single core is T_1 = W / P. Dividing the work evenly among n cores means each of them works for T_1,n = (W / n + H) / P, where H is the overhead per thread induced by the parallelisation itself. The sum of those is T_n = n * T_1,n = W / P + n (H / P) = T_1 + n (H / P). The overhead is always a positive value, even in the trivial case of so-called embarrassing parallelism where no two threads need to communicate or synchronise. For example, launching the OpenMP threads takes time. You cannot get rid of the overhead, you can only amortise it over the lifetime of the threads by making sure that each one get a lot to work on. Therefore, T_n > T_1 and with fixed number of operations in both cases the performance on n cores will always be lower than on a single core. The only exception of this rule is the case when the data for work of size W doesn't fit in the lower-level caches but that for work of size W / n does. This results in massive speed-up that exceeds the number of cores, known as superlinear speed-up. You are measuring inside the thread function so you ignore the value of H and T_n should more or less be equal to T_1 within the timer precision, but...
With multiple threads running on multiple CPU cores, they all compete for limited shared CPU resources, namely last-level cache (if any), memory bandwidth, and thermal envelope.
The memory bandwidth is not a problem when you are simply incrementing a scalar variable, but becomes the bottleneck when the code starts actually moving data in and out of the CPU. A canonical example from numerical computing is the sparse matrix-vector multiplication (spMVM) -- a properly optimised spMVM routine working with double non-zero values and long indices eats so much memory bandwidth, that one can completely saturate the memory bus with as low as two threads per CPU socket, making an expensive 64-core CPU a very poor choice in that case. This is true for all algorithms with low arithmetic intensity (operations per unit of data volume).
When it comes to the thermal envelope, most modern CPUs employ dynamic power management and will overclock or clock down the cores depending on how many of them are active. Therefore, while n clocked down cores perform more work in total per unit of time than a single core, a single core outperforms n cores in terms of work per total CPU time, which is the metric you are using.
With all this in mind, there is one last (but not least) thing to consider -- timer resolution and measurement noise. Your run times are in couples of microseconds. Unless your code is running on some specialised hardware that does nothing else but run your code (i.e., no time sharing with daemons, kernel threads, and other processes and no interrupt handing), you need benchmarks that run several orders of magnitude longer, preferably for at least a couple of seconds.
The loop is almost certainly still getting optimized, even if you output the value of x after the outer loop. The compiler can trivially replace the entire loop with a single instruction since the loop bounds are constant at compile time. Indeed, in this example:
#include <iostream>
int main()
{
int x = 0;
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
++x;
}
}
std::cout << x << '\n';
return 0;
}
The loop is replaced with the single assembly instruction mov esi, 10000000.
Always inspect the assembly output when benchmarking to make sure that you're measuring what you think you are; in this case you are just measuring the overhead of creating threads, which of course will be higher the more threads you create.
Consider having the innermost loop do something that can't be optimized away. Random number generation is a good candidate because it should perform in constant time, and it has the side-effect of permuting the PRNG state (making it ineligible to be removed entirely, unless the seed is known in advance and the compiler is able to unravel all of the mutation in the PRNG).
For example:
#include <iostream>
#include <random>
int main()
{
std::mt19937 r;
std::uniform_real_distribution<double> dist{0, 1};
for (int i = 0; i < 10000; ++i) {
for (int j = 0; j < 1000; ++j) {
dist(r);
}
}
return 0;
}
Both loops and the PRNG invocation are left intact here.
I was kind of bored so I wanted to try using std::thread and eventually measure performance of single and multithreaded console application. This is a two part question. So I started with a single threaded sum of a massive vector of ints (800000 of ints).
int sum = 0;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 800000; ++i)
sum += ints[i];
auto end = chrono::high_resolution_clock::now();
auto diff = end - start;
Then I added range based and iterator based for loop and measured the same way with chrono::high_resolution_clock.
for (auto& val : ints)
sum += val;
for (auto it = ints.begin(); it != ints.end(); ++it)
sum += *it;
At this point console output looked like:
index loop: 30.0017ms
range loop: 221.013ms
iterator loop: 442.025ms
This was a debug version, so I changed to release and the difference was ~1ms in favor of index based for. No big deal, but just out of curiosity: should there be a difference this big in debug mode between these three for loops? Or even a difference in 1ms in release mode?
I moved on to the thread creation, and tried to do a parallel sum of the array with this lambda (captured everything by reference so I could use vector of ints and a mutex previously declared) using index based for.
auto func = [&](int start, int total, int index)
{
int partial_sum = 0;
auto s = chrono::high_resolution_clock::now();
for (int i = start; i < start + total; ++i)
partial_sum += ints[i];
auto e = chrono::high_resolution_clock::now();
auto d = e - s;
m.lock();
cout << "thread " + to_string(index) + ": " << chrono::duration<double, milli>(d).count() << "ms" << endl;
sum += partial_sum;
m.unlock();
};
for (int i = 0; i < 8; ++i)
threads.push_back(thread(func, i * 100000, 100000, i));
Basically every thread was summing 1/8 of the total array, and the final console output was:
thread 0: 6.0004ms
thread 3: 6.0004ms
thread 2: 6.0004ms
thread 5: 7.0004ms
thread 4: 7.0004ms
thread 1: 7.0004ms
thread 6: 7.0004ms
thread 7: 7.0004ms
8 threads total: 53.0032ms
So I guess the second part of this question is what's going on here? Solution with 2 threads ended with ~30ms as well. Cache ping pong? Something else? If I'm doing something wrong, what would be the correct way to do it? Also if It's relevant, I was trying this on an i7 with 8 threads, so yes I know I didn't count the main thread, but tried it with 7 separate threads and pretty much got the same result.
EDIT: Sorry forgot the mention this was on Windows 7 with Visual Studio 2013 and Visual Studio's v120 compiler or whatever it's called.
EDIT2: Here's the whole main function:
http://pastebin.com/HyZUYxSY
With optimisation not turned on, all the method calls that are performed behind the scenes are likely real method calls. Inline functions are likely not inlined but really called. For template code, you really need to turn on optimisation to avoid that all the code is taken literally. For example, it's likely that your iterator code will call iter.end () 800,000 times, and operator!= for the comparison 800,000 times, which calls operator== and so on and so on.
For the multithreaded code, processors are complicated. Operating systems are complicated. Your code isn't alone on the computer. Your computer can change its clock speed, change into turbo mode, change into heat protection mode. And rounding the times to milliseconds isn't really helpful. Could be one thread to 6.49 milliseconds and another too 6.51 and it got rounded differently.
should there be a difference this big in debug mode between these three for loops?
Yes. If allowed, a decent compiler can produce identical output for each of the 3 different loops, but if optimizations are not enabled, the iterator version has more function calls and function calls have certain overhead.
Or even a difference in 1ms in release mode?
Your test code:
start = ...
for (auto& val : ints)
sum += val;
end = ...
diff = end - start;
sum = 0;
Doesn't use the result of the loop at all so when optimized, the compiler should simply choose to throw away the code resulting in something like:
start = ...
// do nothing...
end = ...
diff = end - start;
For all your loops.
The difference of 1ms may be produced by high granularity of the "high_resolution_clock" in the used implementation of the standard library and by differences in process scheduling during the execution. I measured the index based for being 0.04 ms slower, but that result is meaningless.
Aside from how std::thread is implemented on Windows I would to point your attention to your available execution units and context switching.
An i7 does not have 8 real execution units. It's a quad-core processor with hyper-threading. And HT does not magically double the available number of threads, no matter how it's advertised. It's a really clever system which tries to fit in instructions from an extra pipeline whenever possible. But in the end all instructions go through only four execution units.
So running 8 (or 7) threads is still more than your CPU can really handle simultaneously. That means your CPU has to switch a lot between 8 hot threads clamouring for calculation time. Top that off with several hundred more threads from the OS, admittedly most of which are asleep, that need time and you're left with a high degree of uncertainty in your measurements.
With a single threaded for-loop the OS can dedicate a single core to that task and spread the half-sleeping threads across the other three. This is why you're seeing such a difference between 1 thread and 8 threads.
As for your debugging questions: you should check if Visual Studio has Iterator checking enabled in debugging. When it's enabled every time an iterator is used it is bounds-checked and such. See: https://msdn.microsoft.com/en-us/library/aa985965.aspx
Lastly: have a look at the -openmp switch. If you enable that and apply the OpenMP #pragmas to your for-loops you can do away with all the manual thread creation. I toyed around with similar threading tests (because it's cool. :) ) and OpenMPs performance is pretty damn good.
For the first question, regarding the difference in performance between the range, iterator and index implementations, others have pointed out that in a non-optimized build, much which would normally be inlined may not be.
However there is an additional wrinkle: by default, in Debug builds, Visual Studio will use checked iterators. Access through a checked iterator is checked for safety (does the iterator refer to a valid element?), and consequently operations which use them, including the range-based iteration, are heavily penalized.
For the second part, I have to say that those durations seem abnormally long. When I run the code locally, compiled with g++ -O3 on a core i7-4770 (Linux), I get sub-millisecond timings for each method, less in fact than the jitter between runs. Altering the code to iterate each test 1000 times gives more stable results, with the per test times being 0.33 ms for the index and range loops with no extra tweaking, and about 0.15 ms for the parallel test.
The parallel threads are doing in total the same number of operations, and what's more, using all four cores limits the CPU's ability to dynamically increase its clock speed. So how can it take less total time?
I'd wager that the gains result from better utilization of the per-core L2 caches, four in total. Indeed, using four threads instead of eight threads reduces the total parallel time to 0.11 ms, consistent with better L2 cache use.
Browsing the Intel processor documentation, all the Core i7 processors, including the mobile ones, have at least 4 MB of L3 cache, which will happily accommodate 800 thousand 4-byte ints. So I'm surprised both by the raw times being 100 times larger than I'm seeing, and the 8-thread time totals being so much greater, which as you surmise, is a strong hint that they are thrashing the cache. I'm presuming this is demonstrating just how suboptimal the Debug build code is. Could you post results from an optimised build?
Not knowing how those std::thread classes are implemented, one possible explanation for the 53ms could be:
The threads are started right away when they get instantiated. (I see no thread.start() or threads.StartAll() or alike). So, during the time the first thread instance gets active, the main thread might (or might not) be preempted. There is no guarantee that the threads are getting spawned on individual cores, after all (thread affinity).
If you have a closer look at POSIX APIs, there is the notion of "application context" and "system context", which basically implies, that there might be an OS policy in place which would not use all cores for 1 application.
On Windows (this is where you were testing), maybe the threads are not being spawned directly but via a thread pool, maybe with some extra std::thread functionality, which could produce overhead/delay. (Such as completion ports etc.).
Unfortunately my machine is pretty fast so I had to increase the amount of data processed to yield significant times. But on the upside, this reminded me to point out, that typically, it starts to pay off to go parallel when the computation times are way beyond the time of a time slice (rule of thumb).
Here my "native" Windows implementation, which - for a large enough array finally makes the threads win over a single threaded computation.
#include <stdafx.h>
#include <nativethreadTest.h>
#include <vector>
#include <cstdint>
#include <Windows.h>
#include <chrono>
#include <iostream>
#include <thread>
struct Range
{
Range( const int32_t *p, size_t l)
: data(p)
, length(l)
, result(0)
{}
const int32_t *data;
size_t length;
int32_t result;
};
static int32_t Sum(const int32_t * data, size_t length)
{
int32_t sum = 0;
const int32_t *end = data + length;
for (; data != end; data++)
{
sum += *data;
}
return sum;
}
static int32_t TestSingleThreaded(const Range& range)
{
return Sum(range.data, range.length);
}
DWORD
WINAPI
CalcThread
(_In_ LPVOID lpParameter
)
{
Range * myRange = reinterpret_cast<Range*>(lpParameter);
myRange->result = Sum(myRange->data, myRange->length);
return 0;
}
static int32_t TestWithNCores(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<HANDLE> threadHandles;
threadHandles.reserve(ncores);
for (size_t i = 0; i < ncores; ++i)
{
threadHandles.push_back(::CreateThread(NULL, 0, CalcThread, &ranges[i], 0, NULL));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
DWORD waitResult = ::WaitForMultipleObjects((DWORD)threadHandles.size(), &threadHandles[0], TRUE, INFINITE);
if (WAIT_OBJECT_0 == waitResult)
{
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
}
else
{
throw std::runtime_error("Something went horribly - HORRIBLY wrong!");
}
for (auto& h : threadHandles)
{
::CloseHandle(h);
}
return result;
}
static int32_t TestWithSTLThreads(const Range& range, size_t ncores)
{
int32_t result = 0;
std::vector<Range> ranges;
size_t nextStart = 0;
size_t chunkLength = range.length / ncores;
size_t remainder = range.length - chunkLength * ncores;
while (nextStart < range.length)
{
ranges.push_back(Range(&range.data[nextStart], chunkLength));
nextStart += chunkLength;
}
Range remainderRange(&range.data[range.length - remainder], remainder);
std::vector<std::thread> threads;
for (size_t i = 0; i < ncores; ++i)
{
threads.push_back(std::thread([](Range* range){ range->result = Sum(range->data, range->length); }, &ranges[i]));
}
int32_t remainderResult = Sum(remainderRange.data, remainderRange.length);
for (auto& t : threads)
{
t.join();
}
for (auto& r : ranges)
{
result += r.result;
}
result += remainderResult;
return result;
}
void TestNativeThreads()
{
const size_t DATA_SIZE = 800000000ULL;
typedef std::vector<int32_t> DataVector;
DataVector data;
data.reserve(DATA_SIZE);
for (size_t i = 0; i < DATA_SIZE; ++i)
{
data.push_back(static_cast<int32_t>(i));
}
Range r = { data.data(), data.size() };
std::chrono::system_clock::time_point singleThreadedStart = std::chrono::high_resolution_clock::now();
int32_t result = TestSingleThreaded(r);
std::chrono::system_clock::time_point singleThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Single threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(singleThreadedEnd - singleThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point multiThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithNCores(r, 8);
std::chrono::system_clock::time_point multiThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "Multi threaded sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(multiThreadedEnd - multiThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
std::chrono::system_clock::time_point stdThreadedStart = std::chrono::high_resolution_clock::now();
result = TestWithSTLThreads(r, 8);
std::chrono::system_clock::time_point stdThreadedEnd = std::chrono::high_resolution_clock::now();
std::cout
<< "std::thread sum: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(stdThreadedEnd - stdThreadedStart).count()
<< "ms." << " Result = " << result << std::endl;
}
Here the output on my machine of this code:
Single threaded sum: 382ms. Result = -532120576
Multi threaded sum: 234ms. Result = -532120576
std::thread sum: 245ms. Result = -532120576
Press any key to continue . . ..
Last not least, I feel urged to mention that the way this code is written it is rather a memory IO performance benchmark than a core CPU computation benchmark.
Better computation benchmarks would use small amounts of data which is local, fits into CPU caches etc.
Maybe it would be interesting to experiment with the splitting of the data into ranges. What if each thread were "jumping" over the data from the start to an end with a gap of ncores? Thread 1: 0 8 16... Thread 2: 1 9 17 ... etc.? Maybe then the "locality" of the memory could gain extra speed.
In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.