I am working on a project that uses multiple threads to parallelize tasks. During development, I noticed that performance of operations on a std::list container, e.g. pop_front() or push_back(), are significantly slower while executed in the thread, compared to single-thread execution. See the code snipped below:
#include <iostream>
#include <chrono>
#include <vector>
#include <string>
#include <thread>
#include <list>
using namespace std;
void SingleThreadedList()
{
auto t1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2; i++)
{
list<char> l;
for(int j=0; j<10000; j++)
{
l.push_back('c');
l.pop_front();
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << "duration single thread: " << duration << endl;
}
void MultiThreadedList()
{
auto t1 = std::chrono::high_resolution_clock::now();
auto lambda_fkt = []() {
list<char> l;
for(int i=0; i<10000; i++)
{
l.push_back('c');
l.pop_front();
}
};
vector<thread*> thread_array;
for (int i=0; i<2; ++i)
{
thread *th = new thread(lambda_fkt);
thread_array.push_back(th);
}
for(auto t : thread_array)
{
t->join();
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << "duration multi thread: " << duration << endl;
for(auto t : thread_array)
{
delete t;
}
}
int main() {
SingleThreadedList();
MultiThreadedList();
}
the code produces the following output:
duration single thread: 4589
duration multi thread: 245483
Single Thread variant takes 4 ms, but as soon as the threads are created, execution takes more than 200ms! I cannot imagine that standard library containers show such a performance difference, depending on the execution context. Therefore I wonder if maybe someone can explain to me, what happens in this code and how I can avoid the performance decrease within thread? Thank you!
PS: When I remove the list operations from this sample and add e.g. some simple math, the code quickly shows the expected behavior: Multi-thread variant gets faster, if the computation is split among multiple threads, thus using multiple cores to get the result.
Related
I am trying to write a multi-threaded program to produce a vector of N*NumPerThread uniform random integers, where N is the return value of std::thread::hardware_concurrency() and NumPerThread is the amount of random numbers I want each thread to generate.
I created a multi-threaded version:
#include <iostream>
#include <thread>
#include <vector>
#include <random>
#include <chrono>
using Clock = std::chrono::high_resolution_clock;
namespace Vars
{
const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread
std::vector<int> RandNums(NumPerThread*N);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 1000);
int sz = 0;
}
using namespace Vars;
void AddN(int start)
{
static std::mutex mtx;
std::lock_guard<std::mutex> lock(mtx);
for (unsigned int i=start; i<start+NumPerThread; i++)
{
RandNums[i] = dis(gen);
++sz;
}
}
int main()
{
auto start_time = Clock::now();
std::vector<std::thread> threads;
threads.reserve(N);
for (unsigned int i=0; i<N; i++)
{
threads.emplace_back(std::move(std::thread(AddN, i*NumPerThread)));
}
for (auto &i: threads)
{
i.join();
}
auto end_time = Clock::now();
std::cout << "\nTime difference = "
<< std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
std::cout << "size = " << sz << '\n';
}
and a single-threaded version
#include <iostream>
#include <thread>
#include <vector>
#include <random>
#include <chrono>
using Clock = std::chrono::high_resolution_clock;
namespace Vars
{
const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread
std::vector<int> RandNums(NumPerThread*N);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 1000);
int sz = 0;
}
using namespace Vars;
void AddN()
{
for (unsigned int i=0; i<NumPerThread*N; i++)
{
RandNums[i] = dis(gen);
++sz;
}
}
int main()
{
auto start_time = Clock::now();
AddN();
auto end_time = Clock::now();
std::cout << "\nTime difference = "
<< std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
std::cout << "size = " << sz << '\n';
}
The execution times are more or less the same. I am assuming there is a problem with the multi-threaded version?
P.S. I looked at all of the other similar questions here, I don't see how they directly apply to this task...
Threading is not a magical salve you can rub onto any code that makes it go faster. Like any tool, you have to use it correctly.
In particular, if you want performance out of threading, among the most important questions you need to ask is what data needs to be shared across threads. Your algorithm decided that the data which needs to be shared is the entire std::vector<int> result object. And since different threads cannot manipulate the object at the same time, each thread has to wait its turn to do the manipulation.
Your code is the equivalent of expecting 10 chefs to cook 10 meals in the same time as 1 chef, but you only provide them a single stove.
Threading works out best when nobody has to wait on anybody else to get any work done. Arrange your algorithms accordingly. For example, each thread could build its own array and return them, with the receiving code concatenating all of the arrays together.
You can do with without any mutex.
Create your vector
Use a mutex just to (and technically this probably isn't ncessary) to create an iterator point at v.begin () + itsThreadIndex*NumPerThread;
then each thread can freely increment that iterator and write to a part of the vector not touched by other threads.
Be sure each thread has its own copy of
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 1000);
That should run much faster.
UNTESTED code - but this should make my above suggestion more clear:
using Clock = std::chrono::high_resolution_clock;
namespace SharedVars
{
const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread
std::vector<int> RandNums(NumPerThread*N);
std::mutex mtx;
}
void PerThread_AddN(int threadNumber)
{
using namespace SharedVars;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 1000);
int sz = 0;
vector<int>::iterator from;
vector<int>::iterator to;
{
std::lock_guard<std::mutex> lock(mtx); // hold the lock only while accessing shared vector, not while accessing its contents
from = RandNums.begin () + threadNumber*NumPerThread;
to = from + NumPerThread;
}
for (auto i = from; i < to; ++i)
{
*i = dis(gen);
}
}
int main()
{
auto start_time = Clock::now();
std::vector<std::thread> threads;
threads.reserve(N);
for (unsigned int i=0; i<N; i++)
{
threads.emplace_back(std::move(std::thread(PerThread_AddN, i)));
}
for (auto &i: threads)
{
i.join();
}
auto end_time = Clock::now();
std::cout << "\nTime difference = "
<< std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
std::cout << "size = " << sz << '\n';
}
Nicol Boas was right on the money. I reimplemented it using std::packaged_task, and it's around 4-5 times faster now.
#include <iostream>
#include <vector>
#include <random>
#include <future>
#include <chrono>
using Clock = std::chrono::high_resolution_clock;
const unsigned int N = std::thread::hardware_concurrency(); //number of threads on device
const unsigned int NumPerThread = 5e5; //number of random numbers to generate per thread
std::vector<int> x(NumPerThread);
std::vector<int> createVec()
{
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 1000);
for (unsigned int i = 0; i < NumPerThread; i++)
{
x[i] = dis(gen);
}
return x;
}
int main()
{
auto start_time = Clock::now();
std::vector<int> RandNums;
RandNums.reserve(N*NumPerThread);
std::vector<std::future<std::vector<int>>> results;
results.reserve(N);
std::vector<int> crap;
crap.reserve(NumPerThread);
for (unsigned int i=0; i<N; i++)
{
std::packaged_task<std::vector<int>()> temp(createVec);
results[i] = std::move(temp.get_future());
temp();
crap = std::move(results[i].get());
RandNums.insert(RandNums.begin()+(0*NumPerThread),crap.begin(),crap.end());
}
std::cout << RandNums.size() << '\n';
auto end_time = Clock::now();
std::cout << "Time difference = "
<< std::chrono::duration<double, std::nano>(end_time - start_time).count() << " nanoseconds\n";
}
But is there a way to make this one better? lewis's version is way faster than this, so there must be something else missing...
I'm building a numa-aware processor that binds to a given socket and accepts lambdas. Here is what I've done:
#include <numa.h>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>
#include <vector>
using namespace std;
unsigned nodes = numa_num_configured_nodes();
unsigned cores = numa_num_configured_cpus();
unsigned cores_per_node = cores / nodes;
int main(int argc, char* argv[]) {
putenv("OMP_PLACES=sockets(1)");
cout << numa_available() << endl; // returns 0
numa_set_interleave_mask(numa_all_nodes_ptr);
int size = 200000000;
for (auto i = 0; i < nodes; ++i) {
auto t = thread([&]() {
// binding to given socket
numa_bind(numa_parse_nodestring(to_string(i).c_str()));
vector<int> v(size, 0);
cout << "node #" << i << ": on CPU " << sched_getcpu() << endl;
#pragma omp parallel for num_threads(cores_per_node) proc_bind(master)
for (auto i = 0; i < 200000000; ++i) {
for (auto j = 0; j < 10; ++j) {
v[i]++;
v[i] *= v[i];
v[i] *= v[i];
}
}
});
t.join();
}
}
However, all threads are running on socket 0. It seems numa_bind doesn't bind current thread to the given socket. The second numa processor -- Numac 1 outputs node #1: on CPU 0, which should be on CPU 1. So what's going wrong?
This works for me exactly as I expected:
#include <cassert>
#include <iostream>
#include <numa.h>
#include <omp.h>
#include <sched.h>
int main() {
assert (numa_available() != -1);
auto nodes = numa_num_configured_nodes();
auto cores = numa_num_configured_cpus();
auto cores_per_node = cores / nodes;
omp_set_nested(1);
#pragma omp parallel num_threads(nodes)
{
auto outer_thread_id = omp_get_thread_num();
numa_run_on_node(outer_thread_id);
#pragma omp parallel num_threads(cores_per_node)
{
auto inner_thread_id = omp_get_thread_num();
#pragma omp critical
std::cout
<< "Thread " << outer_thread_id << ":" << inner_thread_id
<< " core: " << sched_getcpu() << std::endl;
assert(outer_thread_id == numa_node_of_cpu(sched_getcpu()));
}
}
}
Program first create 2 (outer) threads on my dual-socket server. Then, it binds them to different sockets (NUMA nodes). Finally, it splits each thread into 20 (inner) threads, since each CPU has 10 physical cores and enabled hyperthreading.
All inner threads are running on the same socket as its parent thread. That is on cores 0-9 and 20-29 for outer thread 0, and on cores 10-19 and 30-39 for outer thread 1. (sched_getcpu() returned the number of virtual core from range 0-39 in my case.)
Note that there is no C++11 threading, just pure OpenMP.
I want to use several functions that declare the same array but in different ways (statically, on the stack and on the heap) and to display the execution time of each functions. Finally I want to call those functions several times.
I think I've managed to do everything but for the execution time of the functions I'm constantly getting 0 and I don't know if it's supposed to be normal. If somebody could confirm it for me. Thanks
Here's my code
#include "stdafx.h"
#include <iostream>
#include <time.h>
#include <stdio.h>
#include <chrono>
#define size 100000
using namespace std;
void prem(){
auto start = std::chrono::high_resolution_clock::now();
static int array[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed timefor static: " << elapsed.count() << " s\n";
}
void first(){
auto start = std::chrono::high_resolution_clock::now();
int array[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time on the stack: " << elapsed.count() << " s\n";
}
void secon(){
auto start = std::chrono::high_resolution_clock::now();
int *array = new int[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time dynamic: " << elapsed.count() << " s\n";
delete[] array;
}
int main()
{
for (int i = 0; i <= 1000; i++){
prem();
first();
secon();
}
return 0;
}
prem() - the array is allocated outside of the function
first() - the array is allocated before your code gets to it
You are looping over all 3 functions in a single loop. Why? Didn't you mean to loop for 1000 times over each one separately, so that they (hopefully) don't affect each other? In practice that last statement is not true though.
My suggestions:
Loop over each function separately
Do the now() call for the entire 1000 loops: make the now() calls before you enter the loop and after you exit it, then get the difference and divide it by the number of iterations(1000)
Dynamic allocation can be (trivially) reduced to just grabbing a block of memory in the vast available address space (I assume you are running on 64-bit platform) and unless you actually use that memory the OS doesn't even need to make sure it is in RAM. That would certainly skew your results significantly
Write a "driver" function that gets function pointer to "test"
Possible implementation of that driver() function:
void driver( void(*_f)(), int _iter, std::string _name){
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < _iter; ++i){
*_f();
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time " << _name << ": " << elapsed.count() / _iter << " s" << std::endl;
}
That way your main() looks like that:
void main(){
const int iterations = 1000;
driver(prem, iterations, "static allocation");
driver(first, iterations, "stack allocation");
driver(secon, iterations, "dynamic allocation");
}
Do not do such synthetic tests because the compiler will optimize out everything that is not used.
As another answer suggests, you need to measure the time for entire 1000 loops. And even though, I do not think you will get reasonable results.
Let's make not 1000 iterations, but 1000000. And let's add another case, where we just do two subsequent calls to chrono::high_resolution_clock::now() as a baseline:
#include <iostream>
#include <time.h>
#include <stdio.h>
#include <chrono>
#include <string>
#include <functional>
#define size 100000
using namespace std;
void prem() {
static int array[size];
}
void first() {
int array[size];
}
void second() {
int *array = new int[size];
delete[] array;
}
void PrintTime(std::chrono::duration<double> elapsed, int count, std::string msg)
{
std::cout << msg << elapsed.count() / count << " s\n";
}
int main()
{
int iterations = 1000000;
{
auto start = std::chrono::high_resolution_clock::now();
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time for nothing: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
prem();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed timefor static: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
first();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time on the stack: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
second();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time dynamic: ");
}
return 0;
}
With all optimisations on, I get this result:
Elapsed time for nothing: 3.11e-13 s
Elapsed timefor static: 3.11e-13 s
Elapsed time on the stack: 3.11e-13 s
Elapsed time dynamic: 1.88703e-07 s
That basically means, that compiler actually optimized out prem() and first(). Even not calls, but entire loops, because they do not have side effects.
Now, the code takes 866678 clock ticks when the code is run in multithread and when I comment the for loops in the threads(each FOR loop runs 10000 times) and just run the whole FOR loop (20000 times). The run time is same for both with and without threads. But ideally it should have been half right?
// thread example
#include <iostream> // std::cout
#include <thread> // std::thread
#include <time.h>
#include<cmath>
#include<unistd.h>
int K = 20000;
long int a[20000];
void makeArray(){
for(int i=0;i<K;i++){
a[i] = i;
}
}
void foo()
{
// do stuff...
std::cout << "FOOOO Running...\n";
for(int i=K/2;i<K;i++){
// a[i] = a[i]*a[i]*10;
// a[i] = exp(2/5);
int j = i*i;
usleep(2000);
}
}
void bar(int x)
{
// do stuff...
std::cout << "BARRRR Running...\n";
for(int i=0;i<K/2;i++){
//a[i] = a[i]*a[i];
int j = i*i;
usleep(2000);
}
}
void show(){
std::cout<<"The array is:"<<"\n";
for(int i=0; i <K;i++){
std::cout<<a[i]<<"\n";
}
}
int main()
{
clock_t t1,t2;
t1 = clock();
makeArray();
// show();
std::thread first (foo); // spawn new thread that calls foo()
std::thread second (bar,0); // spawn new thread that calls bar(0)
//show();
std::cout << "main, foo and bar now execute concurrently...\n";
// synchronize threads:
first.join(); // pauses until first finishes
second.join(); // pauses until second finishes
//show();
// for(int i=0;i<K;i++){
// int j = i*i;
// //a[i] = a[i]*a[i];
// usleep(2000);
// }
std::cout << "foo and bar completed.\n";
//show();
t2 = clock();
std::cout<<"Runtime:"<< (float)t2-(float)t1<<"\n";
return 0;
}
The problem is in your use of clock(). This function actually returns the total amount of CPU run time consumed by your program across all cores / CPUs.
What you are actually interested in is the wall clock time that it took for your program to complete.
Replace clock() with time(), gettimeofday() or something similar to get what you want.
EDIT - Here's the C++ way to do timers the way you want: http://www.cplusplus.com/reference/chrono/high_resolution_clock/now/
In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.