Boost: Creating objects and populating a vector with threads - c++

Using this boost asio based thread pool, in this case the class is named ThreadPool, I want to parallelize the population of a vector of type std::vector<boost::shared_ptr<T>>, where T is a struct containing a vector of type std::vector<int> whose content and size are dynamically determined after struct initialization.
Unfortunately, I am a newb at both c++ and multi threading, so my attempts at solving this problem have failed spectacularly. Here's an overly simplified sample program that times the non-threaded and threaded versions of the tasks. The threaded version's performance is horrendous...
#include "thread_pool.hpp"
#include <ctime>
#include <iostream>
#include <vector>
using namespace boost;
using namespace std;
struct T {
vector<int> nums = {};
};
typedef boost::shared_ptr<T> Tptr;
typedef vector<Tptr> TptrVector;
void create_T(const int i, TptrVector& v) {
v[i] = Tptr(new T());
T& t = *v[i].get();
for (int i = 0; i < 100; i++) {
t.nums.push_back(i);
}
}
int main(int argc, char* argv[]) {
clock_t begin, end;
double elapsed;
// define and parse program options
if (argc != 3) {
cout << argv[0] << " <num iterations> <num threads>" << endl;
return 1;
}
int iterations = stoi(argv[1]),
threads = stoi(argv[2]);
// create thread pool
ThreadPool tp(threads);
// non-threaded
cout << "non-thread" << endl;
begin = clock();
TptrVector v(iterations);
for (int i = 0; i < iterations; i++) {
create_T(i, v);
}
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed << " seconds" << endl;
// threaded
cout << "threaded" << endl;
begin = clock();
TptrVector v2(iterations);
for (int i = 0; i < iterations; i++) {
tp.submit(boost::bind(create_T, i, v2));
}
tp.stop();
end = clock();
elapsed = double(end - begin) / CLOCKS_PER_SEC;
cout << elapsed << " seconds" << endl;
return 0;
}
After doing some digging, I think the poor performance may be due to the threads vying for memory access, but my newb status if keeping me from exploiting this insight. Can you efficiently populate the pointer vector using multiple threads, ideally in a thread pool?

you haven't provided neither enough details or a Minimal, Complete, and Verifiable example, so expect lots of guessing.
createT is a "cheap" function. Scheduling a task and an overhead of its execution is much more expensive. It's why your performance is bad. To get a boost from parallelism you need to have proper work granularity and amount of work. Granularity means that each task (in your case one call to createT) should be big enough to pay for multithreading overhead. The simplest approach would be to group createT calls to get bigger tasks.

Related

Why is the C++ thread/future overhead so big

I have a worker routine (code below), which is running slower when I run it in a separate thread. As far as I can tell, the worker code and data is completely independent of other threads. All the worker does is to append nodes to a tree. The goal is having multiple workers growing trees in parallel.
Can someone help me understand why there is (significant) overhead when running the worker in a separate thread?
Edit:
Initially I was testing WorkerFuture twice, I corrected that and I now get the same (better) performance in the no thread and defer async cases, and considerable overhead when an extra thread is involved.
The command to compile (linux): g++ -std=c++11 main.cpp -o main -O3 -pthread
Here is the output (time in milliseconds):
Thread : 4000001 size in 1861 ms
Async : 4000001 size in 1836 ms
Defer async: 4000001 size in 1423 ms
No thread : 4000001 size in 1455 ms
Code:
#include <iostream>
#include <vector>
#include <random>
#include <chrono>
#include <thread>
#include <future>
struct Data
{
int data;
};
struct Tree
{
Data data;
long long total;
std::vector<Tree *> children;
long long Size()
{
long long size = 1;
for (auto c : children)
size += c->Size();
return size;
}
~Tree()
{
for (auto c : children)
delete c;
}
};
int
GetRandom(long long size)
{
static long long counter = 0;
return counter++ % size;
}
void
Worker_(Tree *root)
{
std::vector<Tree *> nodes = {root};
Tree *it = root;
while (!it->children.empty())
{
it = it->children[GetRandom(it->children.size())];
nodes.push_back(it);
}
for (int i = 0; i < 100; ++i)
nodes.back()->children.push_back(new Tree{{10}, 1, {}});
for (auto t : nodes)
++t->total;
}
long long
Worker(long long iterations)
{
Tree root = {};
for (long long i = 0; i < iterations; ++i)
Worker_(&root);
return root.Size();
}
void ThreadFn(long long iterations, long long &result)
{
result = Worker(iterations);
}
long long
WorkerThread(long long iterations)
{
long long result = 0;
std::thread t(ThreadFn, iterations, std::ref(result));
t.join();
return result;
}
long long
WorkerFuture(long long iterations)
{
std::future<long long> f = std::async(std::launch::async, [iterations] {
return Worker(iterations);
});
return f.get();
}
long long
WorkerFutureSameThread(long long iterations)
{
std::future<long long> f = std::async(std::launch::deferred, [iterations] {
return Worker(iterations);
});
return f.get();
}
int main()
{
long long iterations = 40000;
auto t1 = std::chrono::high_resolution_clock::now();
auto total = WorkerThread(iterations);
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "Thread : " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
t1 = std::chrono::high_resolution_clock::now();
total = WorkerFuture(iterations);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "Async : " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
t1 = std::chrono::high_resolution_clock::now();
total = WorkerFutureSameThread(iterations);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "Defer async: " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
t1 = std::chrono::high_resolution_clock::now();
total = Worker(iterations);
t2 = std::chrono::high_resolution_clock::now();
std::cout << "No thread : " << total << " size in " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms\n";
}
It seems that the problem is caused by dynamic memory management. When multiple threads are involved (even if the main thread does nothing), C++ runtime must synchronize access to dynamic memory (heap), which generates some overhead. I did some experiments with GCC and the solution of your problem is to use some scalable memory allocator library. For instance, when I used tbbmalloc, e.g.,
export LD_LIBRARY_PATH=$TBB_ROOT/lib/intel64/gcc4.7:$LD_LIBRARY_PATH
export LD_PRELOAD=libtbbmalloc_proxy.so.2
the whole problem disappeared.
The reason is simple. You do not do anything in parallel manner.
When extra thread is doing something main thread does nothing (waits for thread job to complete).
In case of thread you have extra thing to do (handle thread and synchronization) so you have a trade off.
To see any gain you have to do at least two things at the same time.

Can iterating over unsorted data structure (like array, tree), with multiple thread make iteration faster?

Can iterating over unsorted data structure like array, tree with multiple thread make it faster?
For example I have big array with unsorted data.
int array[1000];
I'm searching array[i] == 8
Can running:
Thread 1:
for(auto i = 0; i < 500; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
Thread 2:
for(auto i = 500; i < 1000; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
be faster than normal iteration?
#update
I've written simple test witch describe problem better:
For searching int* array = new int[100000000];
and repeating it 1000 times
I got the result:
a
Number of threads = 2
End of multithread iteration
End of normal iteration
Time with 2 threads 73581
Time with 1 thread 154070
Bool values:0
0
0
Process returned 0 (0x0) execution time : 256.216 s
Press any key to continue.
What's more when program was running with 2 threads cpu usage of the process was around ~90% and when iterating with 1 thread it was never more than 50%.
So Smeeheey and erip are right that it can make iteration faster.
Of course it can be more tricky for not such trivial problems.
And as I've learned from this test is that compiler can optimize main thread (when i was not showing boolean storing results of search loop in main thread was ignored) but it will not do that for other threads.
This is code I have used:
#include<cstdlib>
#include<thread>
#include<ctime>
#include<iostream>
#define SIZE_OF_ARRAY 100000000
#define REPEAT 1000
inline bool threadSearch(int* array){
for(auto i = 0; i < SIZE_OF_ARRAY/2; i++)
if(array[i] == 101) // there is no array[i]==101
return true;
return false;
}
int main(){
int i;
std::cin >> i; // stops program enabling to set real time priority of the process
clock_t with_multi_thread;
clock_t normal;
srand(time(NULL));
std::cout << "Number of threads = "
<< std::thread::hardware_concurrency() << std::endl;
int* array = new int[SIZE_OF_ARRAY];
bool true_if_found_t1 =false;
bool true_if_found_t2 =false;
bool true_if_found_normal =false;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
with_multi_thread=clock();
for(auto j=0; j<REPEAT; j++){
std::thread t([&](){
if(threadSearch(array))
true_if_found_t1=true;
});
std::thread u([&](){
if(threadSearch(array+SIZE_OF_ARRAY/2))
true_if_found_t2=true;
});
if(t.joinable())
t.join();
if(u.joinable())
u.join();
}
with_multi_thread=(clock()-with_multi_thread);
std::cout << "End of multithread iteration" << std::endl;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
normal=clock();
for(auto j=0; j<REPEAT; j++)
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
if(array[i] == 101) // there is no array[i]==101
true_if_found_normal=true;
normal=(clock()-normal);
std::cout << "End of normal iteration" << std::endl;
std::cout << "Time with 2 threads " << with_multi_thread<<std::endl;
std::cout << "Time with 1 thread " << normal<<std::endl;
std::cout << "Bool values:" << true_if_found_t1<<std::endl
<< true_if_found_t2<<std::endl
<<true_if_found_normal<<std::endl;// showing bool values to prevent compiler from optimization
return 0;
}
The answer is yes, it can make it faster - but not necessarily. In your case, when you're iterating over pretty small arrays, it is likely that the overhead of launching a new thread will be much higher than the benefit gained. If you array was much bigger then this would be reduced as a proportion of the overall runtime and eventually become worth it. Note you will only get speed up if your system has more than 1 physical core available to it.
Additionally, you should note that whilst that the code that reads the array in your case is perfectly thread-safe, writing to std::cout is not (you will get very strange looking output if your try this). Instead perhaps your thread should do something like return an integer type indicating the number of instances found.

better way to stop program execution after certain time

I have following which stop execution of program after certain time.
#include <iostream>
#include<ctime>
using namespace std;
int main( )
{
time_t timer1;
time(&timer1);
time_t timer2;
double second;
while(1)
{
time(&timer2);
second = difftime(timer2,timer1);
//check if timediff is cross 3 seconds
if(second > 3)
{
return 0;
}
}
return 0;
}
Is above program would work if time increase from 23:59 to 00:01 ?
If there any other better way?
Provided you have C++11, you can have a look at this example:
#include <thread>
#include <chrono>
int main() {
std::this_thread::sleep_for (std::chrono::seconds(3));
return 0;
}
Alternatively I'd go with a threading library of your choice and use its Thread sleep function. In most cases it is better to send your thread to sleep instead of busy waiting.
time() returns the time since the Epoch (00:00:00 UTC, January 1, 1970), measured in seconds. Thus, the time of day does not matter.
You can use std::chrono::steady_clock in C++11. Check the example in the now static method for an example :
using namespace std::chrono;
steady_clock::time_point clock_begin = steady_clock::now();
std::cout << "printing out 1000 stars...\n";
for (int i=0; i<1000; ++i) std::cout << "*";
std::cout << std::endl;
steady_clock::time_point clock_end = steady_clock::now();
steady_clock::duration time_span = clock_end - clock_begin;
double nseconds = double(time_span.count()) * steady_clock::period::num / steady_clock::period::den;
std::cout << "It took me " << nseconds << " seconds.";
std::cout << std::endl;

Populating a list of integers is 100 times faster than populating a vector of integers

I am comparing the times taken to populate a list of integers against a vector of integers .
Each vector & list is populated with 10 million random integers and this experiment is repeated 100 times to find the average .
To my amazement , populating a list is about 100 times quicker than populating a vector of integers . I would expect populating vector of integer to be much faster as vector are continuous in memory and insertions are much quicker .
How can populating a list be 100 times not 10 times quicker than populating a vector. I am sure I am missing some concept or idea which is causing this
This is my code used to generate the results
#include <iostream>
#include <sstream>
#include <list>
#include <vector>
#include <ctime>
#include <time.h>
using namespace std;
int main()
{
list<int> mylist;
vector<int> myvector;
srand(time(NULL));
int num;
clock_t list_start;
clock_t list_end;
clock_t list_totaltime;
for (int i=0;i<100;i++)
{
list_start = clock();
for (int i = 0 ; i < 10000000 ; i++ ) // 10 million
{
num = rand() % 10000000 ;
mylist.push_back(num);
}
list_end = clock();
list_totaltime += difftime(list_end,list_start);
mylist.clear();
}
cout << list_totaltime/CLOCKS_PER_SEC/100;
cout <<" List is done ";
cout << endl
<< endl;
clock_t vector_start;
clock_t vector_end;
clock_t vector_totaltime;
for (int i=0;i<100;i++)
{
vector_start = clock();
for (int i = 0 ; i < 10000000 ; i++ ) // 10 million times
{
num = rand() % 10000000 ;
myvector.push_back(num);
}
vector_end = clock();
vector_totaltime += difftime(vector_end,vector_start);
myvector.clear();
}
cout << vector_totaltime/CLOCKS_PER_SEC/100;
cout << " Vector is done " ;
}
Can someone explain to me why this is happening ???
I tried with VS2013 C++ compiler, and std::vector is much faster than std::list (as I expected).
I got the following results:
Testing STL vector vs. list push_back() time
--------------------------------------------
Testing std::vector...done.
std::vector::push_back(): 89.1318 ms
Testing std::list...done.
std::list::push_back(): 781.214 ms
I used Windows high-resolution performance counters to measure times.
Of course, I did the tests in optimized release build.
I also refactored the random number generation out of the push-back loop, and used a more serious random number technique than rand().
Is your method of using clock() good to measure execution times?
What C++ compiler did you use? Did you test an optimized build?
The compilable test code follows:
// Testing push_back performance: std::vector vs. std::list
#include <algorithm>
#include <exception>
#include <iostream>
#include <list>
#include <random>
#include <vector>
#include <Windows.h>
using namespace std;
long long Counter() {
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return li.QuadPart;
}
long long Frequency() {
LARGE_INTEGER li;
QueryPerformanceFrequency(&li);
return li.QuadPart;
}
void PrintTime(const long long start, const long long finish,
const char * const s) {
cout << s << ": " << (finish - start) * 1000.0 / Frequency() << " ms" << endl;
}
int main() {
try {
cout << endl
<< "Testing STL vector vs. list push_back() time\n"
<< "--------------------------------------------\n"
<< endl;
const auto shuffled = []() -> vector<int> {
static const int kCount = 10 * 1000 * 1000;
vector<int> v;
v.reserve(kCount);
for (int i = 1; i <= kCount; ++i) {
v.push_back((i % 100));
}
mt19937 prng(1995);
shuffle(v.begin(), v.end(), prng);
return v;
}();
long long start = 0;
long long finish = 0;
cout << "Testing std::vector...";
start = Counter();
vector<int> v;
for (size_t i = 0; i < shuffled.size(); ++i) {
v.push_back(shuffled[i]);
}
finish = Counter();
cout << "done.\n";
PrintTime(start, finish, "std::vector::push_back()");
cout << endl;
cout << "Testing std::list...";
start = Counter();
list<int> l;
for (size_t i = 0; i < shuffled.size(); ++i) {
l.push_back(shuffled[i]);
}
finish = Counter();
cout << "done.\n";
PrintTime(start, finish, "std::list::push_back()");
} catch (const exception& ex) {
cout << "\n*** ERROR: " << ex.what() << endl;
}
}
Can someone explain to me why this is happening ???
The result is not real i.e., populating a list of integers is not 100 times faster than populating a vector of integers. What you see is a benchmark artifact due to errors in your code pointed out in the comments.
If you initialize the variables and avoid the integer division then you should see different result e.g., on my machine populating a vector is 3 times faster than populating a list (btw, calling vector::reserve() has no effect on the result).
Related: Fun with uninitialized variables and compiler (GCC).
Also, you shouldn't use difftime(time_t, time_t) with clock_t values.
For the record, after fixing uninitialized variable problems and integer division, running an optimized build compiled with gcc 4.7.3 on x86_64 and the following flags
g++ -Wall -Wextra -pedantic-errors -pthread -std=c++11 -O3
I get
0.2 List is done
0.07 Vector is done
So, the vector is faster, as I would have expected. This without any further change to the code.
Elements in a vector are stored sequentially in consecutive memory locations. Initially some memory is allotted to the vector and as you keep adding more elements to a vector there has to a lot of memory operations to retain this property of vector. These memory operations take considerable time.
Whereas in case of a list the head stores the address of the second, the address of the third element is stored in third and so on. Here is there is no need of any memory re-allocation.
But lists take more storage memory compared to vectors.
If the size of the vector exceeds its capacity, the vector needs to be reallocated. In this case all the previous values have to be copied to the new storage.
Try to increase the capacity with vector::reserve to reduce the needed reallocations.

Forcing race between threads using C++11 threads

Just got started on multithreading (and multithreading in general) using C++11 threading library, and and wrote small short snipped of code.
#include <iostream>
#include <thread>
int x = 5; //variable to be effected by race
//This function will be called from a thread
void call_from_thread1() {
for (int i = 0; i < 5; i++) {
x++;
std::cout << "In Thread 1 :" << x << std::endl;
}
}
int main() {
//Launch a thread
std::thread t1(call_from_thread1);
for (int j = 0; j < 5; j++) {
x--;
std::cout << "In Thread 0 :" << x << std::endl;
}
//Join the thread with the main thread
t1.join();
std::cout << x << std::endl;
return 0;
}
Was expecting to get different results every time (or nearly every time) I ran this program, due to race between two threads. However, output is always: 0, i.e. two threads run as if they ran sequentially. Why am I getting same results and is there any ways to simulate or force race between two threads ?
Your sample size is rather small, and somewhat self-stalls on the continuous stdout flushes. In short, you need a bigger hammer.
If you want to see a real race condition in action, consider the following. I purposely added an atomic and non-atomic counter, sending both to the threads of the sample. Some test-run results are posted after the code:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
void racer(std::atomic_int& cnt, int& val)
{
for (int i=0;i<1000000; ++i)
{
++val;
++cnt;
}
}
int main(int argc, char *argv[])
{
unsigned int N = std::thread::hardware_concurrency();
std::atomic_int cnt = ATOMIC_VAR_INIT(0);
int val = 0;
std::vector<std::thread> thrds;
std::generate_n(std::back_inserter(thrds), N,
[&cnt,&val](){ return std::thread(racer, std::ref(cnt), std::ref(val));});
std::for_each(thrds.begin(), thrds.end(),
[](std::thread& thrd){ thrd.join();});
std::cout << "cnt = " << cnt << std::endl;
std::cout << "val = " << val << std::endl;
return 0;
}
Some sample runs from the above code:
cnt = 4000000
val = 1871016
cnt = 4000000
val = 1914659
cnt = 4000000
val = 2197354
Note that the atomic counter is accurate (I'm running on a duo-core i7 macbook air laptop with hyper threading, so 4x threads, thus 4-million). The same cannot be said for the non-atomic counter.
There will be significant startup overhead to get the second thread going, so its execution will almost always begin after the first thread has finished the for loop, which by comparison will take almost no time at all. To see a race condition you will need to run a computation that takes much longer, or includes i/o or other operations that take significant time, so that the execution of the two computations actually overlap.