I have a very simple function in C++:
double testSpeed()
{
using namespace boost;
int temp = 0;
timer aTimer;
//1 billion iterations.
for(int i = 0; i < 1000000000; i++) {
temp = temp + i;
}
double elapsedSec = aTimer.elapsed();
double speed = 1.0/elapsedSec;
return speed;
}
I want to run this function with multiple threads. I saw examples online that I can
do it as follows:
// start two new threads that calls the "hello_world" function
boost::thread my_thread1(&testSpeed);
boost::thread my_thread2(&testSpeed);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
However, this will run two threads that each will iterate billion times, right? I want the
two threads to do the job concurrently so the entire thing will run faster. I don't care
about sync, it's just a speed test.
Thank you!
There may be a nicer way, but this should work, it passes the range of variable to iterate over into the thread, it also starts a single timer before the threads are started, and ends after the timer after they're both done. It should be pretty obvious how to scale this up to more threads.
void testSpeed(int start, int end)
{
int temp = 0;
for(int i = start; i < end; i++)
{
temp = temp + i;
}
}
using namespace boost;
timer aTimer;
// start two new threads that calls the "hello_world" function
boost::thread my_thread1(&testSpeed, 0, 500000000);
boost::thread my_thread2(&testSpeed, 500000000, 1000000000);
// wait for both threads to finish
my_thread1.join();
my_thread2.join();
double elapsedSec = aTimer.elapsed();
double speed = 1.0/elapsedSec;
Related
The program I am implementing involves iterating over a medium amount of independent data, performing some computation, collecting the result, and then looping back over again. This loop needs to execute very quickly. To solve this, I am trying to implement the following thread pattern.
Rather than spawn threads in setup and join them in collect, I would like to spawn all threads initially, and keep them synchronized throughout their loops. This question regarding thread barriers had initially seemed to point me in the right direction, but my implementation of them is not working. Below is my example
int main() {
int counter = 0;
int threadcount = 10;
auto on_completion = [&]() noexcept {
++counter; // Incremenent counter
};
std::barrier sync_point(threadcount, on_completion);
auto work = [&]() {
while(true)
sync_point.arrive_and_wait(); // Keep cycling the sync point
};
std::vector<std::thread> threads;
for (int i = 0; i < threadcount; ++i)
threads.emplace_back(work); // Start every thread
for (auto& thread : threads)
thread.join();
}
To keep things as simple as possible, there is no computation being done in the worker threads, and I have done away with the setup thread. I am simply cycling the threads, syncing them after each cycle, and keeping a count of how many times they have looped. However, this code is deadlocking very quickly. More threads = faster deadlock. Adding work/delay inside the compute threads slows down the deadlock, but does not stop it.
Am I abusing the thread barrier? Is this unexpected behavior? Is there a cleaner way to implement this pattern?
Edit
It looks like removing the on_completion gets rid of the deadlock. I tried a different approach to meet the synchronization requirements without using the function, but it still deadlocks fairly quickly.
int threadcount = 10;
std::barrier start_point(threadcount + 1);
std::barrier stop_point(threadcount + 1);
auto work = [&](int i) {
while(true) {
start_point.arrive_and_wait();
stop_point.arrive_and_wait();
}
};
std::vector<std::thread> threads;
for (int i = 0; i < threadcount; ++i) {
threads.emplace_back(work, i);
}
while (true) {
std::cout << "Setup" << std::endl;
start_point.arrive_and_wait(); // Sync to start
// Workers do work here
stop_point.arrive_and_wait(); // Sync to end
std::cout << "Collect" << std::endl;
}
I want to learn how to adapt pseudocode I have for multithreading line by line to C++. I understand the pseudocode but I am not very experienced with C++ nor the std::thread function.
This is the pseudocode I have and that I've often used:
myFunction
{
int threadNr=previous;
int numberProcs = countProcessors();
// Every thread calculates a different line
for (y = y_start+threadNr; y < y_end; y+=numberProcs) {
// Horizontal lines
for (int x = x_start; x < x_end; x++) {
psetp(x,y,RGB(255,128,0));
}
}
}
int numberProcs = countProcessors();
// Launch threads: e.g. for 1 processor launch no other thread, for 2 processors launch 1 thread, for 4 processors launch 3 threads
for (i=0; i<numberProcs-1; i++)
triggerThread(50,FME_CUSTOMEVENT,i); //The last parameter is the thread number
triggerEvent(50,FME_CUSTOMEVENT,numberProcs-1); //The last thread used for progress
// Wait for all threads to finished
waitForThread(0,0xffffffff,-1);
I know I can call my current function using one thread via std::thread like this:
std::thread t1(FilterImage,&size_param, cdepth, in_data, input_worldP, output_worldP);
t1.join();
But this is not efficient as it is calling the entire function over and over again per thread.
I would expect every processor to tackle a horizontal line on it's own.
Any example code would would be highly appreciated as I tend to learn best through example.
Invoking thread::join() forces the calling thread to wait for the child thread to finish executing. For example, if I use it to create a number of threads in a loop, and call join() on each one, it'll be the same as though everything happened in sequence.
Here's an example. I have two methods that print out the numbers 1 through n. The first one does it single threaded, and the second one joins each thread as they're created. Both have the same output, but the threaded one is slower because you're waiting for each thread to finish before starting the next one.
#include <iostream>
#include <thread>
void printN_nothreads(int n) {
for(int i = 0; i < n; i++) {
std::cout << i << "\n";
}
}
void printN_threaded(int n) {
for(int i = 0; i < n; i++) {
std::thread t([=](){ std::cout << i << "\n"; });
t.join(); //This forces synchronization
}
}
Doing threading better.
To gain benefit from using threads, you have to start all the threads before joining them. In addition, to avoid false sharing, each thread should work on a separate region of the image (ideally a section that's far away in memory).
Let's look at how this would work. I don't know what library you're using, so instead I'm going to show you how to write a multi-threaded transform on a vector.
auto transform_section = [](auto func, auto begin, auto end) {
for(; begin != end; ++begin) {
func(*begin);
}
};
This transform_section function will be called once per thread, each on a different section of the vector. Let's write transform so it's multithreaded.
template<class Func, class T>
void transform(Func func, std::vector<T>& data, int num_threads) {
size_t size = data.size();
auto section_start = [size, num_threads](int thread_index) {
return size * thread_index / num_threads;
};
auto section_end = [size, num_threads](int thread_index) {
return size * (thread_index + 1) / num_threads;
};
std::vector<std::thread> threads(num_threads);
// Each thread works on a different section
for(int i = 0; i < num_threads; i++) {
T* start = &data[section_start(i)];
T* end = &data[section_end(i)];
threads[i] = std::thread(transform_section, func, start, end);
}
// We only join AFTER all the threads are started
for(std::thread& t : threads) {
t.join();
}
}
First of all, I think it is important to say that I am new to multithreading and know very little about it. I was trying to write some programs in C++ using threads and ran into a problem (question) that I will try to explain to you now:
I wanted to use several threads to fill an array, here is my code:
static const int num_threads = 5;
int A[50], n;
//------------------------------------------------------------
void ThreadFunc(int tid)
{
for (int q = 0; q < 5; q++)
{
A[n] = tid;
n++;
}
}
//------------------------------------------------------------
int main()
{
thread t[num_threads];
n = 0;
for (int i = 0; i < num_threads; i++)
{
t[i] = thread(ThreadFunc, i);
}
for (int i = 0; i < num_threads; i++)
{
t[i].join();
}
for (int i = 0; i < n; i++)
cout << A[i] << endl;
return 0;
}
As a result of this program I get:
0
0
0
0
0
1
1
1
1
1
2
2
2
2
2
and so on.
As I understand, the second thread starts writing elements to an array only when the first thread finishes writing all elements to an array.
The question is why threads dont't work concurrently? I mean why don't I get something like that:
0
1
2
0
3
1
4
and so on.
Is there any way to solve this problem?
Thank you in advance.
Since n is accessed from more than one thread, those accesses need to be synchronized so that changes made in one thread don't conflict with changes made in another. There are (at least) two ways to do this.
First, you can make n an atomic variable. Just change its definition, and do the increment where the value is used:
std::atomic<int> n;
...
A[n++] = tid;
Or you can wrap all the accesses inside a critical section:
std::mutex mtx;
int next_n() {
std::unique_lock<std::mutex> lock(mtx);
return n++;
}
And in each thread, instead of directly incrementing n, call that function:
A[next_n()] = tid;
This is much slower than the atomic access, so not appropriate here. In more complex situations it will be the right solution.
The worker function is so short, i.e., finishes executing so quickly, that it's possible that each thread is completing before the next one even starts. Also, you may need to link with a thread library to get real threads, e.g., -lpthread. Even with that, the results you're getting are purely by chance and could appear in any order.
There are two corrections you need to make for your program to be properly synchronized. Change:
int n;
// ...
A[n] = tid; n++;
to
std::atomic_int n;
// ...
A[n++] = tid;
Often it's preferable to avoid synchronization issues altogether and split the workload across threads. Since the work done per iteration is the same here, it's as easy as dividing the work evenly:
void ThreadFunc(int tid, int first, int last)
{
for (int i = first; i < last; i++)
A[i] = tid;
}
Inside main, modify the thread create loop:
for (int first = 0, i = 0; i < num_threads; i++) {
// possible num_threads does not evenly divide ASIZE.
int last = (i != num_threads-1) ? std::size(A)/num_threads*(i+1) : std::size(A);
t[i] = thread(ThreadFunc, i, first, last);
first = last;
}
Of course by doing this, even though the array may be written out of order, the values will be stored to the same locations every time.
I'm trying around on the new C++11 threads, but my simple test has abysmal multicore performance. As a simple example, this program adds up some squared random numbers.
#include <iostream>
#include <thread>
#include <vector>
#include <cstdlib>
#include <chrono>
#include <cmath>
double add_single(int N) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
return sum/N;
}
void add_multi(int N, double& result) {
double sum=0;
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand()/RAND_MAX);
}
result = sum/N;
}
int main() {
srand (time(NULL));
int N = 1000000;
// single-threaded
auto t1 = std::chrono::high_resolution_clock::now();
double result1 = add_single(N);
auto t2 = std::chrono::high_resolution_clock::now();
auto time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time single: " << time_elapsed << std::endl;
// multi-threaded
std::vector<std::thread> th;
int nr_threads = 3;
double partual_results[] = {0,0,0};
t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < nr_threads; ++i)
th.push_back(std::thread(add_multi, N/nr_threads, std::ref(partual_results[i]) ));
for(auto &a : th)
a.join();
double result_multicore = 0;
for(double result:partual_results)
result_multicore += result;
result_multicore /= nr_threads;
t2 = std::chrono::high_resolution_clock::now();
time_elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count();
std::cout << "time multi: " << time_elapsed << std::endl;
return 0;
}
Compiled with 'g++ -std=c++11 -pthread test.cpp' on Linux and a 3core machine, a typical result is
time single: 33
time multi: 565
So the multi threaded version is more than an order of magnitude slower. I've used random numbers and a sqrt to make the example less trivial and prone to compiler optimizations, so I'm out of ideas.
edit:
This problem scales for larger N, so the problem is not the short runtime
The time for creating the threads is not the problem. Excluding it does not change the result significantly
Wow I found the problem. It was indeed rand(). I replaced it with an C++11 equivalent and now the runtime scales perfectly. Thanks everyone!
On my system the behavior is same, but as Maxim mentioned, rand is not thread safe. When I change rand to rand_r, then the multi threaded code is faster as expected.
void add_multi(int N, double& result) {
double sum=0;
unsigned int seed = time(NULL);
for (int i = 0; i < N; ++i){
sum+= sqrt(1.0*rand_r(&seed)/RAND_MAX);
}
result = sum/N;
}
As you discovered, rand is the culprit here.
For those who are curious, it's possible that this behavior comes from your implementation of rand using a mutex for thread safety.
For example, eglibc defines rand in terms of __random, which is defined as:
long int
__random ()
{
int32_t retval;
__libc_lock_lock (lock);
(void) __random_r (&unsafe_state, &retval);
__libc_lock_unlock (lock);
return retval;
}
This kind of locking would force multiple threads to run serially, resulting in lower performance.
The time needed to execute the program is very small (33msec). This means that the overhead to create and handle several threads may be more than the real benefit. Try using programs that need longer times for the execution (e.g., 10 sec).
To make this faster, use a thread pool pattern.
This will let you enqueue tasks in other threads without the overhead of creating a std::thread each time you want to use more than one thread.
Don't count the overhead of setting up the queue in your performance metrics, just the time to enqueue and extract the results.
Create a set of threads and a queue of tasks (a structure containing a std::function<void()>) to feed them. The threads wait on the queue for new tasks to do, do them, then wait on new tasks.
The tasks are responsible for communicating their "done-ness" back to the calling context, such as via a std::future<>. The code that lets you enqueue functions into the task queue might do this wrapping for you, ie this signature:
template<typename R=void>
std::future<R> enqueue( std::function<R()> f ) {
std::packaged_task<R()> task(f);
std::future<R> retval = task.get_future();
this->add_to_queue( std::move( task ) ); // if we had move semantics, could be easier
return retval;
}
which turns a naked std::function returning R into a nullary packaged_task, then adds that to the tasks queue. Note that the tasks queue needs be move-aware, because packaged_task is move-only.
Note 1: I am not all that familiar with std::future, so the above could be in error.
Note 2: If tasks put into the above described queue are dependent on each other for intermediate results, the queue could deadlock, because no provision to "reclaim" threads that are blocked and execute new code is described. However, "naked computation" non-blocking tasks should work fine with the above model.
I'm working on linux but multithreading and single threading both are taking 340ms. Can someone tell me what's wrong with what I'm doing?
Here is my code
#include<time.h>
#include<fstream>
#define SIZE_OF_ARRAY 1000000
using namespace std;
struct parameter
{
int *data;
int left;
int right;
};
void readData(int *data)
{
fstream iFile("Data.txt");
for(int i = 0; i < SIZE_OF_ARRAY; i++)
iFile>>data[i];
}
int threadCount = 4;
int Partition(int *data, int left, int right)
{
int i = left, j = right, temp;
int pivot = data[(left + right) / 2];
while(i <= j)
{
while(data[i] < pivot)
i++;
while(data[j] > pivot)
j--;
if(i <= j)
{
temp = data[i];
data[i] = data[j];
data[j] = temp;
i++;
j--;
}
}
return i;
}
void QuickSort(int *data, int left, int right)
{
int index = Partition(data, left, right);
if(left < index - 1)
QuickSort(data, left, index - 1);
if(index < right)
QuickSort(data, index + 1, right);
}
//Multi threading code starts from here
void *Sort(void *param)
{
parameter *param1 = (parameter *)param;
QuickSort(param1->data, param1->left, param1->right);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
pthread_t threadID, threadID1;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
parameter param, param1;
readData(data);
start = clock();
int index = Partition(data, 0, SIZE_OF_ARRAY - 1);
if(0 < index - 1)
{
param.data = data;
param.left = 0;
param.right = index - 1;
pthread_create(&threadID, NULL, Sort, (void *)¶m);
}
if(index < SIZE_OF_ARRAY - 1)
{
param1.data = data;
param1.left = index + 1;
param1.right = SIZE_OF_ARRAY;
pthread_create(&threadID1, NULL, Sort, (void *)¶m1);
}
pthread_attr_destroy(&attr);
pthread_join(threadID, NULL);
pthread_join(threadID1, NULL);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Multithreading Ends here
Single thread main function
int main(int argc, char *argv[])
{
clock_t start, diff;
int *data = new int[SIZE_OF_ARRAY];
readData(data);
start = clock();
QuickSort(data, 0, SIZE_OF_ARRAY - 1);
diff = clock() - start;
cout<<"Sorting Time = "<<diff * 1000 / CLOCKS_PER_SEC<<"\n";
delete []data;
return 0;
}
//Single thread code ends here
some of functions single thread and multi thread use same
clock returns total CPU time, not wall time.
If you have 2 CPUs and 2 threads, then after a second of running both thread simultaneously clock will return CPU time of 2 seconds (the sum of CPU times of each thread).
So the result is totally expected. It does not matter how many CPUs you have, the total running time summed over all CPUs will be the same.
Note that you call Partition once from the main thread...
The code works on the same memory block which prevents a CPU from working when the other accesses that same memory block. Unless your data is really large you're likely to have many such hits.
Finally, if your algorithm works at memory speed when you run it with one thread, adding more threads doesn't help. I did such tests a while back with image data, and having multiple thread decreased the total speed because the process was so memory intensive that both threads were fighting to access memory... and the result was worse than not having threads at all.
What makes really fast computers today go really is fast is running one very intensive process per computer, not a large number of threads (or processes) on a single computer.
Build a thread pool with a producer-consumer queue with 24 threads hanging off it. Partition your data into two and issue a mergesort task object to the pool, the mergesort object should issue further pairs of mergesorts to the queue and wait on a signal for them to finish and so on until a mergersort object finds that it has [L1 cache-size data]. The object then qicksorts its data and signals completion to its parent task.
If that doesn't turn out to be blindingly quick on 24 cores, I'll stop posting about threads..
..and it will handle multiple sorts in parallel.
..and the pool can be used for other tasks.
.. and there is no No performance-destroying, deadlock-generating join(), synchronize(), (if you except the P-C queue, which only locks for long enough to push an object ref on), no thread-creation overhead and no dodgy thread-stopping/terminating/destroying code. Like the barbers, there is no waiting - as soon as a thread is finished with a task it can get another.
No thread micro-management, no tuning, (you could create 64 threads now, ready for the next generation of boxes). You could make the thread count tuneable - just add more threads at runtime, or delete some by queueing up poison-pills.
You don't need a reference to the threads at all - just set 'em off, (pass queue as parameter).