Can one function operate on few threads? - c++

I have a question about threads.
For example I've got code like this
void xyz(int x){
...
}
int main{
for(int i=0;i<n;i++){
xyz(n);
}
}
The question is if I can modife code (and how?) in order to first thread call a function with arguments 1 to n/2 and second thread call a function with arguments from n/2 to n.
Thank you in advance

Here is a simple solution using std::async and a lambda function capturing your n:
#include <future>
int main() {
size_t n = 666;
auto f1 = std::async(std::launch::async, [n]() {
for (size_t i = 0; i < n / 2; ++i)
xyz(i);
});
auto f2 = std::async(std::launch::async, [n]() {
for (size_t i = n/2; i < n; ++i)
xyz(i);
});
f1.wait();
f2.wait();
return 0;
}
Each call to std::async creates a new thread and then calling wait() on the std::futures returned by async, makes sure the program doesn't return before those threads finishing.

Sure. You can use <thread> for this:
#include <thread>
// The function we want to execute on the new thread.
void xyz(int start, int end)
{
for (int i = start; i < end; ++i) {
// ...
}
}
// Start threads.
std::thread t1(xyz, 1, n / 2);
std::thread t2(xyz, n / 2, n);
// Wait for threads to finish.
t1.join();
t2.join();
If you're using GCC or Clang, don't forget to append -pthread to your link command if you get a link error (example: gcc -std=c++14 myfile.cpp -pthread.)

You should read some tutorial about multi-threading. I recommend this Pthread tutorial, since you could apply the concepts to C++11 threads (whose model is close to POSIX threads).
A function can be used in several threads if it is reentrant.
You might synchronize with mutexes your critical sections, and use condition variables. You should avoid data races.
In some cases, you could use atomic operations.
(your question is too broad and unclear; an entire book is needed to answer it)
You might also be interested by OpenMP, OpenACC, OpenCL.
Be aware that threads are quite expensive resources. Each has its own call stack (of one or a few megabytes), and you generally don't want to have much more runnable threads than you have available cores. As a rule of thumb, avoid (on a common desktop or laptop computer) having more than a dozen of runnable threads (and more than a few hundreds of idle ones, and probably less). But YMMV; I prefer to have less than a dozen threads, and I am trying to have less threads than what std::thread::hardware_concurrency gives.
Both answers from Nikos C. and from Maroš Beťko are using two threads. You could use a few more, but it probably would be unreasonable and inefficient to use a lot more of them (e.g. a hundred threads), at least on some ordinary computer. The optimal amount of threads is computer (and software and system) specific. You might make it a configurable parameter of your program. On supercomputers, you could mix multi-threading with some MPI approach. On datacenters or clusters, you might consider a MapReduce approach.
When benchmarking, don't forget to enable compiler optimizations. If using GCC or Clang, compile with -O2 -march=native at least.

Related

c++ Why shouldn't I unlock a mutex from a different thread

Why shouldn't I unlock a mutex from a different thread? In the c++ standard it says it pretty clearly: If the mutex is not currently locked by the calling thread, it causes undefined behavior. But as far as I can see, everything works as expected on Linux(Fedora 31 with GCC). I seriously tried everything but I could not get it to behave strangely.
All I'm asking for is an example where something, literally anything is affected by unlocking a mutex from a different thread.
Here is a quick test I wrote which is super wrong and shoudn't work but it does:
std::mutex* testEvent;
int main()
{
testEvent = new std::mutex[1000];
for(uint32_t i = 0; i < 1000; ++i) testEvent[i].lock();
std::thread threads[2000];
auto lock = [](uint32_t index) ->void { testEvent[index].lock(); assert(!testEvent[index].try_lock()); };
auto unlock = [](uint32_t index) ->void { testEvent[index].unlock(); };
for(uint32_t j = 0; j < 1000; ++j)
{
for(uint32_t i = 0; i < 1000; ++i)
{
threads[i] = std::thread(lock,i);
threads[i+1000] = std::thread(unlock,i);
}
for(uint32_t i = 0; i < 2000; ++i)
{
threads[i].join();
}
std::cout << j << std::endl;
}
delete[] testEvent;
}
As you already said, it is UB. UB means it may work. Or not. Or randomly switch between working and making your computer sing itself a lullaby. (See also "nasal demons".)
Here are just a few ways someone can break your program on Fedora 31 with GCC on x86-64:
Compile with -fsanitize=thread. It will now crash every time, which is still a valid C++ implementation, because UB.
Run unter helgrind (valgrind --tool=helgrind ./a.out). It will crash every time -- still a valid way to host a C++ program, because UB.
The libstdc++/glibc/pthread implementation on the target system switches from using "fast" mutexes by default to "error checking" or "recursive" mutexes (https://manpages.debian.org/jessie/glibc-doc/pthread_mutex_init.3.en.html). Note that this is probably possible in a manner that is ABI-compatible with your program, meaning that it does not even have to be recompiled for it to suddenly stop working.
That being said, since you are using a platform on which the C++ mutex boils down to a futex-implemented "fast" pthread mutex, this does not work by accident. It is just not guaranteed to keep working for any time, or in any circumstance that actually checks if you are doing the right thing.
I really wonder why you would want to do this in the first place ;)
Normally you would want to have something like
lock();
do_critical_task();
unlock();
(In c++, the lock/unlock is often hidden by use of std::lock_guard or similar.)
Let's assume one thread (lets say Thread A) called this code and is inside the critical task, i.e. it is also holding the lock.
Then if you unlock the same mutex from another thread, any thread other than A can also enter the critical section simultaneously.
The main purpose of mutexes is to have mutual exclusion (hence their name), so all you would do is to erase the purpose of the mutex ;)
That said: you should always believe the standard. Only if something works out on a certain system it doesn't mean it's portable. Plus: especially in a concurrent context, a lot of things can work out a thousand times but then fail the 1001'th time as of race conditions.
In mathematics your attempt would be comparable to "proof by example".

How to avoid destroying and recreating threads inside loop?

I have a loop with that creates and uses two threads. The threads always do the same thing and I'm wondering how they can be reused instead of created and destroyed each iteration? Some other operations are do inside the loop that affect the data the threads process. Here is a simplified example:
const int args1 = foo1();
const int args2 = foo2();
vector<string> myVec = populateVector();
int a = 1;
while(int i = 0; i < 100; i++)
{
auto func = [&](const vector<string> vec){
//do stuff involving variable a
foo3(myVec[a]);
}
thread t1(func, args1);
thread t2(func, args2);
t1.join();
t2.join();
a = 2 * a;
}
Is there a way to have t1 and t2 restart? Is there a design pattern I should look into? I ask because adding threads made the program slightly slower when I thought it would be faster.
You can use std::async as suggested in the comments.
What you're also trying to do is a very common usage for a Threadpool. I simple header only implementation of which I commonly utilize is here
To use this library, create the pool outside of the loop with a number of threads set during construction. Then enqueue a function in which a thread will go off and execute. With this library, you'll be getting a std::future (much like the std::async steps) and this is what you'd wait on in your loop.
Generically, you'd want to make access to any data thread-safe with mutexs (or other means, there are a lot of ways to do this) but under very specific situations, you'll not need to.
In this case,
so long as the vector isn't being increased in size (doesn't need to reallocate)
Only reading items or only modifying each item at a time in its own thread
the you wouldn't need to worry about synchronization.
Though its just good habit to do the sync anyways... When other people eventually modify the code, they're not going to know your rules and will cause issues.

Why is std::mutex faster than std::atomic?

I want to put objects in std::vector in multi-threaded mode. So I decided to compare two approaches: one uses std::atomic and the other std::mutex. I see that the second approach is faster than the first one. Why?
I use GCC 4.8.1 and, on my machine (8 threads), I see that the first solution requires 391502 microseconds and the second solution requires 175689 microseconds.
#include <vector>
#include <omp.h>
#include <atomic>
#include <mutex>
#include <iostream>
#include <chrono>
int main(int argc, char* argv[]) {
const size_t size = 1000000;
std::vector<int> first_result(size);
std::vector<int> second_result(size);
std::atomic<bool> sync(false);
{
auto start_time = std::chrono::high_resolution_clock::now();
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
while(sync.exchange(true)) {
std::this_thread::yield();
};
first_result[counter] = counter;
sync.store(false) ;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
{
auto start_time = std::chrono::high_resolution_clock::now();
std::mutex mutex;
#pragma omp parallel for schedule(static, 1)
for (int counter = 0; counter < size; counter++) {
std::unique_lock<std::mutex> lock(mutex);
second_result[counter] = counter;
}
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() << std::endl;
}
return 0;
}
I don't think your question can be answered referring only to the standard- mutexes are as platform-dependent as they can be. However, there is one thing, that should be mentioned.
Mutexes are not slow. You may have seen some articles, that compare their performance against custom spin-locks and other "lightweight" stuff, but that's not the right approach - these are not interchangeable.
Spin locks are considerably fast, when they are locked (acquired) for a relatively short amount of time - acquiring them is very cheap, but other threads, that are also trying to lock, are active for whole this time (running constantly in loop).
Custom spin-lock could be implemented this way:
class SpinLock
{
private:
std::atomic_flag _lockFlag;
public:
SpinLock()
: _lockFlag {ATOMIC_FLAG_INIT}
{ }
void lock()
{
while(_lockFlag.test_and_set(std::memory_order_acquire))
{ }
}
bool try_lock()
{
return !_lockFlag.test_and_set(std::memory_order_acquire);
}
void unlock()
{
_lockFlag.clear();
}
};
Mutex is a primitive, that is much more complicated. In particular, on Windows, we have two such primitives - Critical Section, that works in per-process basis and Mutex, which doesn't have such limitation.
Locking mutex (or critical section) is much more expensive, but OS has the ability to really put other waiting threads to "sleep", which improves performance and helps tasks scheduler in efficient resources management.
Why I write this? Because modern mutexes are often so-called "hybrid mutexes". When such mutex is locked, it behaves like a normal spin-lock - other waiting threads perform some number of "spins" and then heavy mutex is locked to prevent from wasting resources.
In your case, mutex is locked in each loop iteration to perform this instruction:
second_result[counter] = omp_get_thread_num();
It looks like a fast one, so "real" mutex may never be locked. That means, that in this case your "mutex" can be as fast as atomic-based solution (because it becomes an atomic-based solution itself).
Also, in the first solution you used some kind of spin-lock-like behaviour, but I am not sure if this behaviour is predictable in multi-threaded environment. I am pretty sure, that "locking" should have acquire semantics, while unlocking is a release op. Relaxed memory ordering may be too weak for this use case.
I edited the code to be more compact and correct. It uses the std::atomic_flag, which is the only type (unlike std::atomic<> specializations), that is guaranteed to be lock-free (even std::atomic<bool> does not give you that).
Also, referring to the comment below about "not yielding": it is a matter of specific case and requirements. Spin locks are very important part of multi-threaded programming and their performance can often be improved by slightly modifying its behavior. For example, Boost library implements spinlock::lock() as follows:
void lock()
{
for( unsigned k = 0; !try_lock(); ++k )
{
boost::detail::yield( k );
}
}
source: boost/smart_ptr/detail/spinlock_std_atomic.hpp
Where detail::yield() is (Win32 version):
inline void yield( unsigned k )
{
if( k < 4 )
{
}
#if defined( BOOST_SMT_PAUSE )
else if( k < 16 )
{
BOOST_SMT_PAUSE
}
#endif
#if !BOOST_PLAT_WINDOWS_RUNTIME
else if( k < 32 )
{
Sleep( 0 );
}
else
{
Sleep( 1 );
}
#else
else
{
// Sleep isn't supported on the Windows Runtime.
std::this_thread::yield();
}
#endif
}
[source: http://www.boost.org/doc/libs/1_66_0/boost/smart_ptr/detail/yield_k.hpp]
First, thread spins for some fixed number of times (4 in this case). If mutex is still locked, pause instruction is used (if available) or Sleep(0) is called, which basically causes context-switch and allows scheduler to give another blocked thread a chance to do something useful. Then, Sleep(1) is called to perform actual (short) sleep. Very nice!
Also, this statement:
The purpose of a spinlock is busy waiting
is not entirely true. The purpose of spinlock is to serve as a fast, easy-to-implement lock primitive - but it still needs to be written properly, with certain possible scenarios in mind. For example, Intel says (regarding Boost's usage of _mm_pause() as a method of yielding inside lock()):
In the spin-wait loop, the pause intrinsic improves the speed at which
the code detects the release of the lock and provides especially
significant performance gain.
So, implementations like
void lock() { while(m_flag.test_and_set(std::memory_order_acquire)); }
may not be as good as it seems.
There is an additional important issue related to your problem. An efficient spinlock never "spins" on an operation that involves (even potential) modification of a memory location (such as exchange or test_and_set). On typical modern architectures, these operations generate instructions that require the cache line with a lock memory location to be in the exclusive state, which is extremely time-consuming (especially, when multiple threads are spinning at the same time). Always spin on load/read only and try to acquire the lock only when there is a chance that this operation will succeed.
A nice relevant article is, for instance, here: Correctly implementing a spinlock in C++

Safe multi-thread counter increment

For example, I've got a some work that is computed simultaneously by multiple threads.
For demonstration purposes the work is performed inside a while loop. In a single iteration each thread performs its own portion of the work, before the next iteration begins a counter should be incremented once.
My problem is that the counter is updated by each thread.
As this seems like a relatively simple thing to want to do, I presume there is a 'best practice' or common way to go about it?
Here is some sample code to illustrate the issue and help the discussion along.
(Im using boost threads)
class someTask {
public:
int mCounter; //initialized to 0
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
int getCount()
{
boost::mutex::scoped_lock lock( cntmutex );
return mCount;
}
void process( int thread_id, int numThreads )
{
while ( getCount() < mTotal )
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
// Wait for all thread to get to this point
cntmutex.lock();
mCounter++; // < ---- how to ensure this is only updated once?
cntmutex.unlock();
}
}
};
The main problem I see here is that you reason at a too-low level. Therefore, I am going to present an alternative solution based on the new C++11 thread API.
The main idea is that you essentially have a schedule -> dispatch -> do -> collect -> loop routine. In your example you try to reason about all this within the do phase which is quite hard. Your pattern can be much more easily expressed using the opposite approach.
First we isolate the work to be done in its own routine:
void process_thread(size_t id, size_t numThreads) {
// do something
}
Now, we can easily invoke this routine:
#include <future>
#include <thread>
#include <vector>
void process(size_t const total, size_t const numThreads) {
for (size_t count = 0; count != total; ++count) {
std::vector< std::future<void> > results;
// Create all threads, launch the work!
for (size_t id = 0; id != numThreads; ++id) {
results.push_back(std::async(process_thread, id, numThreads));
}
// The destruction of `std::future`
// requires waiting for the task to complete (*)
}
}
(*) See this question.
You can read more about std::async here, and a short introduction is offered here (they appear to be somewhat contradictory on the effect of the launch policy, oh well). It is simpler here to let the implementation decides whether or not to create OS threads: it can adapt depending on the number of available cores.
Note how the code is simplified by removing shared state. Because the threads share nothing, we no longer have to worry about synchronization explicitly!
You protected the counter with a mutex, ensuring that no two threads can access the counter at the same time. Your other option would be using Boost::atomic, c++11 atomic operations or platform-specific atomic operations.
However, your code seems to access mCounter without holding the mutex:
while ( mCounter < mTotal )
That's a problem. You need to hold the mutex to access the shared state.
You may prefer to use this idiom:
Acquire lock.
Do tests and other things to decide whether we need to do work or not.
Adjust accounting to reflect the work we've decided to do.
Release lock. Do work. Acquire lock.
Adjust accounting to reflect the work we've done.
Loop back to step 2 unless we're totally done.
Release lock.
You need to use a message-passing solution. This is more easily enabled by libraries like TBB or PPL. PPL is included for free in Visual Studio 2010 and above, and TBB can be downloaded for free under a FOSS licence from Intel.
concurrent_queue<unsigned int> done;
std::vector<Work> work;
// fill work here
parallel_for(0, work.size(), [&](unsigned int i) {
processWorkItem(work[i]);
done.push(i);
});
It's lockless and you can have an external thread monitor the done variable to see how much, and what, has been completed.
I would like to disagree with David on doing multiple lock acquisitions to do the work.
Mutexes are expensive and with more threads contending for a mutex , it basically falls back to a system call , which results in user space to kernel space context switch along with the with the caller Thread(/s) forced to sleep :Thus a lot of overheads.
So If you are using a multiprocessor system , I would strongly recommend using spin locks instead [1].
So what i would do is :
=> Get rid of the scoped lock acquisition to check the condition.
=> Make your counter volatile to support above
=> In the while loop do the condition check again after acquiring the lock.
class someTask {
public:
volatile int mCounter; //initialized to 0 : Make your counter Volatile
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
void process( int thread_id, int numThreads )
{
while ( mCounter < mTotal ) //compare without acquiring lock
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
cntmutex.lock();
//Now compare again to make sure that the condition still holds
//This would save all those acquisitions and lock release we did just to
//check whther the condition was true.
if(mCounter < mTotal)
{
mCounter++;
}
cntmutex.unlock();
}
}
};
[1]http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock

Avoding multiple thread spawns in pthreads

I have an application that is parallellized using pthreads. The application has a iterative routine call and a thread spawn within the rountine (pthread_create and pthread_join) to parallelize the computation intensive section in the routine. When I use an instrumenting tool like PIN to collect the statistics the tool reports statistics for several threads(no of threads x no of iterations). I beleive it is because it is spawning new set of threads each time the routine is called.
How can I ensure that I create the thread only once and all successive calls use the threads that have been created first.
When I do the same with OpenMP and then try to collect the statistics, I see that the threads are created only once. Is it beacause of the OpenMP runtime ?
EDIT:
im jus giving a simplified version of the code.
int main()
{
//some code
do {
compute_distance(objects,clusters, &delta); //routine with pthread
} while (delta > threshold )
}
void compute_distance(double **objects,double *clusters, double *delta)
{
//some code again
//computation moved to a separate parallel routine..
for (i=0, i<nthreads;i++)
pthread_create(&thread[i],&attr,parallel_compute_phase,(void*)&ip);
for (i=0, i<nthreads;i++)
rc = pthread_join(thread[i], &status);
}
I hope this clearly explains the problem.
How do we save the thread id and test if was already created?
You can make a simple thread pool implementation which creates threads and makes them sleep. Once a thread is required, instead of "pthread_create", you can ask the thread pool subsystem to pick up a thread and do the required work.. This will ensure your control over the number of threads..
An easy thing you can do with minimal code changes is to write some wrappers for pthread_create and _join. Basically you can do something like:
typedef struct {
volatile int go;
volatile int done;
pthread_t h;
void* (*fn)(void*);
void* args;
} pthread_w_t;
void* pthread_w_fn(void* args) {
pthread_w_t* p = (pthread_w_t*)args;
// just let the thread be killed at the end
for(;;) {
while (!p->go) { pthread_yield(); }; // yields are good
p->go = 0; // don't want to go again until told to
p->fn(p->args);
p->done = 1;
}
}
int pthread_create_w(pthread_w_t* th, pthread_attr_t* a,
void* (*fn)(void*), void* args) {
if (!th->h) {
th->done = 0;
th->go = 0;
th->fn = fn;
th->args = args;
pthread_create(&th->h,a,pthread_w_fn,th);
}
th->done = 0; //make sure join won't return too soon
th->go = 1; //and let the wrapper function start the real thread code
}
int pthread_join_w(pthread_w_t*th) {
while (!th->done) { pthread_yield(); };
}
and then you'll have to change your calls and pthread_ts, or create some #define macros to change pthread_create to pthread_create_w etc....and you'll have to init your pthread_w_ts to zero.
Messing with those volatiles can be troublesome though. you'll probably need to spend some time getting my rough outline to actually work properly.
To ensure something that several threads might try to do only happens once, use pthread_once(). To ensure something only happens once that might be done by a single thread, just use a bool (likely one in static storage).
Honestly, it would be far easier to answer your question for everyone if you would edit your question – not comment, since that destroys formatting – to contain the real code in question, including the OpenMP pragmas.