Pthreads and mutexes; locking part of an array - c++

I am trying to parallelize an operation using pthreads. The process looks something like:
double* doSomething( .... ) {
double* foo;
foo = new double[220];
for(i = 0; i<20; i++)
{
//do something with the elements in foo located between 10*i and 10*(i+2)
}
return foo;
}
The stuff happening inside the for-loop can be done in any order, so I want to organize this using threads.
For instance, I could use a number of threads such that each thread goes through parts of the for-loop, but works on different parts of the array. To avoid trouble when working on overlapping parts, i need to lock some memory.
How can I make a mutex (or something else) that locks only part of the array?

If you are using latest gcc you can try parallel versions of standard algorithms. See the libstdc++ parallel mode.

If you just want to make sure that a section of the array is worked once...
Make a global variable:
int _iNextSection;
Whenever a thread gets ready to operate on a section, the thread gets the next available section this way:
iMySection = __sync_fetch_and_add(&_iNextSection, 1);
__sync_fetch_and_add() returns the current value of _iNextSection and then increments _iNextSection. __sync_fetch_and_add() is atomic, which means __sync_fetch_and_add() is guaranteed to complete before another thread can do it. No locking, no blocking, simple, fast.

If the loop looks exactly like you wrote, I would use an array of 21 mutexes and block in each thread on ith an (i + 1)th mutex on the beginning of the loop.
So something like:
...
for (i = 0; i < 20; i++) {
mutex[i].lock();
mutex[i+1].lock();
...
mutex[i+1].unlock();
mutex[i].unlock();
}
The logic is that only two neighboring loop executions can access same data (if the limits are [i * 10, (i + 2) * 10)), so you only need to worry about them.

Related

Multithreading nested foor loop with std::thread

I am quite new to c++ and I would really need some advice on multithreading using std::thread.
i have the following piece of code, which basically separates a for loop of N = 8^L iterations (up to 8^14) using thread:
void Lanczos::Hamil_vector_multiply(vec& initial_vec, vec& result_vec) {
result_vec.zeros();
std::vector<arma::vec> result_threaded(num_of_threads);
std::vector<std::thread> threads;
threads.reserve(num_of_threads);
for (int t = 0; t < num_of_threads; t++) {
u64 start = t * N / num_of_threads;
u64 stop = ((t + 1) == num_of_threads ? N : N * (t + 1) / num_of_threads);
result_threaded[t] = arma::vec(stop - start, fill::zeros);
threads.emplace_back(&Lanczos::Hamil_vector_multiply_kernel, this, start, stop, ref(initial_vec), ref(result_vec));
}for (auto& t : threads) t.join();
}
where Lanczos is my general class (actually it is not necessary to know what it contains), while the member function Hamil_vector_multiply_kernel is of the form:
void Lanczos::Hamil_vector_multiply_kernel(u64 start, u64 stop, vec& initial_vec, vec& result_vec_threaded){
// some declarations
for (u64 k = start; k < stop; k++) {
// some prealiminary work
for (int j = 0; j <= L - 1; j++) {
// a bunch of if-else statements, where result_vec_threaded(k) += something
}
}
}
(the code is quite long, so i didn't paste the whole whing here). My problem is that i call the function Hamil_vector_multiply 100-150 times in another function, so i create each time a new vector of threads, which then destroys itself.My questions:
Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
3)Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
I found several attempts to create a threadpool and job queue, would it be useful to use for instance some workpool like this: https://codereview.stackexchange.com/questions/221617/thread-pool-c-implementation
My code works as it is supposed to (gives the correct result), it boosts up the speed of the programm soemthing like 10 times with 16 cores. But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
Thank you very much in advance!
PS: The function which calls Hamil_vector_multiply 100-150 times is of the form:
void Lanczos::Build_Lanczos_Hamil(vec& initial_vec) {
vec tmp(N);
Hamil_vector_multiply(initial_vec, tmp);
// some calculations
for(int j=0; j<100; j++{
// somtheing
vec tmp2 = ...
Hamil_vector_multiply(tmp2, tmp);
// do somthing else -- not related
}
}
Is it better to create threads in the function which calls Hamil_vector_multiply and then pass a vector of threads to Hamil_vector_multiply in order to avoid creating each time new threads?
If your worried about performance, yes it would help. What your doing right now is essentially allocating a new heap block in every function call (I'm talking about the vector). If you can do it beforehand, it'll give you some performance. There isn't an issue doing this but you could gain some performance.
Would it be better to asynchronously attack the loop (for instance the first thread to finish an iterations starts the next available? If yes can you point to any literature describing threads asynchronously?
This might not be a good idea. You will have to lock resources using mutexes when sharing the same data between multiple threads. This means that you'll get the same amount of performance as processing using one thread because the other thread(s) will have to wait till the resource is unlocked and ready to be used.
Are there maybe better ways of multithreading such a loop? (without multithreading i have a loop from k=0 to k=N=8^14, which takes up a lot of time)
If your goal is to improve performance, if you can put it into multiple threads, and most importantly if multithreading will help, then there isn't a reason to not doing it. From what I can see, your implementation looks pretty neat. But keep in mind, starting a thread itself is a little costly (negligible when compared to your performance gain), and load balancing will definitely improve performance even further.
But if you have other helpful comments not regarding multithreading I woul be grateful for every piece of advice
If your load per thread might vary, it'll be a good investment to think about load balancing. Other than that, I don't see an issue. The major places to improve would be your logic itself. Threads can do so much if your logic takes a hell of a lot time..
Optional:
You can use std::future to implement the same with the added bonus of it starting the thread asynchronously upon destruction, meaning when your thread pool destroys (when the vector goes out of scope), it'll start the threads. But then it might interfere with your first question.

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!
I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.
IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

The behaviour of fftw_execute when called inside an OpenMP parallel region

I'm writing a spectral PDE code I'd like to parallelise, using FFTW to do the FFT's. The main loop in the code looks as below. Lets say I have a real space array, u, and a fourier space array, uhat, and I've already constructed plans to go between them outside of this for loop (and outside of any parallel region), as well as calling the needed FFTW initialisation functions for parallelisation of fftw_execute.
for
{
fftw_execute(u_to_uhat)
// do some things with uhat, the fourier space array - for
// example,scale the transform in a for loop
for(int n = 0, n<Nmax,n++)
{
uhat[n] = scale*uhat[n])
}
// transform back
fftw_execute(uhat_to_u)
}
I want to parallelise everything here, including the manipulations of uhat within that second for loop. My question is how should I use openMP #pragmas to do so? Currently I have a parallel region around the inner loop:
fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
for
{
fftw_execute(u_to_uhat)
// do some things with uhat, the fourier space array - for
// example,scale the transform in a for loop
#pragma omp parallel for
for(int n = 0, n<Nmax,n++)
{
uhat[n] = scale*uhat[n])
}
// transform back
fftw_execute(uhat_to_u)
}
But as I understand it this creates and destroys a thread block every time I enter this for loop, which is expensive. I'd rather construct the parallel region outside the for loop once:
fftw_init_threads();
fftw_plan_with_nthreads(omp_get_max_threads());
#pragma omp parallel
for
{
fftw_execute(u_to_uhat)
// do some things with uhat, the fourier space array - for
// example,scale the transform in a for loop
#pragma omp for
for(int n = 0, n<Nmax,n++)
{
uhat[n] = scale*uhat[n])
}
// transform back
fftw_execute(uhat_to_u)
}
But then I have to worry about the behaviour of fftw_execute called from within the parallel region. I believe the documentation, section 5.4, says fftw_execute is thread safe (i.e safe to call from within a parallel region). But it doesn't tell me whether fftw_execute creates its own block of threads when called - i.e by putting it in a parallel region, am I just making every existing thread construct a load more threads within the fftw_execute function? Alternately, does it know to use the threads that are already there?
In short Can I call fftw_execute() from inside a parallel region, and can you make it work as one would hope it does - i.e just use the existing threads to do the work, rather than spawning new ones.
Sorry for the long question, I'd really appreciate some advice on this!

OpenMP Single Producer Multiple Consumer

I am trying to achieve something contrived using OpenMP.
I have a multi-core system with N available processors. I want to have a vector of objects of length k*P to be populated in batches of P by a single thread (by reading a file), i.e. a single thread reads this file and writes in vecObj[0 to P-1] then vecObj[p to 2P-1] etc. To make things simple, this vector is pre-resized (i.e. inserting using = operator, no pushbacks, constant length as far as we are concerned).
After a batch is written into the vector, I want the remaining N-1 threads to work on the available data. Since every object can take different time to be worked upon, it would be good to have dynamic scheduling for the remaining threads. The below snippet works really well when all the threads are working on the data.
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = 0; i < dataLength(); ++i) {
threadWorkOnElement(vecObj, i);
}
Now, according to me, the the main issue I am facing in thinking up of a solution is the question as to how can I have N-1 threads dynamically scheduled over the range of available data, while another thread just keeps on reading and populating the vector with data?
I am guessing that the issue of writing new data and messaging the remaining threads can be achieved using std atomic.
I think that what I am trying to achieve is along the lines of the following pseudo code
std::atomic<size_t> freshDataEnd;
size_t dataWorkStart = 0;
size_t dataWorkEnd;
#pragma omp parallel
{
#pragma omp task
{
//increment freshDataEnd atomically upon reading every P objects
//return when end of file is reached
readData(vecObj, freshDataEnd);
}
#pragma omp task
{
omp_set_num_threads(N-1);
while(freshDataEnd <= MAX_VEC_LEN) {
if (dataWorkStart < freshDataEnd) {
dataWorkEnd = freshDataEnd;
#pragma omp parallel for schedule(dynamic, per_thread)
for(size_t i = dataWorkStart; i < dataWorkEnd; ++i) {
threadWorkOnElement(vecObj, i);
}
dataWorkStart = dataWorkEnd;
}
}
}
}
Is this the correct approach to achieve what I am trying to do? How can I handle this sort of nested parallelism? Not so important : I would have preferred to stick with openmp directives and not use std atomics, is that possible? How?

OpenMP and STL vector

I've got some code for which I'd like to use OpenMP in the following way:
std::vector<int> v(1000);
# pragma omp parallel for
for (int i = 0; i < 1000; ++i) {
v[i] = i;
}
I have read that STL vector container is not thread-safe in the situation where multiple threads write to a single container, which would imply that I'd need to lock the vector before making any writes; however, I've also been told that the write operation above is somehow "atomic", and so there is no race condition above. Could someone clarify this?
In this particular example, it will be safe.
The reason is that you are not using operations that could cause a reallocation. (such as push_back()). You are only changing the contents of the individual elements.
Note that you can just as legally do this:
std::vector<int> v(1000);
int *ptr = &v[0];
# pragma omp parallel for
for (int i = 0; i < 1000; ++i) {
ptr[i] = i;
}
It becomes not-thread-safe when you start calling methods like push_back(), pop_back(), insert(), etc... from multiple threads.
I'll also add that this particular example isn't well-suited for parallelism since there's hardly any work to be done. But I suppose it's just a dumbed-down example for the purpose of asking this question.
Multiple reads are safe but I would recommend to avoid multiple writes to the same container. But you can write to memory you manage on your own. The difference to a vector would be that you can be sure that the memory would not be changed or reallocated at the same time. Otherwise you can also use a semaphore but this would probably decrease the efficiency and if you use several it can even cause deadlocks if you don't work properly.