OpenMP and STL vector - c++

I've got some code for which I'd like to use OpenMP in the following way:
std::vector<int> v(1000);
# pragma omp parallel for
for (int i = 0; i < 1000; ++i) {
v[i] = i;
}
I have read that STL vector container is not thread-safe in the situation where multiple threads write to a single container, which would imply that I'd need to lock the vector before making any writes; however, I've also been told that the write operation above is somehow "atomic", and so there is no race condition above. Could someone clarify this?

In this particular example, it will be safe.
The reason is that you are not using operations that could cause a reallocation. (such as push_back()). You are only changing the contents of the individual elements.
Note that you can just as legally do this:
std::vector<int> v(1000);
int *ptr = &v[0];
# pragma omp parallel for
for (int i = 0; i < 1000; ++i) {
ptr[i] = i;
}
It becomes not-thread-safe when you start calling methods like push_back(), pop_back(), insert(), etc... from multiple threads.
I'll also add that this particular example isn't well-suited for parallelism since there's hardly any work to be done. But I suppose it's just a dumbed-down example for the purpose of asking this question.

Multiple reads are safe but I would recommend to avoid multiple writes to the same container. But you can write to memory you manage on your own. The difference to a vector would be that you can be sure that the memory would not be changed or reallocated at the same time. Otherwise you can also use a semaphore but this would probably decrease the efficiency and if you use several it can even cause deadlocks if you don't work properly.

Related

C++ OpenMP: put complex variable in or before loop?

I have a loop that I want to process in parallel. Each thread needs an (independent) chunk of memory, but it can be overwritten in every iteration and needn't be reallocated. See the following example:
vector<int> scratch(size);
for(int i=0; i < count; i++){
f(arguments, scratch);
g(scratch);
}
where f takes scratch as an output parameter. To make this parallelizable, I could do
#pragma omp parallel for
for(int i=0; i < count; i++){
vector<int> scratch(size);
f(arguments, scratch);
g(scratch);
}
or
#pragma omp parallel
{
vector<int> scratch(size);
#pragma omp for
for(int i=0; i < count; i++){
f(arguments, scratch);
g(scratch);
}
}
Will I be wasting time for constructing and deconstructing scratch in the first version? Or will the compiler (with optimization) most likely reuse the memory and refrain from reallocation?
On a mainstream PC, the second code is inefficient. Indeed, it generally code the vector to be reallocated and filled with zeros for every iterations. Regarding your system, the default allocator may not scale (AFAIK it is typically the case on Windows with MSVC, but it should be fine on Linux with Jemalloc) and this will reduce the performance of your application. The eager zeros-based vector filling can also causes the same issue if size is big since the RAM is a limited shared resource. Compilers like Clang are able to optimize out some allocations, but in this case, neither GCC nor Clang are able to do this optimization (and the overhead of the memset would still be present anyway).
The third example is quite efficient since the array is allocated and filled only once. Each thread has its own vector so the locality is good. This solution is only worst than the first if the number of iteration is smaller than the number of thread. However, this is not much an issue since it is inefficient in both cases anyway if the f and g calls are short (because of the overhead to distribute the work between threads) or the overhead of the vector is negligible in both cases if the f and g calls are long.

Is mutex needed when all threads set flags in an array based on search result

I have a code block that looks like this
std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
// If any thread finds it true, its true.
// Max value of j is n.
for(auto& j : vec[i])
flags[j] = true;
}
Work based upon the flags.
Is there any need for a mutex ? I understand cache coherence is going to make sure all the write buffers are synchronized, and conflicting buffers will not be written to memory. Secondly the overhead of cache coherence can be avoided by simply changing
flags[j] = true;
to
if(!flags[j]) flags[j] = true;
The check if flags[j] is already set will reduce the write frequency thus need for cache coherency updates. And even if by any chance flags[j] is read to be false it will only end up in one extra write to flags[j] which is okay.
EDIT :
Yes multiple threads may and will try to write to the same index in flags[j]. Hence the question.
uint32_t has intentionally been used and bool is not used since writing to a boolean in parallel can malfunction as the neighboring booleans share the same byte. But writing to the same uint32_t in parallel from different threads will not malfunction in the same manner as booleans even without mutex.
FWIW , to comply with the standards, I ended up keeping this code which more or less complies with the standards not 100% though. But the non-standard code shown above did not fail in tests. I thought for once that it would fail in multi socket machines but turns out x86 also provide multi socket level cache coherence.
#pragma omp parallel
{
std::vector<uint32_t> flags_local(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
for(auto& j : vec[i])
flags_local[j] = true;
}
// No omp directive here, as all threads
// need to traverse their full arrays.
for(int j = 0; j <n; i++) {
if(flags_local[j] && !flags[j]) {
#pragma omp critical
{ flags[j] = true; }
}
}
}
Thread safety in C++ is such that you need not worry about cache coherency and such hardware related issues. What matters is what is specified in the C++ standard, and I don't think it mentions cache coherency. Thats for the implementers to worry about.
For writing to elements of a std::vector the rules are actually rather simple: Writing to distinct elements of a vector is thread-safe. Only if two threads write to the same index you need to synchronize the access (and for that it does not matter whether both threads write the same value or not).
As pointed out by Evg, I made a rough simplification. What counts is that all threads access different memory locations. Hence, with a std::vector<bool> things wouldn't be that simple, because typically several elements of a std::vector<bool> are stored in a single byte.
Yes multiple treads may and will try to write to the same index in flags[j].
Then you need to synrchonize the access. The fact that all elements are false initially and all writes do write true is not relevant. What counts is that you have multiple threads that access the same memory and at least one writes to it. When this is the case you need to synchronize the access or you have a data race.
Accessing a variable concurrently to a write is a race condition, and so it is undefined behavior.
flags[j] = true;
should be protected.
Alternatively you might use atomic types (but see how-to-declare-a-vector-of-atomic-in-c++).
or even simpler using std::atomic_ref (c++20)
std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for
for (int i = 0; i < v; i++) {
for (auto& flag : vec[i]) {
auto atom_flag = std::atomic_ref<std::uint32_t>(flag);
atom_flag = true;
}
}

Multithreading is slower than no threading C++

I am new to multi-thread programming and I am aware several similar questions have been asked on SO before however I would like to get an answer specific to my code.
I have two vectors of objects (v1 & v2) that I want to loop through and depending on if they meet some criteria, add these objects to a single vector like so:
Non-Multithread Case
std::vector<hobj> validobjs;
int length = 70;
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
validobjs.push_back(hobj);
}
}
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
validobjs.push_back(hobj);
}
}
Multithread Case
std::vector<hobj> validobjs;
int length = 70;
#pragma omp parallel
{
std::vector<hobj> threaded1; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto i = this->v1.begin(); i < this->v1.end() ;++i) {
if( !(**i).get_IgnoreFlag() && !(**i).get_ErrorFlag() ) {
hobj obj(*i, length);
threaded1.push_back(obj);
}
}
std::vector<hobj> threaded2; // Each thread has own local vector
#pragma omp for nowait firstprivate(length)
for(auto j = this->v2.begin(); j < this->v2.end() ;++j) {
if( !(**j).get_IgnoreFlag() && !(**j).get_ErrorFlag() ) {
hobj obj(*j, length);
threaded2.push_back(obj);
}
}
#pragma omp critical // Insert local vectors to main vector one thread at a time
{
validobjs.insert(validobjs.end(), threaded1.begin(), threaded1.end());
validobjs.insert(validobjs.end(), threaded2.begin(), threaded2.end());
}
}
In the non-multithreaded case my total time spent doing the operation is around 4x faster than the multithreaded case (~1.5s vs ~6s).
I am aware that the #pragma omp critical directive is a performance hit but since I do not know the size of the validobjs vector beforehand I cannot rely on random insertion by index.
So questions:
1) Is this kind of operation suited for multi-threading?
2) If yes to 1) - does the multithreaded code look reasonable?
3) Is there anything I can do to improve the performance to get it faster than the no-thread case?
Additional info:
The above code is nested within a much larger codebase that is performing 10,000 - 100,000s of iterations (this loop is not using multithreading). I am aware that spawning threads also incurs a performance overhead but as afar as I am aware these threads are being kept alive until the above code is once again executed every iteration
omp_set_num_threads is set to 32 (I'm on a 32 core machine).
Ubuntu, gcc 7.4
Cheers!
I'm no expert on multithreading, but I'll give it a try:
Is this kind of operation suited for multi-threading?
I would say yes. Especially if you got huge datasets, you could split them even further, running any number of filtering operations in parallel. But it depends on the amount of data you want to process, thread creation and synchronization is not free.
As is the merging at the end of the threaded version.
Does the multithreaded code look reasonable?
I think you'r on the right path to let each thread work on independent data.
Is there anything I can do to improve the performance to get it faster than the no-thread case?
I see a few points that might improve performance:
The vectors will need to resize often, which is expensive. You can use reserve() to, well, reserve memory beforehand and thus reduce the number of reallocations (to 0 in the optimal case).
Same goes for the merging of the two vectors at the end, which is a critical point, first reserve:
validobjs.reserve(v1.size() + v2.size());
then merge.
Copying objects from one vector to another can be expensive, depending on the size of the objects you copy and if there is a custom copy-constructor that executes some more code or not. Consider storing only indices of the valid elements or pointers to valid elements.
You could also try to replace elements in parallel in the resulting vector. That could be useful if default-constructing an element is cheap and copying is a bit expensive.
Filter the data in two threads as you do now.
Synchronise them and allocate a vector with a number of elements:
validobjs.resize(v1.size() + v2.size());
Let each thread insert elements on independent parts of the vector. For example, thread one will write to indices 1 to x and thread 2 writes to indices x + 1 to validobjs.size() - 1
Allthough I'm not sure if this is entirely legal or if it is undefined behaviour
You could also think about using std::list (linked list). Concatenating linked lists, or removing elements happens in constant time, however adding elements is a bit slower than on a std::vector with reserved memory.
Those were my thoughts on this, I hope there was something usefull in it.
IMHO,
You copy each element twice: into threaded1/2 and after that into validobjs.
It can make your code slower.
You can add elements into single vector by using synchronization.

OpenMP Nested For Loop with STL Containers

I am little confused on the following case regarding STL containers in C++. Operations such as push_back(.) are unsafe for threading however otherwise I think STL containers can be used.
std::vector<int> global_vector;
#pragma omp parallel for
for (int i = 0; i < height; i++)
{
for(std::vector<int>::iterator it = fvec.begin(); it != fvec.end(); it++)
{
// process here with some push_back into global_vector
global_vector.push_back(/*SOMETHING*/);
}
}
Looking at the above code only the outter for loop is in parallel so I wonder will the push back in the inner for loop be affected making the thread unsafe.
The answer is definitely YES, the code such as it is now is thread-unsafe.
The reason for that is that, push_back() depending on and modifying the internal state of the vector, there will be race conditions between the threads for modifying this internal state. To make the code thread-safe, you would need to make sure that no concurrent calls to this method ever happen.
This can probably be enforced this way:
std::vector<int> global_vector;
#pragma omp parallel for
for (int i = 0; i < height; i++) {
for(std::vector<int>::iterator it = fvec.begin(); it != fvec.end(); it++) {
// process here with some push_back into global_vector
#pragma omp critical
global_vector.push_back(/*SOMETHING*/);
}
}
However, this code would just be a disaster in term of parallel efficiency, since all accesses would be serialised, with also adding a lot of overheads for managing the locks. So just forget about such an approach.
What you could do however is computing in advance the size of the final vector, along with the indexes you really want to access, and only use stateless accesses functions, and on per-threads disjointed sub-sets of the indexes. This would correspond to use global_vector[i] = /*SOMETHING*/; instead of your global_vector.push_back(/*SOMETHING*/); since you know the per-thread ranges of i indexes are disjoint.

OpenMP parallel thread

I need to parallelize this loop, I though that to use was a good idea, but I never studied them before.
#pragma omp parallel for
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it){
worst_q = std::min(worst_q, mesh->element_quality(*it));
}
In this case the loop is not parallelized because it uses iterator and the compiler cannot
understand how to slit it.
Can You help me?
OpenMP requires that the controlling predicate in parallel for loops has one of the following relational operators: <, <=, > or >=. Only random access iterators provide these operators and hence OpenMP parallel loops work only with containers that provide random access iterators. std::set provides only bidirectional iterators. You may overcome that limitation using explicit tasks. Reduction can be performed by first partially reducing over private to each thread variables followed by a global reduction over the partial values.
double *t_worst_q;
// Cache size on x86/x64 in number of t_worst_q[] elements
const int cb = 64 / sizeof(*t_worst_q);
#pragma omp parallel
{
#pragma omp single
{
t_worst_q = new double[omp_get_num_threads() * cb];
for (int i = 0; i < omp_get_num_threads(); i++)
t_worst_q[i * cb] = worst_q;
}
// Perform partial min reduction using tasks
#pragma omp single
{
for(std::set<size_t>::const_iterator it=mesh->NEList[vid].begin();
it!=mesh->NEList[vid].end(); ++it) {
size_t elem = *it;
#pragma omp task
{
int tid = omp_get_thread_num();
t_worst_q[tid * cb] = std::min(t_worst_q[tid * cb],
mesh->element_quality(elem));
}
}
}
// Perform global reduction
#pragma omp critical
{
int tid = omp_get_thread_num();
worst_q = std::min(worst_q, t_worst_q[tid * cb]);
}
}
delete [] t_worst_q;
(I assume that mesh->element_quality() returns double)
Some key points:
The loop is executed serially by one thread only, but each iteration creates a new task. These are most likely queued for execution by the idle threads.
Idle threads waiting at the implicit barrier of the single construct begin consuming tasks as soon as they are created.
The value pointed by it is dereferenced before the task body. If dereferenced inside the task body, it would be firstprivate and a copy of the iterator would be created for each task (i.e. on each iteration). This is not what you want.
Each thread performs partial reduction in its private part of the t_worst_q[].
In order to prevent performance degradation due to false sharing, the elements of t_worst_q[] that each thread accesses are spaced out so to end up in separate cache lines. On x86/x64 the cache line is 64 bytes, therefore the thread number is multiplied by cb = 64 / sizeof(double).
The global min reduction is performed inside a critical construct to protect worst_q from being accessed by several threads at once. This is for illustrative purposes only since the reduction could also be performed by a loop in the main thread after the parallel region.
Note that explicit tasks require compiler which supports OpenMP 3.0 or 3.1. This rules out all versions of Microsoft C/C++ Compiler (it only supports OpenMP 2.0).
Random-Access Container
The simplest solution is to just throw everything into a random-access container (like std::vector) and use the index-based loops that are favoured by OpenMP:
// Copy elements
std::vector<size_t> neListVector(mesh->NEList[vid].begin(), mesh->NEList[vid].end());
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, complexCalc(neListVector[i]));
}
Apart from being incredibly simple, in your situation (tiny elements of type size_t that can easily be copied) this is also the solution with the best performance and scalability.
Avoiding copies
However, in a different situation than yours you may have elements that aren't copied as easily (larger elements) or cannot be copied at all. In this case you can just throw the corresponding pointers in a random-access container:
// Collect pointers
std::vector<const nonCopiableObjectType *> neListVector;
for (const auto &entry : mesh->NEList[vid]) {
neListVector.push_back(&entry);
}
// Process in a standard OpenMP index-based for loop
#pragma omp parallel for reduction(min : worst_q)
for (int i = 0; i < neListVector.size(); i++) {
worst_q = std::min(worst_q, mesh->element_quality(*neListVector[i]));
}
This is slightly more complex than the first solution, still has the same good performance on small elements and increased performance on larger elements.
Tasks and Dynamic Scheduling
Since someone else brought up OpenMP Tasks in his answer, I want to comment on that to. Tasks are a very powerful construct, but they have a huge overhead (that even increases with the number of threads) and in this case just make things more complex.
For the min reduction the use of Tasks is never justified because the creation of a Task in the main thread costs much more than just doing the std::min itself!
For the more complex operation mesh->element_quality you might think that the dynamic nature of Tasks can help you with load-balancing problems, in case that the execution time of mesh->element_quality varies greatly between iterations and you don't have enough iterations to even it out. But even in that case, there is a simpler solution: Simply use dynamic scheduling by adding the schedule(dynamic) directive to your parallel for line in one of my previous solutions. It achieves the same behaviour which far less overhead.