I have this code in test.cpp:
#include <atomic>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <thread>
static const int N_ITEMS = 11;
static const int N_WORKERS = 4;
int main(void)
{
int* const items = (int*)std::malloc(N_ITEMS * sizeof(*items));
for (int i = 0; i < N_ITEMS; ++i) {
items[i] = i;
}
std::thread* const workers = (std::thread*)std::malloc(N_WORKERS * sizeof(*workers));
std::atomic<int> place(0);
for (int w = 0; w < N_WORKERS; ++w) {
new (&workers[w]) std::thread([items, &place]() {
int i;
while ((i = place.fetch_add(1, std::memory_order_relaxed)) < N_ITEMS) {
items[i] *= items[i];
std::this_thread::sleep_for(std::chrono::seconds(1));
}
});
}
for (int w = 0; w < N_WORKERS; ++w) {
workers[w].join();
workers[w].~thread();
}
std::free(workers);
for (int i = 0; i < N_ITEMS; ++i) {
std::cout << items[i] << '\n';
}
std::free(items);
}
I compile like so on linux:
c++ -std=c++11 -Wall -Wextra -pedantic test.cpp -pthread
When run, the program should print this:
0
1
4
9
16
25
36
49
64
81
100
I don't have much experience with C++, nor do I really understand atomic operations. According to the standard, will the worker threads always square item values correctly, and will the main thread always print the correct final values? I'm worried that the atomic variable will be updated out-of-sync with the item values, or something. If this is the case, can I change the memory order used with fetch_add to fix the code?
Looks safe to me. Your i = place.fetch_add(1) allocates each array index exactly once, each one to one of the threads. So for any given array element, it's only touched by a single thread, and that's guaranteed to be safe for all types other than bit-fields of a struct1.
Footnote 1: Or elements of std::vector<bool> which the standard unfortunately requires to be a packed bit-vector, breaking some of the usual guarantees for std::vector.
There's no need for any ordering of these accesses while the worker threads are working; the main thread join()s the workers before reading the array, so everything done by the workers "happens before" (in ISO C++ standardese) the main thread's std::cout << items[i] accesses.
Of course, the array elements are all written by the main thread before the worker threads are started, but that's also safe because the std::thread constructor makes sure all earlier stuff in the parent threads happens-before anything in the new thread
The completion of the invocation of the constructor synchronizes-with (as defined in std::memory_order) the beginning of the invocation of the copy of f on the new thread of execution.
There's also no need for any order stronger than mo_relaxed on the increment: it's the only atomic variable in your program, and you don't need any ordering of any of your operations except the overall thread creation and join.
It's still atomic, so it's guaranteed that 100 increments will produce the numbers 0..99, simply no guarantee about which thread gets which. (But there is a guarantee that each thread will see monotonically increasing values: for a every atomic object separately, a modification order exists, and that order is consistent with some interleaving of the program-order of modifications to it.)
Just for the record, this is hilariously inefficient compared to having each worker pick a contiguous range of indices to square. That would only take 1 atomic access per thread, or the main thread could just pass them positions.
And it would avoid all the false sharing effects of having 4 threads loading and storing into the same cache line at the same time as they move through the array.
Contiguous ranges would also let the compiler auto-vectorize with SIMD to load/square/store multiple elements at once.
Related
I have a code block that looks like this
std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
// If any thread finds it true, its true.
// Max value of j is n.
for(auto& j : vec[i])
flags[j] = true;
}
Work based upon the flags.
Is there any need for a mutex ? I understand cache coherence is going to make sure all the write buffers are synchronized, and conflicting buffers will not be written to memory. Secondly the overhead of cache coherence can be avoided by simply changing
flags[j] = true;
to
if(!flags[j]) flags[j] = true;
The check if flags[j] is already set will reduce the write frequency thus need for cache coherency updates. And even if by any chance flags[j] is read to be false it will only end up in one extra write to flags[j] which is okay.
EDIT :
Yes multiple threads may and will try to write to the same index in flags[j]. Hence the question.
uint32_t has intentionally been used and bool is not used since writing to a boolean in parallel can malfunction as the neighboring booleans share the same byte. But writing to the same uint32_t in parallel from different threads will not malfunction in the same manner as booleans even without mutex.
FWIW , to comply with the standards, I ended up keeping this code which more or less complies with the standards not 100% though. But the non-standard code shown above did not fail in tests. I thought for once that it would fail in multi socket machines but turns out x86 also provide multi socket level cache coherence.
#pragma omp parallel
{
std::vector<uint32_t> flags_local(n, 0);
#pragma omp parallel for
for(int i = 0; i <v; i++) {
for(auto& j : vec[i])
flags_local[j] = true;
}
// No omp directive here, as all threads
// need to traverse their full arrays.
for(int j = 0; j <n; i++) {
if(flags_local[j] && !flags[j]) {
#pragma omp critical
{ flags[j] = true; }
}
}
}
Thread safety in C++ is such that you need not worry about cache coherency and such hardware related issues. What matters is what is specified in the C++ standard, and I don't think it mentions cache coherency. Thats for the implementers to worry about.
For writing to elements of a std::vector the rules are actually rather simple: Writing to distinct elements of a vector is thread-safe. Only if two threads write to the same index you need to synchronize the access (and for that it does not matter whether both threads write the same value or not).
As pointed out by Evg, I made a rough simplification. What counts is that all threads access different memory locations. Hence, with a std::vector<bool> things wouldn't be that simple, because typically several elements of a std::vector<bool> are stored in a single byte.
Yes multiple treads may and will try to write to the same index in flags[j].
Then you need to synrchonize the access. The fact that all elements are false initially and all writes do write true is not relevant. What counts is that you have multiple threads that access the same memory and at least one writes to it. When this is the case you need to synchronize the access or you have a data race.
Accessing a variable concurrently to a write is a race condition, and so it is undefined behavior.
flags[j] = true;
should be protected.
Alternatively you might use atomic types (but see how-to-declare-a-vector-of-atomic-in-c++).
or even simpler using std::atomic_ref (c++20)
std::vector<uint32_t> flags(n, 0);
#pragma omp parallel for
for (int i = 0; i < v; i++) {
for (auto& flag : vec[i]) {
auto atom_flag = std::atomic_ref<std::uint32_t>(flag);
atom_flag = true;
}
}
Running the following code hundreds of times, I expected the printed value to be always 3, but it seems to be 3 only about ~75% of the time. This probably means I have a misunderstanding about the purpose of the various memory orders in C++, or the point of atomic operations. My question is: is there a way to guarantee that the output of the following program is predictable?
#include <atomic>
#include <iostream>
#include <thread>
#include <vector>
int main () {
std::atomic<int> cnt{0};
auto f = [&](int n) {cnt.store(n, std::memory_order_seq_cst);};
std::vector<std::thread> v;
for (int n = 1; n < 4; ++n)
v.emplace_back(f, n);
for (auto& t : v)
t.join();
std::cout << cnt.load() << std::endl;
return 0;
}
For instance, here's the output statistics after 100 runs:
$ clang++ -std=c++20 -Wall foo.cpp -pthread && for i in {1..100}; do ./a.out; done | sort | uniq -c
2 1
21 2
77 3
What you observe is orthogonal to memory orders.
The scheduler cannot guarantee the order of execution of threads with the same priority. Even if CPUs are idle and the threads are assigned to different CPUs, cache misses and lock contention can make threads stall relative to threads on other CPUs. Or, if CPUs are busy running other threads with same or higher priority, then your new threads will have to wait till the running threads exhaust their time slices or block in the kernel, whatever happens earlier is hard for the scheduler to predict. Only if your system has one CPU the new threads will run in expected order relative to each other because they will form one queue on one CPU.
std::memory_order_relaxed is enough here, since you don't require any ordering between the store to cnt and stores/loads to other non-atomic variables. std::atomic is always atomic, std::memory_order specifies whether loads and stores to other non-atomic variables can be reordered relatively to the load or store of an std::atomic variable.
I'm using C++11 and I'm aware that concurrent writes to std::vector<bool> someArray are not thread-safe due to the specialization of std::vector for bools.
I'm trying to find out if writes to bool someArray[2048] have the same problem:
Suppose all entries in someArray are initially set to false.
Suppose I have a bunch of threads that write at different indices in someArray. In fact, these threads only set different array entries from false to true.
Suppose I have a reader thread that at some point acquires a lock, triggering a memory fence operation.
Q: Will the reader see all the writes to someArray that occurred before the lock was acquired?
Thanks!
You should use std::array<bool, 2048> someArray, not bool someArray[2048];. If you're in C++11-land, you'll want to modernize your code as much as you are able.
std::array<bool, N> is not specialized in the same way that std::vector<bool> is, so there's no concerns there in terms of raw safety.
As for your actual question:
Will the reader see all the writes to someArray that occurred before the lock was acquired?
Only if the writers to the array also interact with the lock, either by releasing it at the time that they finish writing, or else by updating a value associated with the lock that the reader then synchronizes with. If the writers never interact with the lock, then the data that will be retrieved by the reader is undefined.
One thing you'll also want to bear in mind: while it's not unsafe to have multiple threads write to the same array, provided that they are all writing to unique memory addresses, writing could be slowed pretty dramatically by interactions with the cache. For example:
void func_a() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i * 256; index < (i+1) * 256; index++)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
void func_b() {
std::array<bool, 2048> someArray{};
for(int i = 0; i < 8; i++) {
std::thread writer([i, &someArray]{
for(size_t index = i; index < 2048; index += 8)
someArray[index] = true;
//Some kind of synchronization mechanism you need to work out yourself
});
writer.detach();
}
}
The details are going to vary depending on the underlying hardware, but in nearly all situations, func_a is going to be orders of magnitude faster than func_b, at least for a sufficiently large array size (2048 was chosen as an example, but it may not be representative of the actual underlying performance differences). Both functions should have the same result, but one will be considerably faster than the other.
First of all, the general std::vector is not thread-safe as you might think. The guarantees are already stated here.
Addressing your question: The reader may not see all writes after acquiring the lock. This is due to the fact that the writers may never have performed a release operation which is required to establish a happens-before relationship between the writes and the subsequent reads. In (very) simple terms: every acquire operation (such as a mutex lock) needs a release operation to synchronize with. Every memory operation done before a release onto a certain veriable will be visible to any thread that acquired the same variable. See also Release-Acquire ordering.
One important thing to note is that all operations (fetch and store) on an int32 sized variable (such as a bool) is atomic (holds true for x86 or x64 architectures). So if you declare your array as volatile (necessary as each thread may have a cached value of the array), you should not have any issues in modifying the array (via multiple threads).
I am using OpenMP to parallelize a for loop. I am trying to access a C++ Armadillo vector by thread id, but I am wondering if I have to put the access in a critical section even if the different threads access disjoint areas of memory.
This is my code:
#include <armadillo>
#include <omp.h>
#include <iostream>
int main()
{
arma::mat A = arma::randu<arma::mat>(1000,700);
arma::rowvec point = A.row(0);
arma::vec distances = arma::zeros(omp_get_max_threads());
#pragma omp parallel shared(A,point,distances)
{
arma::vec local_distances = arma::zeros(omp_get_num_threads());
int thread_id = omp_get_thread_num();
for(unsigned int l = 0; l < A.n_rows; l++){
double temp = arma::norm(A.row(l) - point,2);
if(temp > local_distances[thread_id])
local_distances[thread_id] = temp;
}
// Is it necessary to put a critical section here?
#pragma omp critical
if(local_distances[thread_id] > distances[thread_id]){
distances[thread_id] = local_distances[thread_id];
}
}
std::cout << distances[distances.index_max()] << std::endl;
}
Is it necessary to put read/writes to the distances vector in my case?
Your code is fine. It is important to understand that
Variables declared outside of the parallel region are implicitly shared.
Variables declared inside of the parallel region are implicitly private - so each thread has a local copy of it.
So it's not really useful to declare a private vector of distances for each thread. You don't even have to have a separate local_distances since access to distances is correct. (Although it should be noted that access to distances is highly inefficient because different threads will try to write data on the same cache line). Anyway, the whole thing is called a reduction, and OpenMP has easy support for that. You can write that like follows:
arma::mat A = arma::randu<arma::mat>(1000,700);
arma::rowvec point = A.row(0);
double distance = 0.;
#pragma omp parallel reduction(max:distance)
{
for(unsigned int l = 0; l < A.n_rows; l++){
distance = std::max(distance, arma::norm(A.row(l) - point,2));
}
}
std::cout << distance << std::endl;
Declaring a variable reduction means that each thread gets a local copy, and after the parallel region, the reduction operation is applied to the set of local copies. This is the most concise, idiomatic and performance optimal solution.
P.S. With C++ code, it can sometimes be a bit difficult to figure out whether an access e.g. though operator[] or arma::mat::row is safe in a multi-threaded program. You always have to figure out whether your code implies writing and/or reading from shared data. Only one thread may write exclusively or many threads may read.
The difficulty of multithreading comes from the need to deal with shared mutable state. There is nothing wrong with one thread accessing mutable (changeable) data, or many threads concurrently accessing immutable (constant) data. It is only when multiple threads need to access the same mutable data that synchronization/critical sections are necessary.
Your code falls under the first case, as each thread_id indexes to unique data--only one thread changes data at a time.
I am trying to parallelize an operation using pthreads. The process looks something like:
double* doSomething( .... ) {
double* foo;
foo = new double[220];
for(i = 0; i<20; i++)
{
//do something with the elements in foo located between 10*i and 10*(i+2)
}
return foo;
}
The stuff happening inside the for-loop can be done in any order, so I want to organize this using threads.
For instance, I could use a number of threads such that each thread goes through parts of the for-loop, but works on different parts of the array. To avoid trouble when working on overlapping parts, i need to lock some memory.
How can I make a mutex (or something else) that locks only part of the array?
If you are using latest gcc you can try parallel versions of standard algorithms. See the libstdc++ parallel mode.
If you just want to make sure that a section of the array is worked once...
Make a global variable:
int _iNextSection;
Whenever a thread gets ready to operate on a section, the thread gets the next available section this way:
iMySection = __sync_fetch_and_add(&_iNextSection, 1);
__sync_fetch_and_add() returns the current value of _iNextSection and then increments _iNextSection. __sync_fetch_and_add() is atomic, which means __sync_fetch_and_add() is guaranteed to complete before another thread can do it. No locking, no blocking, simple, fast.
If the loop looks exactly like you wrote, I would use an array of 21 mutexes and block in each thread on ith an (i + 1)th mutex on the beginning of the loop.
So something like:
...
for (i = 0; i < 20; i++) {
mutex[i].lock();
mutex[i+1].lock();
...
mutex[i+1].unlock();
mutex[i].unlock();
}
The logic is that only two neighboring loop executions can access same data (if the limits are [i * 10, (i + 2) * 10)), so you only need to worry about them.