Safe to use int in multithreaded single writer multi-reader code - c++

I'm writing parallel code that has a single writer and multiple readers. The writer will fill in an array from beginning to end, and the readers will access elements of the array in order. Pseudocode is something like the following:
std::vector<Stuff> vec(knownSize);
int producerIndex = 0;
std::atomic<int> consumerIndex = 0;
Producer thread:
for(a while){
vec[producerIndex] = someStuff();
++producerIndex;
}
Consumer thread:
while(!finished){
int myIndex = consumerIndex++;
while(myIndex >= producerIndex){ spin(); }
use(vec[myIndex]);
}
Do I need any sort of synchronization around the producerIndex? It seems like the worst thing that could happen is that I would read an old value while it's being updated so I might spin an extra time. Am I missing anything? Can I be sure that each assignment to myIndex will be unique?

As the comments have pointed out, this code has a data race. Instead of speculating about whether the code has a chance of doing what you want, just fix it: change the type of producerIndex and consumerIndex from int to std::atomic<int> and let the compiler implementor and standard library implementor worry about how to make that work right on your target platform.

It's likely that the array will be stored in the cache so all the threads will have their own copy of it. Whenever your producer puts a new value in the array this will set the dirty bit on the store address, so every other thread that uses the value will retrieve it from the RAM to its own copy in the cache.
That means you will get a lot of cache misses but no race conditions. :)

Related

Adding values to an array and reading them at the same time without mutex or atomic

I wanted to have some practice with async operations, so my goal is to try write a short program that will read and write into the same array.
So let's say it is a int array of a thousand indices.
Can I make a async function to go ahead and write let's say all 1's or some random number into all indices from start to end of the array. Then make another async function to read that array from start to end.
That reading function would check if the index it's going to read is =0, and if it does =0, it'll keep checking until it doesn't equal zero. This way there is no need for mutex or atomic operations.
Is that possible? Thank you!
To me this is the simple way to do it, I saw sources online saying I need a mutex or atomic operations but I as I was writing the program I don't see the sense in having a mutex/atomic stuff and I thought of the aforementioned approach and scrapped that old implementation. I am only having one source/function manipulating the array so I don't think it needs a mutex.
TLDR; Is a mutex or atomic operations needed when there is only one writer and one reader to an array?
https://en.cppreference.com/w/cpp/atomic/memory_order says:
If one evaluation modifies a memory location, and the other reads or modifies the same memory location, and if at least one of the evaluations is not an atomic operation, the behavior of the program is undefined (the program has a data race) unless there exists a happens-before relationship between these two evaluations.
You will need at least one atomic variable or call to std::atomic_thread_fence.
And witout this if you wanted to do something similar to this:
extern int index;
extern int *array;
void try_to_read()
{
// wait till filled
while(index != -1);
// do something with data[index]
}
Then that would not even be compiled as you would expect. Compilers will optimize out this while.

Multiple threads writing into a single buffer concurrently

I have an output cointainer similar to this:
struct cont {
std::mutex m;
size_t offset;
char* data;
cont(size_t sizeB) {
data = new char[sizeB];
}
void write(char* data, size_t sizeB) {
m.lock();
size_t off = offset;
offset += sizeB;
m.unlock();
std::memcpy(this->data + off, data, sizeB);
}
};
The idea is that I have many threads, each working on a dynamically sized workload and outputting data in no specific order into that container. The threads are triggered by server access and there is no telling how many are in concurrently or how much they will contribute.
The reason I'm questioning this is because as you can see, the main workload is outside the mutex lock since in theory, only the distribution of the available buffer needs to be synchronized and the threads shouldn't collide after that bit.
Its been working fine so far but from previous experience, threading problems can manifest themselves way down the road so is this considered a thread safe practice?
Seems OK. If you want to optimize, you could make the offset atomic, to avoid the mutex altogether. So, just declare
std::atomic<size_t> offset;
and the mutex can be removed.
As it stands, I'm afraid this is incomplete: your solution correctly allocates space between multiple threads, but you also need a solution for threads to "commit" their writes. Imagine that one writer thread is indefinitely delayed in the midst of a memcpy (or even prior to commencing its memcpy). How does any other thread ever find out about this so that you can eventually use this buffer at all?
This seems perfectly safe. You're probably worried about trampling on "leftover" bytes concurrently when offset changes by a number which is not a multiple of 4 or 8 bytes. I wanted to alleviate your concerns by quoting the Standard, but the entry for memcpy points to the C Library Reference, which is scant on details. Nevertheless, the function treat the buffers as arrays of unsigned char, so it cannot reliably assume it can also optimize copying the tail when it's unaligned or an incomplete word, as that could constitute an out-of-bound access.

Thread-safe Settings

Im writing some settings classes that can be accessed from everywhere in my multithreaded application. I will read these settings very often (so read access should be fast), but they are not written very often.
For primitive datatypes it looks like boost::atomic offers what I need, so I came up with something like this:
class UInt16Setting
{
private:
boost::atomic<uint16_t> _Value;
public:
uint16_t getValue() const { return _Value.load(boost::memory_order_relaxed); }
void setValue(uint16_t value) { _Value.store(value, boost::memory_order_relaxed); }
};
Question 1: Im not sure about the memory ordering. I think in my application I don't really care about memory ordering (do I?). I just want to make sure that getValue() always returns a non-corrupted value (either the old or the new one). So are my memory ordering settings correct?
Question 2: Is this approach using boost::atomic recommended for this kind of synchronization? Or are there other constructs that offer better read performance?
I will also need some more complex setting types in my application, like std::string or for example a list of boost::asio::ip::tcp::endpoints. I consider all these setting values as immutable. So once I set the value using setValue(), the value itself (the std::string or the list of endpoints itself) does not change anymore. So again I just want to make sure that I get either the old value or the new value, but not some corrupted state.
Question 3: Does this approach work with boost::atomic<std::string>? If not, what are alternatives?
Question 4: How about more complex setting types like the list of endpoints? Would you recommend something like boost::atomic<boost::shared_ptr<std::vector<boost::asio::ip::tcp::endpoint>>>? If not, what would be better?
Q1, Correct if you don't try to read any shared non-atomic variables after reading the atomic. Memory barriers only synchronize access to non-atomic variables that may happen between atomic operations
Q2 I don't know (but see below)
Q3 Should work (if compiles). However,
atomic<string>
possibly isn't lock free
Q4 Should work but, again, the implementation isn't possibly lockfree (Implementing lockfree shared_ptr is challenging and patent-mined field).
So probably readers-writers lock (as Damon suggests in the comments) may be simpler and even more effective if your config includes data with size more than 1 machine word (for which CPU native atomics usually works)
[EDIT]However,
atomic<shared_ptr<TheWholeStructContainigAll> >
may have some sense even being non-lock free: this approach minimize collision probability for readers that need more than one coherent value, though the writer should make a new copy of the whole "parameter sheet" every time it changes something.
For question 1, the answer is "depends, but probably not". If you really only care that a single value isn't garbled, then yes, this is fine, and you don't care about memory order either.
Usually, though, this is a false premise.
For questions 2, 3, and 4 yes, this will work, but it will likely use locking for complex objects such as string (internally, for every access, without you knowing). Only rather small objects which are roughly the size of one or two pointers can normally be accessed/changed atomically in a lockfree manner. This depends on your platform, too.
It's a big difference whether one successfully updates one or two values atomically. Say you have the values left and right which delimit the left and right boundaries of where a task will do some processing in an array. Assume they are 50 and 100, respectively, and you change them to 101 and 150, each atomically. So the other thread picks up the change from 50 to 101 and starts doing its calculation, sees that 101 > 100, finishes, and writes the result to a file. After that, you change the output file's name, again, atomically.
Everything was atomic (and thus, more expensive than normal), but none of it was useful. The result is still wrong, and was written to the wrong file, too.
This may not be a problem in your particular case, but usually it is (and, your requirements may change in the future). Usually you really want the complete set of changes being atomic.
That said, if you have either many or complex (or, both many and complex) updates like this to do, you might want to use one big (reader-writer) lock for the whole config in the first place anyway, since that is more efficient than acquiring and releasing 20 or 30 locks or doing 50 or 100 atomic operations. Do however note that in any case, locking will severely impact performance.
As pointed out in the comments above, I would preferrably make a deep copy of the configuration from the one thread that modifies the configuration, and schedule updates of the reference (shared pointer) used by consumers as a normal tasks. That copy-modify-publish approach a bit similar to how MVCC databases work, too (these, too, have the problem that locking kills their performance).
Modifying a copy asserts that only readers are accessing any shared state, so no synchronization is necessary either for readers or for the single writer. Reading and writing is fast.
Swapping the configuration set happens only at well-defined points in times when the set is guaranteed to be in a complete, consistent state and threads are guaranteed not to do something else, so no ugly surprises of any kind can happen.
A typical task-driven application would look somewhat like this (in C++-like pseudocode):
// consumer/worker thread(s)
for(;;)
{
task = queue.pop();
switch(task.code)
{
case EXIT:
return;
case SET_CONFIG:
my_conf = task.data;
break;
default:
task.func(task.data, &my_conf); // can read without sync
}
}
// thread that interacts with user (also producer)
for(;;)
{
input = get_input();
if(input.action == QUIT)
{
queue.push(task(EXIT, 0, 0));
for(auto threads : thread)
thread.join();
return 0;
}
else if(input.action == CHANGE_SETTINGS)
{
new_config = new config(config); // copy, readonly operation, no sync
// assume we have operator[] overloaded
new_config[...] = ...; // I own this exclusively, no sync
task t(SET_CONFIG, 0, shared_ptr<...>(input.data));
queue.push(t);
}
else if(input.action() == ADD_TASK)
{
task t(RUN, input.func, input.data);
queue.push(t);
}
...
}
For anything more substantial than a pointer, use a mutex. The tbb (opensource) library supports the concept of reader-writer mutices, which allow multiple simultaneous readers, see the documentation.

C - faster locking of integer when using PThreads

I have a counter that's used by multiple threads to write to a specific element in an array. Here's what I have so far...
int count = 0;
pthread_mutex_t count_mutex;
void *Foo()
{
// something = random value from I/O redirection
pthread_mutex_lock(&count_mutex);
count = count + 1;
currentCount = count;
pthread_mutex_unlock(&count_mutex);
// do quick assignment operation. array[currentCount] = something
}
main()
{
// create n pthreads with the task Foo
}
The problem is that it is ungodly slow. I'm accepting a file of integers as I/O redirection and writing them into an array. It seems like each thread spends a lot of time waiting for the lock to be removed. Is there a faster way to increment the counter?
Note: I need to keep the numbers in order which is why I have to use a counter vs giving each thread a specific chunk of the array to write to.
You need to use interlocking. Check out the Interlocked* function on windows, or apple's OSAtomic* functions, or maybe libatomic on linux.
If you have a compiler that supports C++11 well you may even be able to use std::atomic.
Well, one option is to batch up the changes locally somewhere before applying the batch to your protected resource.
For example, have each thread gather ten pieces of information (or less if it runs out before it's gathered ten) then modify Foo to take a length and array - that way, you amortise the cost of the locking, making it much more efficient.
I'd also be very wary of doing:
// do quick assignment operation. array[currentCount] = something
outside the protected area - that's a recipe for disaster since another thread may change currentCount from underneath you. That's not a problem if it's a local variable since each thread will have its own copy but it's not clear from the code what scope that variable has.

thread synchronization - delicate issue

let's i have this loop :
static a;
for (static int i=0; i<10; i++)
{
a++;
///// point A
}
to this loop 2 threads enters...
i'm not sure about something.... what will happen in case thread1 gets into POINT A , stay there, while THREAD2 gets into the loop 10 times, but after the 10'th loop after incrementing i's value to 10, before checking i's value if it's less then 10,
Thread1 is getting out of the loop and suppose to increment i and get into the loop again.
what's the value that Thread1 will increment (which i will he see) ? will it be 10 or 0 ?
is it posibble that Thread1 will increment i to 1, and then thread 2 will get to the loop again for 9 times (and them maybe 8 ,7 , etc...)
thanks
You have to realize that an increment operation is effectively really:
read the value
add 1
write the value back
You have to ask yourself, what happens if two of these happen in two independent threads at the same time:
static int a = 0;
thread 1 reads a (0)
adds 1 (value is 1)
thread 2 reads a (0)
adds 1 (value is 1)
thread 1 writes (1)
thread 2 writes (1)
For two simultaneous increments, you can see that it is possible that one of them gets lost because both threads read the pre-incremented value.
The example you gave is complicated by the static loop index, which I didn't notice at first.
Since this is c++ code, standard implementation is that the static variables are visible to all threads, thus there is only one loop counting variable for all threads. The sane thing to do would be to use a normal auto variable, because each thread would have its own, no locking required.
That means that while you will lose increments sometimes, you also may gain them because the loop itself may lose count and iterate extra times. All in all, a great example of what not to do.
If i is shared between multiple threads, all bets are off. It's possible for any thread to increment i at essentially any point during another thread's execution (including halfway through that thread's increment operation). There is no meaningful way to reason about the contents of i in the above code. Don't do that. Either give each thread its own copy of i, or make the increment and comparison with 10 a single atomic operation.
It's not really a delicate issue because you would never allow this in real code if the synchronization was going to be an issue.
I'm just going to use i++ in your loop:
for (static int i=0; i<10; i++)
{
}
Because it mimics a. (Note, static here is very strange)
Consider if Thread A is suspended just as it reaches i++. Thread B gets i all the way to 9, goes into i++ and makes it 10. If it got to move on, the loop would exist. Ah, but now Thread A is resumed! So it continues where it left off: increment i! So i becomes 11, and your loop is borked.
Any time threads share data, it needs to be protected. You could also make i++ and i < 10 happen atomically (never be interrupted), if your platform supports it.
You should use mutual exclusion to solve this problem.
And that is why, on multi-threaded environment, we are suppose to use locks.
In your case, you should write:
bool test_increment(int& i)
{
lock()
++i;
bool result = i < 10;
unlock();
return result;
}
static a;
for(static int i = -1 ; test_increment(i) ; )
{
++a;
// Point A
}
Now the problem disappears .. Note that lock() and unlock() are supposed to lock and unlock a mutex common to all threads trying to access i!
Yes, it's possible that either thread can do the majority of the work in that loop. But as Dynite explained, this would (and should) never show up in real code. If synchronization is an issue, you should provide mutual exclusion (a Boost, pthread, or Windows Thread) mutex to prevent race conditions such as this.
Why would you use a static loop counter?
This smells like homework, and a bad one at that.
Both the threads have their own copy of i, so the behavior can be anything at all. That's part of why it's such a problem.
When you use a mutex or critical section the threads will generally sync up, but even that is not absolutely guaranteed if the variable is not volatile.
And someone will no doubt point out "volatile has no use in multithreading!" but people say lots of stupid things. You don't have to have volatile but it is helpful for some things.
If your "int" is not the atomic machine word size (think 64 bit address + data emulating a 32-bit VM) you will "word-tear". In that case your "int" is 32 bits, but the machine addresses 64 atomically. Now you have to read all 64, increment half, and write them all back.
This is a much larger issue; bone up on processor instruction sets, and grep gcc for how it implements "volatile" everywhere if you really want the gory details.
Add "volatile" and see how the machine code changes. If you aren't looking down at the chip registers, please just use boost libraries and be done with it.
If you need to increment a value with multiple threads at the same time, then Look up "atomic operations". For linux, look up "gcc atomic operations". There is hardware support on most platforms to atomicly increment, add, compare and swap, and more. LOCKING WOULD BE OVERKILL for this....atomic inc is magnitudes faster than lock inc unlock. If you have to change a lot of fields at the same time you may need a lock, although you can change 128 bits worth of fields at a time with most atomic ops.
volatile is not the same as an atomic operation. Volatile helps the compiler know when its a bad idea to use a copy of a variable. Among its uses, volatile is important when you have multiple threads changing data that you would like to read the "most up to date version of" of without locking. Volatile will still not fix your a++ problem as two threads can read the value of "a" at the same time and then both increment the same "a" and then the last one to write "a" wins and you lost an inc. Volatile will slow down optimized code by not letting the compiler hold values in registers and what not.