Concurrent/Asynchronous access to shared data - concurrency

I searched around a bit, but could not find anything useful. Could someone help me with this concurrency/synchronization problem?
Given five instances of the program below running asynchronously, with s being a shared data with an initial value of 0 and i a local variable, which values are obtainable by s?
for (i = 0; i < 5; i ++) {
s = s + 1;
}
2
1
6
I would like to know which values, and why exactly.

The non-answering answer is: Uaaaagh, don't do such a thing.
An answer more in the sense of your question is: In principle, any value is possible, because it is totally undefined. You have no strict guarantee that concurrent writes are atomic in any way and don't result in complete garbage.
In practice, less-than-machine-word sized writes are atomic everywhere (for what I know, at least), but they do not have a defined order. Also, you usually don't know in which order threads/processes are scheduled. So you will never see a "random garbage" value, but also you cannot know what it will be. It will be anything 5 or higher (up to 25).
Since no atomic increment is used, there is a race between reading the value, incrementing it, and writing it back. If the value is being written by another instance before the result is written back, the write (and thus increment) that finished earlier has no effect. If that does not happen, both increments are effective.
Nevertheless, each instance increments the value at least 5 times, so apart from the theoretical "total garbage" possibility, there is no way a value less than 5 could result in the end. Therefore (1) and (2) are not possible, but (3) is.

Related

Large performance difference between comparing a variable to a fixed value and reading or writing from mapped memory address

I'm developing a software that runs on a DE10 board, in an ARM Cortex-A9 processor.
This software has to access physical memory addresses in order to communicate with the FPGA in the DE10, and this is done mapping /dev/mem, this method is described here.
I have a situation where I have to select which of 4 addresses to send some values, and this could be done in 1 of 2 ways:
Using an if statement and checking a integer variable (which is always 0 or 1 at that part of the loop) and only write if it's 1.
Multiply the values that should be sent by the aforementioned variable and write on all addresses without any conditional, because writing zero doesn't have any effect on my system.
I was curious about which would be faster, so I tried this:
First, I made this loop:
int test=0;
for(int i=0;i<1000000;i++)
{
if(test==9)
{
test=15;
}
test++;
if(test==9)
{
test=0;
}
}
The first if statement should never be satisfied, so its only contribution to the time taken in the loop is from its comparison itself.
The increment and the second if statement are just things I added in an attempt to prevent the compiler from just "optimizing out" the first if statement.
This loop is ran once without being benchmarked (just in case there's any frequency scaling ramp, although I'm pretty sure it has none) and then its ran once again being benchmarked, and it takes around 18350 μs to complete.
Without the first if statement, it takes around 17260 μs
Now, If I change that first if statement by a line that sets the value of a memory-mapped address to the value of the integer test, like this:
for(int i=0;i<1000000;i++)
{
*(uint8_t*)address=test;
test++;
if(test==9)
{
test=0;
}
}
This loops takes around 253600 μs to complete, almost 14 x slower.
Reading that address instead of writing on it barely changes anything.
Is this what it really is, or is there some kind of compiler optimization possibly frustrating my benchmarking?
Should I expect this difference in performance (and thus favoring the comparison method) in the actual software?

C++ Is writing and reading a variable from different threads undefined behavior [duplicate]

This question already has answers here:
c++ what happens when in one thread write and in second read the same object? (is it safe?) [duplicate]
(4 answers)
Closed 2 years ago.
Since I've started multi-threading, I've been asking myself this one question :
Is writing and reading a variable from different threads undefined behavior?
Let's use the minimal example where we increment an integer in a thread and read the integer inside another one.
void thread1()
{
x++;
}
void thread2()
{
if (x == 5)
{
//doSomething
}
}
I understand that the addition operation is not atomic and therefore I could make a read from the second thread while the first thread is in the middle of the adding operation, but there is something i'm not quite sure of.
Does x keeps his value until the whole addition operation is completed and then is assigned this new value, or does x have an intermediate state where reading from it would result in undefined behavior.
If the first theory applies, then reading from x while it's being writing to would simply return the value before the addition and wouldn't be so problematic.
If the second theory is true, could someone explain more in detail what is the process of the addition operation and why it would be undefined behavior (maybe with an example?)
Thanks
The comments already got the basics right.
The compiler, when compiling a single function may consider the ways in which a variable is changed. If the function cannot directly or indirectly change a certain variable, then the compiler may assume that there is no change to that variable whatsoever, unless there's thread synchronization. In that case the compiler must deal with the possibility of another thread changing those variables.
If the compiler assumption is violated (i.e. you have a bug), then literally anything may happen. This is not constrained, because that would severely restrict optimizers. You may make some assumptions that x has some unique address in memory, but optimizers are known to move variables around and have multiple variables share a single address (just at different times). Such optimizations may very well be justified based on a single-thread assumption, one that your example is violating. Your second thread may think it's looking at x, but it might also be getting y.
x (32bit variable) will be always defined on 32+bits cpu however not so precisely. You know that x can be any value from start up to end range defined by ++.
like in following case: x is initialized to 0 and you call 5 times thread1 the thread 2 can see this x in range from 0 to 5.
It means I can consider assignment of integer to memory as atomic.
There are some reasons why x on both thread is not synchronized e.g. while x on thread1 is 5 on the thread2 can be 0 in the same time.
One of the reason is cpu cache which is different for each core. To synchronise the value between caches you have to use memory barriers. You can use for example std::atomic which do a great job for you

How is 'decreases' in JML defined?

The statement after decreases has to get strictly smaller in each loop and always be non-zero. But does it have to reach 0? Does it have to get smaller by one?
As stated in the JML documentation, decreases (you can also write decreasing) means that an int or long expression with that specifier "must be no less than 0 when the loop is executing, and must decrease by at least one (1) each time around the loop."
So it may or may not reach 0, but can't get smaller than that. Also, it has to get smaller by at least, but not necessarily exactly one. Note the example in the documentation for a more precise explanation.

Is adding 1 to a number repeatedly slower than adding everything at the same time in C++? [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
If I have a number a, would it be slower to add 1 to it b times rather than simply adding a + b?
a += b;
or
for (int i = 0; i < b; i++) {
a += 1;
}
I realize that the second example seems kind of silly, but I have a situation where coding would actually be easier that way, and I am wondering if that would impact performance.
EDIT: Thank you for all your answers. It looks like some posters would like to know what situation I have. I am trying to write a function to shift an inputted character a certain number of characters over (ie. a cipher) if it is a letter. So, I want to say that one char += the number of shifts, but I also need to account for the jumps between the lowercase characters and uppercase characters on the ascii table, and also wrapping from z back to A. So, while it is doable in another way, I thought it would be easiest to keep adding one until I get to the end of a block of letter characters, then jump to the next one and keep going.
If your loop is really that simple, I don't see any reason why a compiler couldn't optimize it. I have no idea if any actually would, though. If your compiler doesn't the single addition will be much faster than the loop.
The language C++ does not describe how long either of those operations take. Compilers are free to turn your first statement into the second, and that is a legal way to compile it.
In practice, many compilers would treat those two subexpressions as the same expression, assuming everything is of type int. The second, however, would be fragile in that seemingly innocuous changes would cause massive performance degradation. Small changes in type that 'should not matter', extra statements nearby, etc.
It would be extremely rare for the first to be slower than the second, but if the type of a was such that += b was a much slower operation than calling += 1 a bunch of times, it could be. For example;
struct A {
std::vector<int> v;
void operator+=( int x ) {
// optimize for common case:
if (x==1 && v.size()==v.capacity()) v.reserve( v.size()*2 );
// grow the buffer:
for (int i = 0; i < x; ++i)
v.reserve( v.size()+1 );
v.resize( v.size()+1 );
}
}
};
then A a; int b = 100000; a+=b; would take much longer than the loop construct.
But I had to work at it.
The overhead (CPU instructions) on having a variable being incremented in a loop is likely to be insignificant compared to the total number of instructions in that loop (unless the only thing you are doing in the loop is incrementing). Loop variables are likely to remain in the low levels of the CPU cache (if not in CPU registries) and is very fast to increment as in doesn't need to read from the RAM via the FSB. Anyway, if in doubt just make a quick profile and you'll know if it makes sense to sacrifice code readability for speed.
Yes, absolutely slower. The second example is beyond silly. I highly doubt you have a situation where it would make sense to do it that way.
Lets say 'b' is 500,000... most computers can add that in a single operation, why do 500,000 operations (not including the loop overhead).
If the processor has an increment instruction, the compiler will usually translate the "add one" operation into an increment instruction.
Some processors may have an optimized increment instructions to help speed up things like loops. Other processors can combine an increment operation with a load or store instruction.
There is a possibility that a small loop containing only an increment instruction could be replaced by a multiply and add. The compiler is allowed to do so, if and only if the functionality is the same.
This kind of operation, generally produces negligible results. However, for large data sets and performance critical applications, this kind of operation may be necessary and the time gained would be significant.
Edit 1:
For adding values other than 1, the compiler would emit processor instructions to use the best addition operations.
The add operation is optimized in hardware as a different animal than incrementing. Arithmetic Logic Units (ALU) have been around for a long time. The basic addition operation is very optimized and a lot faster than incrementing in a loop.

atomic_compare_exchange with greater-than instead of equals?

C++11 has a 'compare and exchange' operation for atomic variables.
The semantics are:
Atomically compares the value pointed to by obj with the value pointed to by expected, and if those are equal, replaces the former with desired (performs read-modify-write operation). Otherwise, loads the actual value pointed to by obj into *expected (performs load operation).
I want to do the same, but instead of setting *obj when the values are equal, I want it to be set when one is greater-than the other (assume we're talking about an ordered type).
Is this supported somehow? Achievable by some hack perhaps?
Note: A CAS loop will not do for me, since both the values I'm comparing might change between non-atomic operations.
I think you misunderstand how compare and swap/exchange works: the basic idea is that having looked at the current value you can work out some corresponding new value - and you attempt that update. If it succeeds - great - continue with whatever you need to, but if it fails then start all over again: looking at the new value that some other thread's put in there and thinking about the value that you'd consequently now need.
I want it to be set when one is greater-than the other (assume we're talking about an ordered type).
So say you want to store 11 but only if the existing value's still atomically less than 11. You won't find an instruction to do that directly, but you can easily do it with the existing compare and swap:
int target_value = 11;
do {
int snapped_x = x;
if (snapped_x >= target_value)
what do you want to do instead?
} while (!compare_and_swap(x, snapped_x, target_value));
// ...or whatever your exact calling convention is...
You still get the behaviour you want, just with a potentially higher failure/spin rate....
As requested, here's my comment as an answer:
I, too, wish this existed, but it does not, as far as I know (certainly not for x86/x64), apart from conceptually, of course, and workarounds that (potentially) use more than a single atomic instruction (which work but are not wait-free).
This may be an old question, but I think many people will want this kind of feature.
I come up with an idea, here show the pseudo code(I am linux kernel people, so use some kernel functions).
update(new)
{
old = atomic_read(&pvalue);
while (old < new) {
v = atomic_cmpxchg(&pvalue, old, new);
if (v != old) {
old = v;
continue;
}
}
}
The code doesn't try cmpxchg for old value less than new value.
If there are concurrency issues, please tell me. Thanks:)