So I'm reading through a book an in this chapter that goes over multithreading and concurrency they gave me a question that does not really make sense to me.
I'm suppose to create 3 functions with param x that simply calculates x * x; one using mutex, one using atomic types, and one using neither. And create 3 global variables holding the values.
The first two functions will prevent race conditions but the third might not.
After that I create N threads and then loop through and tell each thread to calculate it's x function (3 separate loops, one for each function. So I'm creating N threads 3 times)
Now the book tells me that using function 1 & 2 I should always get the correct answer but using function 3 I won't always get the right answer. However, I am always getting the right answer for all of them. I assume this is because I am just calculating x * x which is all it does.
As an example, when N=3, the correct value is 0 * 0 + 1 * 1 + 2 * 2 = 5.
this is the atomic function:
void squareAtomic(atomic<int> x)
{
accumAtomic += x * x;
}
And this is how I call the function
thread threadsAtomic[N]
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsAtomic[i] = thread(squareAtomic, i);
}
for (int i = 0; i < N; i++)
{
threadsAtomic[i].join();
}
This is the function that should sometimes create race conditions:
void squareNormal(int x)
{
accumNormal += x * x;
}
Heres how I call that:
thread threadsNormal[N];
for (int i = 0; i < N; i++) //i will be the current thread that represents x
{
threadsNormal[i] = thread(squareNormal, i);
}
for (int i = 0; i < N; i++)
{
threadsNormal[i].join();
}
This is all my own code so I might not be doing this question correctly, and in that case I apologize.
One problem with race conditions (and with undefined behavior in general) is that their presence doesn't guarantee that your program will behave incorrectly. Rather, undefined behavior only voids the guarantee that your program will behave according to rules of the C++ language spec. That can make undefined behavior very difficult to detect via empirical testing. (Every multithreading-programmer's worst nightmare is the bug that was never seen once during the program's intensive three-month testing period, and only appears in the form of a mysterious crash during the big on-stage demo in front of a live audience)
In this case your racy program's race condition comes in the form of multiple threads reading and writing accumNormal simultaneously; in particular, you might get an incorrect result if thread A reads the value of accumNormal, and then thread B writes a new value to accumNormal, and then thread A writes a new value to accumNormal, overwriting thread B's value.
If you want to be able to demonstrate to yourself that race conditions really can cause incorrect results, you'd want to write a program where multiple threads hammer on the same shared variable for a long time. For example, you might have half the threads increment the variable 1 million times, while the other half decrement the variable 1 million times, and then check afterwards (i.e. after joining all the threads) to see if the final value is zero (which is what you would expect it to be), and if not, run the test again, and let that test run all night if necessary. (and even that might not be enough to detect incorrect behavior, e.g. if you are running on hardware where increments and decrements are implemented in such a way that they "just happen to work" for this use case)
Related
My prof once said, that if-statements are rather slow and should be avoided as much as possible. I'm making a game in OpenGL, where I need a lot of them.
In my tests replacing an if-statement with AND via short-circuiting worked, but is it faster?
bool doSomething();
int main()
{
int randomNumber = std::rand() % 10;
randomNumber == 5 && doSomething();
return 0;
}
bool doSomething()
{
std::cout << "function executed" << std::endl;
return true;
}
My intention is to use this inside the draw function of my renderer. My models are supposed to have flags, if a flag is true, a certain function should execute.
if-statements are rather slow and should be avoided as much as possible.
This is wrong and/or misleading. Most simplified statements about slowness of a program are wrong. There's probably something wrong with this answer too.
C++ statements don't have a speed that can be attributed to them. It's the speed of the compiled program that matters. And that consists of assembly language instructions; not of C++ statements.
What would probably be more correct is to say that branch instructions can be relatively slow (on modern, superscalar CPU architectures) (when the branch cannot be predicted well) (depending on what you are comparing to; there are many things that are much more expensive).
randomNumber == 5 && doSomething();
An if-statement is often compiled into a program that uses a branch instruction. A short-circuiting logical-and operation is also often compiled into a program that uses a branch instruction. Replacing if-statement with a logical-and operator is not a magic bullet that makes the program faster.
If you were to compare the program produced by the logical-and and the corresponding program where it is replaced with if (randomNumber == 5), you would find that the optimiser sees through your trick and produces the same assembly in both cases.
My models are supposed to have flags, if a flag is true, a certain function should execute.
In order to avoid the branch, you must change the premise. Instead of iterating through a sequence of all models, checking flag, and conditionally calling a function, you could create a sequence of all models for which the function should be called, iterate that, and call the function unconditionally -> no branching. Is this alternative faster? There is certainly some overhead of maintaining the data structure and the branch predictor may have made this unnecessary. Only way to know for sure is to measure the program.
I agree with the comments above that in almost all practical cases, it's OK to use ifs as much as you need without hesitation.
I also agree that it is not an issue important for a beginner to waste energy on optimizing, and that using logical operators will likely to emit code similar to ifs.
However - there is a valid issue here related to branching in general, so those who are interested are welcome to read on.
Modern CPUs use what we call Instruction pipelining.
Without getting too deap into the technical details:
Within each CPU core there is a level of parallelism.
Each assembly instruction is composed of several stages, and while the current instruction is executed, the next instructions are prepared to a certain degree.
This is called instruction pipelining.
This concept is broken with any kind of branching in general, and conditionals (ifs) in particular.
It's true that there is a mechanism of branch prediction, but it works only to some extent.
So although in most cases ifs are totally OK, there are cases it should be taken into account.
As always when it comes to optimizations, one should carefully profile.
Take the following piece of code as an example (similar things are common in image processing and other implementations):
unsigned char * pData = ...; // get data from somewhere
int dataSize = 100000000; // something big
bool cond = ...; // initialize some condition for relevant for all data
for (int i = 0; i < dataSize; ++i, ++pData)
{
if (cond)
{
*pData = 2; // imagine some small calculation
}
else
{
*pData = 3; // imagine some other small calculation
}
}
It might be better to do it like this (even though it contains duplication which is evil from software engineering point of view):
if (cond)
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 2; // imagine some small calculation
}
}
else
{
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = 3; // imagine some other small calculation
}
}
We still have an if but it's causing to branch potentially only once.
In certain [rare] cases (requires profiling as mentioned above) it will be more efficient to do even something like this:
for (int i = 0; i < dataSize; ++i, ++pData)
{
*pData = (2 * cond + 3 * (!cond));
}
I know it's not common , but I encountered specific HW some years ago on which the cost of 2 multiplications and 1 addition with negation was less than the cost of branching (due to reset of instruction pipeline). Also this "trick" supports using different condition values for different parts of the data.
Bottom line: ifs are usually OK, but it's good to be aware that sometimes there is a cost.
I want to know if there is any different between std::atomic<int> and int if we are just doing load and store. I am not concerned about the memory ordering. For example consider the below code
int x{1};
void f(int myid) {
while(1){
while(x!= myid){}
//cout<<"thread : "<< myid<<"\n";
//this_thread::sleep_for(std::chrono::duration(3s));
x = (x % 3) + 1;
}
}
int main(){
thread x[3];
for(int i=0;i<3;i++){
x[i] = thread(f,i+1);
}
for(int i=0;i<3;i++){
x[i].join();
}
}
Now the output (if you uncomment the cout) will be
Thread :1
Thread :2
Thread :3
...
I want to know if there is any benefit in changing the int x to atomic<int> x?
Consider your code:
void f(int myid) {
while(1){
while(x!= myid){}
//cout<<"thread : "<< myid<<"\n";
//this_thread::sleep_for(std::chrono::duration(3s));
x = (x % 3) + 1;
}
}
If the program didn't have undefined behaviour, then you could expect that when f was called, x would be read from the stack at least once, but having done that, the compiler has no reason to think that any changes to x will happen outside the function, or that any changes to x made within the function need to be visible outside the function until after the function returns, so it's entitled to read x into a CPU register, keep looking at the same register value and comparing it to myid - which means it'll either pass through instantly or be stuck forever.
Then, compilers are allowed to assume they'll make progress (see Forward Progress in the C++ Standard), so they could conclude that because they'd never progress if x != myid, x can't possibly be equal to myid, and remove the inner while loop. Similarly, an outer loop simplified to while (1) x = (x % 3) + 1; where x might be a register - doesn't make progress and could also be eliminated. Or, the compiler could leave the loop but remove the seemingly pointless operations on x.
Putting your code into the online Godbolt compiler explorer and compiling with GCC trunk at -O3 optimisation, f(int) code is:
f(int):
.L2:
jmp .L2
If you make x atomic, then the compiler can't simply use a register while accessing/modifying it, and assume that there will be a good time to update it before the function returns. It will actually have to modify the variable in memory and propagate that change so other threads can read the updated value.
I want to know if there is any benefit in changing the int x to atomic x?
You could say that. Turning int into atomic<int> in your example will turn your program from incorrect to correct (*).
Accessing the same int from multiple threads at the same time (without any form of access synchronization) is Undefined Behavior.
*) Well, the program might still be incorrect, but at least it avoids this particular problem.
A question from a job-interview
int count = 0;
void func1()
{
for ( int i =0 ; i < 10; ++i )
count = count + 1;
}
void func2()
{
for ( int i =0 ; i < 10; ++i )
count++;
}
void func3()
{
for ( int i =0 ; i < 10; ++i )
++count;
}
int main()
{
thread(func1);
thread(func2);
thread(func3);
//joining all the threads
return 0;
}
The question is: what's the range of values count might theoreticaly take? The upper bound apparently is 30, but what's the lower one? They told me it's 10, but i'm not sure about it. Otherwise, why do we need memory barriers?
So, what's the lower bound of the range?
It's undefined behavior, so count could take on any value
imaginable. Or the program could crash.
James Kanze's answer is the right one for all practical purposes, but in this particular case, if the code is exactly as written and the thread used here is std::thread from C++11, the behavior is actually defined.
In particular, thread(func1); will start a thread running func1. Then, at the end of the expression, the temporary thread object will be destroyed, without join or detach having been called on it. So the thread is still joinable, and the standard defines that in such a case, the destructor calls std::terminate. (See [thread.thread.destr]: "If joinable() then terminate(), otherwise no effects.") So your program aborts.
Because this happens before the second thread is even started, there is no actual race condition - the first thread is the only one that ever touches count, if it even gets that far.
Starting with the easy part, the obvious upper bound is 30 since, if everything goes right, you have 3 calls to functions; each capable of incrementing count 10 times. Overall: 3*10=30.
Ad to the lower bound, they are correct and this is why - the worst-case scenario is that each time that one thread tries to increment count, the other threads will be doing so at exactly the same time. Keeping in mind that ++count actually is the following pseudo code:
count_temp = count;
count_temp = count_temp+1;
count = count_temp;
It should be obvious that if they all perform the same code at the same time, you have only 10 real increments, since they all read the same initial value of count and all write back the same added value.
First of all, I'd like to thank you guys for giving me reason me to read the standard in depth. I would not be able to continue this debate otherwise.
The standard states quite clearly in section 1.10 clause 21: The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
However, the term undefined behavior is also defined in the standard, section 1.3.24: behavior for which this International Standard imposes no requirements... Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment...
Taking Sebasian's answer regarding std::terminate into account, and working under the assumption that these threads will not throw an exception thereby causing premature termination; while the standard doesn't define the result - it is fairly evident what it may be because of the simplicity of the algorithm. In other words, while the 100% accurate answer would be that the result is undefined - I still maintain that the range of possible outcomes is well defined and is 10-30 due to the characteristic of the environment.
BTW - I really wanted to make this a comment instead of another answer, however it was too long
This is an interview question.
class X
{
int i = 0 ;
public:
Class *foo()
{
for ( ; i < 1000 ; ++i )
{
// some code but do not change value of i
}
}
}
int main()
{
X myX ;
Thread t1 = create_thread( myX.foo() ) ;
Thread t2 = create_thread( myX.foo() ) ;
Start t1 ...
Start t2 ...
join(t1)
joint(t2)
}
Q1: if the code run on 1-cpu processor, how many times can the for-loop run in worst case?
Q2: what if the code run on 2-cpu processor, how many times can the for-loop run in worst case?
My ideas:
The loop may run infinite times, because a thread can run it many times before the other thread updates the value of i.
Or, when t1 is suspended, t2 runs 1000 times and then we have 1000 x 1000 times ?
Is this correct?
create_thread( myX.foo() ) calls create_thread with the return value of myX.foo(). myX.foo() is run on the main thread, so myX.i will eventually have a value of 1000 (which is the value which it has after two calls to myX.foo()).
If the code was actually meant to run myX.foo() twice on a two different threads concurrently, then the code would have undefined behaviour (due to the race condition in the access to myX.i). So yes, the loop could run an infinite number of times (or zero times, or the program could decide to get up and eat a bagel).
This is a bad interview question if the code is transcribed accurately.
class X
{
int i = 0;
This notation is not valid C++. G++ says:
3:13: error: ISO C++ forbids initialization of member ‘i’ [-fpermissive]
3:13: error: making ‘i’ static [-fpermissive]
3:13: error: ISO C++ forbids in-class initialization of non-const static member ‘i’
We'll ignore this, assuming that the code was written as something more like:
class X
{
int i;
public:
X() : i(0) { }
The original code continues:
public:
Class *foo()
{
for ( ; i < 1000 ; ++i )
{
// some code but do not change value of i
}
return 0; // Added to remove undefined behaviour
}
}
It is not clear what a Class * is - the type Class is unspecified in the example.
int main()
{
X myX;
Thread t1 = create_thread( myX.foo() );
Since foo() is called here and its return value is passed to create_thread(), the loop will be executed 1000 times here - it matters not whether it is a multi-core system. After the loops are done, the return value is passed to create_thread().
Since we don't have a specification for create_thread(), it is not possible to predict what it will do with the Class * that is returned from myX.foo(), any more than it is possible to tell how myX.foo() actually generates an appropriate Class * or what a Class object is capable of doing. The chances are that the null pointer will cause problems - however, for the sake of the question, we'll assume that the Class * is valid and a new thread is created and placed on hold waiting for the 'start' operation to let it run.
Thread t2 = create_thread( myX.foo() );
Here we have to make some assumptions. We may assume that the Class * returned by myX.foo() does not give access to the member variable i that is in myX. Therefore, even if thread t1 is running before t2 is created, there is no interference from t1 in the value of myX, and when the main thread executes this statement, the loop will execute 0 more times. The result from myX.foo() will be used to create thread t2, which cannot interfere with i in myX any more either. We'll discuss variations on these assumptions below.
Start t1 ...
Start t2 ...
The threads are allowed to run; they do whatever is implied by the Class * returned from myX.foo(). But the threads can neither reference nor (therefore) modify myX; they have not been given access to it unless the Class * somehow provides that access.
join(t1)
joint(t2)
The threads complete...
}
So, the body of the loop executes 1000 times before t1 is created, and is executed an additional 0 times before t2 is created. And it does not matter whether it is a single-core or multi-core machine.
Indeed, even if you assume that the Class * gives the thread access to the i and you assume that t1 starts running immediately (possibly before create_thread() returns to the main thread), as long as it does not modify i, the behaviour is guaranteed to be '1000 and 0 times'.
Clearly, if t1 starts running when create_thread() is called and modifies the i in myX, then the behaviour is indeterminate. However, while the threads are in suspended animation until the 'start' operations, there is no indeterminacy and '1000 and 0 times' remains the correct answer.
Alternative Scenario
If the create_thread() calls have been misremembered and the code was:
Thread t1 = create_thread(myX.foo);
Thread t2 = create_thread(myX.foo);
where a pointer to member function is being passed to create_thread(), then the answer is quite different. Now the function is not executed until the threads are started, and the answer is indeterminate whether there is one CPU or are several CPUs on the machine. It comes down to thread scheduling issues and also depends on how the code is optimized. Almost any answer between 1000 and 2000 is plausible.
Under sufficiently weird circumstances, the answer might even be larger. For example, suppose t2 executed and read i as 0, then got suspended to let t1 run; t1 processes iterations 0..900, and then writes back i, and transfers control to t2, which increments its internal copy of i to 1 and writes this back, then gets suspended, and t1 runs again and reads i and runs from 1 to 900 again, and then lets t2 have another go...etc. Under this implausible scenario (implausible the code for t1 and t2 to execute is probably the same - though it all hinges on what the Class * really is), there could be a lot of iterations.
It is no matter on which type of system this code will run. Switching between threads goes in the same way.
Worst case for all cases is:
1000 + 1000 * 999 * 998 * 997 * ... * 2 * 1 times. (incorrect!!! correct one is in update)
When first thread tries to increase a variable (it already read a value, but not written yet), second thread can make all the loop from the start value of i, but when second thread finishing it's last loop, first thread increases a value of i, and second thread starts it's long job again :)
* Updated (A little more details)
Sorry, real formula is:
1000 + 1000 + 999 + 998 + ... + 2 + 1 times,
or 500999
Each iteration of the loop looks like this:
Check condition.
Make a work.
Read value from i
Increase read value
Write value to i
Nobody said that step 2 has constant time. So I suppose that it has a varying and suitable for my worst case time.
Here is this worst case:
Iteration 1:
[1st thread] Makes steps 1-4 of the first loop iteration (very long work time)
[2nd thread] Makes all the loop (1000 times), but doesn't check a condition last time
Iteration 2:
[1] Maskes step 5, so now i == 1, and makes steps 1-4 of next loop iteration
[2] Makes all the loop from current i (999 times)
Iteration 3: the same as before, but i == 2
...
Iteration 1000: the same as before , but i == 999
In the end we will have 1000 Iterations and each iteration will have 1 execution of loop code from first thread and (1000 - Iteration number) executions from second thread.
Worst case 2000 times assuming main lives till t1 and t2
it can't be infinite. because if even single thread is running, it will increment i value.
I must be just having a moment, because this should be easy but I can't seem to get it working right.
Whats the correct way to implement an atomic counter in GCC?
i.e. I want a counter that runs from zero to 4 and is thread safe.
I was doing this (which is further wrapped in a class, but not here)
static volatile int _count = 0;
const int limit = 4;
int get_count(){
// Create a local copy of diskid
int save_count = __sync_fetch_and_add(&_count, 1);
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
}
return save_count;
}
But it's running from 1 through from 1 - 4 inclusive then around to zero.
It should go from 0 - 3. Normally I'd do a counter with a mod operator but I don't
know how to do that safely.
Perhaps this version is better. Can you see any problems with it, or offer
a better solution.
int get_count(){
// Create a local copy of diskid
int save_count = _count;
if (save_count >= limit){
__sync_fetch_and_and(&_count, 0); // Set it back to zero
return 0;
}
return save_count;
}
Actually, I should point out that it's not absolutely critical that each thread get a different value. If two threads happened to read the same value at the same time that wouldn't be a problem. But they can't exceed limit at any time.
Your code isn't atomic (and your second get_count doesn't even increment the counter value)!
Say count is 3 at the start and two threads simultaneously call get_count. One of them will get his atomic add done first and increments count to 4. If the second thread is fast enough, it can increment it to 5 before the first thread resets it to zero.
Also, in your wraparound processing, you reset count to 0 but not save_count. This is clearly not what's intended.
This is easiest if limit is a power of 2. Don't ever do the reduction yourself, just use
return (unsigned) __sync_fetch_and_add(&count, 1) % (unsigned) limit;
or alternatively
return __sync_fetch_and_add(&count, 1) & (limit - 1);
This only does one atomic operation per invocation, is safe and very cheap. For generic limits, you can still use %, but that will break the sequence if the counter ever overflows. You can try using a 64-bit value (if your platform supports 64-bit atomics) and just hope it never overflows; this is a bad idea though. The proper way to do this is using an atomic compare-exchange operation. You do this:
int old_count, new_count;
do {
old_count = count;
new_count = old_count + 1;
if (new_count >= limit) new_count = 0; // or use %
} while (!__sync_bool_compare_and_swap(&count, old_count, new_count));
This approach generalizes to more complicated sequences and update operations too.
That said, this type of lockless operation is tricky to get right, relies on undefined behavior to some degree (all current compilers get this right, but no C/C++ standard before C++0x actually has a well-defined memory model) and is easy to break. I recommend using a simple mutex/lock unless you've profiled it and found it to be a bottleneck.
You're in luck, because the range you want happens to fit into exactly 2 bits.
Easy solution: Let the volatile variable count up forever. But after you read it, use just the lowest two bits (val & 3). Presto, atomic counter from 0-3.
It's impossible to create anything atomic in pure C, even with volatile. You need asm. C1x will have special atomic types, but until then you're stuck with asm.
You have two problems.
__sync_fetch_and_add will return the previous value (i.e., before adding one). So at the step where _count becomes 3, your local save_count variable is getting back 2. So you actually have to increment _count up to 4 before it'll come back as a 3.
But even on top of that, you're specifically looking for it to be >= 4 before you reset it back to 0. That's just a question of using the wrong limit if you're only looking for it to get as high as three.