Multithreading and sequence of instructions - c++

While learning multithread programming I've written the following code.
#include <thread>
#include <iostream>
#include <cassert>
void check() {
int a = 0;
int b = 0;
{
std::jthread t2([&](){
int i = 0;
while (a >= b) {
++i;
}
std::cout << "failed at iteration " << i << "\n"
// I know at this point a and b may have changed
<< a << " >= " << b << "\n";
std::exit(0);
});
std::jthread t1([&](){
while (true) {
++a;
++b;
}
});
}
}
int main() {
check();
}
Since ++a always happens before ++b a should be always greater or equal to b.
But experiment shows that sometimes b > a. Why? What causes it? And how can I enforce it?
Even when I replace int a = 0; with int a = 1000; which makes all of this even more crazy.
The program exits soon so no int overflow happens.
I found no instructions reordering in assembly which might have caused this.

Since ++a always happens before ++b a should be always greater or
equal to b
Only in its execution thread. And only if that's observable by the execution thread.
C++ requires certain explicit "synchronization" in order for changes made by one execution thread be visible by other execution threads.
++a;
++b;
With these statements alone, there are no means by which this execution thread would actually be able to "tell the difference" whether a or b was incremented first. As such, C++ allows the compiler to implement whatever optimizations or code reordering steps it wants, as long as it has no observable effects in its execution thread, and if the actual generated code incremented b first there will not be any observable effects. There's no way that this execution thread could possibly tell the difference.
But if there was some intervening statement that "looked" at a, then this wouldn't hold true any more, and the compiler would be required to actually generate code that increments a before using it in some way.
And that's just this execution thread, alone. Even if it was possible to observe the relative order of changes to a in b in this execution thread the C++ compiler is allowed, by the standard, to actually increment the actual variables in any order, as long as there are also any other adjustments that make this not observable. But it could be observable by another execution thread. To prevent that it will be necessary to take explicit synchronization steps, using mutexes, condition variables, and other parts of C++'s execution thread model.

There are non-trivial race conditions between the increment of these different variables and when you read them. If you want strict ordering of these reads and writes you will have to use some sort of synchronization mechanism. std::atomic<> makes it easier.
Try this instead:
#include<iostream>
#include <thread>
#include <iostream>
#include <cassert>
#include <atomic>
void check() {
struct AB { int a = 0; int b=0; };
std::atomic<AB> ab;
{
std::jthread t2([&](){
int i = 0;
AB temp;
while (true) {
temp = ab;
if ( temp.a > temp.b ) break;
++i;
}
std::cout << "failed at iteration " << i << "\n"
// I know at this point a and b may have changed
<< temp.a << " >= " << temp.b << "\n";
std::exit(0);
});
std::jthread t1([&](){
while (true) {
AB temp = ab;
temp.a++;
temp.b++;
ab = temp;
}
});
}
}
int main() {
check();
}
Code: https://godbolt.org/z/Kxeb8d8or
Result:
Program returned: 143
Program stderr
Killed - processing time exceeded

Related

c++ threads safety and time efficiency: why does thread with mutex check sometimes works faster than without it?

I'm beginner in threads usage in c++. I've read basics about std::thread and mutex, and it seems I understand the purpose of using mutexes.
I decided to check if threads are really so dangerous without mutexes (Well I believe books but prefer to see it with my own eyes). As a testcase of "what I shouldn't do in future" I created 2 versions of the same concept: there are 2 threads, one of them increments a number several times (NUMBER_OF_ITERATIONS), another one decrements the same number the same number of times, so we expect to see the same number after the code is executed as before it. The code is attached.
At first I run 2 threads which do it in unsafe manner - without any mutexes, just to see what can happen. And after this part is finished I run 2 threads which do the same thing but in safe manner (with mutexes).
Expected results: without mutexes a result can differ from initial value, because data could be corrupted if two threads works with it simultaneously. Especially it's usual for huge NUMBER_OF_ITERATIONS - because the probability to corrupt data is higher. So this result I can understand.
Also I measured time spent by both "safe" and "unsafe" parts. For huge number of iterations the safe part spends much more time, than unsafe one, as I expected: there is some time spent for mutex check. But for small numbers of iterations (400, 4000) the safe part execution time is less than unsafe time. Why is that possible? Is it something which operating system does? Or is there some optimization by compiler which I'm not aware of? I spent some time thinking about it and decided to ask here.
I use windows and MSVS12 compiler.
Thus the question is: why the safe part execution could be faster than unsafe part one (for small NUMBER_OF_ITERATIONS < 1000*n)?
Another one: why is it related to NUMBER_OF_ITERATIONS: for smaller ones (4000) "safe" part with mutexes is faster, but for huge ones (400000) the "safe" part is slower?
main.cpp
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <windows.h>
//
///change number of iterations for different results
const long long NUMBER_OF_ITERATIONS = 400;
//
/// time check counter
class Counter{
double PCFreq_ = 0.0;
__int64 CounterStart_ = 0;
public:
Counter(){
LARGE_INTEGER li;
if(!QueryPerformanceFrequency(&li))
std::cerr << "QueryPerformanceFrequency failed!\n";
PCFreq_ = double(li.QuadPart)/1000.0;
QueryPerformanceCounter(&li);
CounterStart_ = li.QuadPart;
}
double GetCounter(){
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return double(li.QuadPart-CounterStart_)/PCFreq_;
}
};
/// "dangerous" functions for unsafe threads: increment and decrement number
void incr(long long* j){
for (long long i = 0; i < NUMBER_OF_ITERATIONS; i++) (*j)++;
std::cout << "incr finished" << std::endl;
}
void decr(long long* j){
for (long long i = 0; i < NUMBER_OF_ITERATIONS; i++) (*j)--;
std::cout << "decr finished" << std::endl;
}
///class for safe thread operations with incrment and decrement
template<typename T>
class Safe_number {
public:
Safe_number(int i){number_ = T(i);}
Safe_number(long long i){number_ = T(i);}
bool inc(){
if(m_.try_lock()){
number_++;
m_.unlock();
return true;
}
else
return false;
}
bool dec(){
if(m_.try_lock()){
number_--;
m_.unlock();
return true;
}
else
return false;
}
T val(){return number_;}
private:
T number_;
std::mutex m_;
};
///
template<typename T>
void incr(Safe_number<T>* n){
long long i = 0;
while(i < NUMBER_OF_ITERATIONS){
if (n->inc()) i++;
}
std::cout << "incr <T> finished" << std::endl;
}
///
template<typename T>
void decr(Safe_number<T>* n){
long long i = 0;
while(i < NUMBER_OF_ITERATIONS){
if (n->dec()) i++;
}
std::cout << "decr <T> finished" << std::endl;
}
using namespace std;
// run increments and decrements of the same number
// in threads in "safe" and "unsafe" way
int main()
{
//init numbers to 0
long long number = 0;
Safe_number<long long> sNum(number);
Counter cnt;//init time counter
//
//run 2 unsafe threads for ++ and --
std::thread t1(incr, &number);
std::thread t2(decr, &number);
t1.join();
t2.join();
//check time of execution of unsafe part
double time1 = cnt.GetCounter();
cout <<"finished first thr" << endl;
//
// run 2 safe threads for ++ and --, now we expect final value 0
std::thread t3(incr<long long>, &sNum);
std::thread t4(decr<long long>, &sNum);
t3.join();
t4.join();
//check time of execution of safe part
double time2 = cnt.GetCounter() - time1;
cout << "unsafe part, number = " << number << " time1 = " << time1 << endl;
cout << "safe part, Safe number = " << sNum.val() << " time2 = " << time2 << endl << endl;
return 0;
}
You should not draw conclusions about the speed of any given algorithm if the input size is very small. What defines "very small" can be kind of arbitrary, but on modern hardware, under usual conditions, "small" can refer to any collection size less than a few hundred thousand objects, and "large" can refer to any collection larger than that.
Obviously, Your Milage May Vary.
In this case, the overhead of constructing threads, which, while usually slow, can also be rather inconsistent and could be a larger factor in the speed of your code than what the actual algorithm is doing. It's possible that the compiler has some kind of powerful optimizations it can do on smaller input sizes (which it can definitely know about due to the input size being hard-coded into the code itself) that it cannot then perform on larger inputs.
The broader point being that you should always prefer larger inputs when testing algorithm speed, and to also have the same program repeat its tests (preferably in random order!) to try to "smooth out" irregularities in the timings.

atomic read then write with std::atomic

Assume the following code
#include <iostream>
#include <atomic>
#include <chrono>
#include <thread>
std::atomic<uint> val;
void F()
{
while(true)
{
++val;
}
}
void G()
{
while(true)
{
++val;
}
}
void H()
{
while(true)
{
std::this_thread::sleep_for(std::chrono::seconds(1));
std::cout <<"val="<< val << std::endl;
val = 0;
}
}
int main()
{
std::thread ft(F);
std::thread gt(G);
std::thread ht(H);
ft.join();
gt.join();
ht.join();
return 0;
}
It is basically two threads incrementing the value of val and a third thread which reports this value every second then resets it. The problem is, when the third thread is reading this value and then sets it to zero, there can be possible increments which will be lost (we did not include them in report). So we need an atomic read-then-write mechanism. Is there a clean way to do this that I am not aware of?
PS: I don't want to lock anything
The std::atomic::exchange method seems to be what you're after (emphasis mine):
Atomically replaces the underlying value with desired. The operation is read-modify-write operation.
Use as follows:
auto localValue = val.exchange(0);
std::cout << "Value = " << localValue << std::endl;
As others mentioned std::atomic::exchange would work.
To mention why your current code does what you said, between the two lines of execution:
std::cout <<"val="<< val << std::endl;
val = 0;
the other two threads have time to increment the value that ht thread is about to reset.
std::atomic::exchange would do those lines in one "atomic" operation.

Is it safe/efficient to cancel a c++ thread by writing to an outside variable?

I have a search problem, which I want to parallelize. If one thread has found a solution, I want all other threads to stop. Otherwise, if all threads exit regularly, I know, that there is no solution.
The following code (that demonstrates my cancelling strategy) seems to work, but I'm not sure, if it is safe and the most efficient variant:
#include <iostream>
#include <thread>
#include <cstdint>
#include <chrono>
using namespace std;
struct action {
uint64_t* ii;
action(uint64_t *ii) : ii(ii) {};
void operator()() {
uint64_t k = 0;
for(; k < *ii; ++k) {
//do something useful
}
cout << "counted to " << k << " in 2 seconds" << endl;
}
void cancel() {
*ii = 0;
}
};
int main(int argc, char** argv) {
uint64_t ii = 1000000000;
action a{&ii};
thread t(a);
cout << "start sleeping" << endl;
this_thread::sleep_for(chrono::milliseconds(2000));
cout << "finished sleeping" << endl;
a.cancel();
cout << "cancelled" << endl;
t.join();
cout << "joined" << endl;
}
Can I be sure, that the value, to which ii points, always gets properly reloaded? Is there a more efficient variant, that doesn't require the dereferenciation at every step? I tried to make the upper bound of the loop a member variable, but since the constructor of thread copies the instance of action, I wouldn't have access to that member later.
Also: If my code is exception safe and does not do I/O (and I am sure, that my platform is Linux), is there a reason not to use pthread_cancel on the native thread?
No, there's no guarantee that this will do anything sensible. The code has one thread reading the value of ii and another thread writing to it, without any synchronization. The result is that the behavior of the program is undefined.
I'd just add a flag to the class:
std::atomic<bool> time_to_stop;
The constructor of action should set that to false, and the cancel member function should set it to true. Then change the loop to look at that value:
for(; !time_to_stop && k < *ii; ++k)
You might, instead, make ii atomic. That would work, but it wouldn't be as clear as having a named member to look at.
First off there is no reason to make ii a pointer. You can have it just as a plain uint64_t.
Secondly if you have multiple threads and at least one of them writes to a shared variable then you are going to have to have some sort of synchronization. In this case you could just use std::atomic<uint64_t> to get that synchronization. Otherwise you would have to use a mutex or some sort of memory fence.

Forcing race between threads using C++11 threads

Just got started on multithreading (and multithreading in general) using C++11 threading library, and and wrote small short snipped of code.
#include <iostream>
#include <thread>
int x = 5; //variable to be effected by race
//This function will be called from a thread
void call_from_thread1() {
for (int i = 0; i < 5; i++) {
x++;
std::cout << "In Thread 1 :" << x << std::endl;
}
}
int main() {
//Launch a thread
std::thread t1(call_from_thread1);
for (int j = 0; j < 5; j++) {
x--;
std::cout << "In Thread 0 :" << x << std::endl;
}
//Join the thread with the main thread
t1.join();
std::cout << x << std::endl;
return 0;
}
Was expecting to get different results every time (or nearly every time) I ran this program, due to race between two threads. However, output is always: 0, i.e. two threads run as if they ran sequentially. Why am I getting same results and is there any ways to simulate or force race between two threads ?
Your sample size is rather small, and somewhat self-stalls on the continuous stdout flushes. In short, you need a bigger hammer.
If you want to see a real race condition in action, consider the following. I purposely added an atomic and non-atomic counter, sending both to the threads of the sample. Some test-run results are posted after the code:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
void racer(std::atomic_int& cnt, int& val)
{
for (int i=0;i<1000000; ++i)
{
++val;
++cnt;
}
}
int main(int argc, char *argv[])
{
unsigned int N = std::thread::hardware_concurrency();
std::atomic_int cnt = ATOMIC_VAR_INIT(0);
int val = 0;
std::vector<std::thread> thrds;
std::generate_n(std::back_inserter(thrds), N,
[&cnt,&val](){ return std::thread(racer, std::ref(cnt), std::ref(val));});
std::for_each(thrds.begin(), thrds.end(),
[](std::thread& thrd){ thrd.join();});
std::cout << "cnt = " << cnt << std::endl;
std::cout << "val = " << val << std::endl;
return 0;
}
Some sample runs from the above code:
cnt = 4000000
val = 1871016
cnt = 4000000
val = 1914659
cnt = 4000000
val = 2197354
Note that the atomic counter is accurate (I'm running on a duo-core i7 macbook air laptop with hyper threading, so 4x threads, thus 4-million). The same cannot be said for the non-atomic counter.
There will be significant startup overhead to get the second thread going, so its execution will almost always begin after the first thread has finished the for loop, which by comparison will take almost no time at all. To see a race condition you will need to run a computation that takes much longer, or includes i/o or other operations that take significant time, so that the execution of the two computations actually overlap.

Safe parallel read-only access to a STL container

I want access a STL based container read-only from parallel running threads. Without using any user implemented locking. The base of the following code is C++11 with a proper implementation of the standard.
http://gcc.gnu.org/onlinedocs/libstdc++/manual/using_concurrency.html
http://www.sgi.com/tech/stl/thread_safety.html
http://www.hpl.hp.com/personal/Hans_Boehm/c++mm/threadsintro.html
http://www.open-std.org/jtc1/sc22/wg21/ (current draft or N3337, which is essentially C++11 with minor errors and typos corrected)
23.2.2 Container data races [container.requirements.dataraces]
For purposes of avoiding data races (17.6.5.9), implementations shall
consider the following functions to be const: begin, end, rbegin,
rend, front, back, data, find, lower_bound, upper_bound, equal_range,
at and, except in associative or unordered associative containers,
operator[].
Notwithstanding (17.6.5.9), implementations are required
to avoid data races when the contents of the con- tained object in
different elements in the same sequence, excepting vector<bool>, are
modified concurrently.
[ Note: For a vector<int> x with a size greater
than one, x[1] = 5 and *x.begin() = 10 can be executed concurrently
without a data race, but x[0] = 5 and *x.begin() = 10 executed
concurrently may result in a data race. As an exception to the general
rule, for a vector<bool> y, y[0] = true may race with y[1]
= true. — end note ]
and
17.6.5.9 Data race avoidance [res.on.data.races] 1 This section specifies requirements that implementations shall meet to prevent data
races (1.10). Every standard library function shall meet each
requirement unless otherwise specified. Implementations may prevent
data races in cases other than those specified below.
2 A C++ standard
library function shall not directly or indirectly access objects
(1.10) accessible by threads other than the current thread unless
the objects are accessed directly or indirectly via the function’s
argu- ments, including this.
3 A C++ standard library function shall
not directly or indirectly modify objects (1.10) accessible by threads
other than the current thread unless the objects are accessed directly
or indirectly via the function’s non-const arguments, including
this.
4 [ Note: This means, for example, that implementations can’t
use a static object for internal purposes without synchronization
because it could cause a data race even in programs that do not
explicitly share objects between threads. — end note ]
5 A C++ standard library function shall not access objects indirectly
accessible via its arguments or via elements of its container
arguments except by invoking functions required by its specification
on those container elements.
6 Operations on iterators obtained by
calling a standard library container or string member function may
access the underlying container, but shall not modify it. [ Note: In
particular, container operations that invalidate iterators conflict
with operations on iterators associated with that container. — end
note ]
7 Implementations may share their own internal objects between
threads if the objects are not visible to users and are protected
against data races.
8 Unless otherwise specified, C++ standard library
functions shall perform all operations solely within the current
thread if those operations have effects that are visible (1.10) to
users.
9 [ Note: This allows implementations to parallelize operations
if there are no visible side effects. — end note ]
Conclusion
Containers are not thread safe! But it is safe to call const functions on containers from multiple parallel threads. So it is possible to do read-only operations from parallel threads without locking.
Am I right?
I pretend that their doesn't exist any faulty implementation and every implementation of the C++11 standard is correct.
Sample:
// concurrent thread access to a stl container
// g++ -std=gnu++11 -o p_read p_read.cpp -pthread -Wall -pedantic && ./p_read
#include <iostream>
#include <iomanip>
#include <string>
#include <unistd.h>
#include <thread>
#include <mutex>
#include <map>
#include <cstdlib>
#include <ctime>
using namespace std;
// new in C++11
using str_map = map<string, string>;
// thread is new in C++11
// to_string() is new in C++11
mutex m;
const unsigned int MAP_SIZE = 10000;
void fill_map(str_map& store) {
int key_nr;
string mapped_value;
string key;
while (store.size() < MAP_SIZE) {
// 0 - 9999
key_nr = rand() % MAP_SIZE;
// convert number to string
mapped_value = to_string(key_nr);
key = "key_" + mapped_value;
pair<string, string> value(key, mapped_value);
store.insert(value);
}
}
void print_map(const str_map& store) {
str_map::const_iterator it = store.begin();
while (it != store.end()) {
pair<string, string> value = *it;
cout << left << setw(10) << value.first << right << setw(5) << value.second << "\n";
it++;
}
}
void search_map(const str_map& store, int thread_nr) {
m.lock();
cout << "thread(" << thread_nr << ") launched\n";
m.unlock();
// use a straight search or poke around random
bool straight = false;
if ((thread_nr % 2) == 0) {
straight = true;
}
int key_nr;
string mapped_value;
string key;
str_map::const_iterator it;
string first;
string second;
for (unsigned int i = 0; i < MAP_SIZE; i++) {
if (straight) {
key_nr = i;
} else {
// 0 - 9999, rand is not thread-safe, nrand48 is an alternative
m.lock();
key_nr = rand() % MAP_SIZE;
m.unlock();
}
// convert number to string
mapped_value = to_string(key_nr);
key = "key_" + mapped_value;
it = store.find(key);
// check result
if (it != store.end()) {
// pair
first = it->first;
second = it->second;
// m.lock();
// cout << "thread(" << thread_nr << ") " << key << ": "
// << right << setw(10) << first << setw(5) << second << "\n";
// m.unlock();
// check mismatch
if (key != first || mapped_value != second) {
m.lock();
cerr << key << ": " << first << second << "\n"
<< "Mismatch in thread(" << thread_nr << ")!\n";
exit(1);
// never reached
m.unlock();
}
} else {
m.lock();
cerr << "Warning: key(" << key << ") not found in thread("
<< thread_nr << ")\n";
exit(1);
// never reached
m.unlock();
}
}
}
int main() {
clock_t start, end;
start = clock();
str_map store;
srand(0);
fill_map(store);
cout << "fill_map finished\n";
// print_map(store);
// cout << "print_map finished\n";
// copy for check
str_map copy_store = store;
// launch threads
thread t[10];
for (int i = 0; i < 10; i++) {
t[i] = thread(search_map, store, i);
}
// wait for finish
for (int i = 0; i < 10; i++) {
t[i].join();
}
cout << "search_map threads finished\n";
if (store == copy_store) {
cout << "equal\n";
} else {
cout << "not equal\n";
}
end = clock();
cout << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << "\n";
cout << "CPU-TIME START " << start << "\n";
cout << "CPU-TIME END " << end << "\n";
cout << "CPU-TIME END - START " << end - start << "\n";
cout << "TIME(SEC) " << static_cast<double>(end - start) / CLOCKS_PER_SEC << "\n";
return 0;
}
This code can be compiled with GCC 4.7 and runs fine on my machine.
$ echo $?
$ 0
A data-race, from the C++11 specification in sections 1.10/4 and 1.10/21, requires at least two threads with non-atomic access to the same set of memory locations, the two threads are not synchronized with regards to accessing the set of memory locations, and at least one thread writes to or modifies an element in the set of memory locations. So in your case, if the threads are only reading, you are fine ... by definition since none of the threads write to the same set of memory locations, there are no data-races even though there is no explicit synchronization mechanism between the threads.
Yes, you are right. You are safe as long as the thread that populates your vector finishes doing so before the reader threads start. There was a similar question recently.