Why does my thread counter not always finish? - c++

I think my program has a bug because sometimes when I run my program, it outputs a lower number than 30000, such as 29999. But sometimes it runs correctly and gets to the 30000. My question is how can I fix this and why is it happening.
#include <iostream>
#include <thread>
using namespace std;
int counter;
int i;
void increment()
{
counter++;
}
int main()
{
counter = 0;
cout << "The value in counter is : " << counter << endl;
thread tarr[30000];
for (i = 0; i < 30000; i++)
{
tarr[i] = thread(increment);
}
for (i = 0; i < 30000; i++)
{
tarr[i].join(); //main thread waits for tarr to finish
}
cout << "After running 30,000 threads ";
cout << "the value in counter is : " << counter << endl;
return 0;
}

The problem is that counter++ can be broken down into three operations:
Load initial value to register
Increment the value
Store the new value back to memory
A single thread may do the first two steps, then pass up control to another thread to do the same. What this can mean is:
Thread one reads counter as 5
Thread one increments its internal copy to 6
Thread two reads counter as 5
Thread two increments its internal copy to 6
Thread two writes back 6 to counter
Thread one writes back 6 to counter
You should make counter std::atomic, or guard it with a std::mutex:
std::atomic<int> counter;

Related

In the example of std::atomic<T>::exchange, why the count of times is not 25?

The example I talked about is this one on cppreference.com.
The code snippet is pasted below.
int main(){
const std::size_t ThreadNumber = 5;
const int Sum = 5;
std::atomic<int> atom{0};
std::atomic<int> counter{0};
// lambda as thread proc
auto lambda = [&](const int id){
for (int next = 0; next < Sum;){
// each thread is writing a value from its own knowledge
const int current = atom.exchange(next);
counter++;
// sync writing to prevent from interrupting by other threads
std::osyncstream(std::cout)
<< '#' << id << " (" << std::this_thread::get_id()
<< ") wrote " << next << " replacing the old value "
<< current << '\n';
next = std::max(current, next) + 1;
}
};
std::vector<std::thread> v;
for (std::size_t i = 0; i < ThreadNumber; ++i){
v.emplace_back(lambda, i);
}
for (auto& tr : v){
tr.join();
}
std::cout << ThreadNumber << " threads adding 0 to "
<< Sum << " takes total "
<< counter << " times\n";
}
To me, the value of counter is 25 because 5 threads and each thread loops 5 times. However, the shown output is 16. I also ran it myself, the possible value varies, but it never gets to be 25.
Why the printed value of counter is actually smaller?
Consider one of the possible executions:
Lets say one of the threads finishes the loop before other threads start.
This gives you atom == 4. The next thread to enter the loop will get current == 4 and will exit the loop after the first iteration.
This way the second thread increments current once instead of 5 times like you expect it to.
I haven't taken the trouble to analyse the code in detail, but the for loop in the lambda is broken (or, at least, not doing what you are expecting it to do). If you replace it with something more straightforward, namely:
for (int next = 0; next < Sum; ++next){
then 25 is output.

Using a Mutex to Limit the Number of Threads Running at a Time to 2

I have a program that pushes 10 threads into a vector, each of which is supposed to print out a character 5 times before finishing ('A' for the first thread, 'B' for the second, etc). I'm able to get them to either run all at once (using detach()) or have them run one at a time (using join()). Now I want to use a Mutex to limit the number of threads allowed to print at a time to 2. I've been able to declare the mutex and put the lock in place but I'm unsure of how to apply a limit like this. Anyone have any ideas on how to proceed?
deque<int> q ;
mutex print_mutex ;
mutex queue_mutex ;
condition_variable queue_cond ;
void begin(int num) {
unique_lock<mutex> ul {queue_mutex};
q.emplace_back(num);
queue_cond.wait(ul,[num]{
return q.front() == num; });
q.pop_front();
cout << num << " leaves begin " << endl ;
}
void end ( int num ) {
lock_guard<mutex>lg{queue_mutex};
queue_cond.notify_all();
cout << num << " has ended " << endl ;
}
void run(int num, char ch) {
begin(num);
for (int i = 0; i < 5; ++i) {
{
lock_guard<mutex> lg { print_mutex };
cout << ch << endl << flush ;
}
sleep_for(milliseconds(250));
}
end(num);
}
int main() {
vector<thread>threads {};
for (int i = 0; i < 10; ++i) {
threads.push_back(thread{run,i,static_cast<char>(65+i)});
threads.at(i).join();
}
}
You have already set up a FIFO for your threads with the global deque<int> q. So let's use that.
Currently, you're trying to restrict execution until the current thread is at the front. Although there's a bug, because begin will immediately pop that thread from the deque. Better to remove the value when you call end. Here's that change, first:
void end(int num)
{
{
lock_guard<mutex>lg{queue_mutex};
cout << num << " has ended " << endl ;
q.erase(find(q.begin(), q.end(), num));
}
queue_cond.notify_all();
}
This uses std::find from <algorithm> to remove the specific value. You could use pop_front, but we're about to change that logic so this is more generic. Also notice you don't need to lock the condition variable when you notify.
So, it's not much of a stretch to extend the logic in begin to be in the first two places. Here:
void begin(int num)
{
unique_lock<mutex> ul {queue_mutex};
q.emplace_back(num);
queue_cond.wait(ul,[num]{
auto end = q.begin() + std::min(2, static_cast<int>(q.size()));
return find(q.begin(), end, num) != end;
});
cout << num << " leaves begin " << endl ;
}
You can change that 2 to anything you want, allowing up to that many threads to pass. At some point, you would probably abandon this approach and use something simpler like a single counter variable, then rely on the thread scheduler to manage which thread is woken, rather than force them into your FIFO. That way you can switch to using notify_one to wake a single thread and reduce switching overhead.
Anyway, the last thing to do is remove the join from your thread generation loop. The concurrency is now managed by begin and end. So you would do this:
for (int i = 0; i < 10; ++i) {
threads.push_back( thread{run, i, 'A'+i} );
}
for (auto & t : threads) t.join();

Can iterating over unsorted data structure (like array, tree), with multiple thread make iteration faster?

Can iterating over unsorted data structure like array, tree with multiple thread make it faster?
For example I have big array with unsorted data.
int array[1000];
I'm searching array[i] == 8
Can running:
Thread 1:
for(auto i = 0; i < 500; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
Thread 2:
for(auto i = 500; i < 1000; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
be faster than normal iteration?
#update
I've written simple test witch describe problem better:
For searching int* array = new int[100000000];
and repeating it 1000 times
I got the result:
a
Number of threads = 2
End of multithread iteration
End of normal iteration
Time with 2 threads 73581
Time with 1 thread 154070
Bool values:0
0
0
Process returned 0 (0x0) execution time : 256.216 s
Press any key to continue.
What's more when program was running with 2 threads cpu usage of the process was around ~90% and when iterating with 1 thread it was never more than 50%.
So Smeeheey and erip are right that it can make iteration faster.
Of course it can be more tricky for not such trivial problems.
And as I've learned from this test is that compiler can optimize main thread (when i was not showing boolean storing results of search loop in main thread was ignored) but it will not do that for other threads.
This is code I have used:
#include<cstdlib>
#include<thread>
#include<ctime>
#include<iostream>
#define SIZE_OF_ARRAY 100000000
#define REPEAT 1000
inline bool threadSearch(int* array){
for(auto i = 0; i < SIZE_OF_ARRAY/2; i++)
if(array[i] == 101) // there is no array[i]==101
return true;
return false;
}
int main(){
int i;
std::cin >> i; // stops program enabling to set real time priority of the process
clock_t with_multi_thread;
clock_t normal;
srand(time(NULL));
std::cout << "Number of threads = "
<< std::thread::hardware_concurrency() << std::endl;
int* array = new int[SIZE_OF_ARRAY];
bool true_if_found_t1 =false;
bool true_if_found_t2 =false;
bool true_if_found_normal =false;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
with_multi_thread=clock();
for(auto j=0; j<REPEAT; j++){
std::thread t([&](){
if(threadSearch(array))
true_if_found_t1=true;
});
std::thread u([&](){
if(threadSearch(array+SIZE_OF_ARRAY/2))
true_if_found_t2=true;
});
if(t.joinable())
t.join();
if(u.joinable())
u.join();
}
with_multi_thread=(clock()-with_multi_thread);
std::cout << "End of multithread iteration" << std::endl;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
normal=clock();
for(auto j=0; j<REPEAT; j++)
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
if(array[i] == 101) // there is no array[i]==101
true_if_found_normal=true;
normal=(clock()-normal);
std::cout << "End of normal iteration" << std::endl;
std::cout << "Time with 2 threads " << with_multi_thread<<std::endl;
std::cout << "Time with 1 thread " << normal<<std::endl;
std::cout << "Bool values:" << true_if_found_t1<<std::endl
<< true_if_found_t2<<std::endl
<<true_if_found_normal<<std::endl;// showing bool values to prevent compiler from optimization
return 0;
}
The answer is yes, it can make it faster - but not necessarily. In your case, when you're iterating over pretty small arrays, it is likely that the overhead of launching a new thread will be much higher than the benefit gained. If you array was much bigger then this would be reduced as a proportion of the overall runtime and eventually become worth it. Note you will only get speed up if your system has more than 1 physical core available to it.
Additionally, you should note that whilst that the code that reads the array in your case is perfectly thread-safe, writing to std::cout is not (you will get very strange looking output if your try this). Instead perhaps your thread should do something like return an integer type indicating the number of instances found.

Why is my program printing garbage?

My code:
#include <iostream>
#include <thread>
void function_1()
{
std::cout << "Thread t1 started!\n";
for (int j=0; j>-100; j--) {
std::cout << "t1 says: " << j << "\n";
}
}
int main()
{
std::thread t1(function_1); // t1 starts running
for (int i=0; i<100; i++) {
std::cout << "from main: " << i << "\n";
}
t1.join(); // main thread waits for t1 to finish
return 0;
}
I create a thread that prints numbers in decreasing order while main prints in increasing order.
Sample output here. Why is my code printing garbage ?
Both threads are outputting at the same time, thereby scrambling your output.
You need some kind of thread synchronization mechanism on the printing part.
See this answer for an example using a std::mutex combined with std::lock_guard for cout.
It's not "garbage" — it's the output you asked for! It's just jumbled up, because you have used a grand total of zero synchronisation mechanisms to prevent individual std::cout << ... << std::endl lines (which are not atomic) from being interrupted by similar lines (which are still not atomic) in the other thread.
Traditionally we'd lock a mutex around each of those lines.

Forcing race between threads using C++11 threads

Just got started on multithreading (and multithreading in general) using C++11 threading library, and and wrote small short snipped of code.
#include <iostream>
#include <thread>
int x = 5; //variable to be effected by race
//This function will be called from a thread
void call_from_thread1() {
for (int i = 0; i < 5; i++) {
x++;
std::cout << "In Thread 1 :" << x << std::endl;
}
}
int main() {
//Launch a thread
std::thread t1(call_from_thread1);
for (int j = 0; j < 5; j++) {
x--;
std::cout << "In Thread 0 :" << x << std::endl;
}
//Join the thread with the main thread
t1.join();
std::cout << x << std::endl;
return 0;
}
Was expecting to get different results every time (or nearly every time) I ran this program, due to race between two threads. However, output is always: 0, i.e. two threads run as if they ran sequentially. Why am I getting same results and is there any ways to simulate or force race between two threads ?
Your sample size is rather small, and somewhat self-stalls on the continuous stdout flushes. In short, you need a bigger hammer.
If you want to see a real race condition in action, consider the following. I purposely added an atomic and non-atomic counter, sending both to the threads of the sample. Some test-run results are posted after the code:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
void racer(std::atomic_int& cnt, int& val)
{
for (int i=0;i<1000000; ++i)
{
++val;
++cnt;
}
}
int main(int argc, char *argv[])
{
unsigned int N = std::thread::hardware_concurrency();
std::atomic_int cnt = ATOMIC_VAR_INIT(0);
int val = 0;
std::vector<std::thread> thrds;
std::generate_n(std::back_inserter(thrds), N,
[&cnt,&val](){ return std::thread(racer, std::ref(cnt), std::ref(val));});
std::for_each(thrds.begin(), thrds.end(),
[](std::thread& thrd){ thrd.join();});
std::cout << "cnt = " << cnt << std::endl;
std::cout << "val = " << val << std::endl;
return 0;
}
Some sample runs from the above code:
cnt = 4000000
val = 1871016
cnt = 4000000
val = 1914659
cnt = 4000000
val = 2197354
Note that the atomic counter is accurate (I'm running on a duo-core i7 macbook air laptop with hyper threading, so 4x threads, thus 4-million). The same cannot be said for the non-atomic counter.
There will be significant startup overhead to get the second thread going, so its execution will almost always begin after the first thread has finished the for loop, which by comparison will take almost no time at all. To see a race condition you will need to run a computation that takes much longer, or includes i/o or other operations that take significant time, so that the execution of the two computations actually overlap.