While going through a c++ tutorial book(it's in Spanish so I apologize if my translation to English is not as proper as it should be) I have come across a particular code snippet that I do not fully understand in terms of the different processes that are happening in the background. For example, in terms of multiple address spaces, how would I determine if these are all withing the context of a single process(being that multiple threads are being added over each push to the vector)? How would I determine if each thread is different from the other if they have the exact same computation being made?)
#include <iostream>
#include <vector>
#include <thread>
using namespace std;
int addthreads = 0;
void squarenum(int x) {
addthreads += x * x * x;
}
int main() {
vector<thread> septhread;
for (int i = 1; i <= 9; i++){
septhread.push_back(thread(&squarenum, i));
}
for (auto& th : septhread){
th.join();
}
cout << "Your answer = " << addthreads << endl;
system("pause");
return 0;
}
Every answer defaults to 2025, that much I understand. My basic issue is understanding the first part of my question.
By the way, the compiler required(if you are on Linux):
g++ -std=gnu++ -pthread threadExample.cpp -o threadExample
A thread is a "thread of execution" within a process, sharing the same address space, resources, etc. Depending on the operating system, hardware, etc, they may or may not run on the same CPU or CPU Thread.
A major issue with thread programming, as a result, is managing access to resources. If two threads access the same resource at the same time, Undefined Behavior can occur. If they are both reading, it may be fine, but if one is writing at the same moment the other is reading, numerous outcomes ensue. The simplest is that both threads are running on separate CPUs or cores and so the reader does not see the change made by the writer due to cache. Another is that the reader sees only a portion of the write (if it's a 64-bit value they might only see 32-bits changed).
Your code performs a read-modify-store operation, so the first thread to come along sees the value '0', calculates the result of x*x*x, adds it to 0 and stores the result.
Meanwhile the next thread comes along and does the same thing, it also sees 0 before performing its calculation, so it writes 0 + x*x*x to the value, overwriting the first thread.
These threads might not be in the order that you launched them; it's possible for thread #30 to get the first execution cycle rather than thread #1.
You may need to consider looking at std::atomic or std::mutex.
Related
As the OP states, I have multiple threads - two of them being tight poll loops (I need polling) with regular sleeps:- 1 seconds of sleep after every 10 seconds.
Program has multiple interim updates to be printed with:
std::cout << "progress report text" << std::endl;
Body of thread that polls, pretty much looks like:
void PollHardwareFunction ()
{
lastTimeSlept = std::chrono::HighResClock::now();
while (!stopSignal)
{
poll_hardware();
// Process the data received from hardware
if (std::chrono::HighResClock::now() - lastTimeSlept > std::chrono::seconds(10))
{
std::this_thread::sleep_for(std::chrono::seconds(1));
auto lastTimeSlept = std::chrono::HighResClock::now();
}
}
}
Other threads are pretty normal that do few logical steps and prints status after each step.
void LongRunningFunction ()
{
int dataCounter = 0;
while (wait_for_data_from_hardware_in_concurrent_queue)
{
std::cout << "Data received: " << dataCounter++ << std::endl;
// Process the data received from hardware
std::cout << "STEP1 done." << std::endl;
std::cout << "STEP2 done." << std::endl;
std::cout << "STEP3 done." << std::endl;
}
}
This prints all messages as expected but only in bulk after 10 seconds. Making it look non responsive/stuck during this 10 seconds.
Program is run on following environment:
Compiled with GCC 6.2, run on RHEL 7, an 8 core CPU.
I notice that the program prints on the console only when the spinning threads go to sleep/idle. Once the busy threads go to sleep, all of the prints appear on my output console together. To add to it, data received from hardware is regular - say every 100 milliseconds.
With several CPU cores available free, why the program stays in non-responsive state till the spinning threads stop/pause?
From your comments:
My program is bit better structured - it uses atomic variables and some of lockfree data structures I have implemented.
and
poll_hardware is a function from the hardware vendor's API that reads hardware buffer and pushes data into a concurrent queue.
That sounds dubious. Did you write your own data structure or did you use an existing one? Regardless, please post the code for the queue. If the queue was provided by the vendor, please post the API.
my perspective here is to understand what can cause the programs output remain (feels) stuck where as std::cout << operator is executed (completed execution) with std::endl?
You don't call cout from PollHardwareFunction(), so the issue MUST be from wait_for_data_from_hardware_in_concurrent_queue blocking when it's not supposed to. (If you want to be sure, switch cout to cerr to avoid buffering writes.)
The first thing I would check is if poll_hardware() is dominating a lock by re-locking as soon as it releases. You may have created what is effectively a spin-lock. This is why user Snps suggested sleeping for 1ms in the comments. 1 yield is not enough. I understand that your data is time critical, but you said 100ms, so theoretically you could poll ever 50ms and be fine. A few ms should be totally OK for debugging purposes.
Lock dominating can be both caused by and solved with a reader/writer lock. Reader/writer locks need to be custom designed with the characteristics of the situation in mind. (how many threads are reading vs writing? how often do reads vs writes occur?)
The second thing I would check are your assumptions about sequential programming and memory caching in your lock-free data structures. Loads and stores can be delayed, rearranged, buffered, etc. as an optimization. Everyone is your "frienemy"--the compiler will do this, then the OS will do it, the CPU will take its turn, and then hardware will do it too.
To prevent this, you have to use a memory barrier (aka memory fence) to keep any of your frienemies from optimizing memory accesses. FYI, mutexes use memory barriers in their implementation. A quick way to see if this fixes your problem is to make your shared variables volatile. HOWEVER, don't trust volatile. It only keeps the compiler from reordering your commands, not necessarily the OS or CPU (depending on compiler, naturally).
It would be good to know about some of your other atomic variables, because there could be a logic bug there.
lastly, here your use of auto is defining a scoped variable lastTimeSlept that shadows the "actual" lastTimeSlept.
if (std::chrono::HighResClock::now() - lastTimeSlept > std::chrono::seconds(10))
{
std::this_thread::sleep_for(std::chrono::seconds(1));
auto lastTimeSlept = std::chrono::HighResClock::now();
}
yikes! I don't think that's causing your issue, though.
Here I use Peterson's algorithm to implement mutual exclusion.
I have two very simple threads, one to increase a counter by 1, another to reduce it by 1.
const int PRODUCER = 0,CONSUMER =1;
int counter;
int flag[2];
int turn;
void *producer(void *param)
{
flag[PRODUCER]=1;
turn=CONSUMER;
while(flag[CONSUMER] && turn==CONSUMER);
counter++;
flag[PRODUCER]=0;
}
void *consumer(void *param)
{
flag[CONSUMER]=1;
turn=PRODUCER;
while(flag[PRODUCER] && turn==PRODUCER);
counter--;
flag[CONSUMER]=0;
}
They works fine when I just run them once.
But when I run them again again in a loop, strange things happen.
Here is my main function.
int main(int argc, char *argv[])
{
int case_count =0;
counter =0;
while(counter==0)
{
printf("Case: %d\n",case_count++);
pthread_t tid[2];
pthread_attr_t attr[2];
pthread_attr_init(&attr[0]);
pthread_attr_init(&attr[1]);
counter=0;
flag[0]=0;
flag[1]=0;
turn = 0;
printf ("Counter is intially set to %d\n",counter);
pthread_create(&tid[0],&attr[0],producer,NULL);
pthread_create(&tid[1],&attr[1],consumer,NULL);
pthread_join(tid[0],NULL);
pthread_join(tid[1],NULL);
printf ("counter is now %d\n",counter);
}
return 0;
}
I run the two threads again and again, until in one case the counter isn't zero.
Then, after several cases, the program will always stop! Some times after hundreds of cases, some times thousands, or event tens of thousand.
It means in one case the counter isn't zero. But why??? the two threads modify the counter in critical session, and increase and decrease it only once. Why will the counter not be zero?
Then I run this code in other computers, more strange things happen - in some computers the program seems has no problem, and the others have the same problem with me! Why?
By the way, in my computer, I run this code in VM ware's virtual computer, Ubuntu 16.04. Others' computer is also Ubuntu 16.04, but not all of them are in virtual machines. And the computer with problem contains both virtual machines and real machines.
Peterson's algorithm only works on single core processors/single CPU systems.
That's because they don't do real parallel processing. Two atomar operations never get executet at the same time there.
If you got 2 or more CPUs/CPU cores the amount of atomar operations who can be executed at the same time increase by one for each cpu(core).
This means, even if an integer assignment is atomar it can be executed multiple times at the same time in different CPUs/Cores.
In your case turn=CONSUMER/PRODUCER; is just called twice at the same time in different CPUs/cores.
Deacitvate all CPU cores but one for your program and it should work fine.
You need hardware support to implement any kind of thread-safe algorithm.
There are many reasons why your code is not working at you intended. The simplest one is that the cores have individual caches. So your program starts on say two cores. Both cache flag to be 0, 0. They both modify their own copy, so they don't see what the other core is doing.
In addition memory works in blocks, so writing flag[PRODUCER] will very likely write flag[CONSUMER] as well (because ints are 4 bytes and most of todays processors have memory blocks of 64 bytes).
Another problem would be operation reordering. Both the compiler and the processor are allowed to swap instructions. There are constraints that dictate that the single threaded execution result shouldn't change, but obviously they don't apply here.
The compiler might also figure out that you are setting turn to x and then checking if it is x, which is obviously true in a single threaded world so it can be optimized away.
This list is not exhaustive. There are many more things (some platform specific) that could happen and break your program.
So, at the very least try to use std::atomic types with strong memory ordering (memory_order_seq_cst). All your variables should be std::atomic. This gives you hardware support but it will be a lot slower.
This will still not work because most you might still have some piece of code where you read and then change. This is not atomic because some other thread might have changed the data after your read and before you changed it.
I tried to use new API for synchronization barriers from Windows 8, but the following simple code sometimes hangs in Windows 8:
#undef WINVER
#define WINVER 0x0603
#include "windows.h"
#include <thread>
#include <vector>
int main()
{
SYNCHRONIZATION_BARRIER barrier;
int count = 32;
InitializeSynchronizationBarrier (&barrier, count, -1);
std::vector<std::thread> threads;
for (int thr_num = 0; thr_num < count; thr_num++)
{
threads.emplace_back ([thr_num]
{
for (int i = 0; i < 100000; i++)
EnterSynchronizationBarrier (&barrier, 0);
});
}
for (auto &thr : threads)
thr.join ();
return 0;
}
Tested on Windows 8.1 64-bit on 32-core dual-Xeon E5 2630. It hangs roughly one time out of ten launches.
It seems that in windows 10 it works normally (on another machine). Is this a bug in windows 8 that got fixed, or this is not a correct usage of EnterSynchronizationBarrier (maybe you can't call it in a loop?). There're not much information about this function, have anybody even used it?
Not that it matters years later, except perhaps to show that some problems are too obscure for Stack Overflow to deliver close attention in useful time, but your usage is correct, if extreme, and the stress you have put the called function to does look to have exposed its problems with memory barriers.
In your fragment, a synchronisation barrier is prepared for 32 threads and you create 32 threads which each proceed to 100000 phases of synchronised work. All 32 reach their call number N to EnterSynchronizationBarrier before all are released on their way to their call number N+1. It should work. It likely would if your phases had any substance.
The stress is that each phase between calls is just however few instructions are involved in looping back to repeat the call. While the last thread to end phase N is in its call, it signals the others to leave, and they have a good chance of leaving (and even of reentering the function to end their phase N+1) while the thread that ends phase N is still doing its internal bookkeeping.
In this bookkeeping are two counters. One, named Barrier according to Microsoft's symbol files, is decremented as threads enter the synchronisation barrier. The other, named LeftBarrier, is incremented as they leave it. The thread that ends a phase resets Barrier from LeftBarrier (which should be the count of all participating threads) and resets LeftBarrier to 1. Or so it goes as a simplification.
The complicated reality is that the Barrier count is overloaded: its high bit signifies the change of phase. If a thread that waits at the synchronisation barrier is spinning rather than blocking on an event, then what it checks for while spinning is whether the high bit in Barrier has changed. It therefore really matters exactly how the counters get reset in the ending thread's bookkeeping. The sequence is: read LeftBarrier; write LeftBarrier as 1; write Barrier as the old LeftBarrier with the high bit toggled.
What I think happens is that without a memory barrier, the Barrier count can be written before LeftBarrier, but because Barrier has a toggled high bit, a spinning thread comes out of its spin and increments LeftBarrier from another processor before the first resets it to 1. The increment gets lost, after which all bets are off because subsequent phases will find that LeftBarrier at the end of a phase is no longer the count of participating threads.
Windows 8 and 8.1 have no memory barrier here. Windows 10 does, though I believe it's in the wrong place and that Windows Vista and Windows 7 had it correctly between the two writes. The implementation was anyway reworked completely for Version 1607 so that it now uses the WaitOnAddress functionality, much as sketched by a later Raymond Chen blog than the one cited by one of your correspondents. At the time of the cited blog, Microsoft, though possibly not Raymond, surely knew of the function's two earlier code changes regarding memory barriers.
In C/C++ How can I make the threads(POSIX pthreads/Windows threads) to give me a safe method to pass progress back to the main thread on the progress of the execution or my work that I’ve decided to perform with the thread.
Is it possible to report the progress in terms of percentage ?
I'm going to assume a very simple case of a main thread, and one function. What I'd recommend is passing in a pointer to an atomic (as suggested by Kirill above) for each time you launch the thread. Assuming C++11 here.
using namespace std;
void threadedFunction(atomic<int>* progress)
{
for(int i = 0; i < 100; i++)
{
progress->store(i); // updates the variable safely
chrono::milliseconds dura( 2000 );
this_thread::sleep_for(dura); // Sleeps for a bit
}
return;
}
int main(int argc, char** argv)
{
// Make and launch 10 threads
vector<atomic<int>> atomics;
vector<thread> threads;
for(int i = 0; i < 10; i++)
{
atomics.emplace_back(0);
threads.emplace_back(threadedFunction, &atomics[i]);
}
// Monitor the threads down here
// use atomics[n].load() to get the value from the atomics
return 0;
}
I think that'll do what you want. I omitted polling the threads, but you get the idea. I'm passing in an object that both the main thread and the child thread know about (in this case the atomic<int> variable) that they both can update and/or poll for results. If you're not on a full C++11 thread/atomic support compiler, use whatever your platform determines, but there's always a way to pass a variable (at the least a void*) into the thread function. And that's how you get something to pass information back and forth via non-statics.
The best way to solve this is to use C++ atomics for that. Declare in some visible enough place:
std::atomic<int> my_thread_progress(0);
In a simple case this should be a static variable, in a more complex place this should be a data field of some object that manages threads or something similar.
On many platforms this will be slightly paranoiac because almost everywhere the read and write operations on integers are atomic. Bit using atomics still it makes because:
You will have guarantee that this will work fine on any platform, even on a 16 bit CPU or whatever unusual hardware;
Your code will be easier to read. Reader will immediately see that this is shared variable without placing any comments. Once it will be updated with load/store methods, it will be easier to catch on what is going on.
EDIT
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B and 3C (http://download.intel.com/products/processor/manual/325462.pdf)
Volume 3A: 8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary
So here's my scenario. First, I have a structure -
struct interval
{
double lower;
double higher;
}
Now my thread function -
void* thread_function(void* i)
{
interval* in = (interval*)i;
double a = in->lower;
cout << a;
pthread_exit(NULL)
}
In main, let's say I create these 2 threads -
pthread_t one,two;
interval i;
i.lower = 0; i.higher = 5;
pthread_create(&one,NULL,thread_function,&i);
i.lower=10; i.higher = 20;
pthread_create(&two,NULL,thread_function, &i);
pthread_join(one,NULL);
pthread_join(two,NULL);
Here's the problem. Ideally, thread "one" should print out 0 and thread "two" should print out 10. However, this doesn't happen. Occasionally, I end up getting two 10s.
Is this by design? In other words, by the time the thread is created, the value in i.lower has been changed already in main, therefore both threads end up using the same value?
Is this by design?
Yes. It's unspecified when exactly the threads start and when they will access that value. You need to give each one of them their own copy of the data.
Your application is non-deterministic.
There is no telling when a thread will be scheduled to run.
Note: By creating a thread does not mean it will start executing immediately (or even first). The second thread created may actually start running before the first (it is all dependant on the OS and hardware).
To get deterministic behavior each thread must be given its own data (that is not modified by the main thread).
pthread_t one,two;
interval oneData,twoData
oneData.lower = 0; oneData.higher = 5;
pthread_create(&one,NULL,thread_function,&oneData);
twoData.lower=10; twoData.higher = 20;
pthread_create(&two,NULL,thread_function, &twoData);
pthread_join(one,NULL);
pthread_join(two,NULL);
I would not call it by design.
I would rather refer to it as a side-effect of scheduling policy. But the observed behavior is what I would expect.
This is the classic 'race condition'; where the results vary depending on which thread wins the 'race'. You have no way of knowing which thread will 'win' each time.
Your analysis of the problem is correct; you simply don't have any guarantees that the first thread created will be able to read i.lower before the data is changed on the next line of your main function. This is in some sense the heart of why it can be hard to think about multithreaded programming at first.
The straight forward solution to your immediate problem is to keep different intervals with different data, and pass a separate one to each thread, i.e.
interval i, j;
i.lower = 0; j.lower = 10;
pthread_create(&one,NULL,thread_function,&i);
pthread_create(&two,NULL,thread_function,&j);
This will of course solve your immediate problem. But soon you'll probably wonder what to do if you want multiple threads actually using the same data. What if thread 1 wants to make changes to i and thread 2 wants to take these into account? It would hardly be much point in doing multithreaded programming if each thread would have to keep its memory separate from the others (well, leaving message passing out of the picture for now). Enter mutex locks! I thought I'd give you a heads up that you'll want to look into this topic sooner rather than later, as it'll also help you understand the basics of threads in general and the required change in mentality that goes along with multithreaded programming.
I seem to recall that this is a decent short introduction to pthreads, including getting started with understanding locking etc.