EnterSynchronizationBarrier hangs in windows 8 - c++

I tried to use new API for synchronization barriers from Windows 8, but the following simple code sometimes hangs in Windows 8:
#undef WINVER
#define WINVER 0x0603
#include "windows.h"
#include <thread>
#include <vector>
int main()
{
SYNCHRONIZATION_BARRIER barrier;
int count = 32;
InitializeSynchronizationBarrier (&barrier, count, -1);
std::vector<std::thread> threads;
for (int thr_num = 0; thr_num < count; thr_num++)
{
threads.emplace_back ([thr_num]
{
for (int i = 0; i < 100000; i++)
EnterSynchronizationBarrier (&barrier, 0);
});
}
for (auto &thr : threads)
thr.join ();
return 0;
}
Tested on Windows 8.1 64-bit on 32-core dual-Xeon E5 2630. It hangs roughly one time out of ten launches.
It seems that in windows 10 it works normally (on another machine). Is this a bug in windows 8 that got fixed, or this is not a correct usage of EnterSynchronizationBarrier (maybe you can't call it in a loop?). There're not much information about this function, have anybody even used it?

Not that it matters years later, except perhaps to show that some problems are too obscure for Stack Overflow to deliver close attention in useful time, but your usage is correct, if extreme, and the stress you have put the called function to does look to have exposed its problems with memory barriers.
In your fragment, a synchronisation barrier is prepared for 32 threads and you create 32 threads which each proceed to 100000 phases of synchronised work. All 32 reach their call number N to EnterSynchronizationBarrier before all are released on their way to their call number N+1. It should work. It likely would if your phases had any substance.
The stress is that each phase between calls is just however few instructions are involved in looping back to repeat the call. While the last thread to end phase N is in its call, it signals the others to leave, and they have a good chance of leaving (and even of reentering the function to end their phase N+1) while the thread that ends phase N is still doing its internal bookkeeping.
In this bookkeeping are two counters. One, named Barrier according to Microsoft's symbol files, is decremented as threads enter the synchronisation barrier. The other, named LeftBarrier, is incremented as they leave it. The thread that ends a phase resets Barrier from LeftBarrier (which should be the count of all participating threads) and resets LeftBarrier to 1. Or so it goes as a simplification.
The complicated reality is that the Barrier count is overloaded: its high bit signifies the change of phase. If a thread that waits at the synchronisation barrier is spinning rather than blocking on an event, then what it checks for while spinning is whether the high bit in Barrier has changed. It therefore really matters exactly how the counters get reset in the ending thread's bookkeeping. The sequence is: read LeftBarrier; write LeftBarrier as 1; write Barrier as the old LeftBarrier with the high bit toggled.
What I think happens is that without a memory barrier, the Barrier count can be written before LeftBarrier, but because Barrier has a toggled high bit, a spinning thread comes out of its spin and increments LeftBarrier from another processor before the first resets it to 1. The increment gets lost, after which all bets are off because subsequent phases will find that LeftBarrier at the end of a phase is no longer the count of participating threads.
Windows 8 and 8.1 have no memory barrier here. Windows 10 does, though I believe it's in the wrong place and that Windows Vista and Windows 7 had it correctly between the two writes. The implementation was anyway reworked completely for Version 1607 so that it now uses the WaitOnAddress functionality, much as sketched by a later Raymond Chen blog than the one cited by one of your correspondents. At the time of the cited blog, Microsoft, though possibly not Raymond, surely knew of the function's two earlier code changes regarding memory barriers.

Related

Best way to implement a periodic linux task in c++20

I have a periodic task in c++, running on an embedded linux platform, and have to run at 5 ms intervals. It seems to be working as expected, but is my current solution good enough?
I have implemented the scheduler using sleep_until(), but some comments I have received is that setitimer() is better. As I would like the application to be at least some what portable, I would prefer c++ standard... of course unless there are other problems.
I have found plenty of sites that show implementation with each, but I have not found any arguments for why one solution is better than the other. As I see it, sleep_until() will implement an "optimal" on any (supported) platform, and I'm getting a feeling the comments I have received are focused more on usleep() (which I do not use).
My implementation looks a little like this:
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
do_the_magic();
std::this_thread::sleep_until(next_time);
}
}
A short summoning of the issue.
I have an embedded linux platform, build with yocto and with RT capabilities
The application need to read and process incoming data every 5 ms
Building with gcc 11.2.0
Using c++20
All the "hard work" is done in separate threads, so this question is only regards triggering the task periodically and with minimal jitter
Since the application is supposed to read and process the data every 5 ms, it is possible that a few times, it does not perform the required operations. What I mean to say is that in a time interval of 20 ms, do_the_magic() is supposed to be invoked 4 times... But if the time taken to execute do_the_magic() is 10 ms, it will get invoked only 2 times. If that is an acceptable outcome, the current implementation is good enough.
Since the application is reading data, it probably receives it from the network or disk. And adding the overhead of processing it, it likely takes more than 5 ms to do so (depending on the size of the data). If it is not acceptable to miss out on any invocation of do_the_magic, the current implementation is not good enough.
What you could probably do is create a few threads. Each thread executes the do_the_magic function and then goes to sleep. Every 5 ms, you wake a sleeping thread which will most likely take less than 5 ms to happen. This way no invocation of do_the_magic is missed. Also, the number of threads depends on how long will do_the_magic take to execute.
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
void wake_some_thread () {
static int i = 0;
release_semaphore (i); // Release semaphore associated with thread i
i++;
i = i % NUM_THREADS;
}
void * thread_func (void * args) {
while (true) {
// Wait for a semaphore
do_the_magic();
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
wake_some_thread (); // Releases a semaphore to wake a thread
std::this_thread::sleep_until(next_time);
}
Create as many semaphores as the number of threads where thread i is waiting for semaphore i. wake_some_thread can then release a semaphore starting from index 0 till NUM_THREADS and start again.
5ms is a pretty tight timing.
You can get a jitter-free 5ms tick only if you do the following:
Isolate a CPU for this thread. Configure it with nohz_full and rcu_nocbs
Pin your thread to this CPU, assign it a real-time scheduling policy (e.g., SCHED_FIFO)
Do not let any other threads run on this CPU core.
Do not allow any context switches in this thread. This includes avoiding system calls altogether. I.e., you cannot use std::this_thread::sleep_until(...) or anything else.
Do a busy wait in between processing (ensure 100% CPU utilisation)
Use lock-free communication to transfer data from this thread to other, non-real-time threads, e.g., for storing the data to files, accessing network, logging to console, etc.
Now, the question is how you're going to "read and process data" without system calls. It depends on your system. If you can do any user-space I/O (map the physical register addresses to your process address space, use DMA without interrupts, etc.) - you'll have a perfectly real-time processing. Otherwise, any system call will trigger a context switch, and latency of this context switch will be unpredictable.
For example, you can do this with certain Ethernet devices (SolarFlare, etc.), with 100% user-space drivers. For anything else you're likely to have to write your own user-space driver, or even implement your own interrupt-free device (e.g., if you're running on an FPGA SoC).

Why my std::atomic<int> variable isn't thread-safe?

I don't know why my code isn't thread-safe, as it outputs some inconsistent results.
value 48
value 49
value 50
value 54
value 51
value 52
value 53
My understanding of an atomic object is it prevents its intermediate state from exposing, so it should solve the problem when one thread is reading it and the other thread is writing it.
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
I probably misunderstood what an atomic object is, Can someone explain?
void
inc(std::atomic<int>& a)
{
while (true) {
a = a + 1;
printf("value %d\n", a.load());
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}
}
int
main()
{
std::atomic<int> a(0);
std::thread t1(inc, std::ref(a));
std::thread t2(inc, std::ref(a));
std::thread t3(inc, std::ref(a));
std::thread t4(inc, std::ref(a));
std::thread t5(inc, std::ref(a));
std::thread t6(inc, std::ref(a));
t1.join();
t2.join();
t3.join();
t4.join();
t5.join();
t6.join();
return 0;
}
I used to think I could use std::atomic without a mutex to solve the multi-threading counter increment problem, and it didn't look like the case.
You can, just not the way you have coded it. You have to think about where the atomic accesses occur. Consider this line of code …
a = a + 1;
First the value of a is fetched atomically. Let's say the value fetched is 50.
We add one to that value getting 51.
Finally we atomically store that value into a using the = operator
a ends up being 51
We atomically load the value of a by calling a.load()
We print the value we just loaded by calling printf()
So far so good. But between steps 1 and 3 some other threads may have changed the value of a - for example to the value 54. So, when step 3 stores 51 into a it overwrites the value 54 giving you the output you see.
As #Sopel and #Shawn suggest in the comments, you can atomically increment the value in a using one of the appropriate functions (like fetch_add) or operator overloads (like operator ++ or operator +=. See the std::atomic documentation for details
Update
I added steps 5 and 6 above. Those steps can also lead to results that may not look correct.
Between the store at step 3. and the call tp a.load() at step 5. other threads can modify the contents of a. After our thread stores 51 in a at step 3 it may find that a.load() returns some different number at step 5. Thus the thread that set a to the value 51 may not pass the value 51 to printf().
Another source of problems is that nothing coordinates the execution of steps 5. and 6. between two threads. So, for example, imagine two threads X and Y running on a single processor. One possible execution order might be this …
Thread X executes steps 1 through 5 above incrementing a from 50 to 51 and getting the value 51 back from a.load()
Thread Y executes steps 1 through 5 above incrementing a from 51 to 52 and getting the value 52 back from a.load()
Thread Y executes printf() sending 52 to the console
Thread X executes printf() sending 51 to the console
We've now printed 52 on the console, followed by 51.
Finally, there's another problem lurking at step 6. because printf() doesn't make any promises about what happens if two threads call printf() at the same time (at least I don't think it does).
On a multiprocessor system threads X and Y above might call printf() at exactly the same moment (or within a few ticks of exactly the same moment) on two different processors. We can't make any prediction about which printf() output will appear first on the console.
Note The documentation for printf mentions a lock introduced in C++17 "… used to prevent data races when multiple threads read, write, position, or query the position of a stream." In the case of two threads simultaneously contending for that lock we still can't tell which one will win.
Besides the increment of a being done non-atomically, the fetch of the value to display after the increment is non-atomic with respect to the increment. It is possible that one of the other threads increments a after the current thread has incremented it but before the fetch of the value to display. This would possibly result in the same value being shown twice, with the previous value skipped.
Another issue here is that the threads do not necessarily run in the order they have been created. Thread 7 could execute its output before threads 4, 5, and 6, but after all four threads have incremented a. Since the thread that did the last increment displays its output earlier, you end up with the output not being sequential. This is more likely to happen on a system with fewer than six hardware threads available to run on.
Adding a small sleep between the various thread creates (e.g., sleep_for(10)) would make this less likely to occur, but would still not eliminate the possibility. The only sure way to keep the output ordered is to use some sort of exclusion (like a mutex) to ensure only one thread has access to the increment and output code, and treat both the increment and output code as a single transaction that must run together before another thread tries to do an increment.
The other answers point out the non-atomic increment and various problems. I mostly want to point out some interesting practical details about exactly what we see when running this code on a real system. (x86-64 Arch Linux, gcc9.1 -O3, i7-6700k 4c8t Skylake).
It can be useful to understand why certain bugs or design choices lead to certain behaviours, for troubleshooting / debugging.
Use int tmp = ++a; to capture the fetch_add result in a local variable instead of reloading it from the shared variable. (And as 1202ProgramAlarm says, you might want to treat the whole increment and print as an atomic transaction if you insist on having your counts printed in order as well as being done properly.)
Or you might want to have each thread record the values it saw in a private data structure to be printed later, instead of also serializing threads with printf during the increments. (In practice all trying to increment the same atomic variable will serialize them waiting for access to the cache line; ++a will go in order so you can tell from the modification order which thread went in which order.)
Fun fact: a.store(1 + a.load(std:memory_order_relaxed), std::memory_order_release) is what you might do for a variable that was only written by 1 thread, but read by multiple threads. You don't need an atomic RMW because no other thread ever modifies it. You just need a thread-safe way to publish updates. (Or better, in a loop keep a local counter and just .store() it without loading from the shared variable.)
If you used the default a = ... for a sequentially-consistent store, you might as well have done an atomic RMW on x86. One good way to compile that is with an atomic xchg, or mov+mfence is as expensive (or more).
What's interesting is that despite the massive problems with your code, no counts were lost or stepped on (no duplicate counts), merely printing reordered. So in practice the danger wasn't encountered because of other effects going on.
I tried it on my own machine and did lose some counts. But after removing the sleep, I just got reordering. (I copy-pasted about 1000 lines of the output into a file, and sort -u to uniquify the output didn't change the line count. It did move some late prints around though; presumably one thread got stalled for a while.) My testing didn't check for the possibility of lost counts, skipped by not saving the value being stored into a, and instead reloading it. I'm not sure there's a plausible way for that to happen here without multiple threads reading the same count, which would be detected.
Store + reload, even a seq-cst store which has to flush the store buffer before it can reload, is very fast compared to printf making a write() system call. (The format string includes a newline and I didn't redirect output to a file so stdout is line-buffered and can't just append the string to a buffer.)
(write() system calls on the same file descriptor are serializing in POSIX: write(2) is atomic. Also, printf(3) itself is thread-safe on GNU/Linux, as required by C++17, and probably by POSIX long before that.)
Stdio locking in printf happens to be enough serialization in almost all cases: the thread that just unlocked stdout and left printf can do the atomic increment and then try to take the stdout lock again.
The other threads were all blocked trying to take the lock on stdout. One (other?) thread can wake up and take the lock on stdout, but for its increment to race with the other thread it would have to enter and leave printf and load a the first time before that other thread commits its a = ... seq-cst store.
This does not mean it's actually safe
Just that testing this specific version of the program (at least on x86) doesn't easily reveal the lack of safety. Interrupts or scheduling variations, including competition from other things running on the same machine, certainly could block a thread at just the wrong time.
My desktop has 8 logical cores so there were enough for every thread to get one, not having to get descheduled. (Although normally that would tend to happen on I/O or when waiting on a lock anyway).
With the sleep there, it is not unlikely for multiple threads to wake up at nearly the same time and race with each other in practice on real x86 hardware. It's so long that timer granularity becomes a factor, I think. Or something like that.
Redirecting output to a file
With stdout open on a non-TTY file, it's full-buffered instead of line-buffered, and doesn't always make a system call while holding the stdout lock.
(I got a 17MiB file in /tmp from hitting control-C a fraction of a second after running ./a.out > output.)
This makes it fast enough for threads to actually race with each other in practice, showing the expected bugs of duplicate values. (A thread reads a but loses ownership of the cache line before it stores (tmp)+1, resulting in two or more threads doing the same increment. And/or multiple threads reading the same value when they reload a after flushing their store buffer.)
1228589 unique lines (sort -u | wc) but total output of
1291035 total lines. So ~5% of the output lines were duplicates.
I didn't check if it was usually one value duplicated multiple times or if it was usually only one duplicate. Or how far backward the value ever jumped. If a thread happened to be stalled by an interrupt handler after loading but before storing val+1, it could be quite far. Or if it actually slept or blocked for some reason, it could rewind indefinitely far.

Multi Threading - Peterson's algorithm not working

Here I use Peterson's algorithm to implement mutual exclusion.
I have two very simple threads, one to increase a counter by 1, another to reduce it by 1.
const int PRODUCER = 0,CONSUMER =1;
int counter;
int flag[2];
int turn;
void *producer(void *param)
{
flag[PRODUCER]=1;
turn=CONSUMER;
while(flag[CONSUMER] && turn==CONSUMER);
counter++;
flag[PRODUCER]=0;
}
void *consumer(void *param)
{
flag[CONSUMER]=1;
turn=PRODUCER;
while(flag[PRODUCER] && turn==PRODUCER);
counter--;
flag[CONSUMER]=0;
}
They works fine when I just run them once.
But when I run them again again in a loop, strange things happen.
Here is my main function.
int main(int argc, char *argv[])
{
int case_count =0;
counter =0;
while(counter==0)
{
printf("Case: %d\n",case_count++);
pthread_t tid[2];
pthread_attr_t attr[2];
pthread_attr_init(&attr[0]);
pthread_attr_init(&attr[1]);
counter=0;
flag[0]=0;
flag[1]=0;
turn = 0;
printf ("Counter is intially set to %d\n",counter);
pthread_create(&tid[0],&attr[0],producer,NULL);
pthread_create(&tid[1],&attr[1],consumer,NULL);
pthread_join(tid[0],NULL);
pthread_join(tid[1],NULL);
printf ("counter is now %d\n",counter);
}
return 0;
}
I run the two threads again and again, until in one case the counter isn't zero.
Then, after several cases, the program will always stop! Some times after hundreds of cases, some times thousands, or event tens of thousand.
It means in one case the counter isn't zero. But why??? the two threads modify the counter in critical session, and increase and decrease it only once. Why will the counter not be zero?
Then I run this code in other computers, more strange things happen - in some computers the program seems has no problem, and the others have the same problem with me! Why?
By the way, in my computer, I run this code in VM ware's virtual computer, Ubuntu 16.04. Others' computer is also Ubuntu 16.04, but not all of them are in virtual machines. And the computer with problem contains both virtual machines and real machines.
Peterson's algorithm only works on single core processors/single CPU systems.
That's because they don't do real parallel processing. Two atomar operations never get executet at the same time there.
If you got 2 or more CPUs/CPU cores the amount of atomar operations who can be executed at the same time increase by one for each cpu(core).
This means, even if an integer assignment is atomar it can be executed multiple times at the same time in different CPUs/Cores.
In your case turn=CONSUMER/PRODUCER; is just called twice at the same time in different CPUs/cores.
Deacitvate all CPU cores but one for your program and it should work fine.
You need hardware support to implement any kind of thread-safe algorithm.
There are many reasons why your code is not working at you intended. The simplest one is that the cores have individual caches. So your program starts on say two cores. Both cache flag to be 0, 0. They both modify their own copy, so they don't see what the other core is doing.
In addition memory works in blocks, so writing flag[PRODUCER] will very likely write flag[CONSUMER] as well (because ints are 4 bytes and most of todays processors have memory blocks of 64 bytes).
Another problem would be operation reordering. Both the compiler and the processor are allowed to swap instructions. There are constraints that dictate that the single threaded execution result shouldn't change, but obviously they don't apply here.
The compiler might also figure out that you are setting turn to x and then checking if it is x, which is obviously true in a single threaded world so it can be optimized away.
This list is not exhaustive. There are many more things (some platform specific) that could happen and break your program.
So, at the very least try to use std::atomic types with strong memory ordering (memory_order_seq_cst). All your variables should be std::atomic. This gives you hardware support but it will be a lot slower.
This will still not work because most you might still have some piece of code where you read and then change. This is not atomic because some other thread might have changed the data after your read and before you changed it.

C++ multiple threads and processes in vector

While going through a c++ tutorial book(it's in Spanish so I apologize if my translation to English is not as proper as it should be) I have come across a particular code snippet that I do not fully understand in terms of the different processes that are happening in the background. For example, in terms of multiple address spaces, how would I determine if these are all withing the context of a single process(being that multiple threads are being added over each push to the vector)? How would I determine if each thread is different from the other if they have the exact same computation being made?)
#include <iostream>
#include <vector>
#include <thread>
using namespace std;
int addthreads = 0;
void squarenum(int x) {
addthreads += x * x * x;
}
int main() {
vector<thread> septhread;
for (int i = 1; i <= 9; i++){
septhread.push_back(thread(&squarenum, i));
}
for (auto& th : septhread){
th.join();
}
cout << "Your answer = " << addthreads << endl;
system("pause");
return 0;
}
Every answer defaults to 2025, that much I understand. My basic issue is understanding the first part of my question.
By the way, the compiler required(if you are on Linux):
g++ -std=gnu++ -pthread threadExample.cpp -o threadExample
A thread is a "thread of execution" within a process, sharing the same address space, resources, etc. Depending on the operating system, hardware, etc, they may or may not run on the same CPU or CPU Thread.
A major issue with thread programming, as a result, is managing access to resources. If two threads access the same resource at the same time, Undefined Behavior can occur. If they are both reading, it may be fine, but if one is writing at the same moment the other is reading, numerous outcomes ensue. The simplest is that both threads are running on separate CPUs or cores and so the reader does not see the change made by the writer due to cache. Another is that the reader sees only a portion of the write (if it's a 64-bit value they might only see 32-bits changed).
Your code performs a read-modify-store operation, so the first thread to come along sees the value '0', calculates the result of x*x*x, adds it to 0 and stores the result.
Meanwhile the next thread comes along and does the same thing, it also sees 0 before performing its calculation, so it writes 0 + x*x*x to the value, overwriting the first thread.
These threads might not be in the order that you launched them; it's possible for thread #30 to get the first execution cycle rather than thread #1.
You may need to consider looking at std::atomic or std::mutex.

Reporting a thread progress to main thread in C++

In C/C++ How can I make the threads(POSIX pthreads/Windows threads) to give me a safe method to pass progress back to the main thread on the progress of the execution or my work that I’ve decided to perform with the thread.
Is it possible to report the progress in terms of percentage ?
I'm going to assume a very simple case of a main thread, and one function. What I'd recommend is passing in a pointer to an atomic (as suggested by Kirill above) for each time you launch the thread. Assuming C++11 here.
using namespace std;
void threadedFunction(atomic<int>* progress)
{
for(int i = 0; i < 100; i++)
{
progress->store(i); // updates the variable safely
chrono::milliseconds dura( 2000 );
this_thread::sleep_for(dura); // Sleeps for a bit
}
return;
}
int main(int argc, char** argv)
{
// Make and launch 10 threads
vector<atomic<int>> atomics;
vector<thread> threads;
for(int i = 0; i < 10; i++)
{
atomics.emplace_back(0);
threads.emplace_back(threadedFunction, &atomics[i]);
}
// Monitor the threads down here
// use atomics[n].load() to get the value from the atomics
return 0;
}
I think that'll do what you want. I omitted polling the threads, but you get the idea. I'm passing in an object that both the main thread and the child thread know about (in this case the atomic<int> variable) that they both can update and/or poll for results. If you're not on a full C++11 thread/atomic support compiler, use whatever your platform determines, but there's always a way to pass a variable (at the least a void*) into the thread function. And that's how you get something to pass information back and forth via non-statics.
The best way to solve this is to use C++ atomics for that. Declare in some visible enough place:
std::atomic<int> my_thread_progress(0);
In a simple case this should be a static variable, in a more complex place this should be a data field of some object that manages threads or something similar.
On many platforms this will be slightly paranoiac because almost everywhere the read and write operations on integers are atomic. Bit using atomics still it makes because:
You will have guarantee that this will work fine on any platform, even on a 16 bit CPU or whatever unusual hardware;
Your code will be easier to read. Reader will immediately see that this is shared variable without placing any comments. Once it will be updated with load/store methods, it will be easier to catch on what is going on.
EDIT
Intel® 64 and IA-32 Architectures Software Developer’s Manual
Combined Volumes: 1, 2A, 2B, 2C, 3A, 3B and 3C (http://download.intel.com/products/processor/manual/325462.pdf)
Volume 3A: 8.1.1 Guaranteed Atomic Operations
The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
Reading or writing a byte
Reading or writing a word aligned on a 16-bit boundary
Reading or writing a doubleword aligned on a 32-bit boundary