how to use std::atomic_signal_fence() with semaphore and volatile? - c++

std::atomic_signal_fence() Establishes memory synchronization ordering ... between a thread and a signal handler executed on the same thread.
-- cppreference
In order to find an example for this illustration, I looked at bames53's similar question in stackoverflow. However the answer may not suit my x86_64 environment, since x86_64 CPU is strong memory model and forbids Store-Store re-ordering ^1. Its example will correctly execute even without std::atomic_signal_fence() in my x86_64 environment.
So I made a Store-Load re-ordering example suitable for x86_64 after Jeff Preshing's post. The example code is not that short, so I opened this question instead of appending onto bames53's similar question.
main() and signal_handler() will run in the same thread(i.e. they will share the same tid) in a single core environment. main() can be interrupted at any time by signal_handler(). If no signal_fences are used, in the generated binary X = 1; r1 = Y; will be exchanged their ordering if compiled with g++ -O2(Store(X)-Load(Y) is optimized to Load(Y)-Store(X)). The same with Y = 1; r2 = X;. So if main() is interrupted just after 'Load(Y)', it results r1 == 0 and r2 == 0 at last. Thus in the following code line (C) will assert fail. But if line (A) and (B) are uncommented, it should never assert fail since a signal_fence is used to protect synchronization between main() and signal_handler(), to the best of my understanding.
#include <atomic>
#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <semaphore.h>
#include <signal.h>
#include <unistd.h>
sem_t endSema;
// volatile int synchronizer;
int X, Y;
int r1, r2;
void signal_handler(int sig) {
signal(sig, SIG_IGN);
Y = 1;
// std::atomic_signal_fence(std::memory_order_seq_cst); // (A) if uncommented, assert still may fail
r2 = X;
signal(SIGINT, signal_handler);
sem_post(&endSema); // if changed to the following, assert never fail
// synchronizer = 1;
}
int main(int argc, char* argv[]) {
std::srand(std::time(nullptr));
sem_init(&endSema, 0, 0);
signal(SIGINT, signal_handler);
for (;;) {
while(std::rand() % std::stol(argv[1]) != 0); // argv[1] ~ 1000'000
X = 1;
// std::atomic_signal_fence(std::memory_order_seq_cst); // (B) if uncommented, assert still may fail.
r1 = Y;
sem_wait(&endSema); // if changed to the following, assert never fail
// while (synchronizer == 0); synchronizer = 0;
std::cout << "r1=" << r1 << " r2=" << r2 << std::endl;
if (r1 == 0) assert(r2 != 0); // (C)
Y = 0; r1 = 0; r2 = 0; X = 0;
}
return 0;
}
Firstly semaphore is used to synchronize main() with signal_handler(). In this version, the assert always fail after around received 30 SIGINTs with or without the signal fence. It seems that std::atomic_signal_fence() did not work as I expected.
Secondly If semaphore is replaced with volatile int synchronizer, the program seems never fail with or without the signal fence.
What's wrong with the code? Did I miss-understand the cppreference doc? Or is there any more proper example code for this topic in x86_64 environment that I can observe the effects of std::atomic_signal_fence?
Below is some relevant info:
compiling & running env: CentOS 8 (Linux 4.18.0) x86_64 single CPU core.
Compiler: g++ (GCC) 8.3.1 20190507
Compiling command g++ -std=c++17 -o ordering -O2 ordering.cpp -pthread
Run with ./ordering 1000000, then keep pressing Ctrl-C to invoke the signal handler.

Related

Under which conditions is renaming a file an atomic operation on Linux?

Assumption
According to the documentation, calling rename on Linux performs an atomic replace:
If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing.
Contradiction
However, if I run a simple parallel test, with each thread running the following operations:
create a file foo<thread_id>
rename foo<thread_id> to cache_file (cache_file is the same for every thread)
hard link cache_file to bar<thread_id>
it will eventually fail to create the hard link with the following error:
filesystem error: cannot create hard link: No such file or directory [/app/cache_file] [/app/bar1]. So it seems that the replacement of cache_file is not atomic, as concurrently creating a hard link causes an error. (Note that cache_file is actually stored in a content addressable storage, so the overwrite shouldn't do any harm, as the content of the replaced file and the replacement file is exactly the same.)
Question
Shouldn't the hard link creation always succeed if the replacement operation is atomic, so that the created hard link refers to either the replaced file or the replacement file?
See the minimal working example on godbolt or here:
#include <thread>
#include <vector>
#include <string>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <filesystem>
#include <cstdio>
#include <fcntl.h>
#include <unistd.h>
auto myrename(std::filesystem::path const& from,
std::filesystem::path const& to, int variant) -> bool {
switch (variant) {
case 0: // c++ rename
std::filesystem::rename(from, to);
return true;
case 1: // c rename
return std::rename(from.c_str(), to.c_str()) == 0;
case 2: // linux rename (same as std::rename?)
return rename(from.c_str(), to.c_str()) == 0;
case 3: // linux link and unlink (no overwrite)
return (link(from.c_str(), to.c_str()) == 0 or errno == EEXIST)
and unlink(from.c_str()) == 0;
case 4: // linux renameat2 without overwrite
return renameat2(0, from.c_str(), 0, to.c_str(), RENAME_NOREPLACE) == 0
or (errno == EEXIST and unlink(from.c_str()) == 0);
default:
return false;
}
}
auto mylink(std::filesystem::path const& from, std::filesystem::path const& to,
int variant) -> bool {
if (std::filesystem::exists(to)) std::filesystem::remove(to);
switch (variant) {
case 0: // c++ hard link
std::filesystem::create_hard_link(from, to);
return true;
case 1: // linux link
return link(from.c_str(), to.c_str()) == 0;
default:
return false;
}
}
auto create_store_stage(std::string const& id) noexcept -> bool {
try {
auto cwd = std::filesystem::current_path();
auto cache = cwd / "cache_file"; // common
auto ifile = cwd / ("foo" + id); // thread local
auto ofile = cwd / ("bar" + id); // thread local
return std::ofstream{ifile}.put('x') // 1. create input file
and myrename(ifile, cache, 0) // 2. store in cache
and mylink(cache, ofile, 0); // 3. hard link to output file
} catch (std::exception const& e) {
std::cout << "caught exception: " << e.what() << std::endl;
return false;
}
}
int main(int argc, const char *argv[]) {
bool fail{};
std::vector<std::thread> threads{};
for (int i{}; i < std::thread::hardware_concurrency(); ++i) {
threads.emplace_back([id = std::to_string(i), &fail]{
while (not fail and create_store_stage(id)) {}
if (errno) perror(("thread " + id + " failed with error").c_str());
fail = true;
});
}
std::for_each(threads.begin(), threads.end(), [](auto& t) { t.join(); });
return 0;
}
Additional Notes
tested on Debian 11 (Kernel 5.10.0) and Ubuntu 20.04 (Kernel 5.8.0)
tested with GCC 9.3/10.2 and Clang 10.0.0/11.0.0 (although I don't expect the compiler to be the issue)
myrename() variants 3 and 4 work correctly (both do not overwrite, which is fine for a content addressable storage)
as expected, neither variant 0 nor 1 of mylink() does make any difference (both use link(), according to strace)
interesting: on WSL2 with Ubuntu 20.04 (Kernel 4.4.0) the myrename() variants 0, 1, and 2 work correctly, but 3 and 4 fail with filesystem error: cannot create hard link: Invalid argument [/app/cache_file] [/app/bar3] and Invalid argument, respectively
*Update
as pointed out by the busybee, link() should be atomic as well. The Linux man pages do not mention any atomic properties, while the POSIX specification explicitly does:
The link() function shall atomically create a new link for the existing file and the link count of the file shall be incremented by one.
as mentioned by numzero, this could be an unintended side-effect. But I did some testing and this behavior dates back to at least Kernel version 2.6.32.

create thread but process shows up?

#include <unistd.h>
#include <stdio.h>
#include <cstring>
#include <thread>
void test_cpu() {
printf("thread: test_cpu start\n");
int total = 0;
while (1) {
++total;
}
}
void test_mem() {
printf("thread: test_mem start\n");
int step = 20;
int size = 10 * 1024 * 1024; // 10Mb
for (int i = 0; i < step; ++i) {
char* tmp = new char[size];
memset(tmp, i, size);
sleep(1);
}
printf("thread: test_mem done\n");
}
int main(int argc, char** argv) {
std::thread t1(test_cpu);
std::thread t2(test_mem);
t1.join();
t2.join();
return 0;
}
Compile it with g++ -o test test.cc --std=c++11 -lpthread
I run the program in Linux, and run top to monitor it.
I expect to see ONE process however I saw THREE.
It looks like std::thread is creating threads, why do I end up with getting processes?
Linux does not implement threads. It only has Light Weight Processes (LWP) while pthread library wraps them to provide POSIX-compatible thread interface. The main LWP creates its own address space while each subsequent thread LWP shares address space with main LWP.
Many utils, such as HTOP (which seems to be on the screenshot) by default list LWP. In order to hide thread LWPs you can open Setup (F2) -> Display Options and check Hide kernel threads and Hide userland process threads options. There is also an option to highlight threads - Display threads in different color.

Boost MPI doesn't free resources when listening for lists?

This is a follow-on question to How do I free a boost::mpi::request? . I'm noting odd behavior when listening for lists rather than individual items. Is this my error or an error in boost? I'm using MSVC and MSMPI, Boost 1.62. I'm pretty sure that it's not behaving properly on the wait for a cancelled job.
If you try version B with mpiexec -n 2 then you get a clean exit - if you try version A, it hangs indefinitely. Do you all see this as well? Is this a bug?
#include "boost/mpi.hpp"
#include "mpi.h"
#include <list>
#include "boost/serialization/list.hpp"
int main()
{
MPI_Init(NULL, NULL);
MPI_Comm regional;
MPI_Comm_dup(MPI_COMM_WORLD, &regional);
boost::mpi::communicator comm = boost::mpi::communicator(regional, boost::mpi::comm_attach);
if (comm.rank() == 1)
{
//VERSION A:
std::list<int> q;
boost::mpi::request z = comm.irecv<std::list<int>>(1, 0, q);
z.cancel();
z.wait();
//VERSION B:
// int q;
// boost::mpi::request z = comm.irecv<int>(1, 0, q);
// z.cancel();
// z.wait();
}
MPI_Comm_disconnect(&regional);
MPI_Finalize();
return 0;
}
This is clearly a bug in Boost.MPI.
For serialized types, like std::list, the cancel is forwarded from request::cancel() to request::handle_serialized_irecv, which does not specify a proper handling for ra_cancel.

Mach Threads Not Working on x86_64

I'm trying to write a simple "Hello, world!" program using Mach threads on x86_64. Unfortunately, the program crashes with a segmentation fault on my machine, and I can't seem to fix the problem. I couldn't find much documentation about Mach threads online, but I referred to the following C file which also makes use of Mach threads.
As far as I can tell, I'm doing everything correctly. I suspect that the segmentation fault is because I did not set up the thread's stack correctly, but I took the same approach as the reference file, which has the following code.
// This is for alignment. In particular note that the sizeof(void*) is necessary
// since it would usually specify the return address (i.e. we are aligning the call
// frame to a 16 byte boundary as required by the abi, but the stack pointer
// to point to the byte beyond that. Not doing this leads to funny behavior on
// the first access to an external function will fail due to stack misalignment
state.__rsp &= -16;
state.__rsp -= sizeof(void*);
Do you have any idea as to what I could be doing wrong?
#include <cstdint>
#include <iostream>
#include <system_error>
#include <unistd.h>
#include <mach/mach_init.h>
#include <mach/mach_types.h>
#include <mach/task.h>
#include <mach/thread_act.h>
#include <mach/thread_policy.h>
#include <mach/i386/thread_status.h>
void check(kern_return_t err)
{
if (err == KERN_SUCCESS) {
return;
}
auto code = std::error_code{err, std::system_category()};
switch (err) {
case KERN_FAILURE:
throw std::system_error{code, "failure"};
case KERN_INVALID_ARGUMENT:
throw std::system_error{code, "invalid argument"};
default:
throw std::system_error{code, "unknown error"};
}
}
void test()
{
std::cout << "Hello from thread." << std::endl;
}
int main()
{
auto page_size = ::getpagesize();
auto stack = new uint8_t[page_size];
auto thread = ::thread_t{};
auto task = ::mach_task_self();
check(::thread_create(task, &thread));
auto state = ::x86_thread_state64_t{};
auto count = ::mach_msg_type_number_t{x86_THREAD_STATE64_COUNT};
check(::thread_get_state(thread, x86_THREAD_STATE64,
(::thread_state_t)&state, &count));
auto stack_ptr = (uintptr_t)(stack + page_size);
stack_ptr &= -16;
stack_ptr -= sizeof(void*);
state.__rip = (uintptr_t)test;
state.__rsp = (uintptr_t)stack_ptr;
state.__rbp = (uintptr_t)stack_ptr;
check(::thread_set_state(thread, x86_THREAD_STATE64,
(::thread_state_t)&state, x86_THREAD_STATE64_COUNT));
check(::thread_resume(thread));
::sleep(1);
std::cout << "Done." << std::endl;
}
The reference file uses C++11; if compiling with GCC or Clang, you will need to supply the std=c++11 flag.

Is this kind of optimization a compiler bug or not?

Declarations: I use vs 2010/vs 2013, and clang 3.4 prebuilt binary.
I've found a bug in our production code. I minimize the reproduce code to the following:
#include <windows.h>
#include <process.h>
#include <stdio.h>
using namespace std;
bool s_begin_init = false;
bool s_init_done = false;
void thread_proc(void * arg)
{
DWORD tid = GetCurrentThreadId();
printf("Begin Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
if (!s_begin_init)
{
s_begin_init = true;
Sleep(20);
s_init_done = true;
}
else
{
while(!s_init_done) { ; }
}
printf("End Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
}
int main(int argc, char *argv[])
{
argc = argc ; argv = argv ;
for(int i = 0; i < 30; ++i)
{
_beginthread(thread_proc, 0, reinterpret_cast<void*>(i));
}
getchar();
return 0;
}
To compile and run the code:
cl /O2 /Zi /Favc.asm vc_O2_bug.cpp && vc_O2_bug.exe
Some of the threads are busying in the while loop. By checking the produced assembly code, I found the assembly code of
while(!s_init_done) {; }
is:
; Line 19
mov al, BYTE PTR ?s_init_done##3_NA ; s_init_done
$LL2#thread_pro:
; Line 21
test al, al
je SHORT $LL2#thread_pro
; Line 23
It's obvious that when use -O2 optimization flag, VC copy the s_init_done to al register, and repeatedly test the al register.
I then use the clang-cl.exe compiler driver to test the code. The result is same, and the assembly code are
equivalent.
It looks that the compiler think that variable s_init_done will never be changed because the only statement which change it's value is in the "if" block, which is exclusive with the current "else" branch.
I tried the same code with VS2013, The result is also same.
What I doubt is: In C++98/C++03 standard, there's no concept of thread. So the compiler can perform such an optimization for a single-thread-machine. But since c++11 has thread, and both clang 3.4 and VC2013 have support C++11 well, do my question is:
Is think kind of optimization a compiler bug for C++98/C++03, and for C++11 separately?
BTW: When I use -O1 instead, or add volatile qualifier to s_init_done, the bug disappeared.
Your program contains data races on s_begin_init and s_init_done, and therefore has undefined behavior. Per C++11 ยง1.10/21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
The fix is to declare both boolean variables to be atomic:
std::atomic<bool> s_begin_init{false};
std::atomic<bool> s_init_done{false};
or to synchronize accesses to them with a mutex (I'll throw in a condition variable to avoid busy-waiting):
std::mutex mtx;
std::condition_variable cvar;
bool s_begin_init = false;
bool s_init_done = false;
void thread_proc(void * arg)
{
DWORD tid = GetCurrentThreadId();
printf("Begin Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
std::unique_lock<std::mutex> lock(mtx);
if (!s_begin_init)
{
s_begin_init = true;
lock.unlock();
Sleep(20);
lock.lock();
s_init_done = true;
cvar.notify_all();
}
else
{
while(!s_init_done) { cvar.wait(lock); }
}
printf("End Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
}
EDIT: I just noticed the mention of VS2010 in the OP. VS2010 does not support C++11 atomics, so you will have to use the mutex solution or take advantage of MSVC's non-standard extension that gives volatile variables acquire-release semantics:
volatile bool s_begin_init = false;
volatile bool s_init_done = false;