Is this kind of optimization a compiler bug or not?

Is this kind of optimization a compiler bug or not? - c++

Declarations: I use vs 2010/vs 2013, and clang 3.4 prebuilt binary.
I've found a bug in our production code. I minimize the reproduce code to the following:
#include <windows.h>
#include <process.h>
#include <stdio.h>
using namespace std;
bool s_begin_init = false;
bool s_init_done = false;
void thread_proc(void * arg)
{
DWORD tid = GetCurrentThreadId();
printf("Begin Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
if (!s_begin_init)
{
s_begin_init = true;
Sleep(20);
s_init_done = true;
}
else
{
while(!s_init_done) { ; }
}
printf("End Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
}
int main(int argc, char *argv[])
{
argc = argc ; argv = argv ;
for(int i = 0; i < 30; ++i)
{
_beginthread(thread_proc, 0, reinterpret_cast<void*>(i));
}
getchar();
return 0;
}
To compile and run the code:
cl /O2 /Zi /Favc.asm vc_O2_bug.cpp && vc_O2_bug.exe
Some of the threads are busying in the while loop. By checking the produced assembly code, I found the assembly code of
while(!s_init_done) {; }
is:
; Line 19
mov al, BYTE PTR ?s_init_done##3_NA ; s_init_done
$LL2#thread_pro:
; Line 21
test al, al
je SHORT $LL2#thread_pro
; Line 23
It's obvious that when use -O2 optimization flag, VC copy the s_init_done to al register, and repeatedly test the al register.
I then use the clang-cl.exe compiler driver to test the code. The result is same, and the assembly code are
equivalent.
It looks that the compiler think that variable s_init_done will never be changed because the only statement which change it's value is in the "if" block, which is exclusive with the current "else" branch.
I tried the same code with VS2013, The result is also same.
What I doubt is: In C++98/C++03 standard, there's no concept of thread. So the compiler can perform such an optimization for a single-thread-machine. But since c++11 has thread, and both clang 3.4 and VC2013 have support C++11 well, do my question is:
Is think kind of optimization a compiler bug for C++98/C++03, and for C++11 separately?
BTW: When I use -O1 instead, or add volatile qualifier to s_init_done, the bug disappeared.

Your program contains data races on s_begin_init and s_init_done, and therefore has undefined behavior. Per C++11 §1.10/21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
The fix is to declare both boolean variables to be atomic:
std::atomic<bool> s_begin_init{false};
std::atomic<bool> s_init_done{false};
or to synchronize accesses to them with a mutex (I'll throw in a condition variable to avoid busy-waiting):
std::mutex mtx;
std::condition_variable cvar;
bool s_begin_init = false;
bool s_init_done = false;
void thread_proc(void * arg)
{
DWORD tid = GetCurrentThreadId();
printf("Begin Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
std::unique_lock<std::mutex> lock(mtx);
if (!s_begin_init)
{
s_begin_init = true;
lock.unlock();
Sleep(20);
lock.lock();
s_init_done = true;
cvar.notify_all();
}
else
{
while(!s_init_done) { cvar.wait(lock); }
}
printf("End Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
}
EDIT: I just noticed the mention of VS2010 in the OP. VS2010 does not support C++11 atomics, so you will have to use the mutex solution or take advantage of MSVC's non-standard extension that gives volatile variables acquire-release semantics:
volatile bool s_begin_init = false;
volatile bool s_init_done = false;

Related

Changing global variables does not take effect when multithreading on QNX

I change a global variable in one thread, and the change will not take effect in another thread if I do not print it
Here's the code：
pthread_t thread_test[2];
bool test=true;
void* test1(void*)
{
while(1)
{
//printf("test: %d\n",test);
if(test)
continue;
printf("test test test\n");
usleep(500000);
}
}
void* test2(void*)
{
int i=0;
while(i<5)
{
printf("i: %d\n",i++);
sleep(1);
}
test=false;
return NULL;
}
int main()
{
pthread_create(&thread_test[0], NULL, &test1, (void *)NULL);
pthread_create(&thread_test[1], NULL, &test2, (void *)NULL);
pthread_join(thread_test[0],NULL);
return 0;
}
If the line printf("test: %d\n",test); is commented out.Then the change of test in test2 will not take effect in test1.
The run results are shown in below:
enter image description here
if don't commented out it:
enter image description here
This has troubled me for a long time. Wish someone please answer it.

Compiler is probably optimizing out the "while(1)" loop content, assuming global test is always true
Got the same results with qnx compiler 'release' setup:
-g0 -O3
Got "test test test" (test turned false) with 'debug' setup:
-O0 -g3 -ggdb
p.s.
volatile bool test=true;
with 'release' setup also does the trick (despite what others wrote above)
enter image description here

Under which conditions is renaming a file an atomic operation on Linux?

Assumption
According to the documentation, calling rename on Linux performs an atomic replace:
If newpath already exists, it will be atomically replaced, so that there is no point at which another process attempting to access newpath will find it missing.
Contradiction
However, if I run a simple parallel test, with each thread running the following operations:
create a file foo<thread_id>
rename foo<thread_id> to cache_file (cache_file is the same for every thread)
hard link cache_file to bar<thread_id>
it will eventually fail to create the hard link with the following error:
filesystem error: cannot create hard link: No such file or directory [/app/cache_file] [/app/bar1]. So it seems that the replacement of cache_file is not atomic, as concurrently creating a hard link causes an error. (Note that cache_file is actually stored in a content addressable storage, so the overwrite shouldn't do any harm, as the content of the replaced file and the replacement file is exactly the same.)
Question
Shouldn't the hard link creation always succeed if the replacement operation is atomic, so that the created hard link refers to either the replaced file or the replacement file?
See the minimal working example on godbolt or here:
#include <thread>
#include <vector>
#include <string>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <filesystem>
#include <cstdio>
#include <fcntl.h>
#include <unistd.h>
auto myrename(std::filesystem::path const& from,
std::filesystem::path const& to, int variant) -> bool {
switch (variant) {
case 0: // c++ rename
std::filesystem::rename(from, to);
return true;
case 1: // c rename
return std::rename(from.c_str(), to.c_str()) == 0;
case 2: // linux rename (same as std::rename?)
return rename(from.c_str(), to.c_str()) == 0;
case 3: // linux link and unlink (no overwrite)
return (link(from.c_str(), to.c_str()) == 0 or errno == EEXIST)
and unlink(from.c_str()) == 0;
case 4: // linux renameat2 without overwrite
return renameat2(0, from.c_str(), 0, to.c_str(), RENAME_NOREPLACE) == 0
or (errno == EEXIST and unlink(from.c_str()) == 0);
default:
return false;
}
}
auto mylink(std::filesystem::path const& from, std::filesystem::path const& to,
int variant) -> bool {
if (std::filesystem::exists(to)) std::filesystem::remove(to);
switch (variant) {
case 0: // c++ hard link
std::filesystem::create_hard_link(from, to);
return true;
case 1: // linux link
return link(from.c_str(), to.c_str()) == 0;
default:
return false;
}
}
auto create_store_stage(std::string const& id) noexcept -> bool {
try {
auto cwd = std::filesystem::current_path();
auto cache = cwd / "cache_file"; // common
auto ifile = cwd / ("foo" + id); // thread local
auto ofile = cwd / ("bar" + id); // thread local
return std::ofstream{ifile}.put('x') // 1. create input file
and myrename(ifile, cache, 0) // 2. store in cache
and mylink(cache, ofile, 0); // 3. hard link to output file
} catch (std::exception const& e) {
std::cout << "caught exception: " << e.what() << std::endl;
return false;
}
}
int main(int argc, const char *argv[]) {
bool fail{};
std::vector<std::thread> threads{};
for (int i{}; i < std::thread::hardware_concurrency(); ++i) {
threads.emplace_back([id = std::to_string(i), &fail]{
while (not fail and create_store_stage(id)) {}
if (errno) perror(("thread " + id + " failed with error").c_str());
fail = true;
});
}
std::for_each(threads.begin(), threads.end(), [](auto& t) { t.join(); });
return 0;
}
Additional Notes
tested on Debian 11 (Kernel 5.10.0) and Ubuntu 20.04 (Kernel 5.8.0)
tested with GCC 9.3/10.2 and Clang 10.0.0/11.0.0 (although I don't expect the compiler to be the issue)
myrename() variants 3 and 4 work correctly (both do not overwrite, which is fine for a content addressable storage)
as expected, neither variant 0 nor 1 of mylink() does make any difference (both use link(), according to strace)
interesting: on WSL2 with Ubuntu 20.04 (Kernel 4.4.0) the myrename() variants 0, 1, and 2 work correctly, but 3 and 4 fail with filesystem error: cannot create hard link: Invalid argument [/app/cache_file] [/app/bar3] and Invalid argument, respectively
*Update
as pointed out by the busybee, link() should be atomic as well. The Linux man pages do not mention any atomic properties, while the POSIX specification explicitly does:
The link() function shall atomically create a new link for the existing file and the link count of the file shall be incremented by one.
as mentioned by numzero, this could be an unintended side-effect. But I did some testing and this behavior dates back to at least Kernel version 2.6.32.

how to use std::atomic_signal_fence() with semaphore and volatile?

std::atomic_signal_fence() Establishes memory synchronization ordering ... between a thread and a signal handler executed on the same thread.
-- cppreference
In order to find an example for this illustration, I looked at bames53's similar question in stackoverflow. However the answer may not suit my x86_64 environment, since x86_64 CPU is strong memory model and forbids Store-Store re-ordering ^1. Its example will correctly execute even without std::atomic_signal_fence() in my x86_64 environment.
So I made a Store-Load re-ordering example suitable for x86_64 after Jeff Preshing's post. The example code is not that short, so I opened this question instead of appending onto bames53's similar question.
main() and signal_handler() will run in the same thread(i.e. they will share the same tid) in a single core environment. main() can be interrupted at any time by signal_handler(). If no signal_fences are used, in the generated binary X = 1; r1 = Y; will be exchanged their ordering if compiled with g++ -O2(Store(X)-Load(Y) is optimized to Load(Y)-Store(X)). The same with Y = 1; r2 = X;. So if main() is interrupted just after 'Load(Y)', it results r1 == 0 and r2 == 0 at last. Thus in the following code line (C) will assert fail. But if line (A) and (B) are uncommented, it should never assert fail since a signal_fence is used to protect synchronization between main() and signal_handler(), to the best of my understanding.
#include <atomic>
#include <cassert>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <semaphore.h>
#include <signal.h>
#include <unistd.h>
sem_t endSema;
// volatile int synchronizer;
int X, Y;
int r1, r2;
void signal_handler(int sig) {
signal(sig, SIG_IGN);
Y = 1;
// std::atomic_signal_fence(std::memory_order_seq_cst); // (A) if uncommented, assert still may fail
r2 = X;
signal(SIGINT, signal_handler);
sem_post(&endSema); // if changed to the following, assert never fail
// synchronizer = 1;
}
int main(int argc, char* argv[]) {
std::srand(std::time(nullptr));
sem_init(&endSema, 0, 0);
signal(SIGINT, signal_handler);
for (;;) {
while(std::rand() % std::stol(argv[1]) != 0); // argv[1] ~ 1000'000
X = 1;
// std::atomic_signal_fence(std::memory_order_seq_cst); // (B) if uncommented, assert still may fail.
r1 = Y;
sem_wait(&endSema); // if changed to the following, assert never fail
// while (synchronizer == 0); synchronizer = 0;
std::cout << "r1=" << r1 << " r2=" << r2 << std::endl;
if (r1 == 0) assert(r2 != 0); // (C)
Y = 0; r1 = 0; r2 = 0; X = 0;
}
return 0;
}
Firstly semaphore is used to synchronize main() with signal_handler(). In this version, the assert always fail after around received 30 SIGINTs with or without the signal fence. It seems that std::atomic_signal_fence() did not work as I expected.
Secondly If semaphore is replaced with volatile int synchronizer, the program seems never fail with or without the signal fence.
What's wrong with the code? Did I miss-understand the cppreference doc? Or is there any more proper example code for this topic in x86_64 environment that I can observe the effects of std::atomic_signal_fence?
Below is some relevant info:
compiling & running env: CentOS 8 (Linux 4.18.0) x86_64 single CPU core.
Compiler: g++ (GCC) 8.3.1 20190507
Compiling command g++ -std=c++17 -o ordering -O2 ordering.cpp -pthread
Run with ./ordering 1000000, then keep pressing Ctrl-C to invoke the signal handler.

Why does -O2 or greater optimization in clang break this code?

I checked similar questions on the site, but I couldn't find anything that matches my scenario here. This is the code I'm trying to run (requires C++14):
#include <iostream>
#include <chrono>
#include <thread>
using namespace std;
class countdownTimer {
public:
using duration_t = chrono::high_resolution_clock::duration;
countdownTimer(duration_t duration) : duration{ duration }, paused{ true } {}
countdownTimer(const countdownTimer&) = default;
countdownTimer(countdownTimer&&) = default;
countdownTimer& operator=(countdownTimer&&) = default;
countdownTimer& operator=(const countdownTimer&) = default;
void start() noexcept {
if (started) return;
startTime = chrono::high_resolution_clock::now();
endTime = startTime + duration;
started = true;
paused = false;
}
void pause() noexcept {
if (paused || !started) return;
pauseBegin = chrono::high_resolution_clock::now();
paused = true;
}
void resume() noexcept {
if (!paused || !started) return;
auto pauseDuration = chrono::high_resolution_clock::now() - pauseBegin;
startTime += pauseDuration;
endTime += pauseDuration;
paused = false;
}
double remainingSeconds() const noexcept {
auto ret = double{ 0.0 };
if (!started) ret = chrono::duration_cast<chrono::duration<double>>(duration).count();
else if (paused) ret = chrono::duration_cast<chrono::duration<double>>(duration - (pauseBegin - startTime)).count();
else ret = chrono::duration_cast<chrono::duration<double>>(duration - (chrono::high_resolution_clock::now() - startTime)).count();
return (ret < 0.0) ? 0.0 : ret;
}
duration_t remainingTime() const noexcept {
auto ret = duration_t{ 0ms };
if (!started) ret = chrono::duration_cast<duration_t>(duration);
else if (paused) ret = chrono::duration_cast<duration_t>(duration - (pauseBegin - startTime));
else ret = chrono::duration_cast<duration_t>(duration - (chrono::high_resolution_clock::now() - startTime));
return (ret < 0ms) ? 0ms : ret;
}
bool isPaused() const noexcept { return paused; }
bool hasFinished() const noexcept { return remainingTime() == 0s; }
void reset() noexcept {
started = false;
paused = true;
}
private:
chrono::high_resolution_clock::time_point startTime;
chrono::high_resolution_clock::time_point endTime;
chrono::high_resolution_clock::time_point pauseBegin;
duration_t duration;
bool paused;
bool started;
};
int main() {
countdownTimer timer(10s);
timer.start();
while (!timer.hasFinished()) {
cout << timer.remainingSeconds() << endl;
this_thread::sleep_for(1s);
}
}
It's a simple countdown timer class that I wrote for one of my projects. The client code in main() is pretty self-explanatory, it should output a countdown from 10 to 0, and then exit the program. With no optimization or -O/-O1, it does exactly that:
10
8.99495
7.98992
6.9849
5.97981
4.9748
3.96973
2.9687
1.9677
0.966752
Program ended with exit code: 0
But if I step up the optimization to >=-O2, the program just keeps outputting 10, and runs forever. The countdown simply doesn't work, it's stuck at the starting value.
I'm using the latest Xcode on OS X. clang --version says Apple LLVM version 7.3.0 (clang-703.0.31).
The strange part is that my code doesn't contain any weird self-written loops, undefined behavior, or anything like that, it's pretty much just standard library calls, so it's very strange that optimization breaks it.
Any ideas?
PS: I haven't tried it on other compilers, but I'm about to. I'll update the question with those results.

bool started is not initialized.
If you initialize it to false, it works with -O2:
live example
You can find errors like this using the Undefined behavior sanitizer:
$ g++ -std=c++14 -O2 -g -fsanitize=undefined -fno-omit-frame-pointer main.cpp && ./a.out
main.cpp:18:9: runtime error: load of value 106, which is not a valid value for type 'bool'

The bug is in your constructor:
countdownTimer(duration_t duration)
: duration{ duration }, paused{ true } {}
You forgot to initialize started. This triggers undefined behavior when you call start().
No version of clang that I have convenient access to will diagnose this error, but GCC versions 5 and 6 (on Linux - I don't have GCC on my Mac anymore) will:
$ g++ -O2 -Wall -Wextra -std=c++14 test.cc
test.cc: In function ‘int main()’:
test.cc:18:13: warning: ‘*((void*)& timer +33)’ is used uninitialized in this function [-Wuninitialized]
if (started) return;
^~~~~~~
test.cc:74:20: note: ‘*((void*)& timer +33)’ was declared here
countdownTimer timer(10s);
^~~~~
(My copy of Xcode seems to be a bit out of date, with Apple LLVM version 7.0.2 (clang-700.1.81); it does not change the behavior of the program at -O2. It's possible that your clang would diagnose this error if you turned on the warnings.)
(I have filed a bug report with GCC about the IR gobbledygook in the diagnostics.)

How can Tcl_CreateInterp end up in an inifinite loop?

I'm using Tcl library version 8.6.4 (compiled with Visual Studio 2015, 64bits) to interpret some Tcl commands from a C/C++ program.
I noticed that if I create interpreters from different threads, the second one ends up in an infinite loop:
#include "tcl.h"
#include <boost/thread.hpp>
#include <boost/filesystem.hpp>
void runScript()
{
Tcl_Interp* pInterp = Tcl_CreateInterp();
std::string sTclPath = boost::filesystem::current_path().string() + "/../../stg/Debug/lib/tcl";
const char* setvalue = Tcl_SetVar( pInterp, "tcl_library", sTclPath.c_str(), TCL_GLOBAL_ONLY );
assert( setvalue != NULL );
int i = Tcl_Init( pInterp );
assert( i == TCL_OK );
int nTclResult = Tcl_Eval( pInterp, "puts \"Hello\"" );
assert( nTclResult == TCL_OK );
Tcl_DeleteInterp( pInterp );
}
int main( int argc, char* argv[] )
{
Tcl_FindExecutable(NULL);
runScript();
runScript();
boost::thread thrd1( runScript );
thrd1.join(); // works OK
boost::thread thrd2( runScript );
thrd2.join(); // never joins
return 1;
}
Infinite loop is here, within Tcl source code:
void
TclInitNotifier(void)
{
ThreadSpecificData *tsdPtr;
Tcl_ThreadId threadId = Tcl_GetCurrentThread();
Tcl_MutexLock(&listLock);
for (tsdPtr = firstNotifierPtr; tsdPtr && tsdPtr->threadId != threadId;
tsdPtr = tsdPtr->nextPtr) {
/* Empty loop body. */
}
// I never exit this loop because, after first thread was joined
// at some point tsdPtr == tsdPtr->nextPtr
Am I doing something wrong? Is there any special function call I'm missing?
Note: TCL_THREADS was not set while I compiled Tcl. However, I feel like I'm doing nothing wrong here. Also, adding
/* Empty loop body. */
if ( tsdPtr != NULL && tsdPtr->nextPtr == tsdPtr )
{
tsdPtr = NULL;
break;
}
within the loop apparently fixes the issue. But I'm not very confident in modifying 3rd party library source code...

After reporting a bug to Tcl team, I was asked to try again with Tcl library compiled with TCL_THREADS enabled. It fixed the issue.
TCL_THREADS was disabled because I compiled on windows using a CMake Lists.txt file I found on the web: this one was actually written for Linux and disabled thread support because it was unable to find pthread on my machine. I finally compiled Tcl libraries using the scripts provided by the Tcl team: threading is enabled by default and the infinite loop is gone!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Is this kind of optimization a compiler bug or not? - c++

Related

Changing global variables does not take effect when multithreading on QNX

Under which conditions is renaming a file an atomic operation on Linux?

how to use std::atomic_signal_fence() with semaphore and volatile?

Why does -O2 or greater optimization in clang break this code?

How can Tcl_CreateInterp end up in an inifinite loop?

Categories

Resources