Why does -O2 or greater optimization in clang break this code?

Why does -O2 or greater optimization in clang break this code? - c++

I checked similar questions on the site, but I couldn't find anything that matches my scenario here. This is the code I'm trying to run (requires C++14):
#include <iostream>
#include <chrono>
#include <thread>
using namespace std;
class countdownTimer {
public:
using duration_t = chrono::high_resolution_clock::duration;
countdownTimer(duration_t duration) : duration{ duration }, paused{ true } {}
countdownTimer(const countdownTimer&) = default;
countdownTimer(countdownTimer&&) = default;
countdownTimer& operator=(countdownTimer&&) = default;
countdownTimer& operator=(const countdownTimer&) = default;
void start() noexcept {
if (started) return;
startTime = chrono::high_resolution_clock::now();
endTime = startTime + duration;
started = true;
paused = false;
}
void pause() noexcept {
if (paused || !started) return;
pauseBegin = chrono::high_resolution_clock::now();
paused = true;
}
void resume() noexcept {
if (!paused || !started) return;
auto pauseDuration = chrono::high_resolution_clock::now() - pauseBegin;
startTime += pauseDuration;
endTime += pauseDuration;
paused = false;
}
double remainingSeconds() const noexcept {
auto ret = double{ 0.0 };
if (!started) ret = chrono::duration_cast<chrono::duration<double>>(duration).count();
else if (paused) ret = chrono::duration_cast<chrono::duration<double>>(duration - (pauseBegin - startTime)).count();
else ret = chrono::duration_cast<chrono::duration<double>>(duration - (chrono::high_resolution_clock::now() - startTime)).count();
return (ret < 0.0) ? 0.0 : ret;
}
duration_t remainingTime() const noexcept {
auto ret = duration_t{ 0ms };
if (!started) ret = chrono::duration_cast<duration_t>(duration);
else if (paused) ret = chrono::duration_cast<duration_t>(duration - (pauseBegin - startTime));
else ret = chrono::duration_cast<duration_t>(duration - (chrono::high_resolution_clock::now() - startTime));
return (ret < 0ms) ? 0ms : ret;
}
bool isPaused() const noexcept { return paused; }
bool hasFinished() const noexcept { return remainingTime() == 0s; }
void reset() noexcept {
started = false;
paused = true;
}
private:
chrono::high_resolution_clock::time_point startTime;
chrono::high_resolution_clock::time_point endTime;
chrono::high_resolution_clock::time_point pauseBegin;
duration_t duration;
bool paused;
bool started;
};
int main() {
countdownTimer timer(10s);
timer.start();
while (!timer.hasFinished()) {
cout << timer.remainingSeconds() << endl;
this_thread::sleep_for(1s);
}
}
It's a simple countdown timer class that I wrote for one of my projects. The client code in main() is pretty self-explanatory, it should output a countdown from 10 to 0, and then exit the program. With no optimization or -O/-O1, it does exactly that:
10
8.99495
7.98992
6.9849
5.97981
4.9748
3.96973
2.9687
1.9677
0.966752
Program ended with exit code: 0
But if I step up the optimization to >=-O2, the program just keeps outputting 10, and runs forever. The countdown simply doesn't work, it's stuck at the starting value.
I'm using the latest Xcode on OS X. clang --version says Apple LLVM version 7.3.0 (clang-703.0.31).
The strange part is that my code doesn't contain any weird self-written loops, undefined behavior, or anything like that, it's pretty much just standard library calls, so it's very strange that optimization breaks it.
Any ideas?
PS: I haven't tried it on other compilers, but I'm about to. I'll update the question with those results.

bool started is not initialized.
If you initialize it to false, it works with -O2:
live example
You can find errors like this using the Undefined behavior sanitizer:
$ g++ -std=c++14 -O2 -g -fsanitize=undefined -fno-omit-frame-pointer main.cpp && ./a.out
main.cpp:18:9: runtime error: load of value 106, which is not a valid value for type 'bool'

The bug is in your constructor:
countdownTimer(duration_t duration)
: duration{ duration }, paused{ true } {}
You forgot to initialize started. This triggers undefined behavior when you call start().
No version of clang that I have convenient access to will diagnose this error, but GCC versions 5 and 6 (on Linux - I don't have GCC on my Mac anymore) will:
$ g++ -O2 -Wall -Wextra -std=c++14 test.cc
test.cc: In function ‘int main()’:
test.cc:18:13: warning: ‘*((void*)& timer +33)’ is used uninitialized in this function [-Wuninitialized]
if (started) return;
^~~~~~~
test.cc:74:20: note: ‘*((void*)& timer +33)’ was declared here
countdownTimer timer(10s);
^~~~~
(My copy of Xcode seems to be a bit out of date, with Apple LLVM version 7.0.2 (clang-700.1.81); it does not change the behavior of the program at -O2. It's possible that your clang would diagnose this error if you turned on the warnings.)
(I have filed a bug report with GCC about the IR gobbledygook in the diagnostics.)

Related

std::chrono behaves different in arm

So the following code I am using for a hacky production fix due to time constraints. Basically I have a static function that is called from many places, much more than intended and it is causing another section of the application to choke. So I thought I would come up with a quick fix to limit the calls to the overworked function to once every two seconds. This works just fine in x86 using clang or gcc.
#include <chrono>
#include <iostream>
#include <unistd.h>
#include <thread>
static void staticfunction()
{
static std::mutex mutex;
static auto t0(std::chrono::high_resolution_clock::now());
std::unique_lock<std::mutex> lg_mutex(mutex, std::try_to_lock);
if( lg_mutex.owns_lock())
{
auto t1 = std::chrono::high_resolution_clock::now();
if( 2000 <= std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count() )
{
// Make a check in other section of application
std::cout << "Check true after " << std::dec
<< std::chrono::duration_cast<std::chrono::milliseconds>(t1 - t0).count()
<< " ms.\n";
t0 = std::chrono::high_resolution_clock::now();
}
}
}
int main()
{
while(true) {
std::thread t1(staticfunction);
std::thread t2(staticfunction);
std::thread t3(staticfunction);
std::thread t4(staticfunction);
t1.join();
t2.join();
t3.join();
t4.join();
}
return 0;
Prints
Check true after 2000 ms.
Check true after 2000 ms.
Check true after 2002 ms.
Check true after 2005 ms.
....
However, for our ARM controller I cross compiled using Linaro 7.1 and now the condition for the if stmt isn't satisfied until 10 seconds has passed. I was curious and compared against 1 second instead of two (duration_cast of ms vs seconds doesn't change anything) and if(1 <= ....count()) was true after half a second.
Is this a bug in the Linaro compiler? Or are clocks for our ARM controller off? Cross compile flags are -mcpu=cortex-a7 -mfloat-abi=hard -marm -march=armv7ve if that makes a difference
EDIT: multithreaded, same output.

C++ in ARM MCU: Need help to set up a simple timer

I'm programming an ATSAME70 and I'm trying to program a simple timer using the SysTick interrupt available in Cortex M MCUs, but I don't know what is going wrong.
If write this code in a simple main.cpp file:
// main.cpp
#include <cstdint>
#include "init.h"
#include "led.hpp"
volatile uint32_t g_ticks = 0;
extern "C" {
void SysTick_Handler(void)
{
g_ticks++;
}
}
class Timer
{
private:
uint32_t start;
public:
Timer() : start(g_ticks) {}
float elapsed() const { return (g_ticks - start) / 1000.0f; }
};
int main()
{
init();
SysTick_Config(300000000 / 1000); /* Clock is running at 300 MHz */
Timer t;
while (t.elapsed() < 1.0f);
Led::on();
while (true);
}
It works, the led lights up properly after 1 second.
But if I try to keep it clean and separate the program in the following files:
// timer.hpp
#include <cstdint>
class Timer
{
private:
uint32_t start;
public:
Timer();
float elapsed() const;
};
// timer.cpp
#include "timer.hpp"
volatile uint32_t g_ticks = 0;
extern "C" {
void SysTick_Handler(void)
{
g_ticks++;
}
}
Timer::Timer() : start(g_ticks) {}
float Timer::elapsed() const
{
return (g_ticks - start) / 1000.0f;
}
// main.cpp
#include <cstdint>
#include "init.h"
#include "led.hpp"
#include "timer.hpp"
int main()
{
init();
SysTick_Config(300000000 / 1000); /* Clock is running at 300 MHz */
Timer t;
while (t.elapsed() < 1.0f);
Led::on();
while (true);
}
It doesn't work anymore, the program reaches the first while loop and then it gets stuck there, I think g_ticks is being corrupted when I try to read it in t.elapsed() but I don't know what is happening. Does anybody know where I'm wrong?
init() is just a function in which I initialize all needed registers.
EDIT: here are the command lines used to generate the code:
$toolchain_path = "C:\Program Files (x86)\GNU Tools ARM Embedded\8 2018-q4-major\bin";
$link_file = "source\device\same70_flash.ld"
$c_files = "include\sensors\bmi088\bmi088.c " +
...
"source\utils\syscalls.c";
$cpp_files = "source\device\init.cpp " +
...
"source\main.cpp";
Invoke-Expression "& '$toolchain_path\arm-none-eabi-gcc.exe' -c -s -O3 -fdata-sections -ffunction-sections '-Wl,--gc-sections' '-Wl,--entry=Reset_Handler' -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 -Isource -Iinclude\CMSIS -D__SAME70N21__ $c_files --specs=nosys.specs"
foreach ($c_file in $c_files.split(" "))
{
if ($objects) { $objects += " "; }
$objects += ($c_file.split("\")[-1]).split(".")[0] + ".o";
}
Invoke-Expression "& '$toolchain_path\arm-none-eabi-ld.exe' -s --entry=Reset_Handler -r $objects -o drivers.o"
foreach ($object in $objects.split(" ")) { Remove-Item $object; }
Move-Item drivers.o bin\drivers.o -force
Invoke-Expression "& '$toolchain_path\arm-none-eabi-g++.exe' -s -O3 -fdata-sections -ffunction-sections '-Wl,--gc-sections' -mthumb -mcpu=cortex-m7 -mfloat-abi=hard -mfpu=fpv5-d16 '-Wl,--entry=Reset_Handler' -std=c++17 -Isource -Iinclude -Iinclude\CMSIS -D__SAME70N21__ bin/drivers.o $cpp_files --specs=nosys.specs -T $link_file -o bin\code.elf"
Invoke-Expression "& '$toolchain_path\arm-none-eabi-objcopy.exe' -O binary bin\code.elf bin\code.bin"
The script is written in powershell and I'll explain it a little bit. $c_files is just a string with every c file to be compiled separated by an space. $objects is an array of strings containing every file listed in $c_files but with the ".c" extension replaced by ".o". I've done this to link every c compiled file into "drivers.o". Finally, c++ code is compiled using this drivers.o as argument and then I generate the .bin file to upload it to the MCU.
The code is compiled using the latest GNU Arm Embedded toolchain. I must have made a mistake somewhere but I don't know where and I don't have a debugger to debug the code at runtime.
EDIT 2: Both variants work properly without optimizations. If I pass -O1 or higher as argument to the compiler the second variant stops working and I don't understand why.

cec code not working with libcec 4

I am running stretch on raspberry pi 1. Only libcec4 and libcec4-dev are available in repo.
Simple code I found from github is based on old version of libcec.
// Build command:
// g++-4.8 -std=gnu++0x -fPIC -g -Wall -march=armv6 -mfpu=vfp -mfloat-abi=hard -isystem /opt/vc/include/ -isystem /opt/vc/include/interface/vcos/pthreads/ -isystem /opt/vc/include/interface/vmcs_host/linux/ -I/usr/local/include -L /opt/vc/lib -lcec -lbcm_host -ldl cec-simplest.cpp -o cec-simplest
//#CXXFLAGS=-I/usr/local/include
//#LINKFLAGS=-lcec -ldl
#include <libcec/cec.h>
// cecloader.h uses std::cout _without_ including iosfwd or iostream
// Furthermore is uses cout and not std::cout
#include <iostream>
using std::cout;
using std::endl;
#include <libcec/cecloader.h>
#include "bcm_host.h"
//#LINKFLAGS=-lbcm_host
#include <algorithm> // for std::min
// The main loop will just continue until a ctrl-C is received
#include <signal.h>
bool exit_now = false;
void handle_signal(int signal)
{
exit_now = true;
}
//CEC::CBCecKeyPressType
int on_keypress(void* not_used, const CEC::cec_keypress msg)
{
std::string key;
switch( msg.keycode )
{
case CEC::CEC_USER_CONTROL_CODE_SELECT: { key = "select"; break; }
case CEC::CEC_USER_CONTROL_CODE_UP: { key = "up"; break; }
case CEC::CEC_USER_CONTROL_CODE_DOWN: { key = "down"; break; }
case CEC::CEC_USER_CONTROL_CODE_LEFT: { key = "left"; break; }
case CEC::CEC_USER_CONTROL_CODE_RIGHT: { key = "right"; break; }
default: break;
};
std::cout << "on_keypress: " << static_cast<int>(msg.keycode) << " " << key << std::endl;
return 0;
}
int main(int argc, char* argv[])
{
// Install the ctrl-C signal handler
if( SIG_ERR == signal(SIGINT, handle_signal) )
{
std::cerr << "Failed to install the SIGINT signal handler\n";
return 1;
}
// Initialise the graphics pipeline for the raspberry pi. Yes, this is necessary.
bcm_host_init();
// Set up the CEC config and specify the keypress callback function
CEC::ICECCallbacks cec_callbacks;
CEC::libcec_configuration cec_config;
cec_config.Clear();
cec_callbacks.Clear();
const std::string devicename("CECExample");
devicename.copy(cec_config.strDeviceName, std::min(devicename.size(),13u) );
cec_config.clientVersion = CEC::LIBCEC_VERSION_CURRENT;
cec_config.bActivateSource = 0;
cec_config.callbacks = &cec_callbacks;
cec_config.deviceTypes.Add(CEC::CEC_DEVICE_TYPE_RECORDING_DEVICE);
cec_callbacks.CBCecKeyPress = &on_keypress;
// Get a cec adapter by initialising the cec library
CEC::ICECAdapter* cec_adapter = LibCecInitialise(&cec_config);
if( !cec_adapter )
{
std::cerr << "Failed loading libcec.so\n";
return 1;
}
// Try to automatically determine the CEC devices
CEC::cec_adapter devices[10];
int8_t devices_found = cec_adapter->FindAdapters(devices, 10, NULL);
if( devices_found <= 0)
{
std::cerr << "Could not automatically determine the cec adapter devices\n";
UnloadLibCec(cec_adapter);
return 1;
}
// Open a connection to the zeroth CEC device
if( !cec_adapter->Open(devices[0].comm) )
{
std::cerr << "Failed to open the CEC device on port " << devices[0].comm << std::endl;
UnloadLibCec(cec_adapter);
return 1;
}
// Loop until ctrl-C occurs
while( !exit_now )
{
// nothing to do. All happens in the CEC callback on another thread
sleep(1);
}
// Close down and cleanup
cec_adapter->Close();
UnloadLibCec(cec_adapter);
return 0;
}
It does not compile using libcec4 and libcec4-dev and throws these errors ::
cec-simplest.cpp: In function ‘int main(int, char**)’:
cec-simplest.cpp:76:19: error: ‘CEC::ICECCallbacks {aka struct CEC::ICECCallbacks}’ has no member named ‘CBCecKeyPress’; did you mean ‘keyPress’?
cec_callbacks.CBCecKeyPress = &on_keypress;
^~~~~~~~~~~~~
cec-simplest.cpp:88:41: error: ‘class CEC::ICECAdapter’ has no member named ‘FindAdapters’; did you mean ‘PingAdapter’?
int8_t devices_found = cec_adapter->FindAdapters(devices, 10, NULL);
When I renamed CBCecKeyPress to keyPress and FindAdapters to PingAdapter , I got these errors ::
cec-simplest.cpp: In function ‘int main(int, char**)’:
cec-simplest.cpp:76:33: error: invalid conversion from ‘int (*)(void*, CEC::cec_keypress)’ to ‘void (*)(void*, const cec_keypress*) {aka void (*)(void*, const CEC::cec_keypress*)}’ [-fpermissive]
cec_callbacks.keyPress = &on_keypress;
^~~~~~~~~~~~
cec-simplest.cpp:88:70: error: no matching function for call to ‘CEC::ICECAdapter::PingAdapter(CEC::cec_adapter [10], int, NULL)’
int8_t devices_found = cec_adapter->PingAdapter(devices, 10, NULL);
^
In file included from cec-simplest.cpp:5:0:
/usr/include/libcec/cec.h:77:18: note: candidate: virtual bool CEC::ICECAdapter::PingAdapter()
virtual bool PingAdapter(void) = 0;
^~~~~~~~~~~
/usr/include/libcec/cec.h:77:18: note: candidate expects 0 arguments, 3 provided
Some Info that I got about keyPress from /usr/include/libcec/cectypes.h ::
typedef struct cec_keypress
{
cec_user_control_code keycode; /**< the keycode */
unsigned int duration; /**< the duration of the keypress */
} cec_keypress;
typedef struct ICECCallbacks
{
void (CEC_CDECL* keyPress)(void* cbparam, const cec_keypress* key);
/*!
* #brief Transfer a CEC command from libCEC to the client.
* #param cbparam Callback parameter provided when the callbacks were set up
* #param command The command to transfer.
*/
void Clear(void)
{
keyPress = nullptr;
There is no documentation available for libcec.
What modifications do I need to do to make it work with libcec4 ?

I'm the author of the code you are asking about. I know that a year has passed since you asked the question but I only stumbled onto this stackoverflow question because I was searching for the solution myself! LOL. Luckily I've cracked the problem by myself.
I've updated the github code to work https://github.com/DrGeoff/cec_simplest and I've written a blog post about the code https://drgeoffathome.wordpress.com/2018/10/07/a-simple-libcec4-example-for-the-raspberry-pi/
In short, these are the changes I had to make. First up, the on_keypress function now is passed a pointer to a cec_keypress message rather than a by-value copy of a message. The next change is that the CEC framework has changed the name of the callback function from CBCecKeyPress to the simple keyPress. In a similar vein, the FindAdapters function is now DetectAdapters (not PingAdapters as you tried). And finally, the DetectAdapters function fills in an array of cec_adapter_descriptor rather than cec_adapter, which has the flow on effect to the Open call taking a strComName rather than simply comm.

"Bad" GCC optimization performance

I am trying to understand why using -O2 -march=native with GCC gives a slower code than without using them.
Note that I am using MinGW (GCC 4.7.1) under Windows 7.
Here is my code :
struct.hpp :
#ifndef STRUCT_HPP
#define STRUCT_HPP
#include <iostream>
class Figure
{
public:
Figure(char *pName);
virtual ~Figure();
char *GetName();
double GetArea_mm2(int factor);
private:
char name[64];
virtual double GetAreaEx_mm2() = 0;
};
class Disk : public Figure
{
public:
Disk(char *pName, double radius_mm);
~Disk();
private:
double radius_mm;
virtual double GetAreaEx_mm2();
};
class Square : public Figure
{
public:
Square(char *pName, double side_mm);
~Square();
private:
double side_mm;
virtual double GetAreaEx_mm2();
};
#endif
struct.cpp :
#include <cstdio>
#include "struct.hpp"
Figure::Figure(char *pName)
{
sprintf(name, pName);
}
Figure::~Figure()
{
}
char *Figure::GetName()
{
return name;
}
double Figure::GetArea_mm2(int factor)
{
return (double)factor*GetAreaEx_mm2();
}
Disk::Disk(char *pName, double radius_mm_) :
Figure(pName), radius_mm(radius_mm_)
{
}
Disk::~Disk()
{
}
double Disk::GetAreaEx_mm2()
{
return 3.1415926*radius_mm*radius_mm;
}
Square::Square(char *pName, double side_mm_) :
Figure(pName), side_mm(side_mm_)
{
}
Square::~Square()
{
}
double Square::GetAreaEx_mm2()
{
return side_mm*side_mm;
}
main.cpp
#include <iostream>
#include <cstdio>
#include "struct.hpp"
double Do(int n)
{
double sum_mm2 = 0.0;
const int figuresCount = 10000;
Figure **pFigures = new Figure*[figuresCount];
for (int i = 0; i < figuresCount; ++i)
{
if (i % 2)
pFigures[i] = new Disk((char *)"-Disque", i);
else
pFigures[i] = new Square((char *)"-Carré", i);
}
for (int a = 0; a < n; ++a)
{
for (int i = 0; i < figuresCount; ++i)
{
sum_mm2 += pFigures[i]->GetArea_mm2(i);
sum_mm2 += (double)(pFigures[i]->GetName()[0] - '-');
}
}
for (int i = 0; i < figuresCount; ++i)
delete pFigures[i];
delete[] pFigures;
return sum_mm2;
}
int main()
{
double a = 0;
StartChrono(); // home made lib, working fine
a = Do(10000);
double elapsedTime_ms = StopChrono();
std::cout << "Elapsed time : " << elapsedTime_ms << " ms" << std::endl;
return (int)a % 2; // To force the optimizer to keep the Do() call
}
I compile this code twice :
1 : Without optimization
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
2 : With -O2 optimization
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c main.cpp -o main.o
mingw32-g++.exe -Wall -fexceptions -O2 -march=native -std=c++11 -c struct.cpp -o struct.o
mingw32-g++.exe -o program.exe main.o struct.o -s
1 : Execution time :
1196 ms (1269 ms with Visual Studio 2013)
2 : Execution time :
1569 ms (403 ms with Visual Studio 2013) !!!!!!!!!!!!!
Using -O3 instead of -O2 does not improve the results.
I was, and I still am, pretty convinced that GCC and Visual Studio are equivalents, so I don't understand this huge difference.
Plus, I don't understand why the optimized version is slower than the non-optimized version with GCC.
Do I miss something here ?
(Note that I had the same problem with genuine GCC 4.8.2 on Ubuntu)
Thanks for your help

Considering that I don't see the assembly code, I'm going to speculate the following :
The allocation loop can be optimized (by the compiler) by removing the if clause and causing the following :
for (int i=0;i <10000 ; i+=2)
{
pFigures[i] = new Square(...);
}
for (int i=1;i <10000 ; i +=2)
{
pFigures[i] = new Disk(...);
}
Considering that the end condition is a multiple of 4 , it can be even more "efficient"
for (int i=0;i < 10000 ;i+=2*4)
{
pFigures[i] = ...
pFigures[i+2] = ...
pFigures[i+4] = ...
pFigures[i+6] = ...
}
Memory wise this will make Disks to be allocated 4 by 4 an Squares 4 by 4 .
Now, this means they will be found in the memory next to each other.
Next, you are going to iterate the vector 10000 times in a normal order (by normal i mean index after index).
Think about the places where these shapes are allocated in memory.You will end up having 4 times more cache misses (think about the border example, when 4 disks and 4 squares are found in different pages, you will switch between the pages 8 times... in a normal case scenario you would switch between the pages only once).
This sort of optimization (if done by the compiler, and in your particular code) optimizes the time for Allocation , but not the time of access (which in your example is the biggest load).
Test this by removing the i%2 and see what results you get.
Again this is pure speculation, and it assumes that the reason for lower performance was a loop optimization.

I suspect that you've got an issue unique to the combination of mingw/gcc/glibc on Windows because your code performs faster with optimizations on Linux where gcc is altogether more 'at home'.
On a fairly pedestrian Linux VM using gcc 4.8.2:
$ g++ main.cpp struct.cpp
$ time a.out
real 0m2.981s
user 0m2.876s
sys 0m0.079s
$ g++ -O2 main.cpp struct.cpp
$ time a.out
real 0m1.629s
user 0m1.523s
sys 0m0.041s
...and if you really take the blinkers off the optimizer by deleting struct.cpp and moving the implementation all inline:
$ time a.out
real 0m0.550s
user 0m0.543s
sys 0m0.000s

Is this kind of optimization a compiler bug or not?

Declarations: I use vs 2010/vs 2013, and clang 3.4 prebuilt binary.
I've found a bug in our production code. I minimize the reproduce code to the following:
#include <windows.h>
#include <process.h>
#include <stdio.h>
using namespace std;
bool s_begin_init = false;
bool s_init_done = false;
void thread_proc(void * arg)
{
DWORD tid = GetCurrentThreadId();
printf("Begin Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
if (!s_begin_init)
{
s_begin_init = true;
Sleep(20);
s_init_done = true;
}
else
{
while(!s_init_done) { ; }
}
printf("End Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
}
int main(int argc, char *argv[])
{
argc = argc ; argv = argv ;
for(int i = 0; i < 30; ++i)
{
_beginthread(thread_proc, 0, reinterpret_cast<void*>(i));
}
getchar();
return 0;
}
To compile and run the code:
cl /O2 /Zi /Favc.asm vc_O2_bug.cpp && vc_O2_bug.exe
Some of the threads are busying in the while loop. By checking the produced assembly code, I found the assembly code of
while(!s_init_done) {; }
is:
; Line 19
mov al, BYTE PTR ?s_init_done##3_NA ; s_init_done
$LL2#thread_pro:
; Line 21
test al, al
je SHORT $LL2#thread_pro
; Line 23
It's obvious that when use -O2 optimization flag, VC copy the s_init_done to al register, and repeatedly test the al register.
I then use the clang-cl.exe compiler driver to test the code. The result is same, and the assembly code are
equivalent.
It looks that the compiler think that variable s_init_done will never be changed because the only statement which change it's value is in the "if" block, which is exclusive with the current "else" branch.
I tried the same code with VS2013, The result is also same.
What I doubt is: In C++98/C++03 standard, there's no concept of thread. So the compiler can perform such an optimization for a single-thread-machine. But since c++11 has thread, and both clang 3.4 and VC2013 have support C++11 well, do my question is:
Is think kind of optimization a compiler bug for C++98/C++03, and for C++11 separately?
BTW: When I use -O1 instead, or add volatile qualifier to s_init_done, the bug disappeared.

Your program contains data races on s_begin_init and s_init_done, and therefore has undefined behavior. Per C++11 §1.10/21:
The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
The fix is to declare both boolean variables to be atomic:
std::atomic<bool> s_begin_init{false};
std::atomic<bool> s_init_done{false};
or to synchronize accesses to them with a mutex (I'll throw in a condition variable to avoid busy-waiting):
std::mutex mtx;
std::condition_variable cvar;
bool s_begin_init = false;
bool s_init_done = false;
void thread_proc(void * arg)
{
DWORD tid = GetCurrentThreadId();
printf("Begin Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
std::unique_lock<std::mutex> lock(mtx);
if (!s_begin_init)
{
s_begin_init = true;
lock.unlock();
Sleep(20);
lock.lock();
s_init_done = true;
cvar.notify_all();
}
else
{
while(!s_init_done) { cvar.wait(lock); }
}
printf("End Thread %2d, TID=%u\n", reinterpret_cast<int>(arg), tid);
}
EDIT: I just noticed the mention of VS2010 in the OP. VS2010 does not support C++11 atomics, so you will have to use the mutex solution or take advantage of MSVC's non-standard extension that gives volatile variables acquire-release semantics:
volatile bool s_begin_init = false;
volatile bool s_init_done = false;

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why does -O2 or greater optimization in clang break this code? - c++

Related

std::chrono behaves different in arm

C++ in ARM MCU: Need help to set up a simple timer

cec code not working with libcec 4

"Bad" GCC optimization performance

Is this kind of optimization a compiler bug or not?

Categories

Resources