Are compiler optimization solving thread safety issues?

Are compiler optimization solving thread safety issues? - c++

I'm writing a C++ multi-threaded code. When testing the overhead of different mutex lock I found that the thread unsafe code seem to yield the correct result compiled with Release Configuration in Visual Studio but much faster than the code with mutex lock. However with Debug Configuration the result is what I expected. I was wondering if it's the compiler that solved this or it's just because the code compiled in Release configuration runs so fast that two threads never accesses the memory in the same time?
My test code is pasted as following.
class Mutex {
public:
unsigned long long _data;
bool tryLock() {
return mtx.try_lock();
}
inline void Lock() {
mtx.lock();
}
inline void Unlock() {
mtx.unlock();
}
void safeSet(const unsigned long long &data) {
Lock();
_data = data;
Unlock();
}
Mutex& operator++ () {
Lock();
_data++;
Unlock();
return (*this);
}
Mutex operator++(int) {
Mutex tmp = (*this);
Lock();
_data++;
Unlock();
return tmp;
}
Mutex() {
_data = 0;
}
private:
std::mutex mtx;
Mutex(Mutex& cpy) {
_data = cpy._data;
}
}val;
static DWORD64 val_unsafe = 0;
DWORD WINAPI safeThreads(LPVOID lParam) {
for (int i = 0; i < 655360;i++) {
++val;
}
return 0;
}
DWORD WINAPI unsafeThreads(LPVOID lParam) {
for (int i = 0; i < 655360; i++) {
val_unsafe++;
}
return 0;
}
int main()
{
val._data = 0;
vector<HANDLE> hThreads;
LARGE_INTEGER freq, time1, time2;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&time1);
for (int i = 0; i < 32; i++) {
hThreads.push_back( CreateThread(0, 0, safeThreads, 0, 0, 0));
}
for each(HANDLE handle in hThreads)
{
WaitForSingleObject(handle, INFINITE);
}
QueryPerformanceCounter(&time2);
cout<<time2.QuadPart - time1.QuadPart<<endl;
hThreads.clear();
QueryPerformanceCounter(&time1);
for (int i = 0; i < 32; i++) {
hThreads.push_back(CreateThread(0, 0, unsafeThreads, 0, 0, 0));
}
for each(HANDLE handle in hThreads)
{
WaitForSingleObject(handle, INFINITE);
}
QueryPerformanceCounter(&time2);
cout << time2.QuadPart - time1.QuadPart << endl;
hThreads.clear();
cout << val._data << endl << val_unsafe<<endl;
cout << freq.QuadPart << endl;
return 0;
}

The standard doesn't let you assume that code is thread safe by default. Your code gives nevertheless correct result when compiled in release mode for x64.
But why ?
If you look at the assembler file generated for your code, you'll find out that that the optimizer simply unrolled the loop and applied constant propagation. So instead of looping 65535 times, it just adds a constant to your counter :
?unsafeThreads##YAKPEAX#Z PROC ; unsafeThreads, COMDAT
; 74 : for (int i = 0; i < 655360; i++) {
add QWORD PTR ?val_unsafe##3_KA, 655360 ; 000a0000H <======= HERE
; 75 : val_unsafe++;
; 76 : }
; 77 : return 0;
xor eax, eax
; 78 : }
In this situation, with a single and very fast instruction in each thread, it's much less probable to get a data race : most probably one thread is already finished before the next is launched.
How to see the expected result from your benchmark ?
If you want to avoid the optimizer to unroll your test loops, you need to declare _data and unsafe_val as volatile. You'll then notice that the unsafe value is no longer correct due to data races. Running my own tests with this modified code, I get the correct value for the safe version, and always different (and wrong) values for the unsafe version. For example:
safe time:5672583
unsafe time:145092 // <=== much faster
val:20971520
val_unsafe:3874844 // <=== OUCH !!!!
freq: 2597654
Want to make your unsafe code safe ?
If you want to make your unsafe code safe but without using an explicit mutex, you could just make unsafe_val an atomic. The result will be platform dependent (the implementation could very well introduce a mutex for you) but on the same machine than above, with MSVC15 in release mode, I get:
safe time:5616282
unsafe time:798851 // still much faster (6 to 7 times in average)
val:20971520
val_unsafe:20971520 // but always correct
freq2597654
The only thing that you then still must do : rename the atomic version of the variable from unsafe_val into also_safe_val ;-)

Related

C++ atomic is it safe to replace a mutex with an atomic<int>?

#include <iostream>
#include <thread>
#include <mutex>
#include <atomic>
using namespace std;
const int FLAG1 = 1, FLAG2 = 2, FLAG3 = 3;
int res = 0;
atomic<int> flagger;
void func1()
{
for (int i=1; i<=1000000; i++) {
while (flagger.load(std::memory_order_relaxed) != FLAG1) {}
res++; // maybe a bunch of other code here that don't modify flagger
// code here must not be moved outside the load/store (like mutex lock/unlock)
flagger.store(FLAG2, std::memory_order_relaxed);
}
cout << "Func1 finished\n";
}
void func2()
{
for (int i=1; i<=1000000; i++) {
while (flagger.load(std::memory_order_relaxed) != FLAG2) {}
res++; // same
flagger.store(FLAG3, std::memory_order_relaxed);
}
cout << "Func2 finished\n";
}
void func3() {
for (int i=1; i<=1000000; i++) {
while (flagger.load(std::memory_order_relaxed) != FLAG3) {}
res++; // same
flagger.store(FLAG1, std::memory_order_relaxed);
}
cout << "Func3 finished\n";
}
int main()
{
flagger = FLAG1;
std::thread first(func1);
std::thread second(func2);
std::thread third(func3);
first.join();
second.join();
third.join();
cout << "res = " << res << "\n";
return 0;
}
My program has a segment that is similar to this example. Basically, there are 3 threads: inputer, processor, and outputer. I found that busy wait using atomic is faster than putting threads to sleep using condition variable, such as:
std::mutex commonMtx;
std::condition_variable waiter1, waiter2, waiter3;
// then in func1()
unique_lock<std::mutex> uniquer1(commonMtx);
while (flagger != FLAG1) waiter1.wait(uniquer1);
However, is the code in the example safe? When I run it gives correct results (-std=c++17 -O3 flag). However, I don't know whether the compiler can reorder my instructions to outside the atomic check/set block, especially with std::memory_order_relaxed. In case it's unsafe, then is there any way to make it safe while being faster than mutex?
Edit: it's guaranteed that the number of thread is < number of CPU cores

std::memory_order_relaxed results in no guarantees on the ordering of memory operations except on the atomic itself.
All your res++; operations therefore are data races and your program has undefined behavior.
Example:
#include<atomic>
int x;
std::atomic<int> a{0};
void f() {
x = 1;
a.store(1, std::memory_order_relaxed);
x = 2;
}
Clang 13 on x86_64 with -O2 compiles this function to
mov dword ptr [rip + a], 1
mov dword ptr [rip + x], 2
ret
(https://godbolt.org/z/hxjYeG5dv)
Even on a cache-coherent platform, between the first and second mov, another thread can observe a set to 1, but x not set to 1.
You must use memory_order_release and memory_order_acquire (or sequential consistency) to replace a mutex.
(I should add that I have not checked your code in detail, so I can't say that simply replacing the memory order is sufficient in your specific case.)

As mentioned in the other answer, res++; in different threads are not synchronized with each other, and cause a data race and undefined behavior. This can be checked with a thread sanitizer.
To fix this, you need to use memory_order_acquire to loads and memory_order_release for stores. The fix can too be confirmed with a thread sanitizer.
while (flagger.load(std::memory_order_acquire) != FLAG1) {}
res++;
flagger.store(FLAG2, std::memory_order_release);
Or, flagger.load(std::memory_order_acquire) can be replaced with flagger.load(std::memory_order_relaxed), followed by std::atomic_thread_fence(std::memory_order_acquire); after the loop.
while (flagger.load(std::memory_order_relaxed) != FLAG1) {}
std::atomic_thread_fence(std::memory_order_acquire);
res++;
flagger.store(FLAG2, std::memory_order_release);
I'm not sure how much it improves performance, if at all.
The idea is that only the last load in the loop needs to be an acquire operation, which is what the fence achieves.

Error occurred when using thread_local to maintain a concurrent memory buffer

In the following code, I want to create a memory buffer that allows multiple threads to read/write it concurrently. At a time, all threads will read this buffer in parallel, and later they will write to the buffer in parallel. But there will be no read/write operation at the same time.
To do this, I use a vector of shared_ptr<vector<uint64_t>>. When a new thread arrives, it will be allocated with a new vector<uint64_t> and only write to it. Two threads will not write to the same vector.
I use thread_local to track the vector index and offset the current thread will write to. When I need to add a new buffer to the memory_ variable, I use a mutex to protect it.
class TestBuffer {
public:
thread_local static uint32_t index_;
thread_local static uint32_t offset_;
thread_local static bool ready_;
vector<shared_ptr<vector<uint64_t>>> memory_;
mutex lock_;
void init() {
if (!ready_) {
new_slab();
ready_ = true;
}
}
void new_slab() {
std::lock_guard<mutex> lock(lock_);
index_ = memory_.size();
memory_.push_back(make_shared<vector<uint64_t>>(1000));
offset_ = 0;
}
void put(uint64_t value) {
init();
if (offset_ == 1000) {
new_slab();
}
if(memory_[index_] == nullptr) {
cout << "Error" << endl;
}
*(memory_[index_]->data() + offset_) = value;
offset_++;
}
};
thread_local uint32_t TestBuffer::index_ = 0;
thread_local uint32_t TestBuffer::offset_ = 0;
thread_local bool TestBuffer::ready_ = false;
int main() {
TestBuffer buffer;
vector<std::thread> threads;
for (int i = 0; i < 10; ++i) {
thread t = thread([&buffer, i]() {
for (int j = 0; j < 10000; ++j) {
buffer.put(i * 10000 + j);
}
});
threads.emplace_back(move(t));
}
for (auto &t: threads) {
t.join();
}
}
The code does not behave as expected, and reports error is in the put function. The root cause is that memory_[index_] sometimes return nullptr. However, I do not understand why this is possible as I think I have set the values properly. Thanks for the help!

You have a race condition in put caused by new_slab(). When new_slab calls memory_.push_back() the _memory vector may need to resize itself, and if another thread is executing put while the resize is in progress, memory_[index_] might access stale data.
One solution is to protect the _memory vector by locking the mutex:
{
std::lock_guard<mutex> lock(lock_);
if(memory_[index_] == nullptr) {
cout << "Error" << endl;
}
*(memory_[index_]->data() + offset_) = value;
}
Another is to reserve the space you need in the memory_ vector ahead of time.

If statement passes only when preceded by debug cout line (multi-threading in C)

I created this code to use for solving CPU intensive tasks real-time and potentially as a base for a game engine in the future. For it I created a system where there is an array of ints each thread modifies to signal whether they are done with their current task.
The problem occurs when running it with more than 4 threads. When using 6 threads or more, the "if (threadone_private == threadcount)" stops working UNLESS I add this debug line "cout << threadone_private << endl;" before it.
I cannot comprehend why this debug line makes any difference on whether the if conditional functions as expected, neither why it works without it when using 4 threads or less.
For this code I'm using:
#include <GL/glew.h>
#include <GLFW/glfw3.h>
#include <iostream>
#include <thread>
#include <atomic>
#include <vector>
#include <string>
#include <fstream>
#include <sstream>
using namespace std;
Right now this code only counts up to 60 trillion, in asynchronous steps of 3 billion, really fast.
Here are the relevant parts of the code:
int thread_done[6] = { 0,0,0,0,0,0 };
atomic<long long int> testvar1 = 0;
atomic<long long int> testvar2 = 0;
atomic<long long int> testvar3 = 0;
atomic<long long int> testvar4 = 0;
atomic<long long int> testvar5 = 0;
atomic<long long int> testvar6 = 0;
void task1(long long int testvar, int thread_number)
{
int continue_work = 1;
for (; ; ) {
while (continue_work == 1) {
for (int i = 1; i < 3000000001; i++) {
testvar++;
}
thread_done[thread_number] = 1;
if (thread_number==0) {
testvar1 = testvar;
}
if (thread_number == 1) {
testvar2 = testvar;
}
if (thread_number == 2) {
testvar3 = testvar;
}
if (thread_number == 3) {
testvar4 = testvar;
}
if (thread_number == 4) {
testvar5 = testvar;
}
if (thread_number == 5) {
testvar6 = testvar;
}
continue_work = 0;
}
if (thread_done[thread_number] == 0) {
continue_work = 1;
}
}
}
And here is the relevant part of the main thread:
int main() {
long long int testvar = 0;
int threadcount = 6;
int threadone_private = 0;
thread thread_1(task1, testvar, 0);
thread thread_2(task1, testvar, 1);
thread thread_3(task1, testvar, 2);
thread thread_4(task1, testvar, 3);
thread thread_5(task1, testvar, 4);
thread thread_6(task1, testvar, 5);
for (; ; ) {
if (threadcount == 0) {
for (int i = 1; i < 3000001; i++) {
testvar++;
}
cout << testvar << endl;
}
else {
while (testvar < 60000000000000) {
threadone_private = thread_done[0] + thread_done[1] + thread_done[2] + thread_done[3] + thread_done[4] + thread_done[5];
cout << threadone_private << endl;
if (threadone_private == threadcount) {
testvar = testvar1 + testvar2 + testvar3 + testvar4 + testvar5 + testvar6;
cout << testvar << endl;
thread_done[0] = 0;
thread_done[1] = 0;
thread_done[2] = 0;
thread_done[3] = 0;
thread_done[4] = 0;
thread_done[5] = 0;
}
}
}
}
}
I expected that since each worker thread only modifies one int out of the array threadone_private, and since the main thread only ever reads it until all worker threads are waiting, that this if (threadone_private == threadcount) should be bulletproof... Apparently I'm missing something important that goes wrong whenever I change this:
threadone_private = thread_done[0] + thread_done[1] + thread_done[2] + thread_done[3] + thread_done[4] + thread_done[5];
cout << threadone_private << endl;
if (threadone_private == threadcount) {
To this:
threadone_private = thread_done[0] + thread_done[1] + thread_done[2] + thread_done[3] + thread_done[4] + thread_done[5];
//cout << threadone_private << endl;
if (threadone_private == threadcount) {

Disclaimer: Concurrent code is quite complicated and easy to get wrong, so it's generally a good idea to use higher level abstractions. There are a whole lot of details that are easy to get wrong without ever noticing. You should think very carefully about doing such low-level programming if you're not an expert. Sadly C++ lacks good built-in high level concurrent constructs, but there are libraries out there that handle this.
It's unclear what the whole code is supposed to do anyhow to me. As far as I can see whether the code ever stops relies purely on timing - even if you did the synchronization correctly - which is completely non deterministic. Your threads could execute in such a way that thread_done is never all true.
But apart from that there is at least one correctness issue: You're reading and writing to int thread_done[6] = { 0,0,0,0,0,0 }; without synchronization. This is undefined behavior so the compiler can do what it wants.
What probably happens is that the compiler sees that it can cache the value of threadone_private since the thread never writes to it so the value cannot change (legally). The external call to std::cout means it can't be sure that the value isn't change behind its back so it has to read the value each iteration new (also std::cout uses locks which causes synchronization in most implementations which again limits what the compiler can assume).

I cannot see any std::mutex, std::condition_variable or variants of std::lock in your code. Doing multithreading without any of those will never succeed reliably. Because whenever multiple threads modify the same data, you need to make sure only one thread (including your main thread) has access to that data at any given time.
Edit: I noticed you use atomic. I do not have any experience with this, however I know using mutexes works reliably.
Therefore, you need to lock every access (read or write) to that data with a mutex like this:
//somewhere
std::mutex myMutex;
std::condition_variable myCondition;
int workersDone = 0;
/* main thread */
createWorkerThread1();
createWorkerThread2();
{
std::unique_lock<std::mutex> lock(myMutex); //waits until mutex is locked.
while(workersDone != 2) {
myCondition.wait(lock); //the mutex is unlocked while waiting
}
std::cout << "the data is ready now" << std::endl;
} //the lock is destroyed, unlocking the mutex
/* Worker thread */
while(true) {
{
std::unique_lock<std::mutex> lock(myMutex); //waits until mutex is locked
if(read_or_modify_a_piece_of_shared_data() == DATA_FINISHED) {
break; //lock leaves the scope, unlocks the mutex
}
}
prepare_everything_for_the_next_piece_of_shared_data(); //DO NOT access data here
}
//data is processed
++workersDone;
myCondition.notify_one(); //no mutex here. This wakes up the waiting thread
I hope this gives you an idea on how to use mutexes and condition variables to gain thread safety.
Disclaimer: 100% pseudo code ;)

C++ Reusing a vector of threads that call the same function

I would like to reuse a vector of threads that call the same function several times with different parameters. There is no writing (with the exception of an atomic parameter), so no need for a mutex. To depict the idea, I created a basic example of a parallelized code that finds the maximum value of a vector. There are clearly better ways to find the max of a vector, but for the sake of the explanation and to avoid getting into further details of the real code I am writing, I am going with this silly example.
The code finds the maximum number of a vector by calling a function pFind that checks whether the vector contains the number k (k is initialized with an upper bound). If it does, the execution stops, otherwise k is reduced by one and the process repeats.
The code bellow generates a vector of threads that parallelize the search for k in the vector. The issue is that, for every value of k, the vector of threads is regenerated and each time the new threads are joined.
Generating the vector of threads and joining them every time comes with an overhead that I want to avoid.
I am wondering if there is a way of generating a vector (a pool) of threads only once and reuse them for the new executions. Any other speedup tip will be appreciated.
void pFind(
vector<int>& a,
int n,
std::atomic<bool>& flag,
int k,
int numTh,
int val
) {
int i = k;
while (i < n) {
if (a[i] == val) {
flag = true;
break;
} else
i += numTh;
}
}
int main() {
std::atomic<bool> flag;
flag = false;
int numTh = 8;
int val = 1000;
int pos = 0;
while (!flag) {
vector<thread>threads;
for (int i = 0; i < numTh; i++){
thread th(&pFind, std::ref(a), size, std::ref(flag), i, numTh, val);
threads.push_back(std::move(th));
}
for (thread& th : threads)
th.join();
if (flag)
break;
val--;
}
cout << val << "\n";
return 0;
}

There is no way to assign a different execution function (closure) to a std::thread after construction. This is generally true of all thread abstractions, though often implementations try to memoize or cache lower-level abstractions internally to make thread fork and join fast so just constructing new threads is viable. There is a debate in systems programming circles about whether creating a new thread should be incredibly lightweight or whether clients should be written to not fork threads as frequently. (Given this has been ongoing for a very long time, it should be clear there are a lot of tradeoffs involved.)
There are a lot of other abstractions which try to do what you really want. They have names such as "threadpools," "task executors" (or just "executors"), and "futures." All of them tend to map onto threads by creating some set of threads, often related to the number of hardware cores in the system, and then having each of those threads loop and look for requests.
As the comments indicated, the main way you would do this yourself is to have threads with a top-level loop that accepts execution requests, processes them, and then posts the results. To do this you will need to use other synchronization methods such as mutexes and condition variables. It is generally faster to do things this way if there are a lot of requests and requests are not incredibly large.
As much as standard C++ concurrency support is a good thing, it is also rather significantly lacking for real world high performance work. Something like Intel's TBB is far more of an industrial strength solution.

By piecing together some code from different online searches, the following works, but is not as fast as as the approach that regenerates the threads at each iteration of the while loop.
Perhaps someone can comment on this approach.
The following class describes the thread pool
class ThreadPool {
public:
ThreadPool(int threads) : shutdown_(false){
threads_.reserve(threads);
for (int i = 0; i < threads; ++i)
threads_.emplace_back(std::bind(&ThreadPool::threadEntry, this, i));
}
~ThreadPool(){
{
// Unblock any threads and tell them to stop
std::unique_lock<std::mutex>l(lock_);
shutdown_ = true;
condVar_.notify_all();
}
// Wait for all threads to stop
std::cerr << "Joining threads" << std::endl;
for (auto & thread : threads_) thread.join();
}
void doJob(std::function<void(void)>func){
// Place a job on the queu and unblock a thread
std::unique_lock<std::mutex>l(lock_);
jobs_.emplace(std::move(func));
condVar_.notify_one();
}
void threadEntry(int i){
std::function<void(void)>job;
while (1){
{
std::unique_lock<std::mutex>l(lock_);
while (!shutdown_ && jobs_.empty()) condVar_.wait(l);
if (jobs_.empty()){
// No jobs to do and we are shutting down
std::cerr << "Thread " << i << " terminates" << std::endl;
return;
}
std::cerr << "Thread " << i << " does a job" << std::endl;
job = std::move(jobs_.front());
jobs_.pop();
}
// Do the job without holding any locks
job();
}
}
};
Here is the rest of the code
void pFind(
vector<int>& a,
int n,
std::atomic<bool>& flag,
int k,
int numTh,
int val,
std::atomic<int>& completed) {
int i = k;
while (i < n) {
if (a[i] == val) {
flag = true;
break;
} else
i += numTh;
}
completed++;
}
int main() {
std::atomic<bool> flag;
flag = false;
int numTh = 8;
int val = 1000;
int pos = 0;
std::atomic<int> completed;
completed=0;
ThreadPool p(numThreads);
while (!flag) {
for (int i = 0; i < numThreads; i++) {
p.doJob(std::bind(pFind, std::ref(a), size, std::ref(flag), i, numTh, val, std::ref(completed)));
}
while (completed < numTh) {}
if (flag) {
break;
} else {
completed = 0;
val--;
}
}
cout << val << "\n";
return 0;
}

Your code has a race condition: bool is not an atomic type and is therefore not safe for multiple threads to write to concurrently. You need to use std::atomic_bool or std::atomic_flag.
To answer your question, you're recreating the threads vector each iteration of the loop, which you can avoid by moving its declaration outside the loop body. Reusing the threads themselves is a much more complex topic that's hard to get right or describe concisely.
vector<thread> threads;
threads.reserve(numTh);
while (!flag) {
for (size_t i = 0; i < numTh; ++i)
threads.emplace_back(pFind, a, size, flag, i, numTh, val);
for (auto &th : threads)
th.join();
threads.clear();
}

Why do I have worst performance on my spinlock implementation when I use non-cst memory model?

I have two versions of spinlock below. The first uses the default which is memory_order_cst while the latter uses memory_order_acquire/memory_order_release. Since the latter is more relaxed, I expect it to have better performance. However it doesn't seem to be case.
class SimpleSpinLock
{
public:
inline SimpleSpinLock(): mFlag(ATOMIC_FLAG_INIT) {}
inline void lock()
{
int backoff = 0;
while (mFlag.test_and_set()) { DoWaitBackoff(backoff); }
}
inline void unlock()
{
mFlag.clear();
}
private:
std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};
class SimpleSpinLock2
{
public:
inline SimpleSpinLock2(): mFlag(ATOMIC_FLAG_INIT) {}
inline void lock()
{
int backoff = 0;
while (mFlag.test_and_set(std::memory_order_acquire)) { DoWaitBackoff(backoff); }
}
inline void unlock()
{
mFlag.clear(std::memory_order_release);
}
private:
std::atomic_flag mFlag = ATOMIC_FLAG_INIT;
};
const int NUM_THREADS = 8;
const int NUM_ITERS = 5000000;
const int EXPECTED_VAL = NUM_THREADS * NUM_ITERS;
int val = 0;
long j = 0;
SimpleSpinLock spinLock;
void ThreadBody()
{
for (int i = 0; i < NUM_ITERS; ++i)
{
spinLock.lock();
++val;
j = i * 3.5 + val;
spinLock.unlock();
}
}
int main()
{
vector<thread> threads;
for (int i = 0; i < NUM_THREADS; ++i)
{
cout << "Creating thread " << i << endl;
threads.push_back(std::move(std::thread(ThreadBody)));
}
for (thread& thr: threads)
{
thr.join();
}
cout << "Final value: " << val << "\t" << j << endl;
assert(val == EXPECTED_VAL);
return 1;
}
I am running on Ubuntu 12.04 with gcc 4.8.2 running optimization O3.
-- Spinlock with memory_order_cst:
Run 1:
real 0m1.588s
user 0m4.548s
sys 0m0.052s
Run 2:
real 0m1.577s
user 0m4.580s
sys 0m0.032s
Run 3:
real 0m1.560s
user 0m4.436s
sys 0m0.032s
-- Spinlock with memory_order_acquire/release:
Run 1:
real 0m1.797s
user 0m4.608s
sys 0m0.100s
Run 2:
real 0m1.853s
user 0m4.692s
sys 0m0.164s
Run 3:
real 0m1.784s
user 0m4.552s
sys 0m0.124s
Run 4:
real 0m1.475s
user 0m3.596s
sys 0m0.120s
With the more relaxed model, I see a lot more variability. Sometimes it's better. Often times it is worse, does anyone have an explanation for this?

The generated unlock code is different. The CST memory model (with g++ 4.9.0) generates:
movb %sil, spinLock(%rip)
mfence
for the unlock. The acquire/release generates:
movb %sil, spinLock(%rip)
The lock code is the same. Someone else will have say something about why it's better with the fence, but if I had to guess, I would guess that it reduces bus/cache-coherence contention, possibly by reducing interference on the bus. Sometimes stricter is more orderly, and thus faster.
ADDENDUM: According to this, mfence costs around 100 cycles. So maybe you are reducing bus contention, because when a thread finishes the loop body, it pauses a bit before trying to reacquire the lock, letting the other thread finish. You could try to do the same thing by putting in a short delay loop after the unlock, though you'd have to make sure that it didn't get optimized out.
ADDENDUM2: It does seem to be caused by bus interference/contention caused by looping around too fast. I added a short delay loop like:
spinLock.unlock();
for (int i = 0; i < 5; i++) {
j = i * 3.5 + val;
}
Now, the acquire/release performs the same.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js