What's wrong with sequental consistency here? - c++

I'm playing with lock-free algorithms in C and C++ and recently stumbled upon a behavior I don't quite understand. If you have the following code, running it will give you something like
reader started
writer started
iters=79895047, less=401131, eq=48996928, more=30496988
Aren't std::atomics are expected to be sequentially-consistent? If so, why does the reader sometimes see b being updated before a? I also tried to do various tricks involving memory fences with no success. The full compilable code can be seen at https://github.com/akamaus/fence_test
What's wrong with the example?
std::atomic<uint> a(0);
std::atomic<uint> b(0);
volatile bool stop = false;
void *reader(void *p) {
uint64_t iter_counter = 0;
uint cnt_less = 0,
cnt_eq = 0,
cnt_more = 0;
uint aa, bb;
printf("reader started\n");
while(!stop) {
iter_counter++;
aa = a.load(std::memory_order_seq_cst);
bb = b.load(std::memory_order_seq_cst);
if (aa < bb) {
cnt_less++;
} else if (aa > bb) {
cnt_more++;
} else {
cnt_eq++;
}
}
printf("iters=%lu, less=%u, eq=%u, more=%u\n", iter_counter, cnt_less, cnt_eq, cnt_more);
return NULL;
}
void *writer(void *p) {
printf("writer started\n");
uint counter = 0;
while(!stop) {
a.store(counter, std::memory_order_seq_cst);
b.store(counter, std::memory_order_seq_cst);
counter++;
}
}

Sequentially consistent memory ordering implies that the modification order (of the atomic objects manipulated with seq cst) observed by all threads is consistent. The program behaves as if all those operations happen interleaved in a single total order. Consider the following cases:
Writer Reader
a == 0
a = 1
b = 1
b == 1
Result: aa < bb.
Writer Reader
a = 1
a == 1
b == 0
b = 1
Result: aa > bb
With a lock, e.g. a mutex, you can make sure that the operations don't interleave.

Related

Swap the value of two pointers atomically

I've learnt that semaphore can act as an atomic lock that can perform two function: down and up.
Is there any way, to swap the value of two pointers atomically, avoiding race condition and deadlock.
I first came up with the 'solution', suppose both pointers has :
Item a = { value = "A", lock = Semaphore(1) }
Item b = { value = "B", lock = Semaphore(1) }
void atomic_swap(Item* a, Item* b) {
a->lock.down(); // acquire
b->lock.down(); // acquire
non_atomic_swap(&a.value, &b.value);
b->lock.up(); // release
a->lock.up(); // release
}
But if I am not wrong, it will result to deadlock if two atomic_swap is called using same pointers: eg.
Item a = ...;
Item b = ...;
thread_start(atomic_swap, {&a, &b}); // start a thread running atomic_swap(&a, &b);
thread_start(atomic_swap, {&b, &a}); // start a thread running atomic_swap(&b, &a);
On the code above, if both call to atomic_swap arrive the first down simultaneously, the next down will block forever, which results to deadlock.
One of the solution I can think about to avoid deadlock is assign a 'group' to them, only item in the same group can perform atomic_swap safely (without deadlock):
Group g = { lock = Semaphore(1) };
Item a = { value = "A", group = g };
Item b = { value = "B", group = g };
void atomic_swap(Item* a, Item* b) {
// assume a->group == b->group
a->group.down() // b->group.down()
non_atomic_swap(&a.value, &b.value);
a->group.up(); // b->group.up();
}
But this of course require every item to carry a group, and unrelated items might wait for the group because of other calls.
Is there any good way to perform the atomic_swap theoretically using semaphore?
You can use std::less to compare the pointers to ensure that all users acquire the locks in the same order:
void atomic_swap(Item* a, Item* b) {
std::less<Item *> cmp;
if (cmp(a, b)) {
a->lock.down();
b->lock.down();
} else {
b->lock.down();
a->lock.down(); }
non_atomic_swap(&a->value, &b->value);
b->lock.up(); // release
a->lock.up(); // release
}

Why am I getting a race condition?

I'm trying to combine multiple CGAL meshes into one single geometry.
I have the following sequential code that works perfectly fine:
while (m_toCombine.size() > 1) {
auto mesh1 = m_toCombine.front();
m_toCombine.pop_front();
auto mesh2 = m_toCombine.front();
m_toCombine.pop_front();
bool result = CGAL::Polygon_mesh_processing::corefine_and_compute_union(mesh1, mesh2, mesh2);
m_toCombine.push_back(mesh2);
}
Where m_toCombine is a std::list<Triangle_mesh_exact>.
Triangle_mesh_exact is a type of CGAL mesh (triangulated polyhedron geometry). But I don't think it's really relevant to the problem.
Unfortunately, this process is way too slow for my intended application, so I decided to use the "divide to conquer" concept and combine meshes in a parallel fashion:
class Combiner
{
public:
Combiner(const std::list<Triangle_mesh_exact>& toCombine) :
m_toCombine(toCombine) {};
~Combiner() {};
Triangle_mesh_exact combineMeshes();
void combineMeshes2();
private:
std::mutex m_listMutex, m_threadListMutex;
std::mutex m_eventLock;
std::list<MiniThread> m_threads;
std::list<Triangle_mesh_exact> m_toCombine;
std::condition_variable m_eventSignal;
std::atomic<bool> m_done = false;
//void poll(int threadListIndex);
};
Triangle_mesh_exact Combiner::combineMeshes()
{
std::unique_lock<std::mutex> uniqueLock(m_eventLock, std::defer_lock);
int runningCount = 0, finishedCount = 0;
int toCombineCount = m_toCombine.size();
bool stillRunning = false;
bool stillCombining = true;
while (stillCombining || stillRunning) {
uniqueLock.lock();
//std::lock_guard<std::mutex> lock(m_listMutex);
m_listMutex.lock();
Triangle_mesh_exact mesh1 = std::move(m_toCombine.front());
m_toCombine.pop_front();
toCombineCount--;
Triangle_mesh_exact mesh2 = std::move(m_toCombine.front());
m_toCombine.pop_front();
toCombineCount--;
m_listMutex.unlock();
runningCount++;
auto thread = new std::thread([&, this, mesh1, mesh2]() mutable {
//m_listMutex.lock();
CGAL::Polygon_mesh_processing::corefine_and_compute_union(mesh1, mesh2, mesh2);
std::lock_guard<std::mutex> lock(m_listMutex);
m_toCombine.push_back(mesh2);
toCombineCount++;
finishedCount++;
m_eventSignal.notify_one();
//m_listMutex.unlock();
});
thread->detach();
while (toCombineCount < 2 && runningCount != finishedCount) {
m_eventSignal.wait(uniqueLock);
}
stillRunning = runningCount != finishedCount;
stillCombining = toCombineCount >= 2;
uniqueLock.unlock();
}
return m_toCombine.front();
}
Unfortunately, despite being extra careful, I'm getting crashes of memory access violation or errors related to either mesh1 or mesh2 destructors.
Am I missing something?
Instead complicating things check capability of standard library:
std::reduce - cppreference.com
Triangle_mesh_exact combine(Triangle_mesh_exact& a, Triangle_mesh_exact& b)
{
auto success = CGAL::Polygon_mesh_processing::corefine_and_compute_union(a, b, b);
if (!success) throw my_combine_exception{};
return b;
}
Triangle_mesh_exact combineAll()
{
if (m_toCombine.size() == 1) return m_toCombine.front();
if (m_toCombine.empty()) throw std::invalid_argument("");
return std::reduce(std::execution::par,
m_toCombine.begin() + 1, m_toCombine.end(),
m_toCombine.front(), combine);
}

While loop - how to remove code duplication

It's not the first time I find myself in the following situation:
bool a = some_very_long_computation;
bool b = another_very_long_computation;
while (a && b) {
...
a = some_very_long_computation;
b = another_very_long_computation;
}
I don't want to compute everything in while condition, since computations are long and I want to give them appropriate names.
I don't want to create helper functions, because computation uses many local variables, and passing them all will make the code much less readable (and it will be some_huge_call).
It's unknown whether loop body will be executed at least once.
What is a good pattern in such situation? Currently I face it in C++, but I've encountered this in other languages as well. I can solve it by using additional variable isFirstPass, but it looks ugly (and, I guess, will cause some warnings):
bool a, b;
bool isFirstPass = true;
do {
if (!isFirstPass) {
...
} else {
isFirstPass = false;
}
a = some_very_long_computation;
b = another_very_long_computation;
} while (a && b);
The direct simplification of your code is:
while (
some_very_long_computation &&
another_very_long_computation
) {
...
}
If you want to keep the variables a and b:
bool a, b;
while (
(a = some_very_long_computation) &&
(b = another_very_long_computation)
) {
...
}
If you don't want to put the conditions into the while condition:
while (true) {
bool a = some_very_long_computation;
bool b = another_very_long_computation;
if (!(a && b)) {
break;
}
...
}
You could also create helper lambdas (which have access to local variables):
auto fa = [&]() { return some_very_long_computation; };
auto fb = [&]() { return another_very_long_computation; };
while (fa() && fb()) {
...
}

Waiting-time of thread switches systematicly between 0 and 30000 microseconds for the same task

I'm writing a little Console-Game-Engine and for better performance I wanted 2 threads (or more but 2 for this task) using two buffers. One thread is drawing the next frame in the first buffer while the other thread is reading the current frame from the second buffer. Then the buffers get swapped.
Of cause I can only swap them if both threads finished their task and the drawing/writing thread happened to be the one waiting. But the time it is waiting systematicly switches more or less between two values, here a few of the messurements I made (in microseconds):
0, 36968, 0, 36260, 0, 35762, 0, 38069, 0, 36584, 0, 36503
It's pretty obvious that this is not a coincidence but I wasn't able to figure out what the problem was as this is the first time I'm using threads.
Here the code, ask for more if you need it, I think it's too much to post it all:
header-file (Manager currently only adds a pointer to my WinAppBase-class):
class SwapChain : Manager
{
WORD *pScreenBuffer1, *pScreenBuffer2, *pWritePtr, *pReadPtr, *pTemp;
bool isRunning, writingFinished, readingFinished, initialized;
std::mutex lockWriting, lockReading;
std::condition_variable cvWriting, cvReading;
DWORD charsWritten;
COORD startPosition;
int screenBufferWidth;
// THREADS (USES NORMAL THREAD AS SECOND THREAD)
void ReadingThread();
// THIS FUNCTION IS ONLY FOR INTERN USE
void SwapBuffers();
public:
// USE THESE TO CONTROL WHEN THE BUFFERS GET SWAPPED
void BeginDraw();
void EndDraw();
// PUT PIXEL | INLINED FOR BETTER PERFORMANCE
inline void PutPixel(short xPos, short yPos, WORD color)
{
this->pWritePtr[(xPos * 2) + yPos * screenBufferWidth] = color;
this->pWritePtr[(xPos * 2) + yPos * screenBufferWidth + 1] = color;
}
// GENERAL CONTROL OVER SWAP CHAIN
void Initialize();
void Run();
void Stop();
// CONSTRUCTORS
SwapChain(WinAppBase * pAppBase);
virtual ~SwapChain();
};
Cpp-file
SwapChain::SwapChain(WinAppBase * pAppBase)
:
Manager(pAppBase)
{
this->isRunning = false;
this->initialized = false;
this->pReadPtr = NULL;
this->pScreenBuffer1 = NULL;
this->pScreenBuffer2 = NULL;
this->pWritePtr = NULL;
this->pTemp = NULL;
this->charsWritten = 0;
this->startPosition = { 0, 0 };
this->readingFinished = 0;
this->writingFinished = 0;
this->screenBufferWidth = this->pAppBase->screenBufferInfo.dwSize.X;
}
SwapChain::~SwapChain()
{
this->Stop();
if (_CrtIsValidHeapPointer(pReadPtr))
delete[] pReadPtr;
if (_CrtIsValidHeapPointer(pScreenBuffer1))
delete[] pScreenBuffer1;
if (_CrtIsValidHeapPointer(pScreenBuffer2))
delete[] pScreenBuffer2;
if (_CrtIsValidHeapPointer(pWritePtr))
delete[] pWritePtr;
}
void SwapChain::ReadingThread()
{
while (this->isRunning)
{
this->readingFinished = 0;
WriteConsoleOutputAttribute(
this->pAppBase->consoleCursor,
this->pReadPtr,
this->pAppBase->screenBufferSize,
this->startPosition,
&this->charsWritten
);
memset(this->pReadPtr, 0, this->pAppBase->screenBufferSize);
this->readingFinished = true;
this->cvWriting.notify_all();
if (!this->writingFinished)
{
std::unique_lock<std::mutex> lock(this->lockReading);
this->cvReading.wait(lock);
}
}
}
void SwapChain::SwapBuffers()
{
this->pTemp = this->pReadPtr;
this->pReadPtr = this->pWritePtr;
this->pWritePtr = this->pTemp;
this->pTemp = NULL;
}
void SwapChain::BeginDraw()
{
this->writingFinished = false;
}
void SwapChain::EndDraw()
{
TimePoint tpx1, tpx2;
tpx1 = Clock::now();
if (!this->readingFinished)
{
std::unique_lock<std::mutex> lock2(this->lockWriting);
this->cvWriting.wait(lock2);
}
tpx2 = Clock::now();
POST_DEBUG_MESSAGE(std::chrono::duration_cast<std::chrono::microseconds>(tpx2 - tpx1).count(), "EndDraw wating time");
SwapBuffers();
this->writingFinished = true;
this->cvReading.notify_all();
}
void SwapChain::Initialize()
{
if (this->initialized)
{
POST_DEBUG_MESSAGE(Result::CUSTOM, "multiple initialization");
return;
}
this->pScreenBuffer1 = (WORD *)malloc(sizeof(WORD) * this->pAppBase->screenBufferSize);
this->pScreenBuffer2 = (WORD *)malloc(sizeof(WORD) * this->pAppBase->screenBufferSize);
for (int i = 0; i < this->pAppBase->screenBufferSize; i++)
{
this->pScreenBuffer1[i] = 0x0000;
}
for (int i = 0; i < this->pAppBase->screenBufferSize; i++)
{
this->pScreenBuffer2[i] = 0x0000;
}
this->pWritePtr = pScreenBuffer1;
this->pReadPtr = pScreenBuffer2;
this->initialized = true;
}
void SwapChain::Run()
{
this->isRunning = true;
std::thread t1(&SwapChain::ReadingThread, this);
t1.detach();
}
void SwapChain::Stop()
{
this->isRunning = false;
}
This is where I run the SwapChain-class from:
void Application::Run()
{
this->engine.graphicsmanager.swapChain.Initialize();
Sprite<16, 16> sprite(&this->engine);
sprite.LoadSprite("engine/resources/TestData.xml", "root.test.sprites.baum");
this->engine.graphicsmanager.swapChain.Run();
int a, b, c;
for (int i = 0; i < 60; i++)
{
this->engine.graphicsmanager.swapChain.BeginDraw();
for (c = 0; c < 20; c++)
{
for (a = 0; a < 19; a++)
{
for (b = 0; b < 10; b++)
{
sprite.Print(a * 16, b * 16);
}
}
}
this->engine.graphicsmanager.swapChain.EndDraw();
}
this->engine.graphicsmanager.swapChain.Stop();
_getch();
}
The for-loops above simply draw the sprite 20 times from the top-left corner to the bottom-right corner of the console - the buffers don't get swapped during that, and that again for a total of 60 times (so the buffers get swapped 60 times).
sprite.Print uses the PutPixel function of SwapChain.
Here the WinAppBase (which consits more or less of global-like variables)
class WinAppBase
{
public:
// SCREENBUFFER
CONSOLE_SCREEN_BUFFER_INFO screenBufferInfo;
long screenBufferSize;
// CONSOLE
DWORD consoleMode;
HWND consoleWindow;
HANDLE consoleCursor;
HANDLE consoleInputHandle;
HANDLE consoleHandle;
CONSOLE_CURSOR_INFO consoleCursorInfo;
RECT consoleRect;
COORD consoleSize;
// FONT
CONSOLE_FONT_INFOEX fontInfo;
// MEMORY
char * pUserAccessDataPath;
public:
void reload();
WinAppBase();
virtual ~WinAppBase();
};
There are no errors, simply this alternating waitng time.
Maybe you'd like to start by looking if I did the synchronisation of the threads correctly? I'm not exactly sure how to use a mutex or condition-variables so it might comes from that.
Apart from that it is working fine, the sprites are shown as they should.
The clock you are using may have limited resolution. Here is a random example of a clock provided by Microsoft with 15 ms (15000 microsecond) resolution: Why are .NET timers limited to 15 ms resolution?
If one thread is often waiting for the other, it is entirely possible (assuming the above clock resolution) that it sometimes waits two clockticks and sometimes none. Maybe your clock only has 30 ms resolution. We really can't tell from the code. Do you get more precise measurements elsewhere with this clock?
There are also other systems in play such as the OS scheduler or whatever controls your std::threads. That one is (hopefully) much more granular, but how all these interactions play out doesn't have to be obvious or intuitive.

Race condition in shared_ptr doesn't happen

Why there is no any race condition in my code?
Due to source here: http://en.cppreference.com/w/cpp/memory/shared_ptr
If multiple threads of execution access the same shared_ptr without synchronization and any of those accesses uses a non-const member function of shared_ptr then a data race will occur;
class base
{
public:
std::string val1;
};
class der : public base
{
public:
std::string val2;
int val3;
char val4;
};
int main()
{
std::mutex mm;
std::shared_ptr<der> ms(new der());
std::thread t1 = std::thread([ms, &mm]() {
while (1)
{
//std::lock_guard<std::mutex> lock(mm);
std::string some1 = ms->val2;
int some2 = ms->val3;
char some3 = ms->val4;
ms->val2 = "1232324";
ms->val3 = 1232324;
ms->val4 = '1';
}
});
std::thread t2 = std::thread([ms, &mm]() {
while (1)
{
//std::lock_guard<std::mutex> lock(mm);
std::string some1 = ms->val2;
int some2 = ms->val3;
char some3 = ms->val4;
ms->val2 = "123435";
ms->val3 = 123435;
ms->val4 = '3';
}
});
std::shared_ptr<base> bms = ms;
std::thread t3 = std::thread([bms]() {
while (1)
{
bms->val1 = 434;
}
});
while (1)
{
std::this_thread::sleep_for(std::chrono::milliseconds(1));
}
}
Data races do not yield compilation failure; they yield undefined behavior. That behavior could be "works fine". Or "appears to work fine but subtly breaks something 12 minutes later". Or "immediately fails."
Just because code appears to work doesn't mean it actually does. This is more true for threading code than any other kind.
I would recommend you to use valgrind tool - helgrind.
It is very hard to find race conditions sometimes when debugging multi-threading programs.
To run this tool you need to have valgrind on your computer and run it using:
valgrind --tool=helgrind ./Your_Complied_File arg1 arg2 ...