Intel Inspector reports a data race in my spinlock implementation - c++

I made a very simple spinlock using the Interlocked functions in Windows and tested it on a dual-core CPU (two threads that increment a variable);
The program seems to work OK (it gives the same result every time, which is not the case when no synchronization is used), but Intel Parallel Inspector says that there is a race condition at value += j (see the code below). The warning disappears when using Critical Sections instead of my SpinLock.
Is my implementation of SpinLock correct or not ? It's really strange, because all the used operations are atomic and have the proper memory barriers and it shouldn't lead to race conditions.
class SpinLock
{
int *lockValue;
SpinLock(int *value) : lockValue(value) { }
void Lock() {
while(InterlockedCompareExchange((volatile LONG*)lockValue, 1, 0) != 0) {
WaitABit();
}
}
void Unlock() { InterlockedExchange((volatile LONG*)lockValue, 0); }
};
The test program:
static const int THREADS = 2;
HANDLE completedEvents[THREADS];
int value = 0;
int lock = 0; // Global.
DWORD WINAPI TestThread(void *param) {
HANDLE completed = (HANDLE)param;
SpinLock testLock(&lock);
for(int i = 0;i < 1000*20; i++) {
for(int j = 0;j < 10*10; j++) {
// Add something to the variable.
testLock.Lock();
value += j;
testLock.Unlock();
}
}
SetEvent(completed);
}
int main() {
for(int i = 0; i < THREADS; i++) {
completedEvents[i] = CreateEvent(NULL, true, false, NULL);
}
for(int i = 0; i < THREADS; i++) {
DWORD id;
CreateThread(NULL, 0, TestThread, completedEvents[i], 0, &id);
}
WaitForMultipleObjects(THREADS, completedEvents, true, INFINITE);
cout<<value;
}

Parallel Inspector's documentation for data race suggests using a critical section or a mutex to fix races on Windows. There's nothing in it which suggests that Parallel Inspector knows how to recognise any other locking mechanism you might invent.
Tools for analysis of novel locking mechanisms tend to be static tools which look at every possible path through the code, Parallel Inspector's documentation implies that it executes the code once.
If you want to experiment with novel locking mechanisms, the most common tool I've seen used in academic literature is the Spin model checker. There's also ESP, which might reduce the state space, but I don't know if it's been applied to concurrent problems, and also the mobility workbench which would give an analysis if you can couch your problem in pi-calculus. Intel Parallel Inspector doesn't seem anything like as complicated as these tools, but rather designed to check for commonly occurring issues using heuristics.

For other poor folks in a similar situation to me: Intel DOES provide a set of includes and libraries for doing exactly this sort of thing. Check in the Inspector installation directory (you'll see \include, \lib32 and \lib64 in the installation directory) for those materials. Documentation on how to use them (as of June 2018, though Intel cares nothing about keeping links consistent):
https://software.intel.com/en-us/inspector-user-guide-windows-apis-for-custom-synchronization
There are 3 functions:
void __itt_sync_acquired(void *addr)
void __itt_sync_releasing(void *addr)
void __itt_sync_destroy(void *addr)

I'm pretty sure it should be implemented as follows:
class SpinLock
{
long lockValue;
SpinLock(long value) : lockValue(value) { }
void Lock() {
while(InterlockedCompareExchange(&lockValue, 1, 0) != 0) {
WaitABit();
}
}
void Unlock() { InterlockedExchange(&lockValue, 0); }
};

Related

Pthread synchronization with barrier

I am trying to synchronize a function I am parallelizing with pthreads.
The issue is, I am having a deadlock because a thread will exit the function while other threads are still waiting for the thread that exited to reach the barrier. I am unsure whether the pthread_barrier structure takes care of this. Here is an example:
static pthread_barrier_t barrier;
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
}
int main() {
// create pthread barrier
pthread_barrier_init(&barrier, NULL, NUM_THREADS);
// create thread handles
//...
// create threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_create(&thread_handles[i], NULL, &foo, (void*) i);
}
// join the threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(&thread_handles[i], NULL);
}
}
Here is a solution I tried for foo, but it didn't work (note NUM_THREADS_COPY is a copy of the NUM_THREADS constant, and is decremented whenever a thread reaches the end of the function):
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
pthread_barrier_init(&barrier, NULL, --NUM_THREADS_COPY);
}
Is there a solution to updating the number of threads to wait in a barrier for when a thread exits a function?
You need to decide how many threads it will take to pass the barrier before any threads arrive at it. Undefined behavior results from re-initializing the barrier while there are threads waiting at it. Among the plausible manifestations are that some of the waiting threads are prematurely released or that some of the waiting threads never get released, but those are by no means the only unwanted things that could happen. In any case ...
Is there a solution to updating the number of threads to wait in a
barrier for when a thread exits a function?
... no, pthreads barriers do not support that.
Since a barrier seems not to be flexible enough for your needs, you probably want to fall back to the general-purpose thread synchronization object: a condition variable (used together with a mutex and some kind of shared variable).

How to let different threads fill an array together?

Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}

Running fixed number of threads

With the new standards ofc++17 I wonder if there is a good way to start a process with a fixed number of threads until a batch of jobs are finished.
Can you tell me how I can achieve the desired functionality of this code:
std::vector<std::future<std::string>> futureStore;
const int batchSize = 1000;
const int maxNumParallelThreads = 10;
int threadsTerminated = 0;
while(threadsTerminated < batchSize)
{
const int& threadsRunning = futureStore.size();
while(threadsRunning < maxNumParallelThreads)
{
futureStore.emplace_back(std::async(someFunction));
}
for(std::future<std::string>& readyFuture: std::when_any(futureStore.begin(), futureStore.end()))
{
auto retVal = readyFuture.get();
// (possibly do something with the ret val)
threadsTerminated++;
}
}
I read, that there used to be an std::when_any function, but it was a feature that did make it getting into the std features.
Is there any support for this functionality (not necessarily for std::future-s) in the current standard libraries? Is there a way to easily implement it, or do I have to resolve to something like this?
This does not seem to me to be the ideal approach:
All your main thread does is waiting for your other threads finishing, polling the results of your future. Almost wasting this thread somehow...
I don't know in how far std::async re-uses the threads' infrastructures in any suitable way, so you risk creating entirely new threads each time... (apart from that you might not create any threads at all, see here, if you do not specify std::launch::async explicitly.
I personally would prefer another approach:
Create all the threads you want to use at once.
Let each thread run a loop, repeatedly calling someFunction(), until you have reached the number of desired tasks.
The implementation might look similar to this example:
const int BatchSize = 20;
int tasksStarted = 0;
std::mutex mutex;
std::vector<std::string> results;
std::string someFunction()
{
puts("worker started"); fflush(stdout);
sleep(2);
puts("worker done"); fflush(stdout);
return "";
}
void runner()
{
{
std::lock_guard<std::mutex> lk(mutex);
if(tasksStarted >= BatchSize)
return;
++tasksStarted;
}
for(;;)
{
std::string s = someFunction();
{
std::lock_guard<std::mutex> lk(mutex);
results.push_back(s);
if(tasksStarted >= BatchSize)
break;
++tasksStarted;
}
}
}
int main(int argc, char* argv[])
{
const int MaxNumParallelThreads = 4;
std::thread threads[MaxNumParallelThreads - 1]; // main thread is one, too!
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i] = std::thread(&runner);
}
runner();
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i].join();
}
// use results...
return 0;
}
This way, you do not recreate each thread newly, but just continue until all tasks are done.
If these tasks are not all all alike as in above example, you might create a base class Task with a pure virtual function (e. g. "execute" or "operator ()") and create subclasses with the implementation required (and holding any necessary data).
You could then place the instances into a std::vector or std::list (well, we won't iterate, list might be appropriate here...) as pointers (otherwise, you get type erasure!) and let each thread remove one of the tasks when it has finished its previous one (do not forget to protect against race conditions!) and execute it. As soon as no more tasks are left, return...
If you dont care about the exact number of threads, the simplest solution would be:
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){return std::async(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}
From my experience, std::async will reuse the threads, after a certain amount of threads is spawend. It will not spawn 1000 threads. Also, you will not gain much of a performance boost (if any), when using a threadpool. I did measurements in the past, and the overall runtime was nearly identical.
The only reason, I use threadpools now, is to avoid the delay for creating threads in the computation loop. If you have timing constraints, you may miss deadlines, when using std::async for the first time, since it will create the threads on the first calls.
There is a good thread pool library for these applications. Have a look here:
https://github.com/vit-vit/ctpl
#include <ctpl.h>
const unsigned int numberOfThreads = 10;
const unsigned int batchSize = 1000;
ctpl::thread_pool pool(batchSize /* two threads in the pool */);
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){ return pool.push(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}

Can boost::mutex lock out an OS if enough are active?

I'm working on a producer consumer problem with an intermediate processing thread. When I run 200 of these applications it locks the system up in win7 when lots of connections timeout. Unfortunately, not in a way that I know how to debug it. The system becomes unresponsive and I have to restart it with the power button. It works fine on my mac, and oddly enough, it works fine in windows in safe mode.
I'm using boost 1.44 as that is what the host application uses.
Here is my queue. My intention is that the queues are synchronized on their size. I've manipulated this to use timed_wait to make sure I wasn't losing notifications, though I saw no difference in effect.
class ConcurrentQueue {
public:
void push(const std::string& str, size_t notify_size, size_t max_size);
std::string pop();
private:
std::queue<std::string> queue;
boost::mutex mutex;
boost::condition_variable cond;
};
void ConcurrentQueue::push(
const std::string& str, size_t notify_size, size_t max_size) {
size_t queue_size;
{{
boost::mutex::scoped_lock lock(mutex);
if (queue.size() < max_size) {
queue.push(str);
}
queue_size = queue.size();
}}
if (queue_size >= notify_size)
cond.notify_one();
}
std::string ConcurrentQueue::pop() {
boost::mutex::scoped_lock lock(mutex);
while (!queue.size())
cond.wait(lock);
std::string str = queue.front();
queue.pop();
return str;
}
These threads use the below queues to process and send using libcurl.
boost::shared_ptr<ConcurrentQueue> queue_a(new ConcurrentQueue);
boost::shared_ptr<ConcurrentQueue> queue_b(new ConcurrentQueue);
void prod_run(size_t iterations) {
try {
// stagger startup
boost::this_thread::sleep(
boost::posix_time::seconds(random_num(0, 25)));
size_t save_frequency = random_num(41, 97);
for (size_t i = 0; i < iterations; i++) {
// compute
size_t v = 1;
for (size_t j = 2; j < (i % 7890) + 4567; j++) {
v *= j;
v = std::max(v % 39484, v % 85783);
}
// save
if (i % save_frequency == 0) {
std::string iv =
boost::str( boost::format("%1%=%2%") % i % v );
queue_a->push(iv, 1, 200);
}
sleep_frame();
}
} catch (boost::thread_interrupted&) {
}
}
void prodcons_run() {
try {
for (;;) {
std::string iv = queue_a->pop();
queue_b->push(iv, 1, 200);
}
} catch (boost::thread_interrupted&) {
}
}
void cons_run() {
try {
for (;;) {
std::string iv = queue_b->pop();
send_http_post("http://127.0.0.1", iv);
}
} catch (boost::thread_interrupted&) {
}
}
My understanding of using mutexes in this way should not make a system unresponsive. If anything, my apps would deadlock and sleep forever.
Is there some way that having 200 of these at once creates a scenario where this isn't the case?
Update:
When I restart the computer, most of the time I need to replug in the USB keyboard to get it to respond. Given the driver comment, I thought that may be relevant. I tried updating the northbridge drivers, though they were up to date. I'll look to see if there are other drivers that need attention.
Update:
I've watched memory, non-paged pool, cpu, handles, ports and none of them are at alarming rates at any time while the system is responsive. It's possible something spikes at the end, though that is not visible to me.
Update:
When the system hangs, it stops rendering and does not respond to the keyboard. The last frame that it rendered stays up though. The system sounds like it is still running and when the system comes back up there is nothing in the event viewer saying that it crashed. There are no crash dump files either. I interpret this as the OS is being blocked out of execution.
A mutex lock locks other applications that use the same lock. Any mutex used by the OS should not be (directly) available to any application.
Of course, if the mutex is implemented using the OS in some way, it may well call into the OS, and thus using CPU resources. However, a mutex lock should not cause any worse behaviour than the application using CPU resources any other way.
It may of course be that if you use locks in an inappropriate way, different parts of the application becomes deadlocked, as function 1 acquires lock A, and then function 2 acquires lock B. If then function 1 tries to acquire lock B, and function 2 tries to acquire lock A before releasing their respective locks, you have a deadlock. The trick here is to always acquire multiple locks in the same order. So if you need two locks at the same time, always acquire lock A first, then lock B.
Deadlockin should not affect the OS as such - if anything, it makes it better, but if the application is in some way "misbehaving" in case of a deadlock, it may cause problems by calling the OS a lot - e.g if the locking is done by:
while (!trylock(lock))
{
/// do nothing here
}
it may cause peaks in system usage.

Unable to get std::mutex to protect thread access correctly

[C++ using Visual Studio Professional 2012]
Hi All, I am having trouble using std::mutex to prevent main() from changing variables that a second thread is accessing. In the following example (which is a massively simplified representation of my actual program) the function update() runs from the std::thread t2 in main(). update() checks if the vector world.m_grid[i][j].vec is empty and, if it is not, modifies the value contained. main() also accesses and occasionally clears this vector, and as a result if main() clears the vector after the empty check in update() but before world.m_grid[i][j].vec[0] is modified you get a vector subscript out of range error. I am trying to use std::mutex to prevent this from happening by locking barrier before the update() empty check, and releasing it after world.m_grid[i][j].vec[0] has been modified by update(), and after extensive browsing of mutex tutorials and examples I am unable to understand why the following does not have the desired effect:
#include <cstdlib>
#include <thread>
#include <mutex>
#include <vector>
using namespace std;
mutex barrier;
class World
{
public:
int m_rows;
int m_columns;
class Tile
{
public:
vector<int> vec;
int someVar;
};
vector<vector<Tile> > m_grid;
World (int rows = 100, int columns = 200): m_rows(rows), m_columns(columns), m_grid(rows, vector<Tile>(columns)) {}
};
void update(World& world)
{
while (true)
{
for (int i = 0; i < world.m_rows; ++i)
{
for (int j = 0; j < world.m_columns; ++j)
{
if (!world.m_grid[i][j].vec.empty())
{
lock_guard<mutex> guard(barrier);
world.m_grid[i][j].vec[0] += 5;
}
}
}
}
}
int main()
{
World world;
thread t2(update, ref(world));
while (true)
{
for (int i = 0; i < world.m_rows; ++i)
{
for (int j = 0; j < world.m_columns; ++j)
{
int random = rand() % 10;
if (world.m_grid[i][j].vec.empty() && random < 3) world.m_grid[i][j].vec.push_back(1);
else if (!world.m_grid[i][i].vec.empty() && random < 3) world.m_grid[i][j].vec.clear();
}
}
}
t2.join();
return 0;
}
I must be missing something fundamental here. Ideally the solution would just lock down world.m_grid[i][j] (leaving the rest of world.m_grid accessible to main()), which I assume would involve including a mutex in the class "Tile", but I run into the same problem as described here: Why does std::mutex create a C2248 when used in a struct with WIndows SOCKET? and have been unable to adapt the solution described to my project, so it would be extra helpful if someone was able to help me out there too.
-Thankyou for your time.
[edit] spelling
You need to lock the mutex also in you main function when you access the array:
...
for (int j = 0; j < world.m_columns; ++j) {
lock_guard<mutex> guard(barrier);
int random = rand() % 10;
if (world.m_grid[i][j].vec.empty() && random < 3) world.m_grid[i][j].vec.push_back(1);
else if (!world.m_grid[i][i].vec.empty() && random < 3) world.m_grid[i][j].vec.clear();
}
...
With mutexes you need to secure all accesses to your data. So far in your code thread 2 generates a mutex when it access the data. However the main thread can change the code as it does not know anything about the mutex. So the main thread can simply change the data.
The problem you are having is that you are using so called clients' side synchronization. In other words: you have several threads and before every of them writes/reads shared resource, you have to use the barrier. As tune2fs has already replied, you have to lock_guard<mutex> guard(barrier) before the call in main thread.
That said it would be much better for you to implement server side synchronization. In other words every block (if there are more then 1 lines - like in your main thread - you need to send all of them) would have to be synchronized by the server (World).
Currently I could propose using a method void modify(Func<void(vector<int>&)> mutator); in World so you would send all your logic through this method as lambdas (the easiest). Inside modify you would use the standard lock_guard using a mutex owned by World. This solution is much more scaleable and safe (you really don't want to look at all you places where you invoke code that modifies the vector).