Locking multiple parts of an array - Multithreading

Locking multiple parts of an array - Multithreading - c++

I'm trying to implement a threadsafe locking mechanism for an array with the following intended use case:
Request the indexes you want to lock and try to acquire them. If you fail to acquire ANY index, bail out, and try again (essentially spin).
Once the necessary locks have been acquired perform processing on these indexes.
Release the acquired locks!
I'm using the below code to test the lock - it just increments a test count, with the same indexes being specified for each iteration (so it forces access to be sequential). The only problem is it doesn't work, and I'm kind of stumped...
I have a feeling I'm missing some sort of key race condition, but I can't identity it yet :(
#pragma omp parallel for
for (int i = 0; i < activeItems; i++)
{
std::vector<int> lockedBucketIndexes = {0, 1, 2, 3};
//try and get a lock on the buckets we want, otherwise keep trying
while (!spatialHash->TryAcquireBucketLocks(lockedBucketIndexes))
{
//TODO - do some fancy backoff here
}
testCount++;
spatialHash->DropBucketLocks(lockedBucketIndexes);
}
The class/methods that do the locking:
std::vector<int> _bucketLocks;
SpinLock _bucketLockLock;
bool SpatialHash::TryAcquireBucketLocks(std::vector<int> bucketIndexes)
{
bool success = true;
//try and get a lock to set our bucket locks... lockception
_bucketLockLock.lock();
//quickly check that the buckets we want are free
for each (int bucketIndex in bucketIndexes)
{
if (_bucketLocks[bucketIndex] > 0)
{
success = false;
break;
}
}
//if all the buckets are free, set them to occupied
if (success)
{
for each (int bucketIndex in bucketIndexes)
{
_bucketLocks[bucketIndex] = 1;
}
}
//let go of the lock
_bucketLockLock.unlock();
return success;
}
void DropBucketLocks(std::vector<int> bucketIndexes)
{
//I have no idea why these locks are required
//It seems to almost work with them though...
_bucketLockLock.lock();
for each (int bucketIndex in bucketIndexes)
{
_bucketLocks[bucketIndex] = 0;
}
_bucketLockLock.unlock();
return true;
}
The spinlock class:
class SpinLock {
std::atomic_flag locked = ATOMIC_FLAG_INIT;
public:
void lock() {
while (locked.test_and_set(std::memory_order_acquire)) { ; }
}
void unlock() {
locked.clear(std::memory_order_release);
}
};

Related

flushing thread local buffer at end of parallel loop with TBB

I want to parallelize a loop (using tbb) which contains some expensive but vectorizable iterations (randomly spread). My idea was to buffer those and flush the buffer whenever it reaches the vector size. Such a buffer must be thread-local. For example,
// dummy for testing
void do_vectorized_work(size_t k, size_t*indices)
{}
// dummy for testing
bool requires_expensive_work(size_t k)
{ return (k&7)==0; }
struct buffer
{
size_t K=0, B[vector_size];
void load(size_t i)
{
B[K++]=i;
if(K==vector_size)
flush();
}
void flush()
{
do_vectorized_work(K,B);
K=0;
}
};
void do_work_in_parallel(size_t N)
{
tbb::enumerable_thread_specific<buffer> tl_buffer;
tbb::parallel_for(size_t(0),N,[&](size_t i)
{
if(requires_expensive_work(i))
tl_buffer.local().load(i);
});
}
However, this leaves the buffers non-empty, so I still have to flush each of them a final time
for(auto&b:tl_buffer)
b.flush();
but this is serial! Of course, I can also try to do this in parallel
using tl_range = typename tbb::enumerable_thread_specific<buffer>::range_type;
tbb::parallel_for(tl_buffer.range(),[](tl_range const&range)
{
for(auto r:range)
r->flush();
});
But I'm not sure this is efficient (since there are only as many buffers as there are threads). I was wondering whether it is possible to avoid this final flush after the event. I.e. is it possible to use tbb::tasks (replacing tbb::parallel_for) in such a way that each thread's final task is to flush its buffer?

No, a worker thread does not have complete information about whether this particular task is the last task of the given work or not (this is how work-stealing works). Thus, it is not possible to implement such a function on the level of parallel_for or the scheduler itself. Thus, I'd recommend you to go with these two approaches you describe.
There are two other things you can do about this though.
make it asynchronous. I.e. enqueue a task which will get everything flushed. It will help to remove this code from the hot path on the main thread. Just be careful if there are any dependencies which need to be set on completion of this task.
use tbb::task_scheduler_observer in order to initialize thread-specific data and release it lazily when threads get shut down or when there is no work remains for some time. The latter requires using local observer feature which is not yet officially supported but remains stable for few years already.
Example:
#define TBB_PREVIEW_LOCAL_OBSERVER 1
#include <tbb/tbb.h>
#include <assert.h>
typedef void * buffer_t;
const static int bufsz = 1024;
class thread_buffer_allocator: public tbb::task_scheduler_observer {
tbb::enumerable_thread_specific<buffer_t> _buf;
public:
thread_buffer_allocator( )
: tbb::task_scheduler_observer( /*local=*/ true ) {
observe(true); // activate the observer
}
~thread_buffer_allocator( ) {
observe(false); // deactivate the observer
for(auto &b : _buf) {
printf("destructor: cleared: %p\n", b);
free(b);
}
}
/*override*/ void on_scheduler_entry( bool worker ) {
assert(_buf.local() == nullptr);
_buf.local() = malloc(bufsz);
printf("on entry: %p\n", _buf.local());
}
/*override*/ void on_scheduler_exit( bool worker ) {
printf("on exit\n");
if(_buf.local()) {
printf("on exit: cleared %p\n", _buf.local());
free(_buf.local());
_buf.local() = nullptr;
}
}
};
int main() {
thread_buffer_allocator buffers_scope;
tbb::parallel_for(0, 1024*1024*1024, [&](auto i){
usleep(i%3);
});
return 0;
}

It occurred to me that this can be solved by reduction.
struct buffer
{
std::size_t K=0, B[vector_size];
void load(std::size_t i)
{
B[K++]=i;
if(K==vector_size) flush();
}
void flush()
{
do_vectorized_work(K,B);
K=0;
}
buffer(buffer const&, tbb::split)
{}
void operator()(tbb::block_range<std::size_t> const&range)
{ for(i:range) load(i); }
bool empty()
{ return K==0; }
std::size_t pop()
{ return K? B[--K] : 0; }
void join(buffer&rhs)
{ while(!rhs.empty()) load(rhs.pop()); }
};
void do_work_in_parallel(std::size_t N)
{
buffer buff;
tbb::parallel_reduce(tbb::block_range<std::size_t>(0,N,vector_size),buff);
if(!buff.empty())
buff.flush();
}

Deadlock in C++ code

I try to handle a deadlock in my code but I can't fugure out how to prevent it. I have a thread which accesses data and an update method which update the data. The code looks like this:
thread {
forever {
if (Running) {
LOCK
access data
UNLOCK
}
Running = false;
}
}
update {
Running = false;
LOCK
access data
UNLOCK
Running = true;
}
I tried to fix it with a second access variable but it doesn't change anything.
thread {
forever {
if (!Updating) {
if (Running) {
LOCK
access data
UNLOCK
}
}
Running = false;
}
}
update {
Updating = true;
Running = false;
LOCK
access data
UNLOCK
Updating = false;
Running = true;
}
Thanks for your help.
UPDATE
This is a better description of the problem:
thread {
forever {
if (Running) {
LOCK
if (!Running) leave
access data
UNLOCK
}
Running = false;
}
}
update {
Running = false;
LOCK
access data
UNLOCK
Running = true;
}
My update function is a bit more complex, so that I can't see a way to use one of the standard algorithm for this.
UPDATE 2
Here is the simplified c++ source code. maybe it's better to read as the pseudocode:
void run() {
forever {
if (mRunning) {
QMutexLocker locker(&mMutex);
for (int i; i < 10; i++) {
qDebug("run %d", i);
sleep(1);
if (!mRunning) break;
}
mRunning = false;
}
}
}
void update() {
mRunning = false;
QMutexLocker locker(&mMutex);
qDebug("update");
mRunning = true;
}
UPDATE 3
Ok. The problem is a bit more complex. I forgot that my accesss data part in the thread starts also some child threads to fill the data structure
datathread {
access data
}
thread {
forever {
if (Running) {
LOCK
if (!Running) leave
forloop
start datathread to fill data to accessdata list
UNLOCK
}
Running = false;
}
}
update {
Running = false;
LOCK
access data
UNLOCK
Running = true;
}

Standard way for read method being restarted when during write is to use seqlock. With single writer and reader seqlock is just atomic integer variable, which is incremented every time when writer is started and when it is ended. Such a way reader method can periodically check whether variable is unchanged since read is started:
atomic<int> seq = 0;
updater() // *writer*
{
seq = seq + 1;
<update data>
seq = seq + 1;
}
thread() // *reader*
{
retry: // Start point of unmodified data processing.
{
int seq_old = seq;
if(seq_old & 1)
{
// odd value of the counter means that updater is in progress
goto retry;
}
for(int i = 0; i < 10; i++)
{
<process data[i]>
if(seq_old != seq)
{
// updater has been started. Restart processing.
goto retry;
}
}
// Data processing is done.
}
}
If several updater() can be executed concurrently, whole update code should be executed with mutex taken:
updater() // *writer*
{
QMutexLocker locker(&updater_Mutex);
seq = seq + 1;
<update data>
seq = seq + 1;
}
If even single element of data cannot be accessed concurrently with updating, both <update data> and <process data[i]> should be executed with mutex taken.

Fastest and safest way to call functions in extern process

Describtion of the problem:
we need to call a function in extern process as fast as possible. Boost interprocess shared memory is used for communication. The extern process is either mpi master or a single executable. The calculation time of the function lies between 1ms and 1s. The function should be called up to 10^8-10^9 times.
I've tried a lot of possibilities, but I still have some problems with each of them. Here I introduce two of best working implementations
Version 1 ( using intreprocess conditions )
Main-process
bool calculate(double& result, std::vector<double> c){
// data_ptr is a structure in shared memoty
data_ptr_->validCalculation = false;
bool timeout = false;
// write data (cVec_ is a vector in shared memory )
cVec_->clear();
for (int i = 0; i < c.size(); ++i)
{
cVec_->push_back(c[i]);
}
// cond_input_data is boost interprocess condition
data_ptr_->cond_input_data.notify_one();
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
// lock slave process
scoped_lock<interprocess_mutex> lock_output(data_ptr_->mutex_output);
// wait till data calculated
timeout = !(data_ptr_->cond_output_data.timed_wait(lock_output, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
Extern process runs a while-loop ( till abort condition is fullfilled)
do {
scoped_lock<interprocess_mutex> lock_input(data_ptr_->mutex_input);
boost::system_time const waittime = boost::get_system_time() + boost::posix_time::seconds(maxWaitTime_in_sec);
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime)); // true if timeout, false if no timeout
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_function_here(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
data_ptr_->cond_output_data.notify_one();
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
This is best working version, but sometimes it holds up, if the calculation time is short (~1ms). I assume, it happens, if main-process reaches
data_ptr_->cond_input_data.notify_one();
earlier, than extern process is waiting on
timeout = !(data_ptr_->cond_input_data.timed_wait(lock_input, waittime));
waiting condition. So we have probably some kind of synchronisation problem.
Second condition does not help ( i.e. wait only if input data not set, similar to the anonymous condition example with message_in flag). Since, it is still possible, that one process notify the other one, before the second one is waiting for notification.
Version 2 ( using boolean flag and while loop with some delay )
Main-process
bool calculate(double& result, std::vector<double> c){
data_ptr_->validCalculation = false;
bool timeout = false;
// write data
cVec_->clear();
for (int i = 0; i < c.size(); ++i) //Insert data in the vector
{
cVec_->push_back(c[i]);
}
// this is the flag in shared memory used for communication
*calc_flag_ = true;
clock_t test_begin = clock();
clock_t calc_time_begin = clock();
do
{
calc_time_begin = clock();
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
// wait till data calculated
timeout = (double(calc_time_begin - test_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
} while (*(calc_flag_) && !timeout);
if (!timeout)
{
// get result
result = *result_;
return data_ptr_->validCalculation;
}
else
{
return false;
}
};
and the extern process
do {
// we wait till input data is set
wait_begin = clock();
do
{
wait_end = clock();
timeout = (double(wait_end - wait_begin) / CLOCKS_PER_SEC > maxWaitTime_in_sec);
boost::this_thread::sleep(boost::posix_time::milliseconds(while_loop_delay_m_s));
} while (!(*calc_flag_) && !(*abort_flag_) && !timeout);
if (!timeout)
{
if (!*abort_flag_) {
c.clear();
for (int i = 0; i < (*cVec_).size(); ++i) //Insert data in the vector
{
c.push_back(cVec_->at(i));
}
// calculate value
if (call_of_local_function(result, c)) { // valid calculation ?
*result_ = result;
data_ptr_->validCalculation = true;
}
}
}
//Notify the other process that the data is avalible or we dont get the input data
*calc_flag_ = false;
} while (!*abort_flag_); // while abort flag is not set, check if some values should be calculated
The problem in this version is the delay-time. Since we have calculation times close to 1ms, we have to set the delay at least to this value. For smaller delays the cpu-load is high, for higher delays we lose a lot of performance due to not necessary waiting time
Do you have an idea how to improve one of this versions? or may be there is a better solution?
thx.

Synchronization between threads without overload

I can't find a good solution on how to implement a good mutual exclusion on a common resource between different threads.
I've got many methods (from a class) that do a lot of access to a database, this is one of them
string id = QUERYPHYSICAL + toString(ID);
wait();
mysql_query(connection, id.c_str());
MYSQL_RES *result = mysql_use_result(connection);
while (MYSQL_ROW row = mysql_fetch_row(result)){
Physical[ID - 1].ID = atoi(row[0]);
Physical[ID - 1].NAME = row[1];
Physical[ID - 1].PEOPLE = atoi(row[2]);
Physical[ID - 1].PIRSTATUS = atoi(row[3]);
Physical[ID - 1].LIGHTSTATUS = atoi(row[4]);
}
mysql_free_result(result);
signal();
The methods wait and signal do these things:
void Database::wait(void) {
while(!this->semaphore);
this->semaphore = false;
}
void Database::signal(void) {
this->semaphore = true;
}
But in this case my CPU goes to more than 190% of usage (reading from /proc/loadavg). What should I do to reduce CPU overload and let the system be more efficient? I'm on a 800MHz RaspberryPi

You can use pthread_mutex_t init at the constructor, lock for wait, unlock for signal, destroy at the destructor.
like this:
class Mutex{
pthread_mutex_t m;
public:
Mutex(){
pthread_mutex_init(&m,NULL);
}
~Mutex(){
pthread_mutex_destroy(&m);
}
void wait() {
pthread_mutex_lock(&m);
}
void signal() {
pthread_mutex_unlock(&m);
}
} ;
You also should check the return value of the pthread_mutex functions: 0 for success, non zero means error.

How can I synchronize three threads?

My app consist of the main-process and two threads, all running concurrently and making use of three fifo-queues:
The fifo-q's are Qmain, Q1 and Q2. Internally the queues each use a counter that is incremented when an item is put into the queue, and decremented when an item is 'get'ed from the queue.
The processing involve two threads,
QMaster, which get from Q1 and Q2, and put into Qmain,
Monitor, which put into Q2,
and the main process, which get from Qmain and put into Q1.
The QMaster-thread loop consecutively checks the counts of Q1 and Q2 and if any items are in the q's, it get's them and puts them into Qmain.
The Monitor-thread loop obtains data from external sources, package it and put it into Q2.
The main-process of the app also runs a loop checking the count of Qmain, and if any items, get's an item
from Qmain at each iteration of the loop and process it further. During this processing it occasionally
puts an item into Q1 to be processed later (when it is get'ed from Qmain in turn).
The problem:
I've implemented all as described above, and it works for a randomly (short) time and then hangs.
I've managed to identify the source of the crashing to happen in the increment/decrement of the
count of a fifo-q (it may happen in any of them).
What I've tried:
Using three mutex's: QMAIN_LOCK, Q1_LOCK and Q2_LOCK, which I lock whenever any get/put operation
is done on a relevant fifo-q. Result: the app doesn't get going, just hangs.
The main-process must continue running all the time, must not be blocked on a 'read' (named-pipes fail, socketpair fail).
Any advice?
I think I'm not implementing the mutex's properly, how should it be done?
(Any comments on improving the above design also welcome)
[edit] below are the processes and the fifo-q-template:
Where & how in this should I place the mutex's to avoid the problems described above?
main-process:
...
start thread QMaster
start thread Monitor
...
while (!quit)
{
...
if (Qmain.count() > 0)
{
X = Qmain.get();
process(X)
delete X;
}
...
//at some random time:
Q2.put(Y);
...
}
Monitor:
{
while (1)
{
//obtain & package data
Q2.put(data)
}
}
QMaster:
{
while(1)
{
if (Q1.count() > 0)
Qmain.put(Q1.get());
if (Q2.count() > 0)
Qmain.put(Q2.get());
}
}
fifo_q:
template < class X* > class fifo_q
{
struct item
{
X* data;
item *next;
item() { data=NULL; next=NULL; }
}
item *head, *tail;
int count;
public:
fifo_q() { head=tail=NULL; count=0; }
~fifo_q() { clear(); /*deletes all items*/ }
void put(X x) { item i=new item(); (... adds to tail...); count++; }
X* get() { X *d = h.data; (...deletes head ...); count--; return d; }
clear() {...}
};

An example of how I would adapt the design and lock the queue access the posix way.
Remark that I would wrap the mutex to use RAII or use boost-threading and that I would use stl::deque or stl::queue as queue, but staying as close as possible to your code:
main-process:
...
start thread Monitor
...
while (!quit)
{
...
if (Qmain.count() > 0)
{
X = Qmain.get();
process(X)
delete X;
}
...
//at some random time:
QMain.put(Y);
...
}
Monitor:
{
while (1)
{
//obtain & package data
QMain.put(data)
}
}
fifo_q:
template < class X* > class fifo_q
{
struct item
{
X* data;
item *next;
item() { data=NULL; next=NULL; }
}
item *head, *tail;
int count;
pthread_mutex_t m;
public:
fifo_q() { head=tail=NULL; count=0; }
~fifo_q() { clear(); /*deletes all items*/ }
void put(X x)
{
pthread_mutex_lock(&m);
item i=new item();
(... adds to tail...);
count++;
pthread_mutex_unlock(&m);
}
X* get()
{
pthread_mutex_lock(&m);
X *d = h.data;
(...deletes head ...);
count--;
pthread_mutex_unlock(&m);
return d;
}
clear() {...}
};
Remark too that the mutex still needs to be initialized as in the example here and that count() should also use the mutex

Use the debugger. When your solution with mutexes hangs look at what the threads are doing and you will get a good idea about the cause of the problem.
What is your platform? In Unix/Linux you can use POSIX message queues (you can also use System V message queues, sockets, FIFOs, ...) so you don't need mutexes.
Learn about condition variables. By your description it looks like your Qmaster-thread is busy looping, burning your CPU.
One of your responses suggest you are doing something like:
Q2_mutex.lock()
Qmain_mutex.lock()
Qmain.put(Q2.get())
Qmain_mutex.unlock()
Q2_mutex.unlock()
but you probably want to do it like:
Q2_mutex.lock()
X = Q2.get()
Q2_mutex.unlock()
Qmain_mutex.lock()
Qmain.put(X)
Qmain_mutex.unlock()
and as Gregory suggested above, encapsulate the logic into the get/put.
EDIT: Now that you posted your code I wonder, is this a learning exercise?
Because I see that you are coding your own FIFO queue class instead of using the C++ standard std::queue. I suppose you have tested your class really well and the problem is not there.
Also, I don't understand why you need three different queues. It seems that the Qmain queue would be enough, and then you will not need the Qmaster thread that is indeed busy waiting.
About the encapsulation, you can create a synch_fifo_q class that encapsulates the fifo_q class. Add a private mutex variable and then the public methods (put, get, clear, count,...) should be like put(X) { lock m_mutex; m_fifo_q.put(X); unlock m_mutex; }
question: what would happen if you have more than one reader from the queue? Is it guaranteed that after a "count() > 0" you can do a "get()" and get an element?

I wrote a simple application below:
#include <queue>
#include <windows.h>
#include <process.h>
using namespace std;
queue<int> QMain, Q1, Q2;
CRITICAL_SECTION csMain, cs1, cs2;
unsigned __stdcall TMaster(void*)
{
while(1)
{
if( Q1.size() > 0)
{
::EnterCriticalSection(&cs1);
::EnterCriticalSection(&csMain);
int i1 = Q1.front();
Q1.pop();
//use i1;
i1 = 2 * i1;
//end use;
QMain.push(i1);
::LeaveCriticalSection(&csMain);
::LeaveCriticalSection(&cs1);
}
if( Q2.size() > 0)
{
::EnterCriticalSection(&cs2);
::EnterCriticalSection(&csMain);
int i1 = Q2.front();
Q2.pop();
//use i1;
i1 = 3 * i1;
//end use;
QMain.push(i1);
::LeaveCriticalSection(&csMain);
::LeaveCriticalSection(&cs2);
}
}
return 0;
}
unsigned __stdcall TMoniter(void*)
{
while(1)
{
int irand = ::rand();
if ( irand % 6 >= 3)
{
::EnterCriticalSection(&cs2);
Q2.push(irand % 6);
::LeaveCriticalSection(&cs2);
}
}
return 0;
}
unsigned __stdcall TMain(void)
{
while(1)
{
if (QMain.size() > 0)
{
::EnterCriticalSection(&cs1);
::EnterCriticalSection(&csMain);
int i = QMain.front();
QMain.pop();
i = 4 * i;
Q1.push(i);
::LeaveCriticalSection(&csMain);
::LeaveCriticalSection(&cs1);
}
}
return 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
::InitializeCriticalSection(&cs1);
::InitializeCriticalSection(&cs2);
::InitializeCriticalSection(&csMain);
unsigned threadID;
::_beginthreadex(NULL, 0, &TMaster, NULL, 0, &threadID);
::_beginthreadex(NULL, 0, &TMoniter, NULL, 0, &threadID);
TMain();
return 0;
}

You should not lock second mutex when you already locked one.
Since the question is tagged with C++, I suggest to implement locking inside get/add logic of the queue class (e.g. using boost locks) or write a wrapper if your queue is not a class.
This allows you to simplify the locking logic.
Regarding the sources you have added: queue size check and following put/get should be done in one transaction otherwise another thread can edit the queue in between

Are you acquiring multiple locks simultaneously? This is generally something you want to avoid. If you must, ensure you are always acquiring the locks in the same order in each thread (this is more restrictive to your concurrency and why you generally want to avoid it).
Other concurrency advice: Are you acquiring the lock prior to reading the queue sizes? If you're using a mutex to protect the queues, then your queue implementation isn't concurrent and you probably need to acquire the lock before reading the queue size.

1 problem may occur due to this rule "The main-process must continue running all the time, must not be blocked on a 'read'". How did you implement it? what is the difference between 'get' and 'read'?
Problem seems to be in your implementation, not in the logic. And as you stated, you should not be in any dead lock because you are not acquiring another lock whether in a lock.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Locking multiple parts of an array - Multithreading - c++

Related

flushing thread local buffer at end of parallel loop with TBB

Deadlock in C++ code

Fastest and safest way to call functions in extern process

Synchronization between threads without overload

How can I synchronize three threads?

Categories

Resources