I'm trying to gain better understanding of controlling memory order when coding for multiple threads. I've used mutexes a lot in the past to serialize variable access, but I'm trying to avoid those where possible to improve performance.
I have a queue of pointers that may be filled by many threads and consumed by many threads. It works fine with a single thread, but crashes when I run with multiple threads. It looks like the consumers may be getting duplicates of the pointers which causes them to be freed twice. It's a little hard to tell since when I put in any print statements, it runs fine without crashing.
To start with I'm using a pre-allocated vector to hold the pointers. I keep 3 atomic index variables to keep track of what elements in the vector need processing. It may be worth noting that I tried using a _queue type where the elements themselves were atomic by that did not seem to help. Here is the simpler version:
std::atomic<uint32_t> iread;
std::atomic<uint32_t> iwrite;
std::atomic<uint32_t> iend;
std::vector<JEvent*> _queue;
// Write to _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iwrite.load();
uint32_t inext = (idx+1)%_queue.size();
if( inext == iread.load() ) return kQUEUE_FULL;
if( iwrite.compare_exchange_weak(idx, inext) ){
_queue[idx] = jevent; // jevent is JEvent* passed into this method
while( !_done ){
if( iend.compare_exchange_weak(idx, inext) ) break;
}
break;
}
}
and from the same class
// Read from _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iread.load();
if(idx == iend.load()) return NULL;
JEvent *Event = _queue[idx];
uint32_t inext = (idx+1)%_queue.size();
if( iread.compare_exchange_weak(idx, inext) ){
_nevents_processed++;
return Event;
}
}
I should emphasize that I am really interested in understanding why this doesn't work. Implementing some other pre-made package would get me past this problem, but would not help me avoid making the same type of mistakes again later.
UPDATE
I'm marking Alexandr Konovalov's answer as correct (see my comment in his answer below). In case anyone comes across this page, the corrected code for the "Write" section is:
std::atomic<uint32_t> iread;
std::atomic<uint32_t> iwrite;
std::atomic<uint32_t> iend;
std::vector<JEvent*> _queue;
// Write to _queue (may be thread 1,2,3,...)
while(!_done){
uint32_t idx = iwrite.load();
uint32_t inext = (idx+1)%_queue.size();
if( inext == iread.load() ) return kQUEUE_FULL;
if( iwrite.compare_exchange_weak(idx, inext) ){
_queue[idx] = jevent; // jevent is JEvent* passed into this method
uint32_t save_idx = idx;
while( !_done ){
if( iend.compare_exchange_weak(idx, inext) ) break;
idx = save_idx;
}
break;
}
}
To me, one possible issue can occurs when there are 2 writers and 1 reader. Suppose that 1st writer stops just before
_queue[0] = jevent;
and 2nd writer signals via iend that its _queue[1] is ready to be read. Then, reader via iend sees that _queue[0] is ready to be read, so we have data race.
I recommend you try Relacy Race Detector, that ideally applies to such kind of analysis.
Related
My program has 8 writing threads and one persistence thread. The following code is the core of the persistence thread
std::string longLine;
myMutex.lock();
while (!myQueue.empty()) {
std::string& head = myQueue.front();
const int hSize = head.size();
if(hSize < blockMaxSize)
break;
longLine += head;
myQueue.pop_front();
}
myMutex.unlock();
flushToFile(longLine);
The performance is acceptable (millions of writings finished in hundreds of milliseconds). I still hope to improve the code by avoiding string copying so that I change the code as followed:
myMutex.lock();
while (!myQueue.empty()) {
const int hsize = myQueu.front().size();
if(hsize < blockMaxSize)
break;
std::string head{std::move(myQueue.front())};
myQueue.pop_front();
myMutex.unlock();
flushToFile(head);
myMutex.lock();
}
myMutex.unlock();
It is surprising that the performance drops sharply to millions of writings finished in quite a few seconds. Debugging shows most of time was spent on waiting for the lock after flushing the file.
But I don't understand why. Any one could help?
Not understand more time spent on wait for the lock
Possibly faster. Do all your string concatenations inside the flush function. That way your string concatenation won't block the writer threads trying to append to the queue. This is possibly a micro-optimization.
While we're at it. Let's establish that myQueue is a vector and not a queue or list class. This will be faster since the only operations on the collection are an append or total erase.
std::string longLine;
std::vector<std::string> tempQueue;
myMutex.lock();
if (myQueue.size() >= blockMaxSize) {
tempQueue = std::move(myQueue);
myQueue = {}; // not sure if this is needed
}
myMutex.unlock();
flushToFileWithQueue(tempQueue);
Where flushToFileWithQueue is this:
void flushToFileWithQueue(std::vector<std::string>& queue) {
string longLine;
for (size_t i = 0; i < queue.size(); i++) {
longline += queue[i];
}
queue.resize(0); // faster than calling .pop() N times
flushToFile(longLine);
}
You didn't show what wakes up the persistence thread. If it's polling instead of using a proper condition variable, let me know and I'll show you how to use that.
Also make use of the .reserve() method on these instances of the vector collection such that all the queue has all the memory it needs to grow. Again, possibly a micro-optimization.
I've coded some relatively simple communication protocol using shared memory, and shared mutexes. But then I wanted to expand support to communicate between two .dll's having different run-time in use. It's quite obvious that if you have some std::vector<__int64> and two dll's - one vs2010, one vs2015 - they won't work politely with each other. Then I've thought - why I cannot serialize in ipc manner structure on one side and de-serialize it on another - then vs run-times will work smoothly with each other.
Long story short - I've created separate interface for sending next chunk of data and for requesting next chunk of data. Both are working while decoding happens - meaning if you have vector with 10 entries, each string 1 Mb, and shared memory is 10 Kb - then it would require 1*10*1024/10 times to transfer whole data. Each next request is followed by multiple outstanding function calls - either by SendChunk or GetNextChunk depending on transfer direction.
Now - I wanted to encode and decode happen simultaneously but without any threading - then I've came up with solution of using setjmp and longjmp. I'm attaching part of code below, just for you to get some understanding of what is happening in whole machinery.
#include "..."
#include <setjmp.h> //setjmp
class Jumper: public IMessageSerializer
{
public:
char lbuf[ sizeof(IpcCommand) + 10 ];
jmp_buf jbuf1;
jmp_buf jbuf2;
bool bChunkSupplied;
Jumper() :
bChunkSupplied(false)
{
memset( lbuf, 0 , sizeof(lbuf) );
}
virtual bool GetNextChunk( bool bSend, int offset )
{
if( !bChunkSupplied )
{
bChunkSupplied = true;
return true;
}
int r = setjmp(jbuf1);
((_JUMP_BUFFER *)&jbuf1)->Frame = 0;
if( r == 0 )
longjmp(jbuf2, 1);
bChunkSupplied = true;
return true;
}
virtual bool SendChunk( bool bLast )
{
bChunkSupplied = false;
int r = setjmp(jbuf2);
((_JUMP_BUFFER *)&jbuf2)->Frame = 0;
if( r == 0 )
longjmp(jbuf1, 1);
return true;
}
bool FlushReply( bool bLast )
{
return true;
}
IpcCommand* getCmd( void )
{
return (IpcCommand*) lbuf;
}
int bufSize( void )
{
return 10;
}
}; //class Jumper
Jumper jumper;
void main(void)
{
EncDecCtx enc(&jumper, true, true);
EncDecCtx dec(&jumper, false, false);
CString s;
if( setjmp(jumper.jbuf1) == 0 )
{
alloca(16*1024);
enc.encodeString(L"My testing my very very long string.");
enc.FlushBuffer(true);
} else {
dec.decodeString(s);
}
wprintf(L"%s\r\n", s.GetBuffer() );
}
There are couple of issues here. After first call to setjmp I'm using alloca() - which allocates memory from stack, it will be autofreed on return. alloca can happen only during first jump, because any function call always uses callstack (to save return address) and it can corrupt second "thread" call stack.
There are multiple articles discussing about how dangerous setjmp and longjmp are, but this is now somehow working solution. The stack size (16 Kb) is reservation for next function calls to come - decodeString and so on - it can be adjusted to bigger if not enough.
After trying out this code I've noticed that x86 code was working fine, but 64-but did not work - I've got similar problem to what is described here:
An invalid or unaligned stack was encountered during an unwind operation
Like article suggested I've added ((_JUMP_BUFFER *)&jbuf1)->Frame = 0; kind of resetting - and after that 64-bit code started to work. Currently library is not using any exception mechanism and I'm not planning to use any (will try-catch everything if needed in encode* decode* function calls.
So questions:
Is it acceptable solution to disable unwinding in code ? (((_JUMP_BUFFER *)&jbuf1)->Frame = 0;) What unwinding really means in context setjmp / longjmp ?
Do you see any potential problem with given code snipet?
I just wonder how to convert the following openMP program to a openCL program.
The parallel section of algorithm implemented using openMP looks like this:
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
//double mt_probThreshold = mt_nProbThreshold_;
double mt_probThreshold = nProbThreshold;
int mt_nMaxCandidate = mt_nMaxCandidate_;
double mt_nMinProb = mt_nMinProb_;
int has_next = 1;
std::list<ScrBox3d> mt_detected;
ScrBox3d sample;
while(has_next) {
#pragma omp critical
{ // '{' is very important and define the block of code that needs lock.
// Don't remove this pair of '{' and '}'.
if(piter_ == box_.end()) {
has_next = 0;
} else{
sample = *piter_;
++piter_;
}
} // '}' is very important and define the block of code that needs lock.
if(has_next){
this->SetSample(&sample, thread_id);
//UpdateSample(sample, thread_id); // May be necesssary for more sophisticated features
sample._prob = (float)this->Prob( true, thread_id, mt_probThreshold);
//sample._prob = (float)_clf->LogLikelihood( thread_id);
InsertCandidate( mt_detected, sample, mt_probThreshold, mt_nMaxCandidate, mt_nMinProb );
}
}
#pragma omp critical
{ // '{' is very important and define the block of code that needs lock.
// Don't remove this pair of '{' and '}'.
if(mt_detected_.size()==0) {
mt_detected_ = mt_detected;
//mt_nProbThreshold_ = mt_probThreshold;
nProbThreshold = mt_probThreshold;
} else {
for(std::list<ScrBox3d>::iterator it = mt_detected.begin();
it!=mt_detected.end(); ++it)
InsertCandidate( mt_detected_, *it, /*mt_nProbThreshold_*/nProbThreshold,
mt_nMaxCandidate_, mt_nMinProb_ );
}
} // '}' is very important and define the block of code that needs lock.
}//parallel section end
My question is: can this section be implemented with openCL?
I followed a series of openCL tutorials, and I understood the manner of work, I was writing the code in .cu files, (I previously installed CUDA toolkit) but in this case the situation is more complicated, because there are used a lot of header files, template classes and object-oriented-programming were used.
How could I convert this section implemented in openMP to openCL?
Should I create a new .cu file?
Any advice could help.
Thanks in advance.
Edit:
Using VS profiler I noticed that the most execution time is spent on InsertCandidate() function, I'm thinking about writing a kernel to execute this function on GPU. The most expensive operation of this function is a for instruction. But as it can be seen, each for cycle contains 3 if instructions, and this can lead to divergence, resulting in serialization, even if executed on GPU.
for( iter = detected.begin(); iter != detected.end(); iter++ )
{
if( nCandidate == nMaxCandidate-1 )
nProbThreshold = iter->_prob;
if( box._prob >= iter->_prob )
break;
if( nCandidate >= nMaxCandidate && box._prob <= nMinProb )
break;
nCandidate ++;
}
As a conclusion, can this program be converted to openCL?
It may be possible to convert your sample code to opencl, however I spotted a couple of issues with doing so.
There doesn't seem to be much parallel execution to begin with. More workers may not help at all.
Adding work to process during execution is a fairly recent feature in opencl. You would have to either use opencl 2.0, or know in advance how much work will be added, and pre-allocate memory to store the new data structures. The calls to InsertCandidate may be the part which "can't" be converted to opencl.
If the function is large enough, you may be able to port the calls to this->Prob(...) instead. You need to be able to cache up a bunch of calls' by storing the parameters in a suitable data structure. By 'a bunch' I mean at least hundreds but ideally thousands or more. Again, this is only worth it if this->Prob() is constant for all calls, and complex enough to be worth the round-trip to the opencl device and back.
unordered_map<std::string,unordered_map<std::string, std::string> >* storing_vars;
I have this variable in the scope declared in the scope.
This is declared in the constructor.
this->storing_vars = new unordered_map<std::string,unordered_map<std::string, std::string> >();
in order to initialize it.
Then what I do is call a function over and over again by my BackgroundWorker
for(int i2 = 0; i2 < 30; i2++){
int index_pos_curr = i2;
//Start the Threads HERE
this->backgroundWorker2 = gcnew System::ComponentModel::BackgroundWorker;
this->backgroundWorker2->WorkerReportsProgress = true;
this->backgroundWorker2->WorkerSupportsCancellation = true;
//this->backgroundWorker2->FieldSetter(L"std::string",L"test","damnnit");
backgroundWorker2->DoWork += gcnew DoWorkEventHandler( this, &MainFacebook::backgroundWorker2_DoWork );
backgroundWorker2->RunWorkerCompleted += gcnew RunWorkerCompletedEventHandler( this, &MainFacebook::backgroundWorker2_RunWorkerCompleted );
backgroundWorker2->ProgressChanged += gcnew ProgressChangedEventHandler( this, &MainFacebook::backgroundWorker2_ProgressChanged );
backgroundWorker2->RunWorkerAsync(index_pos_curr);
Sleep(50); //THE PROBLEM IS HERE, IF I COMMENT THIS OUT it won't work, that's probably because there are a lot of functions trying to add values in the same variable (even though the indexes are differents in each call)
}
After this is done it calls the DoWork Function
void backgroundWorker2_DoWork(Object^ sender, DoWorkEventArgs^ e ){
BackgroundWorker^ worker = dynamic_cast<BackgroundWorker^>(sender);
e->Result = SendThem( safe_cast<Int32>(e->Argument), worker, e );
}
int SendThem(int index){
stringstream st;
st << index;
//...
(*this->storing_vars)[st.str()]["index"] = "testing1";
(*this->storing_vars)[st.str()]["rs"] = "testing2";
return 0;
}
as I added the comment in the Sleep(50) line, I believe the problem is that since the thread in the background call the same function, it has a problem to store the data when it's called a lot of times probably not even waiting for the other storing to finish, it's causing an error in the "xhash.h" file, an error that is sanitized by using Sleep(50), but I can't use those because it freezes my UI and also 50 miliseconds is the time I'm assuming it already stored the variable value, but what if it takes longer in slower computers? it's not the right approach.
How do I do to fix that?
I want to be able to UPDATE the unordered_map WITHOUT the use of SLEEP
Thanks in advance.
You can only modify the standard library containers (including, but not limited to, unordered_map) from one thread at a time. The solution is to use critical sections, mutexes, locks to synchronize access. If you don't know what these are, then you need to know before you try to create multiple threads.
No ifs, buts or why's.
If you have multiple threads, you need mechanism to synchronize them, to serialize access to shared data. Common synchronization mechanisms are the ones mentioned above, so go look them up.
After so many votes down I actually started to look for the Mutex, people were talking about here, after a while I find out that it's really simple to use. and it's the correct way as my fellows here told me. Thank you all for the help =D
Here what I did, I just had to add
//Declare the Mutex
static Mutex^ mut = gcnew Mutex;
//then inside the function called over and over again I used mutex
mut->WaitOne();
//Declare/Update the variables
mut->ReleaseMutex();
//Then I release it.
It works perfectly, Thank you all for the helps and criticism. haha
I found one solution by predefining the index of the unordered_map I wanna use it, the problem is just creating the index, updating seems to be ok with multiple threads.
for(int i2 = 0; i2 < 30; i2++){
int index_pos_curr = i2;
//Start the Threads HERE
this->backgroundWorker2 = gcnew System::ComponentModel::BackgroundWorker;
this->backgroundWorker2->WorkerReportsProgress = true;
this->backgroundWorker2->WorkerSupportsCancellation = true;
backgroundWorker2->DoWork += gcnew DoWorkEventHandler( this, &MainFacebook::backgroundWorker2_DoWork );
backgroundWorker2->RunWorkerCompleted += gcnew RunWorkerCompletedEventHandler( this, &MainFacebook::backgroundWorker2_RunWorkerCompleted ); stringstream st; st << index_pos_curr;
(*this->storing_vars)[st.str()]["index"] = "";
//This ^^^^^ will initialize it and then in the BackgroundWorker will only update, this way it doesn't crash. :)
backgroundWorker2->ProgressChanged += gcnew ProgressChangedEventHandler( this, &MainFacebook::backgroundWorker2_ProgressChanged );
backgroundWorker2->RunWorkerAsync(index_pos_curr);
Sleep(50); //THE PROBLEM IS HERE, IF I COMMENT THIS OUT it won't work, that's probably because there are a lot of functions trying to add values in the same variable (even though the indexes are differents in each call)
}
Is this the proper way to iterate over a read on a socket? I am having a hard time getting this to work properly. data.size is an unsigned int that is populated from the socket as well. It is correct. data.data is an unsigned char *.
if ( data.size > 0 ) {
data.data = (unsigned char*)malloc(data.size);
memset(&data.data, 0, data.size);
int remainingSize = data.size;
unsigned char *iter = data.data;
int count = 0;
do {
count = read(connect_fd, iter, remainingSize);
iter += count;
remainingSize -= count;
} while (count > 0 && remainingSize > 0);
}
else {
data.data = 0;
}
Thanks in advance.
You need to check the return value from read before you start adding it to other values.
You'll get a zero when the socket reports EOF, and -1 on error. Keep in mind that for a socket EOF is not the same as closed.
Low level socket programming is very tedious and error prone. If you use C++ you should try to use higher level libraries like Boost or ACE.
I would also suggest to read C++ Network Programming: Mastering Complexity Using ACE and Patterns and C++ Network Programming: Systematic Reuse with ACE and Frameworks
Put the read as part of the while condition.
while((remainingSize > 0) && (count = read(connect_fd, iter, remainingSize)) > 0)
{
iter += count;
remainingSize -= count;
}
This way if it fails you immediately stop the loop.
It is very common pattern to use the read as part of the loop condition otherwise you need to check the state inside the loop which makes the code uglier.
Personally:
I would move the whole above test into a separate function for readability but your milage may very.
Also using malloc (and company) is going to lead to a whole boat of memory management issues. I would use a std::vector. This also future proofs the code when you modify it to start throwing exceptions, now it will also be exception safe.
So assuming you change data.data to have a type of std::vector<unsigned char> then
if ( data.size > 0 )
{
std::vector<unsigned char> buffer(data.size);
unsigned char *iter = &buffer[0];
while(... read(connect_fd, iter, remainingSize) )
{
.....
}
... handle error as required
buffer.resize(buffer.size() - remainingSize);
data.data.swap(buffer);
}
Keep in mind that read() calls are system calls and thus a source of possible blocking, and even if you use non-blocking I/O, are inherently heavyweight. I would recommend minimising them.
A good way to go that has always served me well in over a decade of BSD socket programming in C is to use non-blocking I/O and issue a FIONREAD ioctl() to get the total amount of data waiting at a given polling interval (assuming you're using some sort of synchronous I/O mux like select()) and then just read() that amount as many times as necessary to capture all of it, and then return the function for the moment until the next timer tick.