I'm working on a producer consumer problem with an intermediate processing thread. When I run 200 of these applications it locks the system up in win7 when lots of connections timeout. Unfortunately, not in a way that I know how to debug it. The system becomes unresponsive and I have to restart it with the power button. It works fine on my mac, and oddly enough, it works fine in windows in safe mode.
I'm using boost 1.44 as that is what the host application uses.
Here is my queue. My intention is that the queues are synchronized on their size. I've manipulated this to use timed_wait to make sure I wasn't losing notifications, though I saw no difference in effect.
class ConcurrentQueue {
public:
void push(const std::string& str, size_t notify_size, size_t max_size);
std::string pop();
private:
std::queue<std::string> queue;
boost::mutex mutex;
boost::condition_variable cond;
};
void ConcurrentQueue::push(
const std::string& str, size_t notify_size, size_t max_size) {
size_t queue_size;
{{
boost::mutex::scoped_lock lock(mutex);
if (queue.size() < max_size) {
queue.push(str);
}
queue_size = queue.size();
}}
if (queue_size >= notify_size)
cond.notify_one();
}
std::string ConcurrentQueue::pop() {
boost::mutex::scoped_lock lock(mutex);
while (!queue.size())
cond.wait(lock);
std::string str = queue.front();
queue.pop();
return str;
}
These threads use the below queues to process and send using libcurl.
boost::shared_ptr<ConcurrentQueue> queue_a(new ConcurrentQueue);
boost::shared_ptr<ConcurrentQueue> queue_b(new ConcurrentQueue);
void prod_run(size_t iterations) {
try {
// stagger startup
boost::this_thread::sleep(
boost::posix_time::seconds(random_num(0, 25)));
size_t save_frequency = random_num(41, 97);
for (size_t i = 0; i < iterations; i++) {
// compute
size_t v = 1;
for (size_t j = 2; j < (i % 7890) + 4567; j++) {
v *= j;
v = std::max(v % 39484, v % 85783);
}
// save
if (i % save_frequency == 0) {
std::string iv =
boost::str( boost::format("%1%=%2%") % i % v );
queue_a->push(iv, 1, 200);
}
sleep_frame();
}
} catch (boost::thread_interrupted&) {
}
}
void prodcons_run() {
try {
for (;;) {
std::string iv = queue_a->pop();
queue_b->push(iv, 1, 200);
}
} catch (boost::thread_interrupted&) {
}
}
void cons_run() {
try {
for (;;) {
std::string iv = queue_b->pop();
send_http_post("http://127.0.0.1", iv);
}
} catch (boost::thread_interrupted&) {
}
}
My understanding of using mutexes in this way should not make a system unresponsive. If anything, my apps would deadlock and sleep forever.
Is there some way that having 200 of these at once creates a scenario where this isn't the case?
Update:
When I restart the computer, most of the time I need to replug in the USB keyboard to get it to respond. Given the driver comment, I thought that may be relevant. I tried updating the northbridge drivers, though they were up to date. I'll look to see if there are other drivers that need attention.
Update:
I've watched memory, non-paged pool, cpu, handles, ports and none of them are at alarming rates at any time while the system is responsive. It's possible something spikes at the end, though that is not visible to me.
Update:
When the system hangs, it stops rendering and does not respond to the keyboard. The last frame that it rendered stays up though. The system sounds like it is still running and when the system comes back up there is nothing in the event viewer saying that it crashed. There are no crash dump files either. I interpret this as the OS is being blocked out of execution.
A mutex lock locks other applications that use the same lock. Any mutex used by the OS should not be (directly) available to any application.
Of course, if the mutex is implemented using the OS in some way, it may well call into the OS, and thus using CPU resources. However, a mutex lock should not cause any worse behaviour than the application using CPU resources any other way.
It may of course be that if you use locks in an inappropriate way, different parts of the application becomes deadlocked, as function 1 acquires lock A, and then function 2 acquires lock B. If then function 1 tries to acquire lock B, and function 2 tries to acquire lock A before releasing their respective locks, you have a deadlock. The trick here is to always acquire multiple locks in the same order. So if you need two locks at the same time, always acquire lock A first, then lock B.
Deadlockin should not affect the OS as such - if anything, it makes it better, but if the application is in some way "misbehaving" in case of a deadlock, it may cause problems by calling the OS a lot - e.g if the locking is done by:
while (!trylock(lock))
{
/// do nothing here
}
it may cause peaks in system usage.
Related
My program has 8 writing threads and one persistence thread. The following code is the core of the persistence thread
std::string longLine;
myMutex.lock();
while (!myQueue.empty()) {
std::string& head = myQueue.front();
const int hSize = head.size();
if(hSize < blockMaxSize)
break;
longLine += head;
myQueue.pop_front();
}
myMutex.unlock();
flushToFile(longLine);
The performance is acceptable (millions of writings finished in hundreds of milliseconds). I still hope to improve the code by avoiding string copying so that I change the code as followed:
myMutex.lock();
while (!myQueue.empty()) {
const int hsize = myQueu.front().size();
if(hsize < blockMaxSize)
break;
std::string head{std::move(myQueue.front())};
myQueue.pop_front();
myMutex.unlock();
flushToFile(head);
myMutex.lock();
}
myMutex.unlock();
It is surprising that the performance drops sharply to millions of writings finished in quite a few seconds. Debugging shows most of time was spent on waiting for the lock after flushing the file.
But I don't understand why. Any one could help?
Not understand more time spent on wait for the lock
Possibly faster. Do all your string concatenations inside the flush function. That way your string concatenation won't block the writer threads trying to append to the queue. This is possibly a micro-optimization.
While we're at it. Let's establish that myQueue is a vector and not a queue or list class. This will be faster since the only operations on the collection are an append or total erase.
std::string longLine;
std::vector<std::string> tempQueue;
myMutex.lock();
if (myQueue.size() >= blockMaxSize) {
tempQueue = std::move(myQueue);
myQueue = {}; // not sure if this is needed
}
myMutex.unlock();
flushToFileWithQueue(tempQueue);
Where flushToFileWithQueue is this:
void flushToFileWithQueue(std::vector<std::string>& queue) {
string longLine;
for (size_t i = 0; i < queue.size(); i++) {
longline += queue[i];
}
queue.resize(0); // faster than calling .pop() N times
flushToFile(longLine);
}
You didn't show what wakes up the persistence thread. If it's polling instead of using a proper condition variable, let me know and I'll show you how to use that.
Also make use of the .reserve() method on these instances of the vector collection such that all the queue has all the memory it needs to grow. Again, possibly a micro-optimization.
I am trying to synchronize a function I am parallelizing with pthreads.
The issue is, I am having a deadlock because a thread will exit the function while other threads are still waiting for the thread that exited to reach the barrier. I am unsure whether the pthread_barrier structure takes care of this. Here is an example:
static pthread_barrier_t barrier;
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
}
int main() {
// create pthread barrier
pthread_barrier_init(&barrier, NULL, NUM_THREADS);
// create thread handles
//...
// create threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_create(&thread_handles[i], NULL, &foo, (void*) i);
}
// join the threads
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(&thread_handles[i], NULL);
}
}
Here is a solution I tried for foo, but it didn't work (note NUM_THREADS_COPY is a copy of the NUM_THREADS constant, and is decremented whenever a thread reaches the end of the function):
static void foo(void* arg) {
for(int i = beg; i < end; i++) {
if (i > 0) {
pthread_barrier_wait(&barrier);
}
}
pthread_barrier_init(&barrier, NULL, --NUM_THREADS_COPY);
}
Is there a solution to updating the number of threads to wait in a barrier for when a thread exits a function?
You need to decide how many threads it will take to pass the barrier before any threads arrive at it. Undefined behavior results from re-initializing the barrier while there are threads waiting at it. Among the plausible manifestations are that some of the waiting threads are prematurely released or that some of the waiting threads never get released, but those are by no means the only unwanted things that could happen. In any case ...
Is there a solution to updating the number of threads to wait in a
barrier for when a thread exits a function?
... no, pthreads barriers do not support that.
Since a barrier seems not to be flexible enough for your needs, you probably want to fall back to the general-purpose thread synchronization object: a condition variable (used together with a mutex and some kind of shared variable).
I'd like, instead of having my threads wait, doing nothing, for other threads to finish using data, to do something else in the meantime (like checking for input, or re-rendering the previous frame in the queue, and then returning to check to see if the other thread is done with its task).
I think this code that I've written does that, and it "seems" to work in the tests I've performed, but I don't really understand how std::memory_order_acquire and std::memory_order_clear work exactly, so I'd like some expert advice on if I'm using those correctly to achieve the behaviour I want.
Also, I've never seen multithreading done this way before, which makes me a bit worried. Are there good reasons not to have a thread do other tasks instead of waiting?
/*test program
intended to test if atomic flags can be used to perform other tasks while shared
data is in use, instead of blocking
each thread enters the flag protected part of the loop 20 times before quitting
if the flag indicates that the if block is already in use, the thread is intended to
execute the code in the else block (only up to 5 times to avoid cluttering the output)
debug note: this doesn't work with std::cout because all the threads are using it at once
and it's not thread safe so it all gets garbled. at least it didn't crash
real world usage
one thread renders and draws to the screen, while the other checks for input and
provides frameData for the renderer to use. neither thread should ever block*/
#include <fstream>
#include <atomic>
#include <thread>
#include <string>
struct ThreadData {
int numTimesToWriteToDebugIfBlockFile;
int numTimesToWriteToDebugElseBlockFile;
};
class SharedData {
public:
SharedData() {
threadData = new ThreadData[10];
for (int a = 0; a < 10; ++a) {
threadData[a] = { 20, 5 };
}
flag.clear();
}
~SharedData() {
delete[] threadData;
}
void runThread(int threadID) {
while (this->threadData[threadID].numTimesToWriteToDebugIfBlockFile > 0) {
if (this->flag.test_and_set(std::memory_order_acquire)) {
std::string fileName = "debugIfBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", running, output #" << this->threadData[threadID].numTimesToWriteToDebugIfBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugIfBlockFile -= 1;
this->flag.clear(std::memory_order_release);
}
else {
if (this->threadData[threadID].numTimesToWriteToDebugElseBlockFile > 0) {
std::string fileName = "debugElseBlockOutputThread#";
fileName += std::to_string(threadID);
fileName += ".txt";
std::ofstream writeFile(fileName.c_str(), std::ios::app);
writeFile << threadID << ", standing by, output #" << this->threadData[threadID].numTimesToWriteToDebugElseBlockFile << std::endl;
writeFile.close();
writeFile.clear();
this->threadData[threadID].numTimesToWriteToDebugElseBlockFile -= 1;
}
}
}
}
private:
ThreadData* threadData;
std::atomic_flag flag;
};
void runThread(int threadID, SharedData* sharedData) {
sharedData->runThread(threadID);
}
int main() {
SharedData sharedData;
std::thread thread[10];
for (int a = 0; a < 10; ++a) {
thread[a] = std::thread(runThread, a, &sharedData);
}
thread[0].join();
thread[1].join();
thread[2].join();
thread[3].join();
thread[4].join();
thread[5].join();
thread[6].join();
thread[7].join();
thread[8].join();
thread[9].join();
return 0;
}```
The memory ordering you're using here is correct.
The acquire memory order when you test and set your flag (to take your hand-written lock) has the effect, informally speaking, of preventing any memory accesses of the following code from becoming visible before the flag is tested. That's what you want, because you want to ensure that those accesses are effectively not done if the flag was already set. Likewise, the release order on the clear at the end prevents any of the preceding accesses from becoming visible after the clear, which is also what you need so that they only happen while the lock is held.
However, it's probably simpler to just use a std::mutex. If you don't want to wait to take the lock, but instead do something else if you can't, that's what try_lock is for.
class SharedData {
// ...
private:
std::mutex my_lock;
}
// ...
if (my_lock.try_lock()) {
// lock was taken, proceed with critical section
my_lock.unlock();
} else {
// lock not taken, do non-critical work
}
This may have a bit more overhead, but avoids the need to think about atomicity and memory ordering. It also gives you the option to easily do a blocking wait if that later becomes useful. If you've designed your program around an atomic_flag and later find a situation where you must wait to take the lock, you may find yourself stuck with either spinning while continually retrying the lock (which is wasteful of CPU cycles), or something like std::this_thread::yield(), which may wait for longer than necessary after the lock is available.
It's true this pattern is somewhat unusual. If there is always non-critical work to be done that doesn't need the lock, commonly you'd design your program to have a separate thread that just does the non-critical work continuously, and then the "critical" thread can just block as it waits for the lock.
In QT, from main(GUI) thread I am creating a worker thread to perform a certain operation which accesses a resource shared by both threads. On certain action in GUI, main thread has to manipulate the resource. I tried using QMutex to lock that particular resource. This resource is continuously used by the worker thread, How to notify main thread on this?
Tried using QWaitCondition but it was crashing the application.
Is there any other option to notify and achieve synchronisation between threads?
Attached the code snippet.
void WorkerThread::IncrementCounter()
{
qDebug() << "In Worker Thread IncrementCounter function" << endl;
while(stop == false)
{
mutex.lock();
for(int i = 0; i < 100; i++)
{
for(int j = 0; j < 100; j++)
{
counter++;
}
}
qDebug() << counter;
mutex.unlock();
}
qDebug() << "In Worker Thread Aborting " << endl;
}
//Manipulating the counter value by main thread.
void WorkerThread::setCounter(int value)
{
waitCondition.wait(&mutex);
counter = value;
waitCondition.notify_one();
}
You are using the wait condition completely wrong.
I urge you to read up on mutexes and conditions, and maybe look at some examples.
wait() will block execution until either notify_one() or notify_all() is called somewhere. Which of course cannot happen in your code.
You cannot wait() a condition on one line and then expect the next two lines to ever be called if they contain the only wake up calls.
What you want is to wait() in one thread and notify_XXX() in another.
You could use shared memory from within the same process. Each thread could lock it before writing it, like this:
QSharedMemory *shared=new QSharedMemory("Test Shared Memory");
if(shared->create(1,QSharedMemory::ReadWrite))
{
shared->lock();
// Copy some data to it
char *to = (char*)shared->data();
const char *from = &dataBuffer;
memcpy(to, from, dataSize);
shared->unlock();
}
You should also lock it for reading. If strings are wanted, reading strings can be easier that writing them, if they are zero terminated. You'll want to convert .toLatin1() to get a zero-terminated string which you can get the size of a string. You might get a lock that multiple threads can read from, with shared->attach(); but that's more for reading the shared memory of a different process..
You might just use this instead of muteces. I think if you try to lock it, and something else already has it locked, it will just block until the other process unlocks it.
With the new standards ofc++17 I wonder if there is a good way to start a process with a fixed number of threads until a batch of jobs are finished.
Can you tell me how I can achieve the desired functionality of this code:
std::vector<std::future<std::string>> futureStore;
const int batchSize = 1000;
const int maxNumParallelThreads = 10;
int threadsTerminated = 0;
while(threadsTerminated < batchSize)
{
const int& threadsRunning = futureStore.size();
while(threadsRunning < maxNumParallelThreads)
{
futureStore.emplace_back(std::async(someFunction));
}
for(std::future<std::string>& readyFuture: std::when_any(futureStore.begin(), futureStore.end()))
{
auto retVal = readyFuture.get();
// (possibly do something with the ret val)
threadsTerminated++;
}
}
I read, that there used to be an std::when_any function, but it was a feature that did make it getting into the std features.
Is there any support for this functionality (not necessarily for std::future-s) in the current standard libraries? Is there a way to easily implement it, or do I have to resolve to something like this?
This does not seem to me to be the ideal approach:
All your main thread does is waiting for your other threads finishing, polling the results of your future. Almost wasting this thread somehow...
I don't know in how far std::async re-uses the threads' infrastructures in any suitable way, so you risk creating entirely new threads each time... (apart from that you might not create any threads at all, see here, if you do not specify std::launch::async explicitly.
I personally would prefer another approach:
Create all the threads you want to use at once.
Let each thread run a loop, repeatedly calling someFunction(), until you have reached the number of desired tasks.
The implementation might look similar to this example:
const int BatchSize = 20;
int tasksStarted = 0;
std::mutex mutex;
std::vector<std::string> results;
std::string someFunction()
{
puts("worker started"); fflush(stdout);
sleep(2);
puts("worker done"); fflush(stdout);
return "";
}
void runner()
{
{
std::lock_guard<std::mutex> lk(mutex);
if(tasksStarted >= BatchSize)
return;
++tasksStarted;
}
for(;;)
{
std::string s = someFunction();
{
std::lock_guard<std::mutex> lk(mutex);
results.push_back(s);
if(tasksStarted >= BatchSize)
break;
++tasksStarted;
}
}
}
int main(int argc, char* argv[])
{
const int MaxNumParallelThreads = 4;
std::thread threads[MaxNumParallelThreads - 1]; // main thread is one, too!
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i] = std::thread(&runner);
}
runner();
for(int i = 0; i < MaxNumParallelThreads - 1; ++i)
{
threads[i].join();
}
// use results...
return 0;
}
This way, you do not recreate each thread newly, but just continue until all tasks are done.
If these tasks are not all all alike as in above example, you might create a base class Task with a pure virtual function (e. g. "execute" or "operator ()") and create subclasses with the implementation required (and holding any necessary data).
You could then place the instances into a std::vector or std::list (well, we won't iterate, list might be appropriate here...) as pointers (otherwise, you get type erasure!) and let each thread remove one of the tasks when it has finished its previous one (do not forget to protect against race conditions!) and execute it. As soon as no more tasks are left, return...
If you dont care about the exact number of threads, the simplest solution would be:
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){return std::async(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}
From my experience, std::async will reuse the threads, after a certain amount of threads is spawend. It will not spawn 1000 threads. Also, you will not gain much of a performance boost (if any), when using a threadpool. I did measurements in the past, and the overall runtime was nearly identical.
The only reason, I use threadpools now, is to avoid the delay for creating threads in the computation loop. If you have timing constraints, you may miss deadlines, when using std::async for the first time, since it will create the threads on the first calls.
There is a good thread pool library for these applications. Have a look here:
https://github.com/vit-vit/ctpl
#include <ctpl.h>
const unsigned int numberOfThreads = 10;
const unsigned int batchSize = 1000;
ctpl::thread_pool pool(batchSize /* two threads in the pool */);
std::vector<std::future<std::string>> futureStore(
batchSize
);
std::generate(futureStore.begin(), futureStore.end(), [](){ return pool.push(someTask);});
for(auto& future : futureStore) {
std::string value = future.get();
doWork(value);
}