How do I manage thread shutdown in a multithreaded crawler?

How do I manage thread shutdown in a multithreaded crawler? - c++

Let's say I'm writing a mulithreaded web crawler. Threads get a job (for example, in a form of URL) from a queue, do some work, and then might add some new jobs to the queue. Sounds simple enough, but I'm not sure how to handle the situation where all the jobs are done. Let's say there are currently 0 jobs in the queue, and some thread is trying to get a new job. At this point two situations are possible:
Some other threads are working and might actually produce new jobs for this thread to get. In this case, it is probably possible to just wait for a new task (with a blocking .pop(), if the queue supports it, or just by sleeping and waking up time to time to check if a job is available)
All other threads are also waiting for a job. In this case, no new jobs can be produced, so threads must be terminated.
One solution I can think of is having an integer (behind a mutex), which should serve as a number of "busy" threads - it will be increased when thread gets a job, and decreased once it is finished processing it. This way, if there is 0 jobs and 0 threads working, a thread can safely be terminated. However, I'm not sure it is the best solution possible. Are there any other options to handle such a situation?

Related

why some thread pool implementation doesn't use producer and consumer model

I intend to implement a thread pool to manage threads in my project. The basic structure of thread pool come to my head is queue, and some threads generate tasks into this queue, and some thread managed by thread pool are waiting to handle those task. I think this is class producer and consumer problem. But when I google thread pool implementation on the web, I find those implementation seldom use this classic model, so my question is why they don't use this classic model, does this model has any drawbacks? why they don't use full semaphore and empty semaphore to sync?

If you have multiple threads waiting on a single resource (in this case the semaphores and queue) then you are creating a bottle neck. You are forcing all tasks through one queue, even though you have multiple workers. Logically this might make sense if the workers are usually idle, but the whole point of a thread pool is to deal with a heavily loaded scenario where the workers are kept busy (for maximum through-put). Using a single input queue will be particularly bad on a multi-processor system where all workers read and write the head of the queue when they are trying to get the next task. Even though the lock contention might be low, the queue head pointer will still need to be shared/communicated from one CPU cache to another each time it is updated.
Think about the ideal case: all workers are always busy. When a new task is enqueued you want it to be dispatched to the worker that will complete its current/pending task(s) first.
If, as a client, you had a contention-free oracle that could tell you which worker to enqueue a new task to, and each worker had its own queue, then you could implement each worker with its own multi-writer-single-reader queue and always dispatch new tasks to the best queue, thus eliminating worker contention on a single shared input queue. Of course you don't have such an oracle, but this mechanism still works pretty well until a worker runs out of tasks or the queues get imbalanced. "Work stealing" deals with these cases, while still reducing contention compared to the single queue case.
See also:
Is Work Stealing always the most appropriate user-level thread scheduling algorithm?

Why there's no Producer and Consumer model implementation
This model is very generic and could have lots of different explanations, one of the implementation could be a Queue:
Try Apache APR Queue:
It's documented as Thread Safe FIFO bounded queue.
http://apr.apache.org/docs/apr-util/1.3/apr__queue_8h.html

Find minimum queue size among threads

I am trying to implement a new scheduling technique with Multithreads. Each Thread has it own private local queue. The idea is, each time the task is created from the program thread, it should search the minimum queue sizes ( a queue with less number of tasks) among the queues and enqueue in it.
A way of load balancing among threads, where less busy queues enqueued more.
Can you please suggest some logics (or) idea how to find the minimum size queues among the given queues dynamically in programming point of view.
I am working on visual studio 2008, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .

As you see trying to find the less loaded queue is cumbersome and could be an inefficient method as you may add more work to queues with only one heavy task, whereas queues with small tasks will have nor more jobs and become quickly inactive.
You'd better use a work-stealing heuristic : when a thread is done with its own jobs it will look at the other threads queues and "steal" some work instead of remaining idle or be terminated.
Then the system will be auto-balanced with each thread being active until there is not enough work for everyone.
You should not have a situation with idle threads and work waiting for processing.

If you really want to try this, can each queue not just keep a public 'int count' member, updated with atomic inc/dec as tasks are pushed/popped?
Whether such a design is worth the management overhead and the occasional 'mistakes' when a task is queued to a thread that happens to be running a particularly lengthy job when another thread is just about to dequeue a very short job, is another issue.

Why aren't the threads fetching their work from a 'master' work queue ?
If you are really trying to distribute work items from a master source, to a set of workers, you are then doing load balancing, as you say. In that case, you really are talking about scheduling, unless you simply do round-robin style balancing. Scheduling is a very deep subject in Computing, you can easily spend weeks, or months learning about it.

You could synchronise a counter among the threads. But I guess this isn't what you want.
Since you want to implement everything using dataflow, everything should be queues.
Your first option is to query the number of jobs inside a queue. I think this is not easy, if you want a single reader/writer pattern, because you probably have to use lock for this operation, which is not what you want. Note: I'm just guessing, that you can't use lock-free queues here; either you have a counter or take the difference of two pointers, either way you have a lock.
Your second option (which can be done with lock-free code) is to send a command back to the dispatcher thread, telling him that worker thread x has consumed a job. Using this approach you have n more queues, each from one worker thread to the dispatcher thread.

Threading in an endless C++ program

I have a web interface where the user submits some data and it gets written to a database. In the background there is a C++ program which periodically checks the database for new entries. It then takes these entries, processes them and writes their result to a directory. It then proceeds to sleep and keep checking for new entries to process.
My question is in regards to adding multithreading to the C++ program. I have read that it's generally a bad idea just to create a new thread every time you need a another job done, but rather add the jobs to a queue and disperse them out to a fixed number of threads that have already been created (say, 5 or so). Is this the proper design route to take for my situation? Also, if I understand pthread_join correctly, I don't actually need to call it because I don't want to wait for all of the jobs to finish before continuing to check for new updates to the database.
I just wanted to make sure I'm headed in the right direction, any affirmations/criticisms/resources?

You should first decide whether you even need more than one thread - it sounds like checking the database and writing files at some given interval can be accomplished using only one thread. Multiple threads would become useful when you start having to write different data to multiple files simultaneously at non-regular intervals. You are correct that using a queue of sorts would be the best way to distribute these 'jobs' to your threads, and that using a thread pool will give you a little more control over how many 'jobs' you want running simultaneously at any given time. The pthread_join method is used when you want to make sure one thread doesn't exit before another - I've used this mostly to make sure that the program's initial thread doesn't exit after creating the thread pool, as when the parent thread exits the program's execution stops. Some psuedo code based on my comments below.
main thread:
spawn child threads
while(some exit condition){
check database for new jobs
if(new jobs){
acquire job queue mutex //mutexes ensures only one thread accesses shared
add job to queue //data at a time
signal on shared condition variable
release job queue mutex
}
sleep(some regular duration)
}
child thread:
while(some exit condition){
acquire job queue mutex
if(job queue's size == 0){
wait on the shared condition variable
}
grab job from queue
release job queue mutex
handle job
}
See here for pthread/mutex/CV usage notes.

In my experience creating a thread will most likely take tens of milliseconds. For your days computers this is not a big deal. Nothing bad will happen if it will be created/destroyed often. Looking for simple and flawless app level design might be more important.
As a possible variant, I would recommend considering a pool of threads, one thread per available CPU core. These threads should simply sleep at the end of the loop and regularly check if there is something to do or not.
This simplistic design will add minimal overhead and allow using all available CPU power at the same time.
My 2 cents.

Possible frameworks/ideas for thread managment and work allocation in C++

I am developing a C++ application that needs to process large amount of data. I am not in position to partition data so that multi-processes can handle each partition independently. I am hoping to get ideas on frameworks/libraries that can manage threads and work allocation among worker threads.
Manage threads should include at least below functionality.
1. Decide on how many workers threads are required. We may need to provide user-defined function to calculate number of threads.
2. Create required number of threads.
3. Kill/stop unnecessary threads to reduce resource wastage.
4. Monitor healthiness of each worker thread.
Work allocation should include below functionality.
1. Using callback functionality, the library should get a piece of work.
2. Allocate the work to available worker thread.
3. Master/slave configuration or pipeline-of-worker-threads should be possible.
Many thanks in advance.

Your question essentially boils down to "how do I implement a thread pool?"
Writing a good thread pool is tricky. I recommend hunting for a library that already does what you want rather than trying to implement it yourself. Boost has a thread-pool library in the review queue, and both Microsoft's concurrency runtime and Intel's Threading Building Blocks contain thread pools.
With regard to your specific questions, most platforms provide a function to obtain the number of processors. In C++0x this is std::thread::hardware_concurrency(). You can then use this in combination with information about the work to be done to pick a number of worker threads.
Since creating threads is actually quite time consuming on many platforms, and blocked threads do not consume significant resources beyond their stack space and thread info block, I would recommend that you just block worker threads with no work to do on a condition variable or similar synchronization primitive rather than killing them in the first instance. However, if you end up with a large number of idle threads, it may be a signal that your pool has too many threads, and you could reduce the number of waiting threads.
Monitoring the "healthiness" of each thread is tricky, and typically platform dependent. The simplest way is just to check that (a) the thread is still running, and hasn't unexpectedly died, and (b) the thread is processing tasks at an acceptable rate.
The simplest means of allocating work to threads is just to use a single shared job queue: all tasks are added to the queue, and each thread takes a task when it has completed the previous task. A more complex alternative is to have a queue per thread, with a work-stealing scheme that allows a thread to take work from others if it has run out of tasks.
If your threads can submit tasks to the work queue and wait for the results then you need to have a scheme for ensuring that your worker threads do not all get stalled waiting for tasks that have not yet been scheduled. One option is to spawn a new thread when a task gets blocked, and another is to run the not-yet-scheduled task that is blocking a given thread on that thread directly in a recursive manner. There are advantages and disadvantages with both these schemes, and with other alternatives.

How can I improve my real-time behavior in multi-threaded app using pthreads and condition variables?

I have a multi-threaded application that is using pthreads. I have a mutex() lock and condition variables(). There are two threads, one thread is producing data for the second thread, a worker, which is trying to process the produced data in a real time fashion such that one chuck is processed as close to the elapsing of a fixed time period as possible.
This works pretty well, however, occasionally when the producer thread releases the condition upon which the worker is waiting, a delay of up to almost a whole second is seen before the worker thread gets control and executes again.
I know this because right before the producer releases the condition upon which the worker is waiting, it does a chuck of processing for the worker if it is time to process another chuck, then immediately upon receiving the condition in the worker thread, it also does a chuck of processing if it is time to process another chuck.
In this later case, I am seeing that I am late processing the chuck many times. I'd like to eliminate this lost efficiency and do what I can to keep the chucks ticking away as close to possible to the desired frequency.
Is there anything I can do to reduce the delay between the release condition from the producer and the detection that that condition is released such that the worker resumes processing? For example, would it help for the producer to call something to force itself to be context switched out?
Bottom line is the worker has to wait each time it asks the producer to create work for itself so that the producer can muck with the worker's data structures before telling the worker it is ready to run in parallel again. This period of exclusive access by the producer is meant to be short, but during this period, I am also checking for real-time work to be done by the producer on behalf of the worker while the producer has exclusive access. Somehow my hand off back to running in parallel again results in significant delay occasionally that I would like to avoid. Please suggest how this might be best accomplished.

I could suggest the following pattern. Generally the same technique could be used, e.g. when prebuffering frames in some real-time renderers or something like that.
First, it's obvious that approach that you describe in your message would only be effective if both of your threads are loaded equally (or almost equally) all the time. If not, multi-threading would actually benefit in your situation.
Now, let's think about a thread pattern that would be optimal for your problem. Assume we have a yielding and a processing thread. First of them prepares chunks of data to process, the second makes processing and stores the processing result somewhere (not actually important).
The effective way to make these threads work together is the proper yielding mechanism. Your yielding thread should simply add data to some shared buffer and shouldn't actually care about what would happen with that data. And, well, your buffer could be implemented as a simple FIFO queue. This means that your yielding thread should prepare data to process and make a PUSH call to your queue:
X = PREPARE_DATA()
BUFFER.LOCK()
BUFFER.PUSH(X)
BUFFER.UNLOCK()
Now, the processing thread. It's behaviour should be described this way (you should probably add some artificial delay like SLEEP(X) between calls to EMPTY)
IF !EMPTY(BUFFER) PROCESS(BUFFER.TOP)
The important moment here is what should your processing thread do with processed data. The obvious approach means making a POP call after the data is processed, but you will probably want to come with some better idea. Anyway, in my variant this would look like
// After data is processed
BUFFER.LOCK()
BUFFER.POP()
BUFFER.UNLOCK()
Note that locking operations in yielding and processing threads shouldn't actually impact your performance because they are only called once per chunk of data.
Now, the interesting part. As I wrote at the beginning, this approach would only be effective if threads act somewhat the same in terms of CPU / Resource usage. There is a way to make these threading solution effective even if this condition is not constantly true and matters on some other runtime conditions.
This way means creating another thread that is called controller thread. This thread would merely compare the time that each thread uses to process one chunk of data and balance the thread priorities accordingly. Actually, we don't have to "compare the time", the controller thread could simply work the way like:
IF BUFFER.SIZE() > T
DECREASE_PRIORITY(YIELDING_THREAD)
INCREASE_PRIORITY(PROCESSING_THREAD)
Of course, you could implement some better heuristics here but the approach with controller thread should be clear.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js