Setting celery concurrency to 1 worker per queue

Setting celery concurrency to 1 worker per queue - concurrency

I am essentially using rabbitmq queues in celery as a poor man's synchronisation. Eg when certain objects are updated (and have a high cost), I round robin them to a set of 10 queues based on their object IDs. Firstly is this a common pattern or is there a better way.
Secondly, with celeryd, it seems that the concurrency level option (CELERY_CONCURRENCY) sets the number of workers across all the queues. This kind of defeats the purpose of using the queues for synchronization as a queue can be serviced by multiple workers, which means potential race conditions when performing different actions on the same object.
Is there a way to set the concurrency level (or worker pool options) so that we have one worker per N queues?
Thanks
Sri

Why you don't simply implements a global task lock system, by using memcache or a nosql db?
In this way you avoid any race condition.
Here an example
http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time

Related to the first part of your question, I've asked and answered a similar question here: Route to worker depending on result in Celery?
Essentially you can route directly to a worker depending on a key, which in your case is an ID. It avoids any need for a single locking point. Hopefully it's useful, even though this question is 2 years old :)

Related

Celery: number of workers vs concurrency

What is the difference between having:
one worker with concurrency 4 or
two workers with concurrency 2 each
for the same queue.
Thanks

I assume that you are running both workers in the same machine. In that case I would recommend you to maintain one worker for a queue.
Two workers for the same queue does not benefit you by any means. It would just increase the memory wastage.
Two or more workers when you have multiple queues, to maintain priority or to allocate different number of cores to each worker.
Two or more workers for a single queue is useful if you run the workers in different machines. The workers in different machines consumes tasks from same queue, you could allocate concurrency based on the cores available in each machine.
I do realise I responded 2+ years later. But I just thought I'll put it here for anyone who still has similar doubts.

Intersting question.
Things that I can think of (I'm sure there are a lot more):
For having high availability:
You want more than one machine (if one goes down) - so you must use worker per machine.
Even for one machine - I think it is safer to have 2 workers which run in a two different processes instead of one worker with high concurrency (correct me if I wrong, but I think it is implemented with threads).
In docs I see that they the recommendation is to use concurrency per CPUs.
If you want to separate different tasks to different workers..
Of course, you have price for that: more processes that takes more resources (CPU/Memory etc).
I found this question which is quite similar.

When to use the disruptor pattern and when local storage with work stealing?

Is the following correct?
The disruptor pattern has better parallel performance and scalability if each entry has to be processed in multiple ways (io operations or annotations), since that can be parallelized using multiple consumers without contention.
Contrarily, work stealing (i.e. storing entries locally and stealing entries from other threads) has better parallel performance and scalability if each entry has to be processed in a single way only, since disjointly distributing the entries onto multiple threads in the disruptor pattern causes contention.
(And is the disruptor pattern still so much faster than other lockless multi-producer multi-consumer queues (e.g. from boost) when multiple producers (i.e. CAS operations) are involved?)
My situation in detail:
Processing an entry can produce several new entries, which must be processed eventually, too. Performance has highest priority, entries being processed in FIFO order has second priority.
In the current implementation, each thread uses a local FIFO, where it adds its new entries. Idle threads steal work from other thread's local FIFO. Dependencies between the thread's processing are resolved using a lockless, mechanically sympathetic hash table (CASs on write, with bucket granularity). This results in pretty low contention but FIFO order is sometimes broken.
Using the disruptor pattern would guarantee FIFO order. But wouldn't distributing the entries onto the threads cause much higher contention (e.g. CAS on a read cursor) than for local FIFOs with work stealing (each thread's throughput is about the same)?
References I've found
The performance tests in the standard technical paper on the disruptor (Chapter 5 + 6) do not cover disjoint work distribution.
https://groups.google.com/forum/?fromgroups=#!topic/lmax-disruptor/tt3wQthBYd0 is the only reference I've found on disruptor + work stealing. It states that a queue per thread is dramatically slower if there is any shared state, but does not go into detail or explain why. I doubt that this sentence applies to my situation with:
shared state being resolved with a lockless hash table;
having to disjointly distribute entries amongst consumers;
except for work stealing, each thread reads and writes only in its local queue.

Update - Bottom line up front for max performance: You need to write both in the idiomatic syntax for disruptor and work stealing, and then benchmark.
To me, I think the distinction is primarily in the split between message vs task focus, and therefore in the way you want to think of the problem. Try to solve your problem, and if it is task-focused then Disruptor is a good fit. If the problem is message focused, then you might be more suited to another technique such as work stealing.
Use work stealing when your implementation is message focused. Each thread can pick up a message and run it through to completion. For an example HTTP server - Each inbound http request is allocated a thread. That thread is focused on handling the request start to finish - logging the request, checking security controls, doing vhost lookup, fetching file, sending response, and closing connection
Use disruptor when your implementation is task focused. Each thread can work on a particular stage of the processing. Alternative example: for a task focus, the processing would be split into stages, so you would have a thread that does logging, a thread for security controls, a thread for vhost lookup, etc; each thread focused on its task and passes the request to the next thread in the pipeline. Stages may be parallelised but the overall structure is a thread focused on a specific task and hands the message along between threads.
Of course, you can change your implementation to suit each approach better.
In your case, I would structure the problem differently if you wanted to use Disruptor. Generally you would eliminate shared state by having a single thread own the state and pass all tasks through that thread of work - look up SEDA for lots of diagrams like this. This can have lots of benefits, but again, is really down to your implementation.
Some more verbosity:
Disruptor - very useful when strict ordering of stages is required, additional benefits when all tasks are of a consistent length eg: no blocking on external system, and very similar amount of processing per task. In this scenario, you can assume that all threads will work evenly through the system, and therefore arrange N threads to process every N messages. I like to think of Disruptor as an efficient way to implement SEDA-like systems where threads process stages. You can of course have an application with a single stage and multiple parallel units performing same work at each stage however this is not really the point in my view. This would totally avoid the cost of shared state.
Work stealing - use this when tasks are of a varied duration and the order of message processing is not important, as this allows threads that are free and have already consumed their messages to continue progress from another task queue. This way if for example you have 10 threads and 1 is blocked on IO, the remainder will still complete their processing.

Find minimum queue size among threads

I am trying to implement a new scheduling technique with Multithreads. Each Thread has it own private local queue. The idea is, each time the task is created from the program thread, it should search the minimum queue sizes ( a queue with less number of tasks) among the queues and enqueue in it.
A way of load balancing among threads, where less busy queues enqueued more.
Can you please suggest some logics (or) idea how to find the minimum size queues among the given queues dynamically in programming point of view.
I am working on visual studio 2008, C++ programming language in our own multithreading library implementing a multi-rate synchronous data flow paradigm .

As you see trying to find the less loaded queue is cumbersome and could be an inefficient method as you may add more work to queues with only one heavy task, whereas queues with small tasks will have nor more jobs and become quickly inactive.
You'd better use a work-stealing heuristic : when a thread is done with its own jobs it will look at the other threads queues and "steal" some work instead of remaining idle or be terminated.
Then the system will be auto-balanced with each thread being active until there is not enough work for everyone.
You should not have a situation with idle threads and work waiting for processing.

If you really want to try this, can each queue not just keep a public 'int count' member, updated with atomic inc/dec as tasks are pushed/popped?
Whether such a design is worth the management overhead and the occasional 'mistakes' when a task is queued to a thread that happens to be running a particularly lengthy job when another thread is just about to dequeue a very short job, is another issue.

Why aren't the threads fetching their work from a 'master' work queue ?
If you are really trying to distribute work items from a master source, to a set of workers, you are then doing load balancing, as you say. In that case, you really are talking about scheduling, unless you simply do round-robin style balancing. Scheduling is a very deep subject in Computing, you can easily spend weeks, or months learning about it.

You could synchronise a counter among the threads. But I guess this isn't what you want.
Since you want to implement everything using dataflow, everything should be queues.
Your first option is to query the number of jobs inside a queue. I think this is not easy, if you want a single reader/writer pattern, because you probably have to use lock for this operation, which is not what you want. Note: I'm just guessing, that you can't use lock-free queues here; either you have a counter or take the difference of two pointers, either way you have a lock.
Your second option (which can be done with lock-free code) is to send a command back to the dispatcher thread, telling him that worker thread x has consumed a job. Using this approach you have n more queues, each from one worker thread to the dispatcher thread.

Intel TBB tasks for serving network connections - good model?

I'm developing a backend for a networking product, that serves a dozen of clients (N = 10-100). Each connection requires 2 periodic tasks, the heartbeat, and downloading of telemetry via SSH, each at H Hz. There are also extra events of different kind coming from the frontend. By nature of every of the tasks, there is a solid part of waiting in select call on each connection's socket, which allows OS to switch between threads often to serve other clients while waiting for response.
In my initial implementation, I create 3 threads per connection (heartbeat, telemetry, extra), each waiting on a single condition variable, which is poked every time there is something to do in a workqueue. The workqueue is filled with the above-mentioned periodic events using a timer and commands from the frontend.
I have a few questions here.
Would it be a good idea to switch a worker thread pool approach to Intel TBB tasks? If so, to which value of threads do I need to initialize tbb::task_scheduler_init?
In the current approach with 300 threads waiting on a conditional variable, which is signaled N * H * 3 times per second, it is likely to become a bottleneck for scalability (especially on the side which calls signal). Are there any better approaches for waking up just one worker per task?
How is waking of a worker thread implemented in TBB?
Thanks for your suggestions!

Its difficult to say if switching to TBB would be a good approach or not. What are your performance requirements, and what are the performance numbers for the current implementation? If the current solution is good enough, than its probably not worth-while to switch.
If you want to compare the both (current impl vs TBB) to know which gives better performance, then you could do what is called a "Tracer bullet" (from the book The Pragmatic Programmer) for each implementation and compare the results. In simpler terms, do a reduced prototype of each and compare the results.
As mentioned in this answer, its typically not a good idea to try to do performance improvements without having concrete evidence that what you're going to change will improve.
Besides all of that, you could consider making a thread pool with the number of threads being some function of the number of CPU cores (maybe a factor of 1 or 1.5 threads per core) The threads would take off tasks from a common work-queue. There would be 3 types of tasks: heartbeat, telemetry, extra. This should reduce the negative impacts caused by context switching when using large numbers of threads.

More threads, better performance?

When I write a message driven app. much like a standard windows app only that it extensively uses messaging for internal operations, what would be the best approach regarding to threading?
As I see it, there are basically three approaches (if you have any other setup in mind, please share):
Having a single thread process all of the messages.
Having separate threads for separate message types (General, UI, Networking, etc...)
Having multiple threads that share and process a single message queue.
So, would there be any significant performance differences between the three?
Here are some general thoughts:
Obviously, the last two options benefit from a situation where there's more than one processor. Plus, if any thread is waiting for an external event, other threads can still process unrelated messages. But ignoring that, seems that multiple threads only add overhead (Thread switches, not to mention more complicated sync situations).
And another question: Would you recommend to implement such a system upon the standard Windows messaging system, or to implement a separate queue mechanism, and why?

The specific choice of threading model should be driven by the nature of the problem you are trying to solve. There isn't necessarily a single "correct" approach to designing the threading model for such an application. However, if we adopt the following assumptions:
messages arrive frequently
messages are independent and don't rely too heavily on shared resources
it is desirable to respond to an arriving message as quickly as possible
you want the app to scale well across processing architectures (i.e. multicode/multi-cpu systems)
scalability is the key design requirement (e.g. more message at a faster rate)
resilience to thread failure / long operations is desirable
In my experience, the most effective threading architecture would be to employ a thread pool. All messages arrive on a single queue, multiple threads wait on the queue and process messages as they arrive. A thread pool implementation can model all three thread-distribution examples you have.
#1 Single thread processes all messages => thread pool with only one thread
#2 Thread per N message types => thread pool with N threads, each thread peeks at the queue to find appropriate message types
#3 Multiple threads for all messages => thread pool with multiple threads
The benefits of this design is that you can scale the number of threads in the thread in proportion to the processing environment or the message load. The number of threads can even scale at runtime to adapt to the realtime message load being experienced.
There are many good thread pooling libraries available for most platforms, including .NET, C++/STL, Java, etc.
As to your second question, whether to use standard windows message dispatch mechanism. This mechanism comes with significant overhead and is really only intended for pumping messages through an windows application's UI loop. Unless this is the problem you are trying to solve, I would advise against using it as a general message dispatching solution. Furthermore, windows messages carry very little data - it is not an object-based model. Each windows message has a code, and a 32-bit parameter. This may not be enough to base a clean messaging model on. Finally, the windows message queue is not design to handle cases like queue saturation, thread starvation, or message re-queuing; these are cases that often arise in implementing a decent message queing solution.

We can't tell you much for sure without knowing the workload (ie, the statistical distribution of events over time) but in general
single queue with multiple servers is at least as fast, and usually faster, so 1,3 would be preferable to 2.
multiple threads in most languages add complexity because of the need to avoid contention and multiple-writer problems
long duration processes can block processing for other things that could get done quicker.
So horseback guess is that having a single event queue, with several server threads taking events off the queue, might be a little faster.
Make sure you use a thread-safe data structure for the queue.

It all depends.
For example:
Events in a GUI queue are best done by a single thread as there is an implied order in the events thus they need to be done serially. Which is why most GUI apps have a single thread to handle events, though potentially multiple events to create them (and it does not preclude the event thread from creating a job and handling it off to a worker pool (see below)).
Events on a socket can potentially by done in parallel (assuming HTTP) as each request is stateless and can thus by done independently (OK I know that is over simplifying HTTP).
Work Jobs were each job is independent and placed on queue. This is the classic case of using a set of worker threads. Each thread does a potentially long operation independently of the other threads. On completion comes back to the queue for another job.

In general, don't worry about the overhead of threads. It's not going to be an issue if you're talking about merely a handful of them. Race conditions, deadlocks, and contention are a bigger concern, and if you don't know what I'm talking about, you have a lot of reading to do before you tackle this.
I'd go with option 3, using whatever abstractions my language of choice offers.

Note that there are two different performance goals, and you haven't stated which you are targetting: throughput and responsiveness.
If you're writing a GUI app, the UI needs to be responsive. You don't care how many clicks per second you can process, but you do care about showing some response within a 10th of a second or so (ideally less). This is one of the reasons it's best to have a single thread devoted to handling the GUI (other reasons have been mentioned in other answers). The GUI thread needs to basically convert windows messages into work-items and let your worker queue handle the heavy work. Once the worker is done, it notifies the GUI thread, which then updates the display to reflect any changes. It does things like painting a window, but not rendering the data to be displayed. This gives the app a quick "snapiness" that is what most users want when they talk about performance. They don't care if it takes 15 seconds to do something hard, as long as when they click on a button or a menu, it reacts instantly.
The other performance characteristic is throughput. This is the number of jobs you can process in a specific amount of time. Usually this type of performance tuning is only needed on server type applications, or other heavy-duty processing. This measures how many webpages can be served up in an hour, or how long it takes to render a DVD. For these sort of jobs, you want to have 1 active thread per CPU. Fewer than that, and you're going to be wasting idle clock cycles. More than that, and the threads will be competing for CPU time and tripping over each other. Take a look at the second graph in this article DDJ articles for the trade-off you're dealing with. Note that the ideal thread count is higher than the number of available CPUs due to things like blocking and locking. The key is the number of active threads.

A good place to start is to ask yourself why you need multiple threads.
The well-thought-out answer to this question will lead you to the best answer to the subsequent question, "how should I use multiple threads in my application?"
And that must be a subsequent question; not a primary question. The fist question must be why, not how.

I think it depends on how long each thread will be running. Does each message take the same amount of time to process? Or will certain messages take a few seconds for example. If I knew that Message A was going to take 10 seconds to complete I would definitely use a new thread because why would I want to hold up the queue for a long running thread...
My 2 cents.

I think option 2 is the best. Having each thread doing independant tasks would give you best results. 3rd approach can cause more delays if multiple threads are doing some I/O operation like disk reads, reading common sockets and so on.
Whether to use Windows messaging framework for processing requests depends on the work load each thread would have. I think windows restricts the no. of messages that can be queued at the most to 10000. For most of the cases this should not be an issue. But if you have lots of messages to be queued this might be some thing to take into consideration.
Seperate queue gives a better control in a sense that you may reorder it the way you want (may be depending on priority)

Yes, there will be performance differences between your choices.
(1) introduces a bottle-neck for message processing
(3) introduces locking contention because you'll need to synchronize access to your shared queue.
(2) is starting to go in the right direction... though a queue for each message type is a little extreme. I'd probably recommend starting with a queue for each model in your app and adding queues where it makes since to do so for improved performance.
If you like option #2, it sounds like you would be interested in implementing a SEDA architecture. It is going to take some reading to understand what is going on, but I think the architecture fits well with your line of thinking.
BTW, Yield is a good C++/Python hybrid implementation.

I'd have a thread pool servicing the message queue, and make the number of threads in the pool easily configurable (perhaps even at runtime). Then test it out with expected load.
That way you can see what the actual correlation is - and if your initial assumptions change, you can easily change your approach.
A more sophisticated approach would be for the system to introspect its own performance traits and adapt it's use of resources, threads in particular, as it goes. Probably overkill for most custom application code, but I'm sure there are products that do that out there.
As for the windows events question - I think that's probably an application specific question that there is no right or wrong answer to in the general case. That said, I usually implement my own queue as I can tailor it to the specific characteristics of the task at hand. Sometimes that might involve routing events via the windows message queue.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js