Celery: number of workers vs concurrency - concurrency

What is the difference between having:
one worker with concurrency 4 or
two workers with concurrency 2 each
for the same queue.
Thanks

I assume that you are running both workers in the same machine. In that case I would recommend you to maintain one worker for a queue.
Two workers for the same queue does not benefit you by any means. It would just increase the memory wastage.
Two or more workers when you have multiple queues, to maintain priority or to allocate different number of cores to each worker.
Two or more workers for a single queue is useful if you run the workers in different machines. The workers in different machines consumes tasks from same queue, you could allocate concurrency based on the cores available in each machine.
I do realise I responded 2+ years later. But I just thought I'll put it here for anyone who still has similar doubts.

Intersting question.
Things that I can think of (I'm sure there are a lot more):
For having high availability:
You want more than one machine (if one goes down) - so you must use worker per machine.
Even for one machine - I think it is safer to have 2 workers which run in a two different processes instead of one worker with high concurrency (correct me if I wrong, but I think it is implemented with threads).
In docs I see that they the recommendation is to use concurrency per CPUs.
If you want to separate different tasks to different workers..
Of course, you have price for that: more processes that takes more resources (CPU/Memory etc).
I found this question which is quite similar.

Related

Determining the number of parallel processes in a multi-instance subprocess

I am modelling a process which at times will require a very large number of parallel sub-processes (tens of thousands) to be launched. Obviously it’s not possible for these all to run in parallel simultaneously - how will the Camunda process engine handle this? Is it possible to control how many subprocesses will run at a time?
Camunda 7 uses a job executor thread pool. This determines the concurrency level of jobs such an asynchronously started call activity.
The amount of sub processes you mentioned is very high though. What history level did you have in mind? It is likely better to handle this differently.
Camunda 8 was release two days ago. It has a fundamentally different architecture, no relational DB, applying event streaming concepts, designed for massive volumes. It may be more suitable for your use case.

One asynchronous thread per task?

My application currently has a list of "tasks" (Each task consists of a function - That function can be as simple as printing something out but also way more complex) which gets looped through. (Additional note: Most tasks send a packet after having been executed) As some of these tasks could take quite some time, I thought about using a different, asynchronous thread for each task, thus letting run all the tasks concurrently.
Would that be a smart thing to do or not?One problem is that I can't possibly know the amount of tasks beforehand, so it could result in quite a few threads being created, and I read somewhere that each different hardware has it's limitations. I'm planing to run my application on a raspberry pi, and I think that I will always have to run between 5 and 20 tasks.
Also, some of the tasks have a lower "priority" of running that others.
Should I just run the important tasks first and then the less important ones? (Problem here is that if the sum of the time needed for all tasks together exceeds the time that some specific, important task should be run, my application won't be accurate anymore) Or implement everything in asynchronous threads? Or just try to make everything a little bit faster by only having the "packet-sending" in an asynchronous thread, thus not having to wait until the packets actually get sent?
There are number of questions you will need to ask yourself before you can reasonably design a solution.
Do you wish to limit the number of concurrent tasks?
Is it possible that in future the number of concurrent tasks will increase in a way that you cannot predict today?
... and probably a whole lot more.
If the answer to any of these is "yes" then you have a few options:
A producer/consumer queue with a fixed number of threads draining the queue (not recommended IMHO)
Write your tasks as asynchronous state machines around an event dispatcher such as boost::io_service (this is much more scalable).
If you know it's only going to be 20 concurrent tasks, you'll probably get away with std::async, but it's a sloppy way to write code.

Understanding the scalability of Erlang

It is said that thousands of processes can be spawned to do the similar task concurrently and Erlang is good at handling it. If there is more work to be done, we can simply and safely add more worker processes and that makes it scalable.
What I fail to understand is that if the work performed by each work is itself resource-intensive, how will Erlang be able to handle it? For instance, if entries are being made into a table by several sources and an Erlang application withing its hundreds of processes reads rows from the table and does something, this is obviously likely to cause resource burden. Every worker will try to pull a record from the table.
If this is a bad example, consider a worker that has to perform a highly CPU-intensive computation in memory. Thousands of such workers running concurrently will overwork the CPU.
Please rectify my understanding of the scalability in Erlang:
Erlang processes get time slices of the CPU only if there is work available for them. OS processes on the other hand get time slices regardless of whether they are idle.
The startup and shutdown time of Erlang processes is much lower than that of OS processes.
Apart from the above two points is there something about Erlang that makes it scalable?
Thanks,
Melvyn
Scaling in Erlang is not automatic. The Erlang language and runtime provides some tools which makes it comparatively easy to write concurrent programs. If these are written correctly, then they are able to scale along several different dimensions:
Parallel execution on multiple cores - since the VM understands to utilize them all.
Capacity - Since you can have a process per task and they are light weight.
The biggest advantage is that Erlang processes are isolated, like in the OS, but unlike the OS the communication overhead is small. These two traits is what you want to exploit in Erlang programming.
The problem where you have a highly contended data resource is one to avoid if you are targeting high parallel execution. The best way to go around it is to split up your problem so it doesn't occur.
I have a blog post, http://jlouisramblings.blogspot.dk/2013/01/how-erlang-does-scheduling.html which describes in some more detail how the Erlang scheduler works. You may want to read that.

Intel TBB tasks for serving network connections - good model?

I'm developing a backend for a networking product, that serves a dozen of clients (N = 10-100). Each connection requires 2 periodic tasks, the heartbeat, and downloading of telemetry via SSH, each at H Hz. There are also extra events of different kind coming from the frontend. By nature of every of the tasks, there is a solid part of waiting in select call on each connection's socket, which allows OS to switch between threads often to serve other clients while waiting for response.
In my initial implementation, I create 3 threads per connection (heartbeat, telemetry, extra), each waiting on a single condition variable, which is poked every time there is something to do in a workqueue. The workqueue is filled with the above-mentioned periodic events using a timer and commands from the frontend.
I have a few questions here.
Would it be a good idea to switch a worker thread pool approach to Intel TBB tasks? If so, to which value of threads do I need to initialize tbb::task_scheduler_init?
In the current approach with 300 threads waiting on a conditional variable, which is signaled N * H * 3 times per second, it is likely to become a bottleneck for scalability (especially on the side which calls signal). Are there any better approaches for waking up just one worker per task?
How is waking of a worker thread implemented in TBB?
Thanks for your suggestions!
Its difficult to say if switching to TBB would be a good approach or not. What are your performance requirements, and what are the performance numbers for the current implementation? If the current solution is good enough, than its probably not worth-while to switch.
If you want to compare the both (current impl vs TBB) to know which gives better performance, then you could do what is called a "Tracer bullet" (from the book The Pragmatic Programmer) for each implementation and compare the results. In simpler terms, do a reduced prototype of each and compare the results.
As mentioned in this answer, its typically not a good idea to try to do performance improvements without having concrete evidence that what you're going to change will improve.
Besides all of that, you could consider making a thread pool with the number of threads being some function of the number of CPU cores (maybe a factor of 1 or 1.5 threads per core) The threads would take off tasks from a common work-queue. There would be 3 types of tasks: heartbeat, telemetry, extra. This should reduce the negative impacts caused by context switching when using large numbers of threads.

Setting celery concurrency to 1 worker per queue

I am essentially using rabbitmq queues in celery as a poor man's synchronisation. Eg when certain objects are updated (and have a high cost), I round robin them to a set of 10 queues based on their object IDs. Firstly is this a common pattern or is there a better way.
Secondly, with celeryd, it seems that the concurrency level option (CELERY_CONCURRENCY) sets the number of workers across all the queues. This kind of defeats the purpose of using the queues for synchronization as a queue can be serviced by multiple workers, which means potential race conditions when performing different actions on the same object.
Is there a way to set the concurrency level (or worker pool options) so that we have one worker per N queues?
Thanks
Sri
Why you don't simply implements a global task lock system, by using memcache or a nosql db?
In this way you avoid any race condition.
Here an example
http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time
Related to the first part of your question, I've asked and answered a similar question here: Route to worker depending on result in Celery?
Essentially you can route directly to a worker depending on a key, which in your case is an ID. It avoids any need for a single locking point. Hopefully it's useful, even though this question is 2 years old :)