Celery using only 20% of Cpu (at peak) - concurrency

I'm running a celery + rabbitmq app. I start up a bunch of ec2 machines, but I find that my celery worker machines only use about 15% cpu (peak of 20%). I've configured 2 celery workers per machine.
Shouldn't celery workers be close to using 100% CPU utilization?
MORE INFO: I am not using the celery --concurrency option or eventlet even though I am using multiple workers. By default concurrency is set to 8. My tasks run in php mostly io blocking, so there won't be an issue if we have more processes running in parallel. Is there any way to configure celery to run more number of tasks based on the CPU usage

Shouldn't celery workers be close to using 100% CPU utilization?
Only if you load them up to utilize 100% CPU :)
My tasks run in php mostly io blocking
If your tasks are primarily making IO calls than this is most likely the reason why CPU isn't high. Ie when a process/theads is mostly sitting idle after making an io call and waiting for it to complete.
It's crucial to benchmark your configuration. In practice this could look like:
choose an initial level for concurrency (ie the default)
Benchmark throughput / resource usage
Increase the concurrency Level
Benchmark Throughput / resource usage
Continue until increasing concurrency no longer provides any benefit
If your worker tasks are IO bound this is a perfect case for eventlet. Since it will allow you to run many many IO bound tasks on a single processor. Ie consider the case where your machine has 64 cores. You should easily be able to run some multiple of this for IO bound tasks but at some point majority of resources will go to process accounting and overhead and context switching.
With eventlet, a single processor could handle hundreds or thousands of concurrent workers:
The prefork pool can take use of multiple processes, but how many is
often limited to a few processes per CPU. With Eventlet you can
efficiently spawn hundreds, or thousands of green threads. In an
informal test with a feed hub system the Eventlet pool could fetch and
process hundreds of feeds every second, while the prefork pool spent
14 seconds processing 100 feeds. Note that this is one of the
applications async I/O is especially good at (asynchronous HTTP
requests). You may want a mix of both Eventlet and prefork workers,
and route tasks according to compatibility or what works best.

You have two options - to increase concurrency level (using the --concurrency), or to use the (deprecated) auto-scaling option. Most of the time we overutilise on AWS by using concurrency setting number that is 2 * N where N is number of vCPUs on the instance type of your choice. We do not overutilise nodes that are subscribed to the special queue where we send our CPU-bound tasks.

Related

Why does PoCo HTTP server consume CPU on complete idle

I've experimented with PoCo HTTP server and found it consumes some CPU even on complete idle. This is not high usage but if we have a lot of instances running it may become a problem.
For network services using poll it's normal to permanently use small amount of cpu time. Nginx and redis also have some cpu consumption on idle. To achieve zero cpu usage on idle you well have to use another approach to network communications.

Running an async background task on Tornado

I'm using Tornado Async framework for the implementation of a REST Web-Server.
I need to run a high-CPU-load periodic task on the background of the same server.
It is a low-priority periodic task. It should run all the time on all idle cores, but I don't want it to affect the performance of the Web-Server (under a heavy HTTP-request load, it should take lower priority).
Can I do that with the Tornado IOLoop API?
I know I can use tornado.ioloop.PeriodicCallback to call a periodic background task. But if this task is computational-heavy I may cause performance issues to the web service.
Tornado (or asyncio in python 3) or any other single-process-event-loop based solution is not meant to be used for CPU intensive tasks. You should use it only for IO intensive tasks.
The word "background" means, that you're not waiting for a result (I call it sometimes unattended task). More over, if a background task blocks, the rest of a application have to wait, the same way as a request handler blocks, the other parts, including the background, are blocked.
You might be thinking of use of threads, but in python this is not a solution either, due to GIL.
The right solutions are:
decouple worker from the webserver,
use multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor

AWS Elasticache CPU usage exceeding 100%

We have been using AWS Elasticache for our applications. We had initially set a CPU alarm threshold for 22% (4 core node, so effectively 90% CPU usage), which is based on the recommended thresholds. But we often see the CPU utilization crossing well over 25% to values like 28%, 34%.
What I am trying to understand that how is this theoretically possible, considering Redis is single-threaded ? The only way I can think that this can happen is if there is maintenance operation happening on other cores, which can bump the CPU usage > 25%. Even if the cluster is highly loaded, it should cap CPU usage at 25% and probably start timing out for clients. Can someone help me understand under what scenarios can the CPU usage of a single-threaded Redis instance cross 100% CPU utilization ?
Redis event loop is single-threaded. the Redis process itself is not. There are a couple of extra threads to offline some I/O bound operations. Now, these threads should not consume CPU.
However, Redis also forks child processes to take care of heavy duty operations like AOF rewrite or RDB save. Each forked process generally consumes 100% of a CPU core (except if the operation is slowed down by I/Os), on top of the Redis event loop consumption.
If you find the CPU consumption regularly high, it may be due to a wrong AOF and RDB configuration (i.e. the Redis instance rewrites the AOF or generates a dump too frequently).

Configure uwsgi server for performance

I am deploying a uwsgi server for a django app. Each request will have a latency around 2 seconds. I need to handle 100 QPS. On a 4 cores machines, how should I configure the number of processes and the number of threads? I tried to play with the values but I do not understand what I am doing.
Go through the uWSGI Things to know page. 100 requests per second should be easily attainable with uWSGI.
Based on uWSGI behavior I've experienced, I would recommend that you start with only processes and don't use any threads. With both processes and threads we observed that there seemed to be an affinity to use threads over processes. That resulted in a single process handling all requests until it's thread pool was fully occupied and only then were requests handled by the next process. This resulted in poor utilization of resources as a single core was maxed out with all other idle. Turning off threading resulted in a massive performance boost for our particular use model.
Your experience may be different. The uWSGI authors stress that there isn't any magic config combination- it's completely dependent on your particular use case. You need benchmark your app against various configurations to find the sweet spot. Additionally, unless you're able to use benchmarks that perfectly model your actual production load, you'll want to continue to monitor performance and methodically tweak settings after you deploy.
From the Things to know page:
There is no magic rule for setting the number of processes or threads
to use. It is very much application and system dependent. Simple math
like processes = 2 * cpucores will not be enough. You need to
experiment with various setups and be prepared to constantly monitor
your apps. uwsgitop could be a great tool to find the best values.

Not sure if I should use celery

I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.