RabbitMQ on EC2 Consuming Tons of CPU

RabbitMQ on EC2 Consuming Tons of CPU - django

I am trying to get RabbitMQ with Celery and Django going on an EC2 instance to do some pretty basic background processing. I'm running rabbitmq-server 2.5.0 on a large EC2 instance.
I downloaded and installed the test client per the instructions here (at the very bottom of the page). I have been just letting the test script go and am getting the expected output:
recving rate: 2350 msg/s, min/avg/max latency: 588078478/588352905/588588968 microseconds
recving rate: 1844 msg/s, min/avg/max latency: 588589350/588845737/589195341 microseconds
recving rate: 1562 msg/s, min/avg/max latency: 589182735/589571192/589959071 microseconds
recving rate: 2080 msg/s, min/avg/max latency: 589959557/590284302/590679611 microseconds
The problem is that it is consuming an incredible amount of CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
668 rabbitmq 20 0 618m 506m 2340 S 166 6.8 2:31.53 beam.smp
1301 ubuntu 20 0 2142m 90m 9128 S 17 1.2 0:24.75 java
I was testing on a micro instance earlier and it was completely consuming all resources on the instance.
Is this to be expected? Am I doing something wrong?
Thanks.
Edit:
The real reason for this post was that celerybeat seemed to run okay for awhile and then suddenly consume all resources on the system. I installed the rabbitmq management tools and have been investigating how the queues are created from celery and from the rabbitmq test suite. It seems to me that celery is orphaning these queues and they are not going away.
Here is the queue as generated by the test suite. One queue is created and all the messages go into it and come out:
Celerybeat creates a new queue for every time it runs the task:
It sets the auto-delete parameter to true, but I'm not entirely sure when these queues will get deleted. They seem to just slowly build up and eat resources.
Does anyone have an idea?
Thanks.

Ok, I figured it out.
Here's the relevant piece of documentation:
http://readthedocs.org/docs/celery/latest/userguide/tasks.html#amqp-result-backend
Old results will not be cleaned automatically, so you must make sure to consume the results or else the number of queues will eventually go out of control. If you’re running RabbitMQ 2.1.1 or higher you can take advantage of the x-expires argument to queues, which will expire queues after a certain time limit after they are unused. The queue expiry can be set (in seconds) by the CELERY_AMQP_TASK_RESULT_EXPIRES setting (not enabled by default).

To add to Eric Conner's solution to his own problem, http://docs.celeryproject.org/en/latest/userguide/tasks.html#tips-and-best-practices states:
Ignore results you don’t want
If you don’t care about the results of a task, be sure to set the ignore_result option, as storing results wastes time and resources.
#app.task(ignore_result=True)
def mytask(…):
something()
Results can even be disabled globally using the CELERY_IGNORE_RESULT setting.
That along with Eric's answer is probably a bare minimum best practices for managing your results backend.
If you don't need a results backend, set CELERY_IGNORE_RESULT or don't set a results backend at all. If you do need a results backend, set CELERY_AMQP_TASK_RESULT_EXPIRES to be safeguarded against unused results building up. If you don't need it for a specific app, set the local ignore as above.

Related

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?

This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long

We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)

I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)

I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.

Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

Why is AWS EC2 CPU usage shooting up to 100% momentarily from IOWait?

I have a large web-based application running in AWS with numerous EC2 instances. Occasionally -- about twice or thrice per week -- I receive an alarm notification from my Sensu monitoring system notifying me that one of my instances has hit 100% CPU.
This is the notification:
CheckCPU TOTAL WARNING: total=100.0 user=0.0 nice=0.0 system=0.0 idle=25.0 iowait=100.0 irq=0.0 softirq=0.0 steal=0.0 guest=0.0
Host: my_host_name
Timestamp: 2016-09-28 13:38:57 +0000
Address: XX.XX.XX.XX
Check Name: check-cpu-usage
Command: /etc/sensu/plugins/check-cpu.rb -w 70 -c 90
Status: 1
Occurrences: 1
This seems to be a momentary occurrence and the CPU goes back down to normal levels within seconds. So it seems like something not to get too worried about. But I'm still curious why it is happening. Notice that the CPU is taken up with the 100% IOWaits.
FYI, Amazon's monitoring system doesn't notice this blip. See the images below showing the CPU & IOlevels at 13:38
Interestingly, AWS says tells me that this instance will be retired soon. Might that be the two be related?

AWS is only displaying a 5 minute period, and it looks like your CPU check is set to send alarms after a single occurrence. If your CPU check's interval is less than 5 minutes, the AWS console may be rolling up the average to mask the actual CPU spike.
I'd recommend narrowing down the AWS monitoring console to a smaller period to see if you see the spike there.

I would add this as comment, but I have no reputation to do so.
I have noticed my ec2 instances have been doing this, but for far longer and after apt-get update + upgrade.
I tough it was an Apache thing, then started using Nginx in a new instance to test, and it just did it, run apt-get a few hours ago, then came back to find the instance using full cpu - for hours! Good thing it is just a test machine, but I wonder what is wrong with ubuntu/apt-get that might have cause this. From now on I guess I will have to reboot the machine after apt-get as it seems to be the only way to put it back to normal.

Poor PySpark performance on Stand-alone cluster + Docker

I'm running a Spark slave inside a Docker container on AWS c4.8xlarge machines (one or more) and struggling to get the expected performance when compared to just using multiprocessing on my laptop (with quad-core Intel i7-6820HQ). (see edit below, there is a huge overhead on same hardware as well)
I'm looking for solutions to horizontally scale analytics model training with a "Multiprocessor" which can work in a single thread, multi-process or in a distributed Spark scenario:
class Multiprocessor:
# ...
def map(self, func, args):
if has_pyspark:
n_partitions = min(len(args), 1000)
return _spark_context.parallelize(args, n_partitions).map(func).collect()
elif self.max_n_parallel > 1:
with multiprocessing.Pool(self.max_n_parallel) as pool:
return list(pool.map(func, args))
else:
return list(map(func, args))
As you can see Spark's role is to distribute calculations and simply retrieve results, parallelize().map() is the only API used. args is just a list of integer id tuples, nothing too heavy.
I'm using Docker 1.12.1 (--net host), Spark 2.0.0 (stand-alone cluster), Hadoop 2.7, Python 3.5 and openjdk-7. Results for the same training dataset, every run is CPU-bound:
5.4 minutes with local multiprocessing (4 processes)
5.9 minutes with four c4.8xlarge slaves (10 cores in use / each)
6.9 minutes with local Spark (master local[4])
7.7 minutes with three c4.8xlarge slaves (10 cores in use / each)
25 minutes with a single c4.8xlarge slave (10 cores) (!)
27 minutes with local VM Spark slave (4 cores) (!)
All 36 virtual CPUs seem to be in use, load averages are 250 - 350. There were about 360 args values to be mapped, their processing took 15 - 45 seconds (25th and 75th percentiles). CG times were insignificant. Even tried returning "empty" results to avoid network overhead but it did not affect the total time. Ping to AWS via VPN is 50 - 60 ms.
Any tips on which other metrics I should look into, feel I'm wasting lots CPU cycles somewhere. I'd really like to build architecture around Spark but based on these PoCs at least machines on AWS are way too expensive. Gotta do tests with other local hardware I've access to.
EDIT 1: Tested on a Linux VM on laptop, took 27 minutes when using the stand-alone cluster which is 20 minutes more than with local[4].
EDIT 2: There seems to be 7 pyspark daemons for each slave "core", all of taking significant amount of CPU resources. Is this expected behavior? (picture from laptop's VM)
EDIT 3: Actually this happens even when starting the slave just a single core, I get 100% CPU utilization. According to this answer red color indicates kernel level threads, could Docker play a role here? Anyway, I don't remember seeing this issue when I was prototyping it with Python 2.7, I got very minimal performance overhead. Now updated to Java OpenJDK 8, it made no difference. Also got same results with Spark 1.5.0 and Hadoop 2.6.
EDIT 4: I could track down that by default scipy.linalg.cho_factor uses all available cores, that is why I'm seeing high CPU usage even with one core for the Spark slave. Must investigate further...
Final edit: The issue seems to have nothing to do with AWS or Spark, I've got poor performance on stand-alone Python within the Docker container. See my answer below.

Had the same problem - for me the root cause was memory allocation.
Make sure you allocate enough memory to your spark instances.
In start-slave.sh - run --help to get the memory option (the default is 1GB per node regardless the actual memory in the machine).
You can view in the UI (port 8080 on the master) the allocated memory per node.
You also need to set the memory per executor when you submit your application, i.e. spark-submit (again the default is 1GB), like before - run with --help to get the memory option.
Hope this helps.

Sorry for the confusion (I'm the OP), it took me a while to dig down to what is really happening. I did lots of benchmarking and finally I realized that on Docker image I was using OpenBLAS which by default multithreads linalg functions. My code is running cho_solve hundreds of times on matrices of size ranging from 80 x 80 to 140 x 140. There was simply tons of overhead from launching all these threads, which I don't need in the first place as I'm doing parallel computation via multiprocessing or Spark.
# N_CORES=4 python linalg_test.py
72.983 seconds
# OPENBLAS_NUM_THREADS=1 N_CORES=4 python linalg_test.py
9.075 seconds

AWS Elastic Beanstalk Worker timing out after inactivity during long computation

I am trying to use Amazon Elastic Beanstalk to run a very long numerical simulation - up to 20 hours. The code works beautifully when I tell it to do a short, 20 second simulation. However, when running a longer one, I get the error "The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own)".
After browsing the web, it seems to me that the issue is that Elastic Beanstalk allows worker processes to run for 30 minutes at most, and then they time out because the instance has not responded (i.e. finished the simulation). The solution some have proposed is to send a message every 30 seconds or so that "pings" Elastic Beanstalk, letting it know that the simulation is going well so it doesn't time out, which would let me run a long worker process. So I have a few questions:
Is this the correct approach?
If so, what code or configuration would I add to the project to make it stop terminating early?
If not, how can I smoothly run a 12+ hour simulation on AWS or more generally, the cloud?
Add on information
Thank you for the feedback, Rohit. To give some more information, I'm using Python with Flask.
• I am indeed using an Elastic Beanstalk worker tier with SQS queues
• In my code, I'm running a simulation of variable length - from as short as 20 seconds to as long as 20 hours. 99% of the work that Elastic Beanstalk does is running the simulation. The other 1% involves saving results, sending emails, etc.
• The simulation itself involves using generating many random numbers and working with objects that I defined. I use numpy heavily here.
Let me know if I can provide any more information. I really appreciate the help :)

After talking to a friend who's more in the know about this stuff than me, I solved the problem. It's a little sketchy, but got the job done. For future reference, here is an outline of what I did:
1) Wrote a main script that used Amazon's boto library to connect to my SQS queue. Wrote an infinite while loop to poll the queue every 60 seconds. When there's a message on the queue, run a simulation and then continue through with the loop
2) Borrowed a beautiful /etc/init.d/ template to run my script as a daemon (http://blog.scphillips.com/2013/07/getting-a-python-script-to-run-in-the-background-as-a-service-on-boot/)
3) Made my main script and the script in (2) executable
4) Set up a cron job to make sure the script would start back up if it failed.
Once again, thank you Rohit for taking the time to help me out. I'm glad I still got to use Amazon even though Elastic Beanstalk wasn't the right tool for the job

From your question it seems you are running into launches timing out because some commands during launch that run on your instance take more than 30 minutes.
As explained here, you can adjust the Timeout option in the aws:elasticbeanstalk:command namespace. This can have values between 1 and 1800. This means if your commands finish within 30 minutes you won't see this error. The commands might eventually finish as the error message says but since Elastic Beanstalk has not received a response within the specified period it does not know what is going on your instance.
It would be helpful if you could add more details about your usecase. What commands you are running during startup? Apparently you are using ebextensions to launch commands which take a long time. Is it possible to run those commands in the background or do you need these commands to run during server startup?
If you are running a Tomcat web app you could also use something like servlet init method to run app bootstrapping code. This code can take however long it needs without giving you this error message.

Unfortunately, there is no way to 'process a message' from an SQS queue for more than 12 hours (see the description of ChangeVisibilityTimeout).
With that being the case, this approach doesn't fit your application well. I have ran into the same problem.
The correct way to do this: I don't know. However, I would suggest an alternate approach where you grab a message off of your queue, spin off a thread or process to run your long running simulation, and then delete the message (signaling successful processing). In this approach, be careful of spinning off too many threads on one machine and also be wary of machines shutting down before the simulation has ended, because the queue message has already been deleted.
Final note: your question is excellently worded and sufficiently detailed :)

For those looking to run jobs shorter than 10 hours, it needs to be mentioned that the current inactivity timeout limit is 36000 seconds, so exactly 10 hours and not anymore 30 minutes, like mentioned in posts all over the web (which led me to think a workaround like described above is needed).
Check out the docs: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html
A very nice write-up can be found here: https://dev.to/rizasaputra/understanding-aws-elastic-beanstalk-worker-timeout-42hi

Redis is taking too long to respond

Experiencing very high response latency with Redis, to the point of not being able to output information when using the info command through redis-cli.
This server handles requests from around 200 concurrent processes but it does not store too much information (at least to our knowledge). When the server is responsive, the info command reports used memory around 20 - 30 MB.
When running top on the server, during periods of high response latency, CPU usage hovers around 95 - 100%.
What are some possible causes for this kind of behavior?

It is difficult to propose an explanation only based on the provided data, but here is my guess. I suppose that you have already checked the obvious latency sources (the ones linked to persistence), that no Redis command is hogging the CPU in the slow log, and that the size of the job data pickled by Python-rq is not huge.
According to the documentation, Python-rq inserts the jobs into Redis as hash objects, and let Redis expires the related keys (500 seconds seems to be the default value) to get rid of the jobs. If you have some serious throughput, at a point, you will have many items in Redis waiting to be expired. Their number will be high compared to the pending jobs.
You can check this point by looking at the number of items to be expired in the result of the INFO command.
Redis expiration is based on a lazy mechanism (applied when a key is accessed), and a active mechanism based on key sampling, which is run in the event loop (in pseudo background mode, every 100 ms). The point is when the active expiration mechanism is running, no Redis command can be processed.
To avoid impacting the performance of the client applications too much, only a limited number of keys are processed each time the active mechanism is triggered (by default, 10 keys). However, if more than 25% keys are found to be expired, it tries to expire more keys and loops. This is the way this probabilistic algorithm automatically adapt its activity to the number of keys Redis has to expire.
When many keys are to be expired, this adaptive algorithm can impact the performance of Redis significantly though. You can find more information here.
My suggestion would be to try to prevent Python-rq to delegate item cleaning to Redis by setting expiration. This is a poor design for a queuing system anyway.

I think reduce ttl should not be the right way to avoid CPU usage when Redis expire keys.
Didier says, with a good point, that the current architecture of Python-rq that it delegates the cleaning jobs to Redis by using the key-expire feature. And surely, like Didier said it is not the best way. ( this is used only when result_ttl is greater than 0 )
Then the problem should rise when you have a set of keys/jobs with a expiration dates near one of the other, and it could be done when you have a bursts of job creation.
But Python-rq sets expire key when the job has been finished in one worker,
Then it doesn't have too sense, because the keys should spread around the time with enough time between them to avoid this situation

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js