looking for some help in ElasticCache
We're using ElasticCache Redis to run a Resque based Qing system.
this means it's a mix of sorted sets and Lists.
at normal operation, everything is OK and we're seeing good response times & throughput.
CPU level is around 7-10%, Get+Set commands are around 120-140K operations. (All metrics are cloudwatch based. )
but - when the system experiences a (mild) burst of data, enqueing several K messages, we see the server become near non-responsive.
the CPU is steady # 100% utilization (metric says 50, but it's using a single core)
number of operation drops to ~10K
response times are slow to a matter of SECONDS per request
We would expect, that even IF the CPU got loaded to such an extent, the throughput level would have stayed the same, this is what we experience when running Redis locally. redis can utilize CPU, but throughput stays high. as it is natively single-cored, not context switching appears.
AFAWK - we do NOT impose any limits, or persistence, no replication. using the basic config.
the size: cache.r3.large
we are nor using periodic snapshoting
This seems like a characteristic of a rouge lua script.
having a defect in such a script could cause a big CPU load, while degrading the overall throughput.
are you using such ? try to look in the Redis slow log for one
Related
We have an application that experiences some pretty short, sharp spikes - generally about 15-20mins long with a peak of 150-250 requests/second, but roughly an average of 50-100 requests/second over that time. p50 response times around 70ms (whereas p90 is around 450ms)
The application is generally just serving models from a database/memcached cluster, but also sometimes makes requests to 3rd party APIs etc (tracking/Stripe etc).
This is a Django application running with uwsgi, running on Kubernetes.
I'll spare you the full uwsgi/kube settings, but the TLDR:
# uwsgi
master = true
listen = 128 # Limited by Kubernetes
workers = 2 # Limited by CPU cores (2)
threads = 1
# Of course much more detail here that I can load test...but will leave it there to keep the question simple
# kube
Pods: 5-7 (horizontal autoscaling)
If we assume 150ms average response time, I'd roughly calculate a total capacity of 93requests/second - somewhat short of our peak. In our logs we often get uWSGI listen queue of socket ... full logs, which makes sense.
My question is...what are our options here to handle this spike? Limitations:
It seems the 128 listen queue is determined by the kernal, and the kube docs suggest it's unsafe to increase this.
Our Kube nodes have 2 cores. The general advice seems to be to set your number of workers to 2 * cores (possibly + 1), so we're pretty much at our limit here. Increasing to 3 doesn't seem to have much impact.
Multiple threads in Django can apparently cause weird bugs
Is our only option to keep scaling this horizontally at the kubernetes level? Aside from making our queries/caching as efficient as possible of course.
I have a micro amazon EC2 instance, and whenever the hosted application at this platform is given a large load for a couple of hours, the application slows down and CPU credits reach almost to zero.
I have turned auto scaling option on but still it does not work can some help me to figure out how to get around this?
All t2 instances use a burstable model. Which is not really intended for sustained heavy usage. The instance, when idling, will build up CPU credits up to a cap. When the CPU is maxed, the credits are spent. Once you run out, you are capped at a very low rate. The amount of credits you can get and the rate at which you earn them depend on which t2 instance you are using.
Autoscaling is for horizontal scaling. With it you can launch extra instances based on certain triggers. But you need to use a load balancer to spread traffic accross instances.
To the question how you see on the CPU utilization if you are using all your credited CPU at 100%, my experience is you don't. What you see is in top or iostat for exmaple, your CPU% is reported at something quite low. Like 30%, while IO is not bottlenecked, and you wonder why it is stuck to low CPU% usage.
But there is a value you might see in top, at the far right end, something like "68% st" that is the "steal" value. This means that you only get to have 32% of that CPU, and so your 30% CPU value is actually 94% of what you get.
I have also observed that when you add the CPU% of the processes that are in running state (R) in top, you arrive at a number relative to your actually available CPU. For example, I had 24 processes running at 8% each on a t2.medium instance with 2 virtual CPUs, that means 192% actually running, that is 96% of the available CPU cycles, not 32% as reported by top and iostat as % user.
If I was to generate an auto scale trigger, I would look at what I can get from the /proc file system and consider the "steal" amount.
I am trying to set up a scalable background image processing using beanstalk.
My setup is the following:
Application server (running on Elastic Beanstalk) receives a file, puts it on S3 and sends a request to process it over SQS.
Worker server (also running on Elastic Beanstalk) polls the SQS queue, takes the request, load original image from S3, processes it resulting in 10 different variants and stores them back on S3.
These upload events are happening at a rate of about 1-2 batches per day, 20-40 pics each batch, at unpredictable times.
Problem:
I am currently using one micro-instance for the worker. Generating one variant of the picture can take anywhere from 3 seconds to 25-30 (it seems first ones are done in 3, but then micro instance slows down, I think this is by its 2 second bursty workload design). Anyway, when I upload 30 pictures that means the job takes: 30 pics * 10 variants each * 30 seconds = 2.5 hours to process??!?!
Obviously this is unacceptable, I tried using "small" instance for that, the performance is consistent there, but its about 5 seconds per variant, so still 30*10*5 = 26 minutes per batch. Still not really acceptable.
What is the best way to attack this problem which will get fastest results and will be price efficient at the same time?
Solutions I can think of:
Rely on beanstalk auto-scaling. I've tried that, setting up auto scaling based on CPU utilization. That seems very slow to react and unreliable. I've tried setting measurement time to 1 minute, and breach duration at 1 minute with thresholds of 70% to go up and 30% to go down with 1 increments. It takes the system a while to scale up and then a while to scale down, I can probably fine tune it, but it still feels weird. Ideally I would like to get a faster machine than micro (small, medium?) to use for these spikes of work, but with beanstalk that means I need to run at least one all the time, since most of the time the system is idle that doesn't make any sense price-wise.
Abandon beanstalk for the worker, implement my own monitor of of the SQS queue running on a micro, and let it fire up larger machine(or group of larger machines) when there are enough pending messages in the queue, terminate them the moment we detect queue is idle. That seems like a lot of work, unless there is a solution for this ready out there. In any case, I lose the benefits of beanstalk of deploying the code through git, managing environments etc.
I don't like any of these two solutions
Is there any other nice approach I am missing?
Thanks
CPU utilization on a micro instance is probably not the best metric to use for autoscaling in this case.
Length of the SQS queue would probably be the better metric to use, and the one that makes the most natural sense.
Needless to say, if you can budget for a bigger base-line machine everything would run that much faster.
I have Amazon micro instance and looks like CPU is not enough. Going to upgrade to the next cheapest instance with more CPU available.
Can it be m1.small instance ? According to the description they have same number of compute units. And looks like micro can even overperform small instance when more cores becomes available for short CPU bursts.
Update: note that this information is only really applicable to the previous generation t1.micro instance type, which had a cyclical clamping throttle algorithm. The current generation t2 instance class, including the t2.micro, has much better performance than the t1.micro and an entirely different algorithm controlling the throttling. Throttling on the t2 instance class is driven by CPU credits, which are visible in the CloudWatch metrics for the instance, throttling is much more graceful, and kicks in much later. Throttling on the t1.micro was essentially a black box, and the system would repeatedly shift in and out of the throttled mode, under high loads. There is no longer a compelling reason to use a t1 instance, unless you are running a PV AMI. The t2 is HVM.
ECUs are "EC2 Compute Units" and represent, approximately, the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron processor.
This Comparison of t1.micro and m1.small explains that a small instance has 1 ECU continually available, while a Micro can operate in short bursts of up to 2 ECU, but with an ongoing baseline of much less.
In my testing, I've found that consuming 100% CPU for about 10-15 seconds on a micro instance, gets you throttled down to a fraction of that -- approximately 0.2 ECU -- for about the next 2-3 minutes, when the throttling lifts for a few seconds, then the cycle repeats, though it only repeats if you are still pulling the hard burst. They accomplish the throttling via the hypervisor "stealing" a large percentage of your available cycles. You can see this in "top" when it's happening. If you go long enough without demanding 100% CPU, the 2 ECU burst is immediately available with you need it -- it's not as if they are cycling the performance up and down with a timer -- the throttling is reactive to the imposed load.
Over time, the small instance will get more processing done, since the micro is throttled so aggressively after a few seconds of heavy usage, long enough to more than counteract the brief periods of nice burstablity. This makes sense, though since the micro is a lower cost instance.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html
...so, yes, try a small instance.
Experiencing very high response latency with Redis, to the point of not being able to output information when using the info command through redis-cli.
This server handles requests from around 200 concurrent processes but it does not store too much information (at least to our knowledge). When the server is responsive, the info command reports used memory around 20 - 30 MB.
When running top on the server, during periods of high response latency, CPU usage hovers around 95 - 100%.
What are some possible causes for this kind of behavior?
It is difficult to propose an explanation only based on the provided data, but here is my guess. I suppose that you have already checked the obvious latency sources (the ones linked to persistence), that no Redis command is hogging the CPU in the slow log, and that the size of the job data pickled by Python-rq is not huge.
According to the documentation, Python-rq inserts the jobs into Redis as hash objects, and let Redis expires the related keys (500 seconds seems to be the default value) to get rid of the jobs. If you have some serious throughput, at a point, you will have many items in Redis waiting to be expired. Their number will be high compared to the pending jobs.
You can check this point by looking at the number of items to be expired in the result of the INFO command.
Redis expiration is based on a lazy mechanism (applied when a key is accessed), and a active mechanism based on key sampling, which is run in the event loop (in pseudo background mode, every 100 ms). The point is when the active expiration mechanism is running, no Redis command can be processed.
To avoid impacting the performance of the client applications too much, only a limited number of keys are processed each time the active mechanism is triggered (by default, 10 keys). However, if more than 25% keys are found to be expired, it tries to expire more keys and loops. This is the way this probabilistic algorithm automatically adapt its activity to the number of keys Redis has to expire.
When many keys are to be expired, this adaptive algorithm can impact the performance of Redis significantly though. You can find more information here.
My suggestion would be to try to prevent Python-rq to delegate item cleaning to Redis by setting expiration. This is a poor design for a queuing system anyway.
I think reduce ttl should not be the right way to avoid CPU usage when Redis expire keys.
Didier says, with a good point, that the current architecture of Python-rq that it delegates the cleaning jobs to Redis by using the key-expire feature. And surely, like Didier said it is not the best way. ( this is used only when result_ttl is greater than 0 )
Then the problem should rise when you have a set of keys/jobs with a expiration dates near one of the other, and it could be done when you have a bursts of job creation.
But Python-rq sets expire key when the job has been finished in one worker,
Then it doesn't have too sense, because the keys should spread around the time with enough time between them to avoid this situation